Offline Wikipedia on PocketPC

Cliquez ici pour télécharger la Wikipedia en Français sur votre PocketPC !

Note : Ceci est la publication d’un billet qui est en phase « brouillon » depuis le 22 septembre 2008… Oh, by the way, this is an english version of this post.

Since last week, I’m struggling to compile the latest version of Wikipedia on my PocketPC. It is a bit painful, but is worth doing it, really.

At a first glance, it may seem impossible to have the full Wikipedia on a small device like a PocketPC, but it’s possible.

The Wikimedia Fundation gives away SQL and XML dumps of all their wikipedia versions, including everything. That’s huge ! The english version XML file containing only text articles is already over 17Gb uncompressed, the french is nearly 4Gb… Needless to say that its nearly impossible to fit it on a PocketPC like that ! But there is a magical way to achieve it… It takes time but it works.

My own french version of Wikipedia, available offline, on a Micro-SD card ;)

Prerequisites :

  • ($$$) a *fast* PC with a lot of free disk space and *a lot* of RAM. Dual-Core with 4Gb RAM is barely enough ! (that’s my config…)
  • ($0.00) a Wikimedia XML dump of the latest version.
  • ($0.00) ActiveState Perl, a port for Windows of the Perl programming language
  • ($0.00) The PsPad text editor, freeware as a great multipurpose text editor, and also IDE for Perl)
  • ($39.99)A specific text editor : EmEditor… Although it is shareware, it is the only unicode text editor that I could found able to quickly parse and modify 17Gb XML text files ! Initially I didn’t plan to get it since I had PsPad, but I soon realized that even if I love PsPad and the fact that almost every feature I can think of is already implemented, it is not able to handle huge files.
  • (£15.00) TomeRaider for Windows, shareware too. A bit pricy but there is an offer now and it costs less if bundled with the PocketPC version.

1. Getting a dump of the Wikipedia

First of all, download one of those dumps. That may seem long, but Wikipedia is huge ! On the XML dump page, you’ll find almost all versions of Wikipedia, for different languages, but also Wikiquote, Wiktionnary and all other projects… So in order not to be confused, the french version is called « frwiki ». Let’s go to the latest version (2009 june 15), then select the « Articles, templates, image descriptions, and primary meta-pages » file, which is here. It is more thant 1 Gig although it is compressed with Bzip2 ! Be patient… Then free some space on your hard drive, you’ll need several gigs to work ! Unpack the file… Whoops, it’s now more than 7 Gigas !

2. Running the scripts

The painful part of the work is to convert the XML file in a pre-parsable format for TomeRaider. TomeRaider is able to import a sort of Html file with its own tags and keywords, but definitely not plain XML. So there are some Perl script that just parse the XML file, and create another intermediate file that TomeRaider will compile to the final form.

Unfortunately, the XML dumps are very often embedded with mistakes and issues that crash the perl scripts… That may seem weird, but with a little less than one million articles written by individuals, and some specific formatting rules to (for example) display maps or country information, that is not simple for the perl scripts to just run smoothly and succeded on the first time.

So, it takes many hours to fix the perl scripts, adding « exceptions » for each page that it cannot handle properly. With the previous versions I compiled, that made about 50 pages removed, which means almost the same amount of re-running the script… each run taking between a couple of hours and one day… quite painful !

3. Compiling the final version

Once done (after a few days…), you should have the original XML file, no longer needed, and an intermediate file. You can now import it into TomeRaider, which will convert it to a .TR3 final file. That too may take a loong time, so be patient !

4. Enjoy

Done ? Great, you should now be able to copy the final .tr3 file on a SD-Card and use it directly on your PocketPC device (or other) with TomeRaider for Mobile. Enjoy !!

If you don’t want to do all of those steps, just click here, I’m giving away a version of the french wikipedia, in TR3 format.

4 Comments »

Bruno Kerouanton on juin 25th 2009 in Culture

4 Responses to “Offline Wikipedia on PocketPC”

  1. Web Design - Offline Wikipedia on PocketPC | Web Design >> Freeware News responded on 25 juin 2009 at 18:36 #

    [...] here to read the rest: Offline Wikipedia on PocketPC VN:F [1.4.4_707]please wait…Rating: 0.0/10 (0 votes [...]

  2. Anna responded on 11 juil 2009 at 8:07 #

    Merci Bruno ! …et bonnes vacances :)

  3. tunde responded on 24 juil 2009 at 21:22 #

    english wikipedia in TR3 version, pls. thank you

  4. Bruno Kerouanton responded on 26 juil 2009 at 22:11 #

    @tunde: converting a full wikipedia into TR3 format is a long process, more than one week for me. I manage to do it for the french version only, which is much smaller than the english version. I know that some other people have compiled the english version, but I cannot tell if it’s recent and complete. Googling on that terms should bring you onto such TR3 files.

Trackback URI | Comments RSS

Laisser un commentaire