Small thoughts, follow-up

I’m celebrating the 201th entry on my blog while waiting for my plane in the Basel airport lounge. I admin I’m quite tired since I had a lot of things to achieve before my departure, getting instructions to the team, getting my slides sent to the conference organization (a very kind guy, though), and attending to a quarterly CSO meeting in Lausanne this morning, among a thousand other futile or important things.

The IDA Pro Book

3 days ago, I received my IDA Pro book from an UK bookseller that’s cheaper than Amazon although quite prompt in shipping. Very nice book, I obviously didn’t have the time to read it except the foreword and introduction, and parse the contents brievely : It seems to be a really good book that fully cover every single aspect of IDA usage, plugins, SDK and other things. Writing style is crisp and clear, that’s definitely a good acquisition for IDA users.

Recompiling the Wikipedia for my PocketPC

Among the other things I achieved (or began to do) the last few weeks, I became interested in recompiling the latest french version of Wikipedia for my PocketPC. That is not so simple, I can tell you that my computer spent at least 40 hours on the matter… and myself, a few dozen (late) hours… Downloading the roughly 4Gb of compressed Wikipedia is okay, uncompressing it to around 20Gb still goes right. But when I had to run the perl scripts to convert the XML dump structure to an usable HTML document that could be imported into TomeRaider, my PocketPC viewer, things became quite worse.

Some articles among the ~1.6 million entries made the Perl interpreter hang and use 100%CPU. Although I didn’t fully understand why those entries were crap, I finally achieved to isolate the categories that made this happen. Some weird XML entries for a category of articles related to spain crashed… After carefully studying all the perl scripts (a lot of lines !), I was able to patch it and remove the treatment for those entries. But before this achievement I lost more than 15 hours waiting the script to crash, log the faulty article and retry hoping that the following one would occur sooner or later. At last, the script was crashing after more than 6 hours of running time, a bit frustrating ;)

Here is a screen capture of the work in progress… Click to enlarge !

Usually, I use PSPad as my favorite text editor, but even if I really love it, I have to admit that there are some limitations. Opening a 20Gb XML text file and parsing it is not painful… it’s simply impossible since PSPad tries to load the entire file in RAM, making the PC incredibly slow. I had to resign and try another editor, EmEditor (a shareware this time, that has two great advantages that PSPad doesn’t have : It handles perfectly Unicode, mandatory to edit the Wikipedia XML files, and it’s the fastest editor able to handle up to more than 200Gb text files (!!). I admit that’s right : my 20Gb text file was loading in a few seconds, and when editing it, or even worse, inserting and cutting-pasting blocks of XML text in such huge file, it handled it nicely. So with EmEditor, I was able to understand the structure of Wikipedia XML file dumps, which is really simple and efficient :

  • a header containing some generic info
  • a « title » entry per page, with simple structure. Great for parsing, indeed.

And so, I patched the perl files (which were run into the PSpad IDE environment, I also PsPad for being able to actually run lenthly scripts and compile programs within it), and finally achieved to obtain huge result files… ready to import into TomeRaider3, which should take a few additional days of computing time if I’ve understood people who already tried ! Compiling the full world knowledge is not a easy play, but it’s really worth doing it, having the full french (or english) Wikipedia on a fraction of my PocketPC’s micro-SD card is simply great !

Reading spy novels

I’ve also finished reading Robert Littel‘s book about Palestinian vs Israelian integrism wars, Vicious Circle, quite absorbing as his previous book (I own The Company as an Audible audio book on my Ipod… but it’s more than 50 hours listening time and I had to be motivated to go through the end of this one ;) ).

Okay, time for me to leave as the plane’s boarding time is near. I’m thinking that within 15 days I’ll land off 7 more times, yee !

PS : By the way, if you’re interested in my patched Perl files, or more simply by the resulting Wikipedia file for your PocketPC, I’m obviously giving this away. Just let me know, as I don’t have a 15Gb internet repository for my files to deposit yet…

PPS : I can’t wait to meet Ross Anderson… he’s the Keynote speaker at the Conference ;)

3 Comments »

Bruno Kerouanton on septembre 26th 2008 in General

3 Responses to “Small thoughts, follow-up”

  1. miib responded on 29 sept 2008 at 22:58 #

    Having All Wikipedia’s articles on your PocketPC must be possible as I’ve seen in « Die Hard 4″ that all the Internet can be contained in a USB key…. you should ask Bruce Willis a technical assistance ;-)
    Hope you’ll enjoy your conference

  2. Bruno Kerouanton responded on 30 sept 2008 at 19:26 #

    That’s a funny allusion, thanks ! I always laugh when I’m rethinking about this movie part…

  3. TCP 2.0 « C’est bien fait quand même responded on 18 oct 2008 at 10:52 #

    [...] Bruno, je suis en train de lire The Ida pro book. Il ne faut pas y chercher un ouvrage pour apprendre [...]

Trackback URI | Comments RSS

Laisser un commentaire