I'm a missionary in Japan. The name of my mission agency is WEC International. That's supposedly Worldwide Evangelisation for Christ, but I think I have a better idea about what it stands for...
2007-02-22
The Bible Code
A while ago I wrote
Now, I admit that there is a cost of converting a (let's say) GBT marked-up Bible (let's focus on Bibles, not books) into proprietary format. This is the cost of writing one (1) XSLT stylesheet. Having done similar stuff, I'd estimate that at half a day's work.
I was close. It took four hours: two hours to write a disassembler, an hour to write the convertor, and an hour to fix the bugs. After this bit of software archeology, and the project I did over the summer converting a gigabyte of undocumented legacy files without even having any program which can read them, I'm no longer daunted by binary file formats. Nor should you be. In addition to that article, here are a few more tips about understanding binary files.
- First, if you can create your own files, half the battle is already won. Create a file, run it through
od -c, make a change, save it, and notice what changed in the binary. Generate as many subtly different files as you can, and diff the output ofod. - If you can't generate your own, your job is harder but not impossible. You have to proceed directly to stage two: spotting patterns.
- Several patterns recur frequently in binary files: numbers of records, lengths of strings, lengths of records, offsets into records. If you can add a bit of data, which bytes change? Do they look like they might be lengths or offsets? How much longer does your file get if you add a single byte of input?
- Numbers are often packed into four-byte units, but sometimes into two- or one-byte strings. People love storing numbers in big-endian shorts and longs, so break out the "N" and "n" pack templates.
- Speaking of which,
Parse::Binary::Iterativeis a fantastic Swiss Army Chainsaw for decrypting binary files. Start with a very generic template:Data => "A*", and add fields when you notice changes. To begin with you won't know what those fields mean, so just call them "Unknown1" and "Unknown2". They'll become clear soon enough. Writing a parser first means that you can do the diffs on textual data, which is much easier to handle - even if your guess at the structure is incorrect. - Use
stringsandod -cto identify plain text strings, and pull them out. Look at them individually, find their encoding, and see what other properties you notice. Then look around them; can you find the lengths encoded anywhere? Are there any numbers which index into the list of strings? - Apart from that, it's all about spotting patterns. Do the numbers monotonically increase? Do they repeat after a given number of bytes? Use Perl to lay out the data in rows of a fixed length, and change the length until you have something that looks right.
- Don't assume that everything is there for a reason. Programmers can be lazy, and encode data they don't need. If something doesn't make sense, treat it as unknown and come back to it later when things don't quite work.
- If you have the software to hand, try tweaking some bytes in the file and reading it back in. What changes?
- Isolate, divide and conquer. Start by parsing a file of one record. Then one of two records. Chances are after you get two records right, the whole thing will work. But maybe it doesn't. If it stops working after, say, 105 records, what are the properties of the file? Is it more than a "significant" size (has it gone over 256k, for instance)? Is there anything interesting in the number of records? If you can, try using the application to add a record to the last file which worked, then diff your attempt and the real output. Maybe it's time to revisit some of those bytes you don't understand yet.
- Once you've written the parser, write the generator. I should meddle with
Parse::Binary::Iterativeso it automatically does this for you, but it's easy enough just to reuse the oldpacktemplates. - If you can't see any clear text at all, then either it's packed funny (will the data fit into seven bit bytes? Unlikely because, as I mentioned earlier, programmers are lazy) or it's encrypted. Go read about known-plaintext attacks and probabilistic analysis, and be prepared for the long-haul. Hint: Try XOR first. Programmers are lazy. You'll be surprised how often it works.
As with most of programming it's 90% perspiration and 10% inspiration. The summer job had me working fourteen hour days for a week - mainly through maniacal obsession and problem-solving drive, but also through neighbours with young children and thin walls!
lathos: Going from iPod 1.x to 2.x and severely regretting it.






