Workstation4 / Blog / reverse-engineering
reverse-engineering file-formats python

Decoding a 1998 game's BOLT archive, byte by byte.

Before a single car rolled across the board, this project lived inside a stack of .BLT files from a 1998 life-path CD-ROM. Every screen, sprite, palette, and sound was sealed inside a custom container called BOLT, and nothing downstream could move until I could read it. So I started cracking it open, byte by byte, and most of the interesting part was figuring out where I had been lied to.

What BOLT actually is

BOLT is a generic asset archive from Mass Media Games, used across a surprising range of their titles from the early '90s onward, on CD-i, DOS, N64, and Windows. The container layout is shared across platforms, but the compression and byte order are not: this game uses the Windows variant, which is little-endian and uses a custom LZ-style scheme. A public C++ extractor exists (heinermann/BOLTextract), and an earlier Python port had been attempted against it. Studying both helped most, though not in the way I expected: the real lesson came from understanding why the earlier port failed.

The structure itself is clean once you see it. Find the ASCII magic BOLT (or lowercase bolt) anywhere in the file and treat that position as the origin for everything else. A 16-byte header follows: the magic, a build timestamp packed as single bytes (hour, minute, second, millisecond, month, day, year-since-1900, so 98 means 1998), an entry count where 0 conventionally means 256, and an end offset that the format documentation itself flags as unreliable. After the header comes an array of 16-byte entries.

A directory tree hiding in 16-byte entries

Each entry is the whole game in miniature. Three flag and type bytes, then three little-endian uint32 fields: an uncompressed size, a data offset relative to the archive origin, and a file hash. The hash is the clever bit, because it doubles as a type tag: a hash of zero means the entry is not a file at all but a directory, and in that case the file_type byte is repurposed to hold the number of children. So the archive is a recursive tree. Walk it depth-first, follow each directory's offset, and you reach every artifact the game ever shipped.

@dataclass
class Entry:
    flags: int
    unk_1: int
    unk_2: int
    file_type: int            # for a directory: child count
    uncompressed_size: int    # LE uint32
    data_offset: int          # LE uint32, relative to bolt_begin
    file_hash: int            # LE uint32; 0 == this entry is a directory

    @property
    def is_directory(self) -> bool:
        return self.file_hash == 0

    @property
    def is_uncompressed(self) -> bool:
        return bool(self.flags & 0x08)   # only flag bit I trust

That is the tidy version. The messier version is that following the offsets is exactly where things went wrong, and where they stayed wrong for a good while.

The endianness trap that ate every offset

The reference C++ names its entry fields with a big-endian suffix. Read that quickly and you conclude the format stores its numbers big-endian, so you byte-swap every offset and size you read. The earlier Python port did exactly that, unpacking with >I across the board. The catch is that the swap in the original C++ is conditional. It lives in a helper that only fires when a global big-endian flag is set, and that flag is only set behind a --big command-line switch meant for the N64 and CD-i builds. The Windows build that actually shipped on this disc is little-endian from front to back. The _be suffix was misleading historical naming, not a description of the bytes on disk.

A reversed 32-bit offset does not fail loudly. It quietly points somewhere deep inside the file, at bytes that are not an entry header at all, and the extractor wanders off into garbage. No exception, no crash, just nonsense and a long hunt for the reason. The fix was almost insultingly small: read everything with <I instead of >I. Once I stopped trusting the names, the offsets landed on real headers, the tree walked cleanly, and the payloads were ready to decompress.

The hardest bug was not in my code. It was in a name I inherited and believed.

An LZ decompressor read off the disassembly

Most payloads are compressed. The Windows decompressor reads opcode-prefixed packets: take a byte, the high nibble is the opcode, and the low nibble plus any trailing bytes parameterize it. Opcode 0x0 is a literal copy of N bytes, 0x4 and 0x5 are run fills, 0x6 is a zero fill, and several opcodes are backreferences that copy from earlier in the output, LZSS-style. The genuinely interesting detail is a self-overlapping copy: the source index is recomputed against the current output length on every iteration, so a backreference can read bytes it is itself still writing. That is how a short backref expands into a long repeating run.

Two of those opcodes were wrong in the public reference. The opcodes 0x7 through 0xB are documented as "push the same value twice," but disassembling the game's own executable showed the engine reading two consecutive source bytes instead (a mov dl, [edx] followed by inc then a second load). Until I matched that, text decompressed into something almost readable but subtly corrupted. After, it came out clean.

There was also a structural surprise the reference left as a TODO stub: some compressed entries are not one stream but several concatenated, each ending with the standard 0x00 terminator. So a 0x00 cannot simply stop decompression; it ends a sub-chunk, and you resume from the next byte until you have produced exactly the entry's stated uncompressed size. The header size is the ground truth for when to stop.

while len(out) < expected_size:
    bv = buf[pos]; pos += 1
    op = bv >> 4
    if bv == 0:                       # end of a sub-stream
        if len(out) == expected_size:
            return out
        continue                      # more chunks follow; keep going
    # ... opcode handlers build `out` ...

Twenty-four bytes and a green pixel

Decompressing was only half the job; the bytes still needed a shape. Images turned out to use a 24-byte little-endian header carrying a sentinel of 0xFFFF, a bits-per-pixel marker of 0x0008 (so each pixel is one palette index), then width and height, followed by exactly width × height index bytes. A reliable check is that the file length equals 24 + width × height with no slack. This is where the public C++ heuristic quietly disagreed: it expected a 16-byte header from the older Mass Media games, so it labeled these Windows images as unknown and walked past them. The Windows variant simply grew the header to 24 bytes.

The palette held the second surprise. Each palette is 256 RGB triples (768 bytes) behind a short variable-length header of 3 or 6 bytes. Color index 0 is not a real color. Every palette I looked at hard-codes it to (0x00, 0xFF, 0x00), pure green, the engine's chroma-key transparency marker. The artists painted sprites against that green so it could be masked at draw time, the same trick a weather broadcast uses. Treat index 0 as an ordinary color and every sprite renders on a solid green rectangle; treat it as transparent and the art floats free. That green marker is also the most dependable way to detect a palette, since the header bytes vary but index 0 does not.

  • Image: 24-byte LE header, then one palette index per pixel, top-left origin.
  • Palette: variable header, then 256 (R, G, B) triples, 8 bits per channel.
  • Index 0 is chroma-key green: not a color, a "draw nothing" signal.

How I knew it was right

A reverse-engineering pass can look finished and still be subtly wrong, so I tried not to trust a clean run on its own. The reassuring proof was visual. When the renderer finished, recognizable screens came out in the right colors: blue sky, green trees, white save-slot fields, readable text. A title screen that looks like the title screen is a far better signal than any byte count, and it caught the two-consecutive-bytes opcode bug that an exit code never would have.

The static pass ended up decoding the archive in full: roughly 3,148 images rendered with correct colors, about 1,531 audio clips lifted out of Mass Media's custom 8-bit PCM, and more than 1,500 text strings split between dialog and mini-game items, plus the prize wheels and a couple of game-economy value tables. That set is the foundation the rest of the project leans on. The board art, the spinning cars, the event cards: all of it came out of these containers.

What I carried forward

  • Field names are documentation, and documentation can be wrong. Verify byte order against the build that actually shipped, not against the names someone chose. A single >I versus <I was the whole difference.
  • Reversed integer reads fail silently. When offsets point at garbage with no crash, suspect endianness before anything else.
  • A reference implementation is a hypothesis, not a spec. Two opcodes were wrong in the C++; the disassembly of the real executable was the tiebreaker.
  • A palette index can be a control signal, not a color. Index 0 as transparency turned green rectangles into clean sprites.
  • Validate with your eyes. A correctly colored, recognizable screen told me more than a successful exit code ever could.

Reading old formats is patient work, but it was deeply satisfying to watch it pay off. One careful extractor turned a sealed disc into a full asset library, and that library is what lets the rest of the game exist at all.

W4

Workstation4

A quiet workshop for cool, strange, useful iOS apps. Run by one developer who chases the weird problems for sport.

About the workshop →