Hacker News new | past | comments | ask | show | jobs | submit login
Kaitai Struct: A new way to develop parsers for binary structures (kaitai.io)
230 points by marcodiego on March 17, 2022 | hide | past | favorite | 65 comments



There's also Wuffs, a safe and fast programming language made by Google specifically for decoding and encoding file formats https://github.com/google/wuffs

Paired with C FFI available in most languages, this seems like the nicer solution. It's simpler than generating code for a bunch of high level languages, and more performant


They’re designed to solve different problems, really. Kaitai is often used when trying to represent a binary format from a reverse engineer’s perspective, to allow for sharing this work and using it from whatever other tooling you have. Wuffs is more “production grade” and is intended to provide performant, safe parsing of known formats where one would typically use a language like C or C++.


I agree that it's similar but ultimately different problems. Copy/pasting from https://github.com/google/wuffs/blob/main/doc/related-work.m... gives:

> Kaitai Struct is in a similar space, generating safe parsers for multiple target programming languages from one declarative specification. Again, Wuffs differs in that it is a complete (and performant) end to end implementation, not just for the structured parts of a file format. Repeating a point in the previous paragraph, the difficulty in decoding the GIF format isn't in the regularly-expressible part of the format, it's in the LZW compression. Kaitai's GIF parser returns the compressed LZW data as an opaque blob.


This is "made by a Googler"; that is not the same as "Made by Google".

  This is not an official Google product, it is just code that happens to be owned by Google


That's probably just an artifact of how the author initially went through the open sourcing process. It is code that is written and maintained by a Googler as his day job, that's used in skia (i.e. chromium, android, etc.) today for GIF decoding: https://source.chromium.org/chromium/chromium/src/+/main:thi...


> Wuffs the Library is available as transpiled C code

> However, unlike hand-written C, Wuffs the Language is safe with respect to buffer overflows, integer arithmetic overflows and null pointer dereferences. A key difference between Wuffs and other memory-safe languages is that all such checks are done at compile time, not at run time. If it compiles, it is safe, with respect to those three bug classes.

Sounds a lot like ZZ. [0] It also sounds much like SPARK Ada, except for transpiling to C.

[0] https://github.com/zetzit/zz


How much of a future should I expect for Wuffs?

The linked-to page says: "Version 0.2. The API and ABI aren't stabilized yet. The compiler undoubtedly has bugs."

There are not many recent commits, and mostly by one developer.


Not for managed environments like client-side JS, JVM, .NET, …


This appears to just allow you to parse binary formats to the represented fields. (Not that that's not extremely useful, doing this in managed languages is generally a giant pain in the ass!)

wuffs is much more powerful: it's essentially a safe C dialect that compiles to C, that lets you write an entire codec and know that there aren't any overflows.


I’m a big fan of Kaitai Struct, to the point where I’ve even contributed a small bit of improvements to its Go support, and I use it in a handful of small projects. It’s indispensable for spelunking blobs of binary data.

I’ve also taken some inspiration with a Go library I wrote, restruct:

https://github.com/go-restruct/restruct

… which is a bit like Go’s JSON encoding/decoding library, but with kaitai-like annotations for binary encoding. Check the PNG example to see some of what can be done with it:

https://github.com/go-restruct/restruct/blob/master/formats/...

The main advantage it has over Kaitai at the moment is that it can serialize, which is still not a feature of mainline Kaitai Struct, though it will hopefully be some day.


I wrote a similar solution based on reflection for a game server I wrote in Go. It's probably too early to be worrying this about but reflection is apparently slow. I also found it difficult to write reflection code that was robust.

The problem I'm stuck on now is that since my "Unmarshal" function returns interface{}, I have to type assert for every type of packet in the protocol. There's probably thousands in total. I need some magic that can automatically Unmarshal to a concrete type and pass it to an appropriate handler. gRPC's codegen does this for you. Still working out how best to do this in the constraints of Go.


You can type-cast from interface{} into any other interface, so something like this might work: https://go.dev/play/p/bhrXr8-V7Gc


I do find the documentation/packaging rather unclear. For example the website mentions an IDE, but when you download it, there is no IDE in the downloaded package(thats what I remember). Exactly details I'm not remembering, but the Java version documentation was rather unclear, had to work with a bit before understanding how it worked.

Actually could not figure out how to parse a binary file/stream containing heterogeneous binary structures from reading the docs. I suppose it is possible?


It's some time ago I had to parse a binary file, but your go struct annotations look great. Adding streaming (un)marshalling could be a killer feature for some.


Kudos - this is neat - I especially love the library of pre-existing descriptions, which helps me to learn about the tool as well as about an abundance of file formats without re-engineering time wasted.

This is somewhat akin to ASN.1.

My personal feature wish list:

- support writing as well as reading;

- support generating Rust, Julia and Swift code.

- upload button to let users add to a contrib/ folder of existing format descriptions


Hard to beat the Erlang "destructuring" bit syntax, for example [0] parsing an IPv4 datagram from `Dgram`:

  -define(IP_VERSION, 4).
  -define(IP_MIN_HDR_LEN, 5).

  DgramSize = byte_size(Dgram),
  case Dgram of 
      <<?IP_VERSION:4, HLen:4, SrvcType:8, TotLen:16, 
        ID:16, Flgs:3, FragOff:13,
        TTL:8, Proto:8, HdrChkSum:16,
        SrcIP:32,
        DestIP:32, RestDgram/binary>> when HLen>=5, 4*HLen=<DgramSize ->
          OptsLen = 4*(HLen - ?IP_MIN_HDR_LEN),
          <<Opts:OptsLen/binary,Data/binary>> = RestDgram,
      ...
  end.
[0] https://www.erlang.org/doc/programming_examples/bit_syntax.h...


From the page you linked, it appears that the syntax can only handle cases where the size of a data structure is known ahead of reading the data? What about a more complex example of JFIF or ZIP where the parser has to read a data structure of unknown length until terminating bytes are discovered. Do you know of any more complex Erlang examples?


The page covers this further down -- you can "bind" and then "use" a size [0]:

  <<Sz:8,Payload:Sz/binary-unit:8,Rest/binary>>
This bit syntax binds `Sz` as an 8-bit length value, `Payload` as `Sz` 8-bit bytes with the remainder in `Rest`.

[0] https://www.erlang.org/doc/programming_examples/bit_syntax.h...


Thanks for the example. Is there a way to perform a positive and negative look-ahead instead of just reading the remaining data to end of input?


You also have Erlang’s pattern matching at your disposal:

    decode(<<16#01, Size:8, Payload:Size/binary-unit:8, Rest/binary>>) ->
        [{field1, Payload}|decode(Rest)];
    decode(<<16#02, Flags:16, Rest/binary>>) when Flags =:= 1 ->
        [{field2, true}|decode(Rest)];
    decode(<<16#02, Flags:16, Rest/binary>>) ->
        [{field2, false}|decode(Rest)].
Anything after the when keyword is a guard condition allowing for some manner of positive/negative “look ahead” (Contrived example, but maybe you get the point).


Yes, this is exactly how you do it, with pattern matching & guards!


My OCaml Bitstring was directly influenced by Erlang: https://bitstring.software/examples/


For my Mario Maker 2 level viewer, I used binrw in Rust (compiled to wasm) to parse the level data, it was quite a nice experience, very declarative.

Lib: https://github.com/jam1garner/binrw

Example level (a little slow to load): https://www.smm2-viewer.com/courses/BMV-CN5-4DG


This is fantastically cool, Andrew!

That you reverse engineered Nintendo's API and data model, that you built a lovely and retro visualizer, that you made it a social tool. The technical details. Rust. Wasm.

Everything about this rocks.

Keep it up!


Thanks!

I can't claim credit for the API and reverse engineering of the data model (https://github.com/0Liam/smm2-documentation), standing on the shoulder of giants here. But the original level viewer was a windows app which I couldn't use (there are since a handful of web ports of it); it seemed like an opportunity to learn a few technologies (Rust, WASM). There were a lot of interesting problems, i'll do a proper write-up one day.


Erlang got this right: for the narrow case of packets in/mangle/out, described like an RFC bit-field diagram, it was very clean and simple.


The sad part about Kaitai is that it doesn’t do encoding, which I only found out later.


If the source was a binary data, shouldn't the Kaitai parsed struct be dumpable using the likes of fwrite()? Or do you mean encoding the strings and bit-fields?


They possibly mean the encoding of application data into a wire protocol for example.


As a code generator, I guess this may be nice. It seems like a DSL like the Nom [0] API is more natural and expressive, though. I imagine you can hit limits to expressiveness in Yaml pretty quickly.

[0] https://github.com/Geal/nom


And be sure to visit the "example parsers" hidden in an issue comment thread: https://github.com/Geal/nom/issues/14

I was sad at "[ ] PDF" because I consider that the torture test of any binary parsing framework


As far as .NET implementation goes, it is really bad:

- Very old and currently obsolete project target

- As a result, does not use modern data types such as Span<T>

- No utilisation of ArrayPool<T> which is important for things like serialisers where you expect to deal with buffers a lot

- Appears to be a blind Java port given provided code style

This is not acceptable when working with low-level and binary structures which this standard is focused on. Yes, I know, this is an OSS project and therefore instead of complaining here I should have been working on contributing a PR to fix those issues. However, my main concern is that this standard and set of libraries in the current form work against the performance-sensitive nature of working with binary data.


The JavaScript runtime isn't much better; it's missing support for 64-bit types using BigInt, and the runtime itself could be far more lightweight and efficient. I would submit fixes, but the project just seems unmaintained as a whole.


The cpp is also pretty bad as another comment points out. All sub structures end up being allocd pointers with no way to have them inlined as values.


Kaitai is a really great system, with an awesome WebIDE. At work we have just started a project to use it for astrophysics simulations and data from dark matter detectors, and one of my hobby projects is to use it to explore retro game data formats.


Interesting, but now you have to add in the possibility of having bugs in your YAML file. The YAML is probably less readable than the spec for the binary format itself.

Looking at the code-gen for utf8_string [0] and it's a case of 'thanks, but no thanks'

> std::unique_ptr<std::vector<std::unique_ptr<utf8_codepoint_t>>> m_codepoints;

This is a solution looking for a problem, but I bet it was fun to write.

[0] https://formats.kaitai.io/utf8_string/cpp_stl_11.html


Kaitai is usually used when dealing with blobs that have poorly defined or reverse engineered layout. It’s not particularly great at what you probably think it should be used for, which is why it seems strange to you n


I like Perl's Data::ParseBinary (https://metacpan.org/pod/Data::ParseBinary)

    my $s = Struct("foo",
        UBInt8("a"),
        UBInt16("b"),
        Struct("bar",
            UBInt8("a"),
            UBInt16("b"),
        )
    );
    my $data = $s->parse("ABBabb");
    # $data is { a => 65, b => 16962, bar => { a => 97, b => 25186 } }


How does it handle length-prefixed data?

  length u32
  data   u32[length]
This is a huge problem for traditional parsers since the input modifies the grammar dynamically.



I see, you can pass in a function that computes the length. I wonder if a declarative or formal solution will ever exist.


Tried it a while back but: The yaml based description is really annoying.

It doesn't supported bit per bit reading which I needed, so I had to stop when I realised that.


I am interested in your use case as I worked on a similar tool, handling a number of file formats.

Do you mean mapping meaning from bit fields?


Kuinox could be referring to bit field types that are packed in a way that doesn't correspond to 8-bit alignments. See [1] and [2].

Kaitai Struct does support bit fields but they have to be aligned on 8 bit boundaries[3]. I can't think of any instance of a binary format anyone would still be using that would fail this condition to align on 8 bit boundaries.

[1] http://www.catb.org/esr/structure-packing/#_bitfields

[2] https://retrocomputing.stackexchange.com/questions/15512/did...

[3] https://doc.kaitai.io/user_guide.html#_bit_sized_integers


I used it when I was trying to reverse engineer fortnite replays files. https://fortnitereplaydecompressor.readthedocs.io/en/latest/...

Unreal engine encode the network packet like it's a BitStream, when it want to write a boolean for example, it will write a single bit. The following integer won't be aligned.


Do you mean the Oodle compression mentioned in the link? Because I don't see any such unaligned bitstream unpacking mentioned in the link or in https://fortnitereplaydecompressor.readthedocs.io/en/latest/...


I like GNU poke for futzing with binary data.

http://www.jemarch.net/poke

The poke and other folks had an online conference recently called Binary Tools Summit:

https://binary-tools.net/summit


My team is using Kaitai in production at work to decode proprietary messages from a partner. The web IDE makes writing and debugging the format files pretty easy and I would strongly recommend using it.

Overall I'm very happy with Kaitai. I've found it to be effective and enjoyable to use.


Kind of related, wanted to mention Synalaze It which I found to be a decent binary viewer / editor. You can also write your own grammars.

https://www.synalysis.net/


I used Kaitai for parsing and reverse engineering a binary format that does have some documentation but it's mostly wrapped in a lot of legal mumbo jumbo. Total novice to parsing binary files, looked very intimidating, but with Kaitai IDE I was able to pick it up quickly and had some wonderful results. It has some limitations (already mentioned here), but still very powerful tool. There is a (relatively small) community active on the gitter.im Lobby that was very welcoming and helpful when I got stuck at a problem.

I can't recommend it enough, even (or especially)if you are a novice. It looks daunting, but I found it surprisingly nice.


Ugh, wish I'd found this a couple of years ago; after hand-writing a Unity asset parser in node.js for a hobby project (big/little-endian mixes, byte alignment, versioned header format, different compression algos, etc.).


Great library - too bad it only allows reading


If you're working in Python and need to write as well as read check out Construct[0], which is also a declarative parser builder.

[0]: https://construct.readthedocs.io/en/latest/intro.html


Seems rather well designed actually. Appears that you can even use length-delimited lists and stuff. I like it. I have a project where we have a compact binary encoding and I have to write documentation _and_ serde for it. This works for docs and deserialization so that’s good. I understand why serialization isn’t supported but I feel like there’s probably a clever API that allows inserting your own ser in. We’ll see. I might switch our internal thing this weekend to it.

Would be cool if you could generate a protocol diagram from this.


Can it parse a binary file/stream containing heterogeneous binary structures? Documentation seems not that clear regarding this.

Also the structure defenition seems to needs compiling, what if there is change of structure at runtime? Can it handle it?

Is it possible to for an application to have editable structures and KaiTai to parse them without needing to recompile the whole application (I'm speaking here in a java context).


I tried to use the spec as a description of disassembled data sections for a toy disassembler I was writing a while back. So I was using the spec but not the tools.

It is great that the spec is YAML because that's quick to get up and running. It was just unfortunate that the expression language is custom string, i.e. not YAML.

It may have improved, but at the time the custom language wasn't unambiguously specified either.


Another similar binary parsing language is 010 Editor’s Binary Templates [1]. I prefer it over Kaitai’s language because it reads more like a typical struct definition with control structures interspersed throughout.

[1] https://www.sweetscape.com/010editor/templates.html


I contributed a number of file formats a few years ago (and attempted numerous others) but ran into a number of problems with certain file formats:

1. It's not possible to read from the file until a multiple byte termination sequence is detected. [1]

2. You can't read sections of a file where the termination condition is the presence of a sequence of bytes denoting the next unrelated section of the file (and you don't want to consume/read these bytes) [2]

3. The WebIDE at the time couldn't handle very large file format specifications such as Photoshop (PSD) [3]

4. Files containing compressed or encrypted sections require a compression/encryption algorithm to be hardcoded into Kaitai struct libraries for each programming language it can output to.

The WebIDE I particularly liked as it makes it easy to get started and share results. I also liked how Kaitai Struct allows easy definition of constraints (simple ones at least) into the file format specification so that you can say "this section of the file shall have a size not exceeding header.length * 2 bytes".

Some alternative binary file format specification attempts for those interested in seeing alternatives, each with their own set of problems/pros/cons:

1. 010 Editor [4]

2. Synalysis [5]

3. hachoir [6]

4. DFDL [7]

[1] https://github.com/kaitai-io/kaitai_struct/issues/158

[2] https://github.com/kaitai-io/kaitai_struct/issues/156

[3] https://raw.githubusercontent.com/davidhicks/kaitai_struct_f...

[4] https://www.sweetscape.com/010editor/repository/templates/

[5] https://github.com/synalysis/Grammars

[6] https://github.com/vstinner/hachoir/tree/main/hachoir/parser

[7] https://github.com/DFDLSchemas/


Not able to decode your references.

What does this refer to:

> 4. Files containing compressed or encrypted sections require a compression/encryption algorithm to be hardcoded into Kaitai struct libraries for each programming language it can output to.

Text mentions "Kaitai", but references

4. DFDL [7]

or

[4] https://www.sweetscape.com/010editor/repository/templates/

is unrelated.


Sorry for the confusing format. "[X]" is the notation used for references but "X." is just an item in an ordered list (nothing to do with references).

The comment about compressed and encrypted sections of files relates to the limited built-in algorithms that can be found at [1]. Any additional algorithms, for example, AES-XTS, would need to be added as custom processes and a corresponding function added for each language runtime Kaitai supports.

"DFDL [7]" refers to the reference "[7] https://github.com/DFDLSchemas/".

"4. ..." doesn't have any special meeting

[1] https://raw.githubusercontent.com/kaitai-io/kaitai_struct_co...


This looks really cool! This would have been really useful to me a couple years ago.


It was available a few years ago, and I found it very useful.


This looks like Data.Binary? Or maybe a less insane version of ASN.1 that compiles into parsers.


it was new in 2015, now not anymore


It wasn't new in 2015, Erlang has been doing this much better for years before that.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: