I tried several times to build such a thing but gave up every time. It is easy for simple file formats but if you want something universal it gets quite complex pretty quickly. There are file formats where the endianess is specified in the file and not statically known, there are file formats where the bit width of numbers is not statically known, there are (binary) strings delimited with some special marker and there are many different ways to escape the delimiter within the string, there are data blocks starting at fixed offsets, at offsets relative to some other block or the end of the file, there are offsets computed from different values found in the file, compressed blocks, unions of different block types discriminated by a tag, tags that need to be computed from multiple values, ...
I would really love such a tool, but a language to describe most of the file formats in the wild is probably going to be (close to) Turing complete and is unfortunately nothing you can hack together over a weekend.
Let me start out by saying I do not claim the current binspector grammar can cover all the cases you've described. Yes, at some point binary formats can get so unwieldy the only way to read them is with some kind of Turing-complete system.
What I have discovered, however, is that there are still a vast array of binary formats that can be well described with the format language as it stands. I have tried to devise workarounds to skip past the parts of formats that cannot be handled well. I hope to double-back on those limitations and extend the feature set of the language, but I hope what's in there is enough to get people started.
Of the issues you mentioned, Binspector's language can handle dynamic field endianness, dynamic field size, terminators and delimiters, absolute/relative/dynamic offsets, tag-discriminated unions, and lambda calculations. It doesn't do everything, but it does allow a lot.
You already made it far beyond what I ever achieved. I aimed at a XML-based file format description and that gets messy pretty quickly once you have to express expressions. After the initial friction of having to write a parser, going with a custom language seems way more promising. I always wanted to peek under the hood of Wireshark and see how they do it - do they have specialized code for every protocol or do they use an abstract protocol description, too? I never did it. I hope your project matures, there is a lot of space in this niche.
I wrote a tool I called the "data file disassembler" somewhere around 15 years ago that is basically this, along w/ a user interface that allowed you to drag a selection area and define a "section" and its type.
I used this primarily for reverse-engineering proprietary formats both for personal entertainment, but also while doing work to determine when/if patents were being violated while working for a legal firm.
I had a variety of supported formats, including several microcontroller instruction sets, as I spent a lot of time disassembling ROMs. These had full support for labels etc.
There was another tool called General Edit (http://www.quadrivio.com/ge.html) which used to be available for the Mac but is now defunct. It was one of the inspirations behind Binspector and sounds very similar to the tool you mentioned.
Having wanted a binary grammar format for a while, and just purchased the Pro version of "Synalyze It!" as a starting point, it would be interesting to hear a short non-specialist's summary of the relative merits of your approach to expressing grammars compared with an XML based approach like "Synalyze It!" ...
Overall, "Synalyze It!" seems mostly fairly stable and certainly a useful tool well worth the purchase price for the Pro version. However, it is closed source, by a small developer, on a single platform: would much prefer to rely on something with source code available ...
Or, maybe a better question is what grammars are you currently using Synalyze It! for? I have used Binspector for fairly complex file formats and it has held up for my purposes in the past. Would it be worth an hour or two to see if Binspector could pass muster?
The grammars are for various 2D & 3D CAD model formats. Some can be fairly complicated, and the data files can be fairly large. For clarity, I'm bootstrapping something and don't have any immediate plans to open source the grammars.
Certainly, it would be worth looking at Binspector.
Probably it would take me a couple of days to get up to speed though, and that is not time that I have immediately.
There isn't any support for importing from a C header. In order to get it right you would need a decent C parser as well as details about the compile-time environment (e.g, char being (un)signed.) At this point it's probably faster to code up the format grammar manually.
Generating import code given a bfft is a thought I had considered. One of the features I would like to implement is to separate the parse tree generator from the analyzer and expose an analyzer API. Applications would be able to use the Binspector core as a library, then, and read file contents directly by providing their own bffts and hooking the API to populate their own internals.
Looks really cool! I'd love to see an open grammar definition format for all kinds of tools, both command-line and GUI.
I've worked on parsing TrueType files, and they have some really "interesting" grammars. For example, they have lookup tables that define offsets at the beginning of the file. Also, a format flag at the beginning of a table might define the structure of the rest of the table.
It seems that once you decide you want to be able to parse all of this, you're grammar will turn into a Turing-complete language. Have you considered this? Where do you stop?
Fonts have been notorious for being ill-defined, and this can wreak havoc on applications that do not handle them properly. I would be very interested in seeing some font related bffts surface for the purposes of font validation and analysis.
I have concerns about Turing-completeness, and am interested in maintaining the declarative nature of bffts. The grammar is has drifted from that goal, and I am not entirely sure how to bring it back.
I understand your concern for keeping the format declarative, but I'm not entirely sure you can handle TTF files that way, as you'll run into walls very quickly.
For example, to know the number of entries in the `loca` table, you have to parse the `numGlyphs` field from the `maxp` table. The `maxp` table doesn't have to come first...
That sounds positively diabolical. I'll have to think through this one some more, but off hand I wonder if something can be done with Binspector's slot/signal mechanism to detect when all the necessary pieces are in place. Something to ponder.
I'm not familiar with Docker, but something might have failed in the configure or build phases. What kind of output are you getting from those scripts?
I have given a lot of thought into the question of output. Throughout the parse tree there is an implicit DAG of dependencies - this value affecting that read operation over there, etc. The real trick in making a generic output routine from a bfft is reversing this DAG, so e.g., if I add a pixel down here the parse tree can re-stabilize automatically into something valid. This also opens the door to generic editing of binary formats. My understanding is that DAG reversal is an NP-complete problem, but I suspect with file formats we're dealing with a subset of the space and it might not be as difficult as I am imagining.
I would think that for some formats it's trivial whereas for others it's neigh impossible. There's nothing wrong with offering that feature to a subset of grammars though.
similar idea: generating a nicely formatted spec document from the grammar.
I would really love such a tool, but a language to describe most of the file formats in the wild is probably going to be (close to) Turing complete and is unfortunately nothing you can hack together over a weekend.