Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Binspector – A Binary Format Analysis Tool (binspector.github.io)
45 points by MontagFTB on Oct 16, 2014 | hide | past | favorite | 35 comments



I tried several times to build such a thing but gave up every time. It is easy for simple file formats but if you want something universal it gets quite complex pretty quickly. There are file formats where the endianess is specified in the file and not statically known, there are file formats where the bit width of numbers is not statically known, there are (binary) strings delimited with some special marker and there are many different ways to escape the delimiter within the string, there are data blocks starting at fixed offsets, at offsets relative to some other block or the end of the file, there are offsets computed from different values found in the file, compressed blocks, unions of different block types discriminated by a tag, tags that need to be computed from multiple values, ...

I would really love such a tool, but a language to describe most of the file formats in the wild is probably going to be (close to) Turing complete and is unfortunately nothing you can hack together over a weekend.


Let me start out by saying I do not claim the current binspector grammar can cover all the cases you've described. Yes, at some point binary formats can get so unwieldy the only way to read them is with some kind of Turing-complete system.

What I have discovered, however, is that there are still a vast array of binary formats that can be well described with the format language as it stands. I have tried to devise workarounds to skip past the parts of formats that cannot be handled well. I hope to double-back on those limitations and extend the feature set of the language, but I hope what's in there is enough to get people started.

Of the issues you mentioned, Binspector's language can handle dynamic field endianness, dynamic field size, terminators and delimiters, absolute/relative/dynamic offsets, tag-discriminated unions, and lambda calculations. It doesn't do everything, but it does allow a lot.


You already made it far beyond what I ever achieved. I aimed at a XML-based file format description and that gets messy pretty quickly once you have to express expressions. After the initial friction of having to write a parser, going with a custom language seems way more promising. I always wanted to peek under the hood of Wireshark and see how they do it - do they have specialized code for every protocol or do they use an abstract protocol description, too? I never did it. I hope your project matures, there is a lot of space in this niche.


I wrote a tool I called the "data file disassembler" somewhere around 15 years ago that is basically this, along w/ a user interface that allowed you to drag a selection area and define a "section" and its type.

I used this primarily for reverse-engineering proprietary formats both for personal entertainment, but also while doing work to determine when/if patents were being violated while working for a legal firm.

I had a variety of supported formats, including several microcontroller instruction sets, as I spent a lot of time disassembling ROMs. These had full support for labels etc.

This seems very similar.


On Mac there's a GUI tool ["Synalyze It!"][1] which one could apply grammar to binary files. But it's not free anymore.

[1]: https://www.synalysis.net


There was another tool called General Edit (http://www.quadrivio.com/ge.html) which used to be available for the Mac but is now defunct. It was one of the inspirations behind Binspector and sounds very similar to the tool you mentioned.

edit: grammar


The "Synalyze It!" grammar format is XML, and the developer has some sample grammars for download:

https://www.synalysis.net/formats.xml

Having wanted a binary grammar format for a while, and just purchased the Pro version of "Synalyze It!" as a starting point, it would be interesting to hear a short non-specialist's summary of the relative merits of your approach to expressing grammars compared with an XML based approach like "Synalyze It!" ...

Overall, "Synalyze It!" seems mostly fairly stable and certainly a useful tool well worth the purchase price for the Pro version. However, it is closed source, by a small developer, on a single platform: would much prefer to rely on something with source code available ...


What if you compared two descriptions side by side? What about PNG, for which a grammar exists for both:

https://www.synalysis.net/Grammars/png.grammar

v.

https://raw.githubusercontent.com/binspector/binspector/mast...

Anything I'd say at this point would be biased, but I would be interested in continuing the conversation.


A comparison w.r.t. the PNG grammar sounds like an excellent starting point.

I could see educational value (and more) in attempting a bi-directional translator between the two grammar formats.


Or, maybe a better question is what grammars are you currently using Synalyze It! for? I have used Binspector for fairly complex file formats and it has held up for my purposes in the past. Would it be worth an hour or two to see if Binspector could pass muster?


The grammars are for various 2D & 3D CAD model formats. Some can be fairly complicated, and the data files can be fairly large. For clarity, I'm bootstrapping something and don't have any immediate plans to open source the grammars.

Certainly, it would be worth looking at Binspector. Probably it would take me a couple of days to get up to speed though, and that is not time that I have immediately.


The dev is on github, https://github.com/synalysis send him a push request?


Thanks, I did not know he was on github.

But does he have a repository for "Synalyze It!"?

I cannot see one. It looks like he has open sourced some supporting frameworks ...

edit: grammar


Would he still get paid if he gave is code away?


It is also pretty inexpensive, the Pro version is $30. Uses Python and Lua for extension scripts.

I am a fan of https://pypi.python.org/pypi/hachoir-parser


Link to the repo for the lazy: https://github.com/binspector/binspector

Interesting project. Is there support for importing/exporting from a C header? Or generating file import code given a bfft file?


There isn't any support for importing from a C header. In order to get it right you would need a decent C parser as well as details about the compile-time environment (e.g, char being (un)signed.) At this point it's probably faster to code up the format grammar manually.

Generating import code given a bfft is a thought I had considered. One of the features I would like to implement is to separate the parse tree generator from the analyzer and expose an analyzer API. Applications would be able to use the Binspector core as a library, then, and read file contents directly by providing their own bffts and hooking the API to populate their own internals.

edit: spelling.


Very good point. Perhaps you could use libclang?

Regarding the import code, even if you start with just emitting a C header, it would make it easier to use your tool to design file formats.


Good idea leveraging Clang for struct round-tripping. Yet another thing to look into!


Looks really cool! I'd love to see an open grammar definition format for all kinds of tools, both command-line and GUI.

I've worked on parsing TrueType files, and they have some really "interesting" grammars. For example, they have lookup tables that define offsets at the beginning of the file. Also, a format flag at the beginning of a table might define the structure of the rest of the table.

It seems that once you decide you want to be able to parse all of this, you're grammar will turn into a Turing-complete language. Have you considered this? Where do you stop?


Fonts have been notorious for being ill-defined, and this can wreak havoc on applications that do not handle them properly. I would be very interested in seeing some font related bffts surface for the purposes of font validation and analysis.

I have concerns about Turing-completeness, and am interested in maintaining the declarative nature of bffts. The grammar is has drifted from that goal, and I am not entirely sure how to bring it back.


I have an interest in helping out. I've created opentype.js, an OpenType parser and generator (https://github.com/nodebox/opentype.js).

I understand your concern for keeping the format declarative, but I'm not entirely sure you can handle TTF files that way, as you'll run into walls very quickly.

For example, to know the number of entries in the `loca` table, you have to parse the `numGlyphs` field from the `maxp` table. The `maxp` table doesn't have to come first...


That sounds positively diabolical. I'll have to think through this one some more, but off hand I wonder if something can be done with Binspector's slot/signal mechanism to detect when all the necessary pieces are in place. Something to ponder.


Attempting to dockerize this...

Build script here:

https://github.com/ianmiell/shutit/blob/master/library/binsp...

Get this error:

./smoke_test.sh: line 8: ./bin/debug/binspector: No such file or directory

anyone know why?


I'm not familiar with Docker, but something might have failed in the configure or build phases. What kind of output are you getting from those scripts?


Thanks - I'll pastebin output later when I'm off the tube.


I think it's that ./b2 was not run.


That's strange - the build.sh script should run b2 as long as the $BUILDMODE is either not set or is set to 'bjam'.

I write rudimentary bash. If there's a problem in build.sh, debugging the script should be straightforward.



According to your pastebin, cstddef cannot be found. This is a standard c++ header. An environment issue, perhaps?

http://en.cppreference.com/w/cpp/header/cstddef


OK, done:

docker pull imiell/binspector

https://registry.hub.docker.com/u/imiell/binspector/

Bit of a hack required to get it compiled; there's probably a better solution:

https://github.com/ianmiell/shutit/blob/master/library/binsp...


Yup, tried various manual -I includes, CFLAGS etc.. I'll figure it out.


Looks cool, but in my view a key feature is: can it save a file back, using the same grammar?


(This response is a repost from here: http://binspector.github.io/blog/2014/10/13/binspector-a-bin...)

I have given a lot of thought into the question of output. Throughout the parse tree there is an implicit DAG of dependencies - this value affecting that read operation over there, etc. The real trick in making a generic output routine from a bfft is reversing this DAG, so e.g., if I add a pixel down here the parse tree can re-stabilize automatically into something valid. This also opens the door to generic editing of binary formats. My understanding is that DAG reversal is an NP-complete problem, but I suspect with file formats we're dealing with a subset of the space and it might not be as difficult as I am imagining.


I would think that for some formats it's trivial whereas for others it's neigh impossible. There's nothing wrong with offering that feature to a subset of grammars though.

similar idea: generating a nicely formatted spec document from the grammar.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: