Wuffs: Wrangling Untrusted File Formats Safely

Ono-Sendai · on May 18, 2024

Wuffs is great. I use it in Substrata (https://substrata.info/) for loading PNGs. It is both faster and safer than LibPNG. It's something around 2x faster than LibPNG in my tests (depending on the PNG file), see timings here: https://github.com/google/wuffs/issues/13#issuecomment-17325...

So generally Wuffs is great and you should use it to decode your PNGs. There are some downsides: not all of the obscure bit depths and formats that PNG supports are loaded as-is, some are converted to more standard formats.

Also the Wuffs documentation is a bit hard to understand. It's a litle bit of a mission getting PNG decoding working. You can see my code for that here though: https://github.com/glaretechnologies/glare-core/blob/2c7174c...

repsilat · on May 18, 2024

The "mango" lib [1] claims to be even faster for PNGs. Actively maintained but doesn't have as much buzz, I think the devs haven't advertised it as much on places like this.

Also, it has the funniest testimonials.

1: https://github.com/t0rakka/mango

yjftsjthsd-h · on May 18, 2024

Speed isn't the only thing that matters; is mango as safe as wuffs in the face of untrusted input?

edflsafoiewq · on May 18, 2024

Where does the extra speed come from?

nicoburns · on May 18, 2024

My understanding is that libpng is unoptimised and 5-10x faster is possible.

pornel · on May 18, 2024

libpng is reasonably fast, and has SIMD optimizations. Make sure to compile it with a modern CPU target.

The biggest bottleneck in PNG decoding is zlib, which is not part of libpng. There are faster inflate implementations, but nowhere near 5x.

The second slowest thing is unfiltering, but it takes only 10-20% of the decoding time, so even lightspeed implementation would make little difference.

There is possibility of a 10x difference when encoding, but that's not due to libpng being slow, but because it's possible to apply worse compression and there are dedicated crappy-but-veryfast encoders.

dang · on May 17, 2024

Wuffs’ PNG image decoder - https://news.ycombinator.com/item?id=26714831 - April 2021 (138 comments)

yjftsjthsd-h · on May 17, 2024

This is one of my favorite attempts at better programming language safety, because it compiles down to C that can then be shipped like normal C, so you don't get the ecosystem friction like with ex. Rust.

pcwalton · on May 18, 2024

C has a lot of problems as a compilation target as well, from surprising UB (e.g. signed integer overflow) to debugging problems (e.g. #line is woefully inadequate compared to the ability to emit DWARF DIEs) to the inconvenience of setting up a toolchain for end users. To its credit, Wuffs is one of the better projects that compiles to C, because it targets a very restricted domain. But, in general, don't write programming languages that compile to C.

bobajeff · on May 18, 2024

For many of us making a compile to c language is many times more feasible than using something like llvm. I'm not saying it's great mind you but it's probably the best thing available without a runtime.

For debugging i believe you can generate your own source maps and use gdb as a backend to talk with your custom debugger.

yjftsjthsd-h · on May 18, 2024

Of course C sucks, but since everything under the sun uses it, there's unique value in being able to make it safer without putting a whole new compiler in the process for users. Remember that time the cryptography library in Python decided to add rust? We could have avoided all that pain with wuffs.

vlovich123 · on May 17, 2024

It’s an interesting idea for sure but it isn’t a general purpose language, so the problem domains it can solve is very very different vs what Rust is trying to do.

tialaramex · on May 17, 2024

Nigel has said that emitting "unsafe" Rust is a reasonable thing for a hypothetical WUFFS 1.0 to be able to do as an alternative to C. As with good "unsafe" Rust written by humans WUFFS would know exactly why what it's doing is fine, it's just that the Rust compiler can't necessarily see that, hence the need to label it "unsafe".

Today C makes most sense given the WUFFS language is still in flux.

[Edited to fix a serious typo]

nequo · on May 17, 2024

What would be the primary benefit of emitting Rust rather than C? Both would be considered safe (assuming Wuffs generates correct code), and Rust could access the C code via FFI. Is there something I’m missing?

tialaramex · on May 17, 2024

I expect that the Rust emitted by a hypothetical future WUFFS transpiler would be much easier to just drop into an existing Rust project than some C via a C FFI.

It's common for C libraries that do get wrapped today (e.g. openssl) to have a two phase wrapping, a -sys crate which turns the C into Rust C FFI and then another crate to turn the Rust C FFI into something actually palatable to ordinary people.

vlovich123 · on May 17, 2024

Nominally it can safely elide bounds checks via unsafe that it has proved are actually safe within the constraints of Wuffs, which is what it does for C (+ the language is built for more easy translation to vectorizated than something like llvm is able to do for general purpose languages).

So basically higher performance.

FFI nominally has a runtime and compile time cost - whether that matters for you in particular will depend on your needs, but being able to publish a very simple crate without a build.rs to manage can have an attraction.

jcranmer · on May 18, 2024

The C abstract machine is slightly funkier than unsafe Rust (things like C lacking a way to do signed integer overflow without UB or needing to adhere to strict aliasing in C), so I would expect that lowering to unsafe Rust would be slightly more likely to be correct.

pcwalton · on May 18, 2024

One benefit would be that Rust users could use Wuffs code without having to install a C compiler. Pure-Rust solutions are much more convenient in the Cargo ecosystem than wrangling -sys crates.

edflsafoiewq · on May 18, 2024

You could also use c2rust.

IshKebab · on May 18, 2024

Probably not too much from a "final product" point of view, but using a pure Rust library is a whole lot easier than C from a faff point of view. Especially for cross-compilation.

vlovich123 · on May 17, 2024

I’m responding to this:

> that can then be shipped like normal C, so you don't get the ecosystem friction like with ex. Rust.

Emitting Rust doesn’t help with this.

tomjakubowski · on May 18, 2024

it helps in the other direction: less friction to use from rust

fiddlerwoaroof · on May 18, 2024

But more friction to use from just about every other language.

andrepd · on May 18, 2024

What's the difference vs compiling down to machine code and linking it with your program?

pornel · on May 18, 2024

You reuse optimizer and machine code generator of the C compiler, and you're not tied to a single backend like LLVM.

tedunangst · on May 17, 2024

Related, in the sense of solving the same problem in a different manner: https://rlbox.dev/

refibrillator · on May 18, 2024

Can Wuffs provide stronger safety guarantees than techniques like WasmBoxC?

My understanding is that compiling unsafe C to WASM and back would also guarantee safety with respect to buffer overflows, integer arithmetic overflows and null pointer dereferences.

It’s nice not annotating code to explicitly prove invariants to the compiler like you would in say Wuffs or Rust, but I suppose that’s what limits performance.

klabb3 · on May 18, 2024

Doesn’t wasm have a memory model as well? So unless you sandbox certain parts of it you can still in theory have access across different C functions, within the same wasm module?

What seems nice about wuffs is that it has no side effects and a clear project scope. Deserialization is so riddled with severe issues that it does kind of warrant its own DSL. OTOH, some legacy formats will probably never be ported.

azakai · on May 18, 2024

Yes, Wuffs can do better than WasmBoxC because it does more than sandboxing of the code. It also checks things like integer overflows which can lead to exploits that are technically not memory safety issues, but still potentially dangerous.

But the tradeoff is that you need to rewrite your code for Wuffs, while WasmBoxC can sandbox anything that compiles to wasm and prevent it from corrupting the outside, including existing code in C, C++, Zig, unsafe Rust, etc. etc.

CJefferson · on May 18, 2024

Technically, while WASM promises you put data in and get data out, you can still have memory corruption (as it has a flat memory), so I could make a (for example) gif with some color palette, then later overflow and rewrite the palette.

Not fatal, but perhaps annoying.

who-shot-jr · on May 18, 2024

Could you use this to make sure users uploading files to your website are correct (i.e only jpegs and valid image data)? But in a fast and safe way, or is this overkill?

dividuum · on May 18, 2024

Not sure that’s possible. I’m pretty sure it is not safe to assume „parses in wuffs“ -> „is safe in any other decoder“. I’m using wuffs to check user upload (see my recent response in another thread) but I still generate out linear RGBA and work with that. I still consider the original JPEG data hostile.

pornel · on May 18, 2024

Yes, you could. But be careful to make sure that there's no more data left after the decoder finishes, because it's possible to append a ZIP file (or acropcalypse) at the end of any other valid image file data, and decoders usually stop at the end of the image and don't parse past its end, so won't complain about extra data.

yoochan · on May 19, 2024

Cool project, I'm eagerly waiting for a jpegxl decoder to be written in wuffs so that Google won't have any reason not to support the format...

I would be glad to have a try but the language specification and documentation is almost non existent.

zelphirkalt · on May 19, 2024

I have a question about this. Why is wuffs considered to be safe? In this thread I saw a code example from cpp using wuffs, which seems to the a C library accessed from cpp in that example. Why should I trust that C library to be safe or safer than other libs?

newman314 · on May 17, 2024

Does anyone know of a tool that can do this for PDFs instead?

Joel_Mckay · on May 17, 2024

There are PDF readers that do not support the scripting format extensions.

Note this does not prevent unscrupulous companies abusing dominant market positions to voluntarily embed machine and serial hash watermarks.

To be clear: formats like pdf, ps, webp, svg, and tiff are so badly implemented in some ecosystems... they can't _ever_ be assumed safe input formats. Thus, at some point people need to spin up an actual VM to transcode a "web" version, and scrub each stage of the rendering pipeline like a virus or header injection is already present.

"I never play where nice things are, and don't break things" (Eliza Mowry Blven, The Humanitarian Review, Volume 3, March, 1905)

Cheers =3

tialaramex · on May 17, 2024

I worked with TIFF pretty extensively, it's a mess but I don't see why a WUFFS TIFF codec can't be fine. What makes you say you need "an actual VM to transcode" a TIFF ?

Joel_Mckay · on May 17, 2024

The complex formats of tiff and tga specifications makes it nearly impossible to span all the edge-cases with unit-tests. A VM can be in a known-state snapshot, process pre/post signature logged/compared with a scripted debugger, and binary input/output stripped of non-compliant metadata/blobs at each stage of the pipeline if the process behaves as expected.

I've yet to find a better method than Honeypots to sustainably mitigate the complex leaky dependency mess on traditional architectures. It has been my experience that "all software is terrible, but some of it is useful".

It may just be my bias, but I see code smell getting worse in recent decades...

Have a nice day, =3

https://www.youtube.com/watch?v=aCbfMkh940Q

tialaramex · on May 17, 2024

So, there's actually no particular reason and if somebody cares to write one then yup, TIFF codec in WUFFS would in fact be safer and faster than your uh, approach.

Joel_Mckay · on May 17, 2024

One does not rely on the persistent competence of the coders, and will tell you when something has gone wrong.

And walking a binary object store to ban problem users is not always necessary... depending what you are doing.

Most other approaches makes the same predictable assumptions:

https://en.wikipedia.org/wiki/List_of_cognitive_biases

Despite popular belief, shitty design does not usually get better in another language. Rather, people just feel more confident it isn't shit anymore.

I have yet to see evidence to the contrary. =3

tialaramex · on May 18, 2024

Wait, you believe that somehow one of these approaches doesn't rely on competence from programmers? How do you figure?

Have you been imagining that sandboxes are some sort of fairy dust we just stumbled onto one day, supernatural in nature and not, in fact, just software written by people you're hoping are competent and haven't left any holes?

Joel_Mckay · on May 18, 2024

The point was... one is testing parser/OS integrity via a debugging interface over an expectation of an unchanging emulated environment state... there is nothing particularly special about the approach. Even Qubes OS and RancherVM is not perfect in this regard friend.

Or put another way, the available attack surface of a bare-minimum fixed environment is much easier to auto-audit, than a pile of daily permuted binaries and self-delusion approach. i.e. if it fails to behave in an expected way, or is modified in any way... the host audit process doesn't have to care why or how it is broken to maintain a service queue as the guest is culled.

Perhaps I am wrong about exchanging 15% of raw performance for reliability, but things can get complicated with licenses and multiple OS specific platforms.

You seem to be getting emotional about this subject, presenting secondary and tertiary straw-man arguments. So I'm going to go eat some Cheese Goldfish crackers... and just agree that your beliefs are interesting.

Have a fantastic weekend... =3

tialaramex · on May 18, 2024

There's nothing special about it, but it doesn't work especially well. This is the strategy that's blown up on Apple twice in recent years and will keep burning them.

If you're Matt Godbolt the benefits of sandboxing outweigh the cost because Matt is interested in general purpose software. But WUFFS isn't for that, as its name says it's interested in doing one particular task well.

In this deliberately limited domain, WUFFS gets to sidestep Rice's theorem altogether and just prove the software meets the semantic requirements [technically you do the proving, WUFFS just checks your work].

I hope you enjoyed your goldfish crackers but I urge you to use the right tool for the job.

Joel_Mckay · on May 18, 2024

"the right tool for the job" is sometimes admitting the breadth of underlying dependencies and ambiguous format specifications are unfeasible to fix with your teams time budget.

The design in question currently only processes around 1.8M large image files a day, and does not require additional work/re-implementations to support the dozens of questionable user file-formats. i.e. the plain old ImageMagick lib does most of the heavy lifting at the end.

Would I trust such a solution for something like a native client side web-browser etc... absolutely not... but for the core-bound instance overhead, the resource cost was acceptable for almost a decade of uptime on those system instances.

Use-cases are funny like that, as there is no perfect solution... but rather a tradeoff of what features get the system functional and reliable. Part of that is admitting integration of 3rd party dependencies is a long-term liability, and domain specific languages almost always fade into obscurity.

Cheers, =3

immibis · on May 18, 2024

WUFFS is provably safe - that's the whole schtick. If a WUFFS kernel exists, you can assume it is safe. If it's not proven safe, it doesn't compile. The reason everyone doesn't program in WUFFS is that you have to write a proof that your kernel is safe, which takes a very very very long time.

indolering · on May 18, 2024

What's the formal verification story for WUFFS?

tialaramex · on May 18, 2024

For WUFFS the language, or for WUFFS the library, or for the WUFFS tooling today?

The clever idea is to have you the programmer in effect write a proof that your code has the desired semantic properties as part of the programming activity and so then the WUFFS transpiler is merely checking that the proof is correct.

This leverages your understanding of what you were trying to do.

immibis · on May 18, 2024

Apparently Wuffs only proves safety. Verifying the code does what it's supposed to do is done with unit tests.

Joel_Mckay · on May 18, 2024

If you point out some of the above has run-state in some situations... it is provably nondeterministic... and thus the assertion of correctness is utter nonsense.

Hardly a panacea for fundamentally bad designs that go back decades.

Ever seen a web-server written in postscript? Its worth a look just for the laughs.

Good luck out there =)

lupire · on May 18, 2024

WUFFS is provably safe, or WUFFS programs are provably safe, using WUFFS as an axiom?

trustno2 · on May 18, 2024

pdfs are really really hard. the only viewer that parses them semi-correctly is ... Acrobat Reader.

try to ever read any code for PDFs and see all the horrors.

Google gave up and just bought the code from foxit.

kjksf · on May 18, 2024

Google was never trying to write PDF reader from scratch so they never "gave up".

They just bought foxit code to save years of development when they wanted to ship PDF reader in Chrome.

Your comment about "the only viewer that is semi-correct" is also wildly off the mark.

Parsing correctly written PDF files is hard but multiple engines can do it correctly.

Parsing real life PDFs is much harder then correctly implementing PDF spec because lots of PDFs are just broken. They generators create invalid PDF files and then PDF readers have to spend heroic efforts to somehow make sense of this brokenness. Adobe does it better than most because... well it would be embarrassing if they didn't. They invented the format, they make money from their tools, they were doing it the longest, they have the largest archive of broken PDFs for testing etc. It's hard to expect that e.g. an open-source project with one or two developers can match that.

I work on SumatraPDF so I know.

trustno2 · on May 20, 2024

OK. In one of my previous jobs, I needed to auto-fill PDF forms on BE (among other things with PDFs), the only thing that worked reliably across PDFs was Acrobat. I did not try SummatraPDF.

edit: it seems Summatra doesn't support PDF forms? Either AcroForms or XFA forms?

warkdarrior · on May 17, 2024

As soon as someone writes a Javascript interpreter in Wuffs..

jchw · on May 17, 2024

Do you ever need a JS interpreter to parse a PDF? That's horrifying.

I understand PDF has a bunch of limbs, but I always assumed the JS stuff was at least separate from the parsing. (I am familiar with the PDF format at a lower level but I never touched any of the weird features.)

timschmidt · on May 17, 2024

I wrote an SVG that's all javascript, no elements. All the graphics are generated dynamically at runtime by the javascript. It's SVG standards compliant, but only opens correctly in browsers, not in inkscape or other desktop publishing apps.

I work a lot in OpenSCAD, and had a need to design some custom graph paper. So I found the subset of SVG which was similar to OpenSCAD. :)

jszymborski · on May 18, 2024

Frankly, I wouldn't begrudge a website for not correctly parsing an svg I composed entirely with javascript.

It's annoying you can't just "flatten" or "bake" such an svg like yours into one composed entirely of elements (unless one exists?)

csande17 · on May 18, 2024

Often you can open the SVG in a browser and then use the developer tools to copy out the resulting nodes as "flat" SVG source code.

Chrome even includes a --dump-dom flag you can use to do this on the command line, although I haven't tested it with an SVG.

jszymborski · on May 18, 2024

Clever!

ThePowerOfFuet · on May 17, 2024

https://dangerzone.rocks/

jay-barronville · on May 18, 2024

Wuffs is cool, but you can get similar results writing normal C library code, compiling it into a .wasm binary via Clang, and then running the .wasm binary through the `wasm2c` tool of the WebAssembly Binary Toolkit [0]. I personally prefer this method, although Wuffs will usually produce faster code.

[0]: https://github.com/WebAssembly/wabt/tree/44837a7236e85c048de...

krick · on May 18, 2024

It is not obvious to me why this should guarantee safety.

jay-barronville · on May 18, 2024

`wasm2c` fully implements the WebAssembly sandbox execution environment [0][1] and has the passing tests to prove it. To be a bit more specific, the .wasm binary you generate initially already has the WebAssembly semantics baked in (obviously) and `wasm2c` creates a portable C translation of the WebAssembly while also ensuring that the execution environment is sandboxed (e.g., the code traps when attempting out-of-bounds memory accesses).

[0]: https://webassembly.org

[1]: https://github.com/WebAssembly/wabt/issues/2289#issuecomment...

eviks · on May 18, 2024

How much faster (say, for something like an image codec)

jay-barronville · on May 18, 2024

This might not be what you want to hear (and I might get downvoted for it), but it’s what I consider the best answer: Implement something minimal but useful (and realistic) using both methods and benchmark them yourself.

Even if I told you some of the numbers I’ve seen in my experiments and usage, it wouldn’t be wise to trust them or let them taint your opinion.

YoshiRulz · on May 18, 2024

Superior in every sense to that Magicka garbage they released a couple months ago. I'm excited to see its via-Rust codegen.

Alifatisk · on May 18, 2024

while true { ... } endwhile

Please, let the end brackets should be enough.