That's really neat Jason, what with all the 'assembly is dead' stuff that gets bandied around it's nice to see people keeping the art alive and collaborating at this level.
Assembly has never been dead in the realm of multimedia processing, or for that matter, anywhere that you have to do an enormous amount of SIMD-able number crunching.
Find a video (en|de)coder without SIMD and you've found an unusable piece of software.
There's a reason that Intel focuses so much on SIMD instructions and arithmetic units with each new chip release: if not for SIMD or dedicated hardware, it would be physically impossible to decode 1080p video on almost any modern machine.
Not a tradeoff; that's just bad input checking. Most of ffmpeg's decoders have heavy checks in them--crashes are usually bugs (missing checks). Yes, the checks have a small speed cost, but it's at most 1-2%, probably less.
This is good stuff. It will make writing assembly on Intel at least slightly less painful.
I am still amazed we don't have an x86 assembler that is anywhere near what DSP people use. I mean, we still have to track register usage manually and do manual instruction reordering to improve performance! The x86 world could learn a lot from the DSP world here (Texas Instruments tools are a good example).
Seriously, you're downmodding this? Write some C sometime, and watch the compiler output perfectly-optimized assembly for your architecture. You write a high-level solution to the problem, the compiler makes it work efficiently.
If it doesn't, it's a compiler bug, and should be fixed at that level.
I program C signal processing code on x86. Sure, it's usually sufficient for my needs, but "perfectly-optimized assembly"? Not for those tight loops where you really want it. It's decent and if you hold the compiler's hand will get vectorized somewhat, but the code is still heavy.
It is obvious you don't know what you're talking about and have never seen tightly optimized signal-processing code (as in, say, H.264 weighted prediction or interpolation).