In the video you mentioned that you worked on a bit of RTL for the codec. I am curious on how the codec can be accelerated. You didn't talk much about it, so I am interested in what you have learned. Do you have some info on where most of the CPU time is spent in the codec? What did you try out?
My RTL accelerated the transform stage of the codec. Because Daala has no pixel domain dependencies, the frequency domain coefficients can just be offloaded to hardware, which returns the pixel domain result. There is more information on timing on my website [1].
The transforms no longer take as much CPU time as they used to, due to having better SIMD accelerated versions. Much of the time is now spent on the PVQ decoder, which is not optimized for speed at the moment.
One thing I learned after writing large fixed-function transform hardware is that it doesn't take much extra hardware to turn them into microcoded programmable pipelines, much like a GPU. In fact, this approach is very common internal to hardware video decoders, though the firmware is not exposed to the user. One notable exception is Broadcom's Videocore IV, which is quite an interesting architecture [2].
Also, with the latest mobile processors having 8 or more ARM cores, we can also exploit CPU parallelism in much the same way. I feel it is very important to have the codec perform well on CPU alone, as not everyone will have hardware that can decode it right away. This is something I would like to play with a lot more.
Cool stuff. Thanks a bunch. I agree with you that software optimization takes preference, but I have also seen optimized crypto code in assembly that was completely undocumented and unreadable. I hope that Daala won't fall into that trap.
All functions in Daala have a C version, along with corresponding assembly versions which have tests to make sure that the assembly version matches the C reference.
arewesmallyet.com, it seems to be expired and grabbed by domain squatters. And (what is probably related) Firefox installer size is now much more than 5Mb it was then :)