The hidden cycle-eating demon: L1 cache misses

akkartik · on Sept 4, 2009

My dissertation was partly about prefetching to L1. http://akkartik.name/akkartik-phd07.pdf

It required changes to processor microarchitecture, though.

yan · on Sept 4, 2009

I have posted this before, but if you care at all about memory and how it applies to writing efficient code, you owe it to yourself to check out this fantastic paper: http://people.redhat.com/drepper/cpumemory.pdf

chipsy · on Sept 4, 2009

When working in a higher-level language, it's easy to forget that a lot of the performance disadvantage comes from using data types that are distanced from the hardware, require unboxing, RTTI, etc. Not only do those conveniences eat up CPU, they also eat up memory, which compounds the problem by forcing the data to live outside of cache, regardless of what optimizations the language implementation has.

Hence, I really appreciate it when a language offers some way to drop down to byte-level constructs and build space-efficient implementations where they're necessary. Improvements in memory compactness can often save you the effort of dropping all the way down to C.

barrkel · on Sept 4, 2009

Indeed. I had a recent situation where I was building a trie for prefix search, and found that object overhead / pointer cost was so expensive (a whole 8 bytes per instance in 32 bit - really adds up and worse in x64) I dropped down to using parallel integer arrays to store all data, and even then, I was able to squeeze out more space by going to a custom bit-packed array class.

phildawes · on Sept 4, 2009

Definitely. This is one of the things that attracted me to factor

QE2 · on Sept 3, 2009

While I found this an interesting read, I believe someone who works with *264 encoding with any regularity can easily justify using the relatively cheap hardware encoders available to consumers. I'm not sure if x264 is fully compatible with such a device, though. This is a lot more feasible than waiting for Intel and AMD to significantly change their architecture for what sounds like an edge case.

DarkShikari · on Sept 3, 2009

Hardware encoders are not feasible for ordinary consumers at all; if you want an encoder that produces reasonable compression, you have to go up to the $10k-$50k range (and even there, most of the ones on the market are really not very good!). Low-end hardware encoders are both often slow (usually outperformed by x264 on a cheap quad-core CPU) and extremely bad at compression.

Plus, a CPU can be used for things other than encoding video, while a hardware device is of course useless for anything else, so it's easier to justify spending money on a fast CPU than on a task-specific piece of hardware.

Also, it isn't really an edge case; it will occur in any application which has a working set that is unavoidably larger than the L1 cache. Video encoders are just one of many cases where this occurs.

(Note: added that last point to the blog post after I posted this.)

QE2 · on Sept 3, 2009

Thanks for the follow-up. After re-reading the post, I see that this is not /just/ an x264 problem.

Thank you also for informing me that hardware encoders aren't very good. I had actually considered purchasing one of the ~$100 ones, but now I'll steer clear.

I must be doing something wrong, though, because I can't seem to get much better than 2x real-time on my Q6600 w/ 4GB memory when using x264.

DarkShikari · on Sept 4, 2009

The meaning of 2x real-time, of course, depends entirely on the resolution.

x264 now has encoding presets you can use to trade off speed for compression: they go from "ultrafast" to "placebo" (full list in the help). Grab the latest from x264.nl; we've also had a lot of speed improvements lately ;).

Do note that at high speeds, it's easy to get bottlenecked: the most common case is the decoding of the input, which is often only single-threaded. If your input is uncompressed, then you can get bottlenecked by reading it off the disk. And if you're running filters on the input (e.g. resizing it), that can also serve as a bottleneck.

m_eiman · on Sept 4, 2009

I've used Elgato's Turbo.264 dongle, and as far as I'm concerned it did a good job. It cost roughly $150, and outperformed any software encoding I could manage by a large margin. Of course, if I had the money to build a dedicated computer and figure out the settings to ffmpeg (or some other software encoder) I might get even better results, but for my needs (MacBook, don't want to learn the intricacies of encoding) it's perfect.

I think that quite a few "ordinary consumers" would share my opinion. If they don't mind spending hours or days compiling and tweaking a delicate chain of open source software, they're not "ordinary" - they're power users.

ZeroGravitas · on Sept 4, 2009

I looked into the same dongle (also for a Macbook) and the general impression I got was that it was maybe better than Apple's Quicktime on slower machines but that you could easily beat it with open source encoders. The only reason it is difficult to make the comparison is that the software encoders generally didn't take the same shortcuts as the hardware encoder because they didn't feel the quality tradeoffs were worth it.

This was a while ago, but I have no reason to think the software encoders haven't improved faster than the hardware.

edit: rereading your post I see your main problem with the software solution is "spending hours or days compiling and tweaking a delicate chain of open source software". I can't help but note that if that is the case then you're doing it wrong™.

m_eiman · on Sept 4, 2009

I can't help but note that if that is the case then you're doing it wrong™.

Most likely, but since ffmpeg has the least helpful documentation I've seen (for a newbie at least), I'm not surprised.