Hacker News new | past | comments | ask | show | jobs | submit login

One more: Don't parallelize when you're going to blow L1 caches for no good reason.

I wrote a gradient descent solver a while back, before I realized Vowpal Wabbit existed.

Vowpal Wabbit had a dedicated thread to read in records and pass them through a queue to another thread doing the math. I didn't, because it was a quick POC impl and I couldn't be arsed. My impl was faster.

This advice probably goes for most usages of auto-parallel collection impls that the author was pointing out.




That is a good point. If you have around N things going on, and N cores, and each thing can saturate the 1 core pretty well, it is better to keep them each on one core so they can rip through their own L1 cache rather than ruin each others.

The loop abstractions don't give you the control to assure that, though I would hope the good ones are smart enough to at least keep each thread on 1 core and on it's own chunk of an array.

That would be interesting to explore.


I call this heterogeneous vs. homogeneous threads/processes. Heterogeneous threads are more icache and dcache friendly. They also make your program more modular.

Linux has controls to set the affinity of threads for cores.


Interesting, but hard to draw any conclusions given that there were probably several other things that were different between the two implementations.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: