Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So as someone who is by no means an expert you are half right. The compiler doesn't have to guess what parts are parallel and it's very clear which ops are parallelisable but how you parallelise them is the name of the game.

So for example if you do a pattern of "do a small op to each part of a large block of data and then do another small op to each part of that block of data, etc" then at least in CPU SIMD (ex AVX) you end up memory bottlenecked.

However if you can do a bunch of ops on the same small blocks of data before moving on to the next blocks of data in your overall large block of data then said small blocks can fit inside the L1 cache (or in the registers directly) and that can run the CPU to it's absolute limit.

Hence it becomes a game of scheduling. You already know what you need to optimise but actually doing so gets really hard really fast. Albeit things like MLIR (which are still very new) are making this easy to approach.



> Hence it becomes a game of scheduling. You already know what you need to optimise but actually doing so gets really hard really fast.

This immediately makes me think of Halide, which was specifically invented to make this easier to do by decoupling the algorithm from the scheduler.

Kind of sad that it doesn't see to have caught on much.

[0] https://halide-lang.org/


Well it has actually. MLIR (being built by the LLVM team) is basically the next generation of LLVM and one of the MLIR tutorials is literally "write Halide".

https://mlir.llvm.org/docs/Tutorials/transform/ChH/


Oh, that's cool! I hadn't looked into MLIR in any detail yet, thank you for pointing me in that direction




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: