Hacker News new | past | comments | ask | show | jobs | submit | salykova's comments login

excalidraw <3


as we discussed earlier, the code really needs Clang to attain high performance



We were actively chatting with Justine yesterday, seems like the implementation is at least 2x faster than tinyBLAS on her workstation. The whole discussion is in Mozilla AI discord: https://discord.com/invite/NSnjHmT5xY


"off-topic" channel


Hi! I'm the author of the article. It's my really first time optimizing C code and using intrinsics, so I'm definitely not an expert in this area, but Im willing to learn more! Many thanks for your feedback; I truly appreciate comments that provide new perspectives.

Regarding "creating a constant global array and loading from it" - if I recall correctly, I've tested this approach and it was a bit slower than bit mask shifting. But let me re-test this to be 100% sure.

"Comparing a constant vector {0, 1, 2, 3, 4, ...} with broadcasted m and m-8" - good idea, I will try it!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: