salykova's comments

salykova · 2024-07-04T17:11:28.000000Z

excalidraw <3

salykova · 2024-07-04T16:08:56.000000Z

as we discussed earlier, the code really needs Clang to attain high performance

SushiHippie · 2024-07-04T16:27:36.000000Z

salykova · 2024-07-04T05:59:09.000000Z

We were actively chatting with Justine yesterday, seems like the implementation is at least 2x faster than tinyBLAS on her workstation. The whole discussion is in Mozilla AI discord: https://discord.com/invite/NSnjHmT5xY

salykova · 2024-07-04T08:20:48.000000Z

"off-topic" channel

salykova · 2024-07-04T05:52:46.000000Z

Hi! I'm the author of the article. It's my really first time optimizing C code and using intrinsics, so I'm definitely not an expert in this area, but Im willing to learn more! Many thanks for your feedback; I truly appreciate comments that provide new perspectives.

Regarding "creating a constant global array and loading from it" - if I recall correctly, I've tested this approach and it was a bit slower than bit mask shifting. But let me re-test this to be 100% sure.

"Comparing a constant vector {0, 1, 2, 3, 4, ...} with broadcasted m and m-8" - good idea, I will try it!