Yes, you waste a lot of memory. Memory's cheap. If you need to, you can do the c...

jiggawatts · on Jan 13, 2024

Memory is cheap, but memory bandwidth isn’t.

Languages that can stay in L1 cache for the duration of a computation will run circles around a language that explicitly computes and stores all intermediate values in full.

Also, array-based languages can easily hit the wall of system memory capacity whereas traditional code tends to be streaming and can handle unbounded input lengths.

mlochbaum · on Jan 13, 2024

Which is exactly why I said you block the computation to stay at a low cache level. With SIMD loads and stores I don't think this matters quite as much as you suggest, even without blocking. It's pretty much only arithmetic that can saturate L1. I timed the BQN compiler on various files (some old version of itself, repeated). For 18K it runs at 21.4MB/s; for 1.7M, 16.5MB/s; for 17M, 12.0MB/s. So even when the source won't fit in L3 (mine's 8MB) the degradation is under a factor of 2 (and of course the compiler makes no consideration of cache, who writes a megabyte of BQN?).