Your `sum` array is only 64 elements but you're indexing with indices out of bou...

ap4 · on Aug 1, 2024

That's the trick. This is intended. The point is that the compiler does not notice oob access in the first stage, but notices it in the later stages, and compiles the code to a correctly working kernel. The result is correct, as checked by the function verify_matrix().

ladberg · on Aug 1, 2024

Sorry I'm a bit lost here, could you explain the reasoning behind this and why it works?

ap4 · on Aug 1, 2024

I want to have 512 threads per block, each thread calculating simultaneously 128 values. That's 65536 values per block. I can't accumulate each of these values in registers, because the GPU has the limit of max 65536 registers per block, and some additional registers are needed in the kernel. But if I find a way to trick the first stages of the compiler that it has sufficient amount of free registers, then sometimes, like in the case of this kernel, the later stages of the compiler are sufficiently smart to give me what I want: 512 threads per block, each calculating 128 values.

ladberg · on Aug 1, 2024

I hate to say it but that simply doesn't work: you can't write out of bounds to trick the compiler, it'll just ignore your out of bounds work.

You can look at the generated sass on godbolt: https://cuda.godbolt.org/z/19excTxM3

Note that there are 1024 FFMA instructions in the loop but you would expect 16*8*BK = 2048. This would suggest half the operations are skipped, which lines up with the half of writes that are out of bounds being omitted.

After the compute loop when you're calculating the final result and storing it, you can see that the FFMAs referencing out of bounds indices write QNAN instead of any real results.

Is it possible that the NANs are what are messing with your tests? Those are notoriously hard to deal with correctly, but you should assert that the result doesn't have any NANs whatsoever.

ap4 · on Aug 1, 2024

You are right. The function verify_matrix() from the original SGEMM_CUDA repository did not check for NANs. I deleted the repository. It was the 13th CUDA kernel I wrote in my life, and the whole endeavor teached me a lot. I appreciate the feedback.

ladberg · on Aug 1, 2024

Glad it was a learning experience for you and I apologize if I came off argumentative at all! I was mainly so incredulous because this is my day job haha, so I have a bit more experience than most in the area.

It definitely sucks to be led astray and have time wasted by a bug inherited from the original repo though, sorry to hear that :/