I'm not sure I fully understand fault-only-first load, but reading the description of vle8ff.v I think I only need to exchange the load inside of the loop?
How does the normal load deal with faults?
I'll update the parent comment, it slowed down the speed from 2/1.7 to 1.57/1.36 Bytes/Cycle.
You'd probably want to have a new __riscv_vsetvlmax_e8m8 at the start of each loop iteration, as otherwise an earlier iteration could cut off the vl (e.g. page unloaded by the OS), and thus the loop continues with the truncated vl.
The normal load should just segfault if any loaded byte is outside of readable memory, same as with a scalar load which is similarly partly outside.
> You'd probably want to have a new __riscv_vsetvlmax_e8m8 at the start of each loop iteration, as otherwise an earlier iteration could cut off the vl (e.g. page unloaded by the OS), and thus the loop continues with the truncated vl.
Oh, yeah, that was a big oversight, unfortunately, this didn't undo the performance regression.
> The normal load should just segfault if any loaded byte is outside of readable memory, same as with a scalar load which is similarly partly outside.
I don't quite understand how that plays out.
The reference memcpy implementation uses `vle8.v` and the reference strlen implementation uses `vle8ff.v`.
I think I understand how it works in strlen, but why does memcpy work without the ff? Does it just skip the instruction, or repeat it? Because in either case, shouldn't `vle8.v` work with strlen as well? There must be another option, but I can't think of any.
Also, does this mean I can get the original performance back, if I make sure to page align my pointers and use `vle8.v`?
The memcpy doesn't use a vlmax, it uses a hand-chosen vl. The load won't fault on any elements not loaded (here, elements past the vl), so the memcpy is fine as it only loads items it'll definitely need, whereas your original code can read elements past the null byte.
And yeah, aligning the pointer manually would work (though then it wouldn't be portable code, as the spec does allow for rvv implementations with VLEN of up to 65536 (8KB per register; 64KB with LMUL=8), which'll be larger than the regular 4KB pages).
Ah, this makes a lot more sense now. I thought the "fault" was about the kernel interrupting when a new page needs to be loaded into physical memory, which would also happen for memcpy.
How does the normal load deal with faults?
I'll update the parent comment, it slowed down the speed from 2/1.7 to 1.57/1.36 Bytes/Cycle.