I'd like to see integer types for C/C++ that are able to saturate or trap on overflow - or both types. I'm pretty sure it's even been proposed as a C standard but was rejected or postponed.
It seems like a really obvious feature that's easy to support but isn't.
Free Pascal (fpc) supports trapping on overflow, it uses the overflow flag and tests it after integer operations, it's pretty basic. Here's the assembly output of in Int64 add
# [16] c := a + b;
movq U_$P$TEST_$$_A(%rip),%rdx
movq U_$P$TEST_$$_B(%rip),%rax
addq %rdx,%rax
jno .Lj9
call FPC_OVERFLOW
.Lj9:
movq %rax,U_$P$TEST_$$_C(%rip)
It should be easy to support in C/C++, as FPC uses the same LLVM backend.
That’s bizarre - it jumps on not overflow? And jumps forward? But that’s statically predicted not-taken - it’ll statically mispredict and stall the processor every time!
My gut feeling says this is due to FreePascal supporting platforms[1] that only has short conditional jumps (-128 to +127 bytes), like i8086[2]. The alternative on those platforms would be a unconditional jump after the conditional, to jump over the call to the error handler.
On i8086 at least a non-taken conditional jump is 4 cycles while a unconditional jump is 15, for a total of 19 cycles, while a taken conditional jump is only 16. So it makes sense to almost always take the conditional jump.
It's the way that error handling is done in Free Pascal. I've never worried about the code it generates before, I just care about correct results. Yeah, it could be better.
I don't know specific AMD64 addressing modes, but you should always be able to detect overflow in signed 2s complement addition by checking to see if both addends are positive and the result is negative. There is a similar check for subtraction.
Yeah, but the point is addressing modes fuse addition and multiplication and using the result, so there's no opportunity to check anything after the operations.
RISC-V architects weighted pros and cons of having a flags register, and pros and cons of having overflow exceptions.
They concluded it is best to not have flags at all (conditional branches do their own testing, and no flag dependencies need to be tracked which simplify superscalar implementations) nor overflow checks (flow breaking and costly; if you need the check, the cost of a software check is minimal, by design).
I understand for overflow exceptions, but I would have expected the cost of a register flag to be zero or near zero, for example, in an integer adder?
It also doesn't look like to me the cost of a software check can always be trivial. It can be for a single operation, but an advantage of an overflow register is that it allows to check for a group of operations as a whole (check/branch just once and abort), which is what is probably practical to do algorithmically. In such scenario switching to software checks for each op and/or bound check the inputs sounds by far not minimal.
Flags aren't that simple. In a superscalar microarchitecture, then you have a lot more to track re: who set the flag, as the flag is a target of every instruction that can set it.
Minimal.. It depends! I remember that the developpers behind GMP (a GNU library for bignum) weren't happy with the performance!
But that's pretty niche..
Likewise. That's what I was expecting going in rather than Lovecraftian monstrosities like INT_NAN. Plus they're actually something that's supported in some existing ISAs.
Standard (signed) integers in C could be implemented to trap on overflow. Overflow on signed integer is undefined operation, like NULL dereferencing or division by zero, so trap is perfectly expected behavior.
The issue is that unsigned integers must not trap on overflow, and CPUs, if they could be configured to trap on arithmetic overflow, usually do not have separate opcodes for overflowing arithmetic operations and trapping arithmetic operations.
It’s fine for debug builds, but people should be aware that building with the address sanitizer enabled balloons the size of the executable and slows it down tremendously.
One should always measure to see if there is a noticeable slowdown or not.
That is what I have meant by "an extremely serious reason to do otherwise".
For many programs the speed is determined by things like the throughput or latency of accessing the main memory, or by I/O operations, or by interaction with an user, in which case the sanitize options have negligible influence on speed.
When necessary, they should be turned off only for specific well tested functions whose performance is limited by CPU computation in registers or in cache memory, and they should remain on for the rest of the program.
ASAN can easily slow your program by 100x or worse. Your bias should always be that its impact would not be acceptable in production. If you were willing to accept the cost of ASAN you should have simply chose a slower, safer language in the first place.
I don't understand. It sounds like you're describing the existing behavior for the existing types. But I'm asking for new types with new behavior. The rules could be defined behavior for these types, which is what I want. I don't want types that happen-to-trap-because-it's-UB. I want a type that traps because that's its mission. And another type that saturates.
I understand the existing undefined behavior, that's not particularly interesting. What would be interesting is well-defined behavior for new types that does not interfere with however defined or undefined behavior exists for the existing types.
> usually do not have separate opcodes for overflowing arithmetic operations and trapping arithmetic operations.
I don't care how good or bad the codegen is. Just emit the code for the right behavior, because I'm using this type that requires this behavior.
Saturating arithmetic is available on some DSPs and may be available to C programs via #pragma on those architectures. IMHO this is a burden to every ALU and instruction that does not need the feature. We should not push application specific features like that down to the basic hardware. We'd also need to provide language support for it.
Saturating arithmetic is probably not all that expensive, since its been in every ARM SIMD unit (and sometimes also scalar) for many years. You just pay a little bit in latency when using those instructions.
The program terminates - depending on context, I suppose. I don't know what all toolchains/backends do but some will generate an unaligned load or some other architecturally illegal code (__builtin_trap). or you could trap to some handler.
x86 has had the `INTO` (interrupt on overflow) instruction (just a one byte form of `INT 4`) for decades. After an `ADD` or other arithmetic operation, you would `INTO` to trap it. However, because compilers love exploiting undefined behavior for speedups, they don't emit that instruction, so it's actually fairly slow on modern processors.
I am not sure if I remember it correctly, but wasn't the problem that you had to explicitly call INTO after every operation? You didn't have a global flag to set or a flag bit in the add instruction, you had to explicitly emit the check. So using it carried a significant performance penalty over leaving the operation unchecked.
> However, because compilers love exploiting undefined behavior for speedups
Except if the behavior had been well defined any attempt to catch overflow would violate the standard - see unsigned integers for reference.
> so it's actually fairly slow on modern processors.
Are you sure that having a potential branch every second operation, that depends directly on the result of the preceding operation, has at any point in time not been horribly slow?
> I am not sure if I remember it correctly, but wasn't the problem that you had to explicitly call INTO after every operation? You didn't have a global flag to set or a flag bit in the add instruction, you had to explicitly emit the check. So using it carried a significant performance penalty over leaving the operation unchecked.
Correct.
> Except if the behavior had been well defined any attempt to catch overflow would violate the standard - see unsigned integers for reference.
Then the compiler could emit them only on signed operations, but they don't.
> Are you sure that having a potential branch every second operation, that depends directly on the result of the preceding operation, has at any point in time not been horribly slow?
Yes; it's been shown that `INTO` is slower than `JO raiseOverflowError` because, by the nature of `INTO` being an interrupt, it doesn't get the advantage of branch prediction, but `JO` would.
On an older processor without a pipeline, sure, it'd be just as slow as an equivalent check, but with how massive the pipelines are on today's processors, it's bad.
> Yes; it's been shown that `INTO` is slower than `JO raiseOverflowError` because, by the nature of `INTO` being an interrupt, it doesn't get the advantage of branch prediction, but `JO` would.
A CPU could predict INTO — simply predicting it as not taken would be a good guess. This would require designing, implementing, verifying, and maintaining, and it might add some area to important parts of the CPU. The vendors haven’t seen it as a priority.
(AIUI at least older CPUs couldn’t predict branches in microcode, and INTO is presumably fully microcoded, so it doesn’t get predicted.)
It seems like a really obvious feature that's easy to support but isn't.