Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
New integer types I’d like to see (foonathan.net)
101 points by ibobev on Sept 30, 2022 | hide | past | favorite | 126 comments


I'm far more partial to the idea that NaN was a mistake than a desired feature. Including it then making operations that involve it UB is a road to hell and sadness, far worse than abs(INT_MIN).

The idea to separate bit vectors from numbers is a good one though, it always seemed a little weird to me that we're passing around numeric types for FLAG_FOO & FLAG_BAR.


If you are doing bit manipulation that's more complicated than vectors of boolean flags, you are likely going to use both bitwise and arithmetic operations on the same data. If you need separate types for those operations, you will need many explicit casts in your code. That is usually a sign that something is wrong.

In low-level code, integer types are not numbers. They are fixed-length bit fields with a numerical interpretation. One thing I like in Rust over C/C++ is that it makes this clear. You always specify the size and signedness of the integer explicitly, and the standard library provides many useful operations beyond the operators.


Already in 1964 the language PL/I (still named NPL in 1964) had "bit strings" as a distinct type from integer numbers.

Also "character strings" were a distinct type.

I also believe that these types must be clearly distinguished, even if conversion functions must be provided between them, and even if in certain contexts it may be convenient for the conversions to be implicit.


> Already in 1964 the language PL/I (still named NPL in 1964) had "bit strings" as a distinct type from integer numbers.

That’s because lots of hardware back then had support for bit addressing. The machine I love the best, the PDP-6/PDP-10 had addressable bytes of width ranging from 1-36 bits, which were extremely handy.

The few C compilers for those machines didn’t support them of course because the PDP-11s didn’t support them, but C on the PDP-10 was only used for porting stuff anyway.

Honestly I’m surprised they haven’t been revived. Packed bitfields are really handy and should be fast.


> I’m surprised they haven’t been revived.

ARM has bit addressing with their bit-banding feature where individual bits are mapped onto byte addresses.


That's handy, and I suppose much faster than mask & shift. But it's often handy to be able to stride through an array at, say, three bits at a time.

When you have a very large dataset tricks like this can significantly reduce RAM and especially cache pressure.


I mean, strongly typed flag values are a solved problem. It’s not rocket science, just bothersome to do in C++.


In the first part (balanced signed integers) I had sort of expected this to eventually say "Oh, this is basically how Rust's niches work, huh" but it doesn't. This is especially weird because Jonathan explicitly calls out the "new languages" like Carbon and whatever Herb will call his thing.

Rust's Option<NonZeroU32> is guaranteed to be the exact same size as u32 (4 bytes) and the standard library could do the same thing with a BalancedI32 type if it wanted to.


I was expecting that too. It seems strictly more expressive, since once you have the niche, you can do anything you like with it.

And INT_MIN is a better place for it than zero imo. If Rust had a `BalancedI32` I would reach for it a lot more than I use the NonZero types. In my code at least, I've found zero is a pretty useful number.


AIUI The sticky problem is that the Rust compiler today wants to see a single contiguous range of scalar values to enable this trick and from its perspective e.g. BalancedI8 would have two ranges, 0 to 127 inclusive and then 129 to 255 inclusive, it doesn't see the signed interpretation when thinking about this problem.

That's clearly not impossible to solve, but if I'm correct it means significant compiler development work rather than just a fun weekend chore writing and testing a custom type intended to work only in the standard library.


It’s obviously 129 to 127 inclusive, hoping that the implied unsigned overflow has the correct consequences. ;)


Woah, that actually seems to work. My toy Option<BalancedI8> compiles and works (in a hacked environment).


> This is especially weird because Jonathan explicitly calls out the "new languages" like Carbon and whatever Herb will call his thing.

You overestimate my knowledge of Rust :D I wasn't aware that NonZeroU32 is a thing, thanks for pointing it out.


Q: Who is Herb and what's the "new thing"?


Herb Sutter is the WG21 convener. Versions of the C++ programming language are standards from the ISO/IEC Joint Technical Committee JTC1, sub-committee WG21 often referred to as just "the committee" in this context. The convener is in effect head of the committee. For comparison WG14 is the C language commitee.

Herb has made lots of proposals for changing C++ over the past half a decade or so, a few of them have shipped in newer C++ versions, most didn't go anywhere. At CppCon (a big C++ convention) Herb's keynote was about a transpiler which takes a different syntax, including most of Herb's features that didn't make it and some new features, and turns that into C++. This has been referred to as either "cppfront" or "cpp2".


I'd like to see integer types for C/C++ that are able to saturate or trap on overflow - or both types. I'm pretty sure it's even been proposed as a C standard but was rejected or postponed.

It seems like a really obvious feature that's easy to support but isn't.


Free Pascal (fpc) supports trapping on overflow, it uses the overflow flag and tests it after integer operations, it's pretty basic. Here's the assembly output of in Int64 add

  # [16] c := a + b;
    movq U_$P$TEST_$$_A(%rip),%rdx
    movq U_$P$TEST_$$_B(%rip),%rax
    addq %rdx,%rax
    jno .Lj9
    call FPC_OVERFLOW
  .Lj9:
    movq %rax,U_$P$TEST_$$_C(%rip)
It should be easy to support in C/C++, as FPC uses the same LLVM backend.


That’s bizarre - it jumps on not overflow? And jumps forward? But that’s statically predicted not-taken - it’ll statically mispredict and stall the processor every time!


My gut feeling says this is due to FreePascal supporting platforms[1] that only has short conditional jumps (-128 to +127 bytes), like i8086[2]. The alternative on those platforms would be a unconditional jump after the conditional, to jump over the call to the error handler.

On i8086 at least a non-taken conditional jump is 4 cycles while a unconditional jump is 15, for a total of 19 cycles, while a taken conditional jump is only 16. So it makes sense to almost always take the conditional jump.

[1]: https://wiki.freepascal.org/Free_Pascal_supported_targets

[2]: https://edge.edx.org/c4x/BITSPilani/EEE231/asset/8086_family... (page 2-45, or 60 in the PDF)

[3]: page 2-58 in[2] or 73 in the PDF


It's the way that error handling is done in Free Pascal. I've never worried about the code it generates before, I just care about correct results. Yeah, it could be better.


That's hardware specific, right? You're not guaranteed to have a solution on every platform.


All platforms can detect overflow in one way or another; for example by checking for an unexpected sign change.


AMD64 addressing modes cannot detect overflow, and this is where in practice 90% of arithmetic is done!


That can't be true. Why would a compiler use memory access to do math? Math operations are register based, none of the quantities are addresses.

How can you use memory addressing modes to do math, like 987*765?


> That can't be true.

Lol. See for yourself if you don't believe me.

https://godbolt.org/z/o14f7cdsc

> Why would a compiler use memory access to do math?

Because there are efficient instructions for it. Have you written an arithmetic code generation pass? It's normal to do it like this.

> Math operations are register based, none of the quantities are addresses.

An address is just a number. Yeah they're in registers or can be immediate. So what?

> How can you use memory addressing modes to do math, like 987*765?

Because AMD64 addressing modes have a base, a scale, and an offset component. That's multiplying and adding.

Let me know how you think you can work out if that lea overflowed!


Wow, that's nuts. I'd have never expected that abuse.

You can use -ftrapv to check for integer overflows, like this

https://godbolt.org/z/snbdadv8b

>Let me know how you think you can work out if that lea overflowed!

LEA (Load Effective Address) can't overflow... memory is supposed to wrap, the byte after the last byte is byte 0.


Right but have you seen what that’s done to your machine code? That’s why people don’t do it normally.

> LEA (Load Effective Address) can't overflow...

You're mistaken - it's defined to overflow.

https://reverseengineering.stackexchange.com/questions/11442...


I'm not 100% sure, but I think that the behavior of LEA is to wrap, but it does not set the overflow flag


I don't know specific AMD64 addressing modes, but you should always be able to detect overflow in signed 2s complement addition by checking to see if both addends are positive and the result is negative. There is a similar check for subtraction.


Yeah, but the point is addressing modes fuse addition and multiplication and using the result, so there's no opportunity to check anything after the operations.


The only modern architecture which does not have overflow detection is RISC-V.


Interesting. Is it a deliberate decision or just an omission?


Deliberate.

RISC-V architects weighted pros and cons of having a flags register, and pros and cons of having overflow exceptions.

They concluded it is best to not have flags at all (conditional branches do their own testing, and no flag dependencies need to be tracked which simplify superscalar implementations) nor overflow checks (flow breaking and costly; if you need the check, the cost of a software check is minimal, by design).


I understand for overflow exceptions, but I would have expected the cost of a register flag to be zero or near zero, for example, in an integer adder?

It also doesn't look like to me the cost of a software check can always be trivial. It can be for a single operation, but an advantage of an overflow register is that it allows to check for a group of operations as a whole (check/branch just once and abort), which is what is probably practical to do algorithmically. In such scenario switching to software checks for each op and/or bound check the inputs sounds by far not minimal.


Flags aren't that simple. In a superscalar microarchitecture, then you have a lot more to track re: who set the flag, as the flag is a target of every instruction that can set it.


Minimal.. It depends! I remember that the developpers behind GMP (a GNU library for bignum) weren't happy with the performance! But that's pretty niche..


Likewise. That's what I was expecting going in rather than Lovecraftian monstrosities like INT_NAN. Plus they're actually something that's supported in some existing ISAs.


Totally agree. Overflowing silently is almost never what you want. Sometimes it’s useful but that’s the rare case.


Standard (signed) integers in C could be implemented to trap on overflow. Overflow on signed integer is undefined operation, like NULL dereferencing or division by zero, so trap is perfectly expected behavior.

The issue is that unsigned integers must not trap on overflow, and CPUs, if they could be configured to trap on arithmetic overflow, usually do not have separate opcodes for overflowing arithmetic operations and trapping arithmetic operations.


Most compilers, like gcc or clang, have a compile option to trap on integer overflow, instead of leaving the behavior to be undefined for such cases.

For example with gcc one should always use these compile options, unless there exists an extremely serious reason to do otherwise:

-fsanitize=address,undefined -fsanitize-undefined-trap-on-error

These options provide correct behavior not only on overflows, but also on out-of-bounds accesses and other C undefined operations.


It’s fine for debug builds, but people should be aware that building with the address sanitizer enabled balloons the size of the executable and slows it down tremendously.


One should always measure to see if there is a noticeable slowdown or not.

That is what I have meant by "an extremely serious reason to do otherwise".

For many programs the speed is determined by things like the throughput or latency of accessing the main memory, or by I/O operations, or by interaction with an user, in which case the sanitize options have negligible influence on speed.

When necessary, they should be turned off only for specific well tested functions whose performance is limited by CPU computation in registers or in cache memory, and they should remain on for the rest of the program.


ASAN can easily slow your program by 100x or worse. Your bias should always be that its impact would not be acceptable in production. If you were willing to accept the cost of ASAN you should have simply chose a slower, safer language in the first place.


I don't understand. It sounds like you're describing the existing behavior for the existing types. But I'm asking for new types with new behavior. The rules could be defined behavior for these types, which is what I want. I don't want types that happen-to-trap-because-it's-UB. I want a type that traps because that's its mission. And another type that saturates.

I understand the existing undefined behavior, that's not particularly interesting. What would be interesting is well-defined behavior for new types that does not interfere with however defined or undefined behavior exists for the existing types.

> usually do not have separate opcodes for overflowing arithmetic operations and trapping arithmetic operations.

I don't care how good or bad the codegen is. Just emit the code for the right behavior, because I'm using this type that requires this behavior.


> The issue is that unsigned integers must not trap on overflow

You forgot to add 'in C', in Zig they have the same behaviour as signed integers.


Saturating arithmetic is available on some DSPs and may be available to C programs via #pragma on those architectures. IMHO this is a burden to every ALU and instruction that does not need the feature. We should not push application specific features like that down to the basic hardware. We'd also need to provide language support for it.


Saturating arithmetic is probably not all that expensive, since its been in every ARM SIMD unit (and sometimes also scalar) for many years. You just pay a little bit in latency when using those instructions.


+1. Would be great to have integer NAN, INF and -INF supported in h/w.


The k language has implemented this capability (in software) for decades.

https://code.kx.com/q/ref/#datatypes


Yes, true - thanks. Love that in K in particular, and I like K and array languages in general. Another thing they get right. :-)


> trap on overflow

What do you see happening in that case?


The program terminates - depending on context, I suppose. I don't know what all toolchains/backends do but some will generate an unaligned load or some other architecturally illegal code (__builtin_trap). or you could trap to some handler.


x86 has had the `INTO` (interrupt on overflow) instruction (just a one byte form of `INT 4`) for decades. After an `ADD` or other arithmetic operation, you would `INTO` to trap it. However, because compilers love exploiting undefined behavior for speedups, they don't emit that instruction, so it's actually fairly slow on modern processors.


I am not sure if I remember it correctly, but wasn't the problem that you had to explicitly call INTO after every operation? You didn't have a global flag to set or a flag bit in the add instruction, you had to explicitly emit the check. So using it carried a significant performance penalty over leaving the operation unchecked.

> However, because compilers love exploiting undefined behavior for speedups

Except if the behavior had been well defined any attempt to catch overflow would violate the standard - see unsigned integers for reference.

> so it's actually fairly slow on modern processors.

Are you sure that having a potential branch every second operation, that depends directly on the result of the preceding operation, has at any point in time not been horribly slow?


> I am not sure if I remember it correctly, but wasn't the problem that you had to explicitly call INTO after every operation? You didn't have a global flag to set or a flag bit in the add instruction, you had to explicitly emit the check. So using it carried a significant performance penalty over leaving the operation unchecked.

Correct.

> Except if the behavior had been well defined any attempt to catch overflow would violate the standard - see unsigned integers for reference.

Then the compiler could emit them only on signed operations, but they don't.

> Are you sure that having a potential branch every second operation, that depends directly on the result of the preceding operation, has at any point in time not been horribly slow?

Yes; it's been shown that `INTO` is slower than `JO raiseOverflowError` because, by the nature of `INTO` being an interrupt, it doesn't get the advantage of branch prediction, but `JO` would.

On an older processor without a pipeline, sure, it'd be just as slow as an equivalent check, but with how massive the pipelines are on today's processors, it's bad.


> Yes; it's been shown that `INTO` is slower than `JO raiseOverflowError` because, by the nature of `INTO` being an interrupt, it doesn't get the advantage of branch prediction, but `JO` would.

A CPU could predict INTO — simply predicting it as not taken would be a good guess. This would require designing, implementing, verifying, and maintaining, and it might add some area to important parts of the CPU. The vendors haven’t seen it as a priority.

(AIUI at least older CPUs couldn’t predict branches in microcode, and INTO is presumably fully microcoded, so it doesn’t get predicted.)


Right, well a new type would give them that flexibility because I can opt in to the trap or saturate.


The argument starts here:

> I really don’t like this asymmetry – it leads to annoying edge cases in all sort of integer APIs.

And gets to here:

> Third, you’re getting an unused bit pattern 0b1'0000000, the old INT_MIN, which you can interpret however you like.

This isn't solving the original problem. If you don't like annoying edge cases in all sorts of integer APIs, it's worse.

edit: I can add: the article goes on to suggest that "old INT_MIN" should be called INT_NAN and "let’s just say arithmetic on INT_NAN is undefined behavior". So now what's the result of a + b? Undefined behavior! a * b? Undefined behavior! The general result of any function taking an integer or returning one is undefined behavior!


In C, with int a, b then a + b is already potential UB. Just depends on the values of the variables. It is remarkable to note this.. but not new! :)


Arithmetic on INT_MIN is already undefined behaviour.


? No! INT_MIN +1 is 'implementation defined behaviour' but that's a very different thing.


Congratulations on inventing slower one's complement.

> I’d like to have . . . a 63 bit unsigned integer. . . . However, it is undefined behavior if it ever stores a negative value.

No. There is no "debug mode" for undefined behavior. Undefined behavior is always undefined behavior.

It also doesn't "automatically come[] with a precondition that it cannot be negative" - only that if it's ever negative, your program might crash for reasons that cannot be found in your code.

If you really want this, use uint64_t and treat any value over 0x7fffffffffffffff as an error.

Curious about the drive-by downvoters: Can you point out where you think I'm wrong?


> Undefined behavior is always undefined behavior.

This is not true; the standard explicitly allows implementations to, and I quote directly from the standard, "behave during translation or program execution in a documented manner characteristic of the environment".

This idea that undefined behavior is always undefined behavior and there's no way to reason about it is purely academic, incorrect, and not in any way justified either by source material or in practice.

As for debugging undefined behavior, UBSan [1] is an excellent tool that is well supported by GCC and clang, and MSVC is working to add support for it as well.

[1] https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html


of course the standard allows implementations to "behave during translation or program execution in a documented manner characteristic of the environment" if you do something that is undefined behavior, implementations are literally allowed to do anything they want if you do something that is undefined behavior. the key point is that they are not required to do anything, so what's the point of adding more undefined behavior for this case?


The subtle distinction is that when the standard specifies that behavior is undefined, it does not mean that an implementation is forbidden from providing well specified and documented semantics, and hence the statement I replied to "Undefined behavior is always undefined behavior" is false and furthermore a common misconception.

The point of adding more undefined behavior for this case, or any case in general, is to provide optimization opportunities that do not require introducing changes to the syntax (such as additional type checking or analysis). That way a C++ compiler is welcome to provide debugging support and various checks when a program is compiled in debug mode, and then eliminate those checks and make very strong assumptions about the program's runtime behavior when a program is compiled with optimizations enabled.

This is in contraposition to OPs claim that "There is no "debug mode" for undefined behavior.". There absolutely is a debug mode and it absolutely can catch undefined behavior, and UBSan is an excellent tool for doing precisely what OP claimed is not permissible in C++.


63 bit integers are actually used in Ocaml which is then used to figure out if a value is a integer or a pointer

https://blog.janestreet.com/what-is-gained-and-lost-with-63-...


Also used in Ruby virtual machines and no doubt many others.


63-bit integers with undefined behavior for negatives do not exist in OCaml.


No but the thing is that the 64th bit will cause undefined behavior if it is set incorrectly, it will cause a interger to be seen as a pointer instead of a number. That was the reason I thought about it.


torstenvl is correct. UB is UB whether you are in debug mode or release mode. making it UB for it to store a negative value doesn't make sense. you could put debug checks for it behind an if constexpr or a macro but please C++ has enough UB already, don't add more. disappointing to see that this is downvoted.


> Distinct bit vectors vs integer type

This. I always find bitwise operations an unneeded exercise in mental gymnastics.

On the other hand, I wonder if a lot of the other quirks about numbers could be handled via a library?


> unneeded exercise in mental gymnastics

Great characterization! I’ve always been confused why most languages don’t have better tools for bit vectors. They’re incredible common and it’s really confusing to have to use the underlying representation of numbers to do anything with them.


It is solved by the standard library.

https://en.cppreference.com/w/cpp/utility/bitset


It is not "solved" at all. std::bitset has terrible properties and an awful interface.

1. No dynamic resize, have to know size at compile time or allocate based on max expectations. And yes, std::vector<bool> sucks too.

2. Despite being only statically sized, several classes of bugs are not prevented at compile-time. For example:

std::bitset<4> fail1{"10001"}; // This produces a value of 0b1000, no warnings or exceptions thrown

std::bitset<10> fail2; fail2.set(11); // Exception at runtime. Why is this not a static_assert?

3. Size is implementation defined. std::bitset<1> can use up to 8 bytes depending on compiler/platform.

4. Debug performance is 20x slower than Release. In many cases you are going from what would be a single assembly instruction to multiple function calls.

5. Limited options for accessing underlying storage efficiently (for serialization, etc). to_ullong() will work up to 64 bits, but beyond that it will throw exceptions.

6. Uses exceptions. This is still a deal breaker for many.

7. Cannot toggle a range of bits at once. It's either one or all.


Uhm, you know that there are other languages than c++?


Sure, the point is that it doesn’t take language machinery to do it. And as the other post demonstrates, trying to make a standard one will mostly just get people pissed you made different tradeoffs.

(But actually a ton of other stdlibs do have one too.)


Arbitrary length integers. That is, integers that actually act like integers.


I like this feature of python, but it's wise to remember that operations that we typically think of as O(1) are actually usually dependent on the length of the input, so allowing arbitrary length integers can cause issues (such as DOS opportunities) in places some people don't think to look. Specifically, Python recently changed the str -> int conversion logic to error on strings beyond a certain length (e.g. `int('1'*4301)` will error.) [1]

[1] https://news.ycombinator.com/item?id=32753235


They’re also much more expensive in memory, because you have to separately allocate anything you don’t know the size of.

Something being O(N) can also be a security issue since it introduces a timing side channel.

I don’t think I’ve ever needed a bignum in my life or even a 64-bit integer (excluding pointers and file sizes). Of course I’ve used them inside black box crypto libraries but they have to be careful with them because of said security issues.


As implemented in lisps they typically don't use more memory than 64 bit longs. That's because fixnums (typically 61 bit signed values) are unboxed, existing in the "pointer" itself; only when numbers fall outside that range do they become bignums that must be heap allocated (and in practice almost all integers are fixnums.)


This is being considered for Go 2: https://github.com/golang/go/issues/19623

Personally I'm wary of it: it's not hard to think of adversarial scenarios where this could lead to OOM. On the other hand, adversarially overflowing an int can cause plenty of havoc too...


I guess this could arguably make sense as a vocabulary type (provided in the standard library so that everybody agrees on it, not for performance reasons).

Right now though you will probably get such a type for free with a library of common maths features you'd likely want to use with it. So the only question is whether there are several different libraries and people would prefer to mix and match.


One can get reasonable performance from them if the language (and ABI) is designed for it. Many implementations of lisp family languages, for example. Most computations will be on fixnums, with a fallback to bignums as required. The fixnums are unboxed.

It's nice to require analysis for maximum performance rather than for correctness, since maximum performance usually doesn't matter but correctness usually does.


I agree, this could be a good solution for a lot of situations. People fret about the performance impact, but I'm confident that clever engineers could come up with a way to treat numbers within the typical range(say, signed 64 bit) just as they are today, but if an operation would go out of bounds do the expensive, arbitrary length operation. Emitting some kind of extra warning when those expensive operations are actually done would also be nice.


Lisp compilers do this (although the integers are 61 bit, not 64, for fixnums in general values), including compile-time style warnings if generic arithmetic operations cannot be optimized. Declarations can be added at performance hotspots, but in almost all places the performance impact doesn't matter.


As long as you don't want fast code, sure.


I don't really understand it, wouldn't this just trade the checks and corner cases you have to pay attention to with different set of corner cases? If I have an integer that can approach INT_MIN, I know I have to pay attention to overflows when using it and this proposal helps with at most some fraction of the problems that can arise. And what's the great benefit, that it get's a bit more "elegant"? If you want to write "safe" software and have the compiler enforce it, go use Rust/Ada/Spark instead.


Ah so they want to replace undefined behaviour with a different kind of undefined behaviour


I’m not sure this is a particularly useful critique. I learned a good amount from this post.


Ada solved these problems in 1983.

More recently, you can use Frama-C to constrain allowable sequences of 0's and 1's for C types and formally verify correctness.

In Ada since 1983 you can, e.g, declare your own 8 bit signed symmetric type without the wart -128 like so:

  type Sym_8 is new Integer range -127 .. 127;
Then this fails at compile time:

  My_Signed_Byte : Sym_8 := -128;
  
SPARK can prove all execution paths through your program are free of such constraint violations. This means safe SPARK code can disable runtime checks and run faster than the safest Rust/Zig dev settings, which insert runtime checks for over/under flow.

In Frama-C, say you want a function that returns an absolute value. This function will fail to verify:

  /*@ ensures (x >= 0 ==> \result == x) &&
              (x < 0 ==> \result == -x);
      assigns \nothing;  */
  int abs (int x) {
      if (x >=0)
          return x;
      return -x;
  }
It fails to verify because you might have x==INT_MIN. So this will verify:

  #include <limits.h>
  /*@ requires x > INT_MIN;
      ensures (x >= 0 ==> \result == x) &&
              (x < 0 ==> \result == -x);
      assigns \nothing;  */
  int abs (int x) {
      if (x >=0)
          return x;
      return -x;
  }


For the sake of completeness: Frama-C (+ WP plugin) can verify that the program conforms to the first specification, however, it indeed can't verify that the program does not contain an undefined behavior (reason why we write the second specification).


How does this work in presence of user supplied values?

It has to have a check somewhere, no?


Of course. The checks are performed where the values arrive. If a file or network socket provides a number N that is going to be used as an array index, then N must be proved to be within array bounds at compile time before being used as an array index, which means you are forced to catch the bug at compile time and decide how to signal bad values of N.


Ok, but you can't check _runtime_ values at _compile_ time. If you have user input/file system/network supply N (where 0<N<100) you still HAVE to check it somewhere. You still have to have a runtime check when interacting with outside world.


Of course, that's what I thought I said in my previous comment but I guess I wasn't crystal clear. The formal verification stack basically tells you, "There's no guarantee 0<N<100 holds if you get N from an I/O stream. N can be anything." So it's your job to insert a runtime check for N<=0 or N>=100 and decide how best to handle that, e.g. panic and quit, or clamp N to a valid range if it makes sense, or avoid accessing the array if it makes sense. But the system won't let you use a possibly out of range N as an array subscript so you are forced to code a solution to this special case "at compile time", typically by inserting a runtime check. Though if you are reading data from a trusted ROM you can also use a pragma or special assert to say, "This data can be trusted to have this invariant" and avoid the runtime check.


I would much rather see most of this be generalized to a parametrized bounded integer type, e.g. int<lowerbound, upperbound>.


This would be a big improvement over how a lot of languages handle types - namely it’d let you stop calling “i16” “i32” types when they’re actually storage sizes.

“Safe” languages like Java try to achieve safety by making you write boilerplate casts here - “short = int + int” is illegal without a (short) cast even though that doesn’t change the meaning of the program.

If it was “x = i32<0,3> + i32<0,3>” then it highlights how silly this is and maybe they wouldn’t make you write a (i16<0,6>) cast.


Exactly. In fact, I would go far as to make this the _only_ integer type. u16, u32 etc can just be typedefs.

The ability to be precise with integer ranges could prevent many types of arithmetic and OOB errors.


WUFFS types can be refined as well as having a native representation so e.g.

base.u8[0 .. =99] is a single byte with values from 0 to 99 inclusive

base.u16[0 .. =99] is a 16-bit type with the same values

base.u16[100 .. =400] is still 16 bits but values from 100 to 400 inclusive

In WUFFS all arithmetic overflow or array bounds misses are compile time errors. From the point of view of WUFFS if pixels is an array with 200 elements, and idx is a base.u8[0.. = 240] then pixels[idx] is a type error as it wouldn't make sense to index 240 into an array of 200 elements so that doesn't compile.

As well as the obvious safety implication, this also means WUFFS can in principle go just as fast as if you were inhumanly careful with a conventional low level language and skipped any safety checks, since it knows it already took care of that. In practice it transpiles to C (which doesn't know what safety checks are).


i think you could actually do that in C++ as a library with a bit of metaprogramming. the arithmetic operations will get weird though unless everything is the same type, for example what is the result type of int<-10, 10>+int<0, 20>? (the "right" answer is probably a megabyte of compiler errors)


Surely just `int`?


This is probably not what you're looking for because it might not be high performance. But in hardware synthesis/modeling, we use the so called "AC data types" and I just googled that and found it on Github.

https://github.com/hlslibs/ac_types

Arbitrary bit length, symmetrical and unsymmetrical, rounding and wraparound behaviour specifiable etc. etc.


> So here’s my wish: a signed integer where INT_MIN == -INT_MAX

Just set INT_MIN to -127 and never use -128.

> Second, you’re getting symmetry back. All the operations mentioned above are now symmetric and can’t overflow. This makes them a lot easier to reason about.

But you seem to want the actual bits to be symmetric with the sign bit just being a literal + or -. This makes it easier for humans to reason with, but it does NOT make it easier for a CPU's adder to add -15 and +63, which in the current standard signed int implementation does not even require the CPU adder to know that it's a signed int.

Unsigned and signed int adding work exactly the same bit-wise, and can use the same hardwired circuits to process at blazing speed. That's the beauty of the current standard implementation.


Completely agree that moving bit operations to library calls and freeing up their syntax characters makes a lot of sense. Also seems impossible!


Do you feel the same about +-/*% ?

I think people say this because they rarely use bitwise operators, and so are scared of them.


I'm not scared of them, but very rarely use them and believe that is generally true for others. Basic arithmetic operators that operate on numbers (as opposed to bitwise) are used much more frequently, so I feel very differently about them.


I don't, no.

LuaJIT uses a bit library, while Lua 5.3 and up have the native bitwise operators. I prefer the former.


From a computational point of view it would be nice to see, actually, a type that has no zero and is symmetric around it= ... -3/2,-1/2,1/2,3/2 ... It would be trivial to implement efficiently in hardware (as opposite to the checks required for INT_NAN or INT_MIN) and it would have sense in some cases (like some kinds of discretizations)

From a general programming viewpoint pervasive bigints, ints with NaN/+Inf/-Inf and bitfields would be interesting too, but I don't know if the they are worth the complexity they introduce


the symetric signed int with int_nan i would like to have too

but with the uint63_t i dont quite see the point but admit it would be neat having an unsigned int that cant overflow


Zig has the unsigned int with the nice overflow/UB behaviour. Symmetric int + Int_NAN sounds nice but the performance cost..


Interestingly, in Ethereum smart contract languages (e.g. Solidity and Vyper), having separate types for bit vectors vs integers is common. I don't know why it's caught on there but nowhere else. Obviously those languages have a pretty specialized niche and aren't really going to be used outside of that context.


Does this really help? Sometimes being able to treat integers as bitvectors and vice versa is something desirable. I think one needs appropriate functions, which treat integers or bitvectors as such, but not necessarily a split in types, possibly requiring any kind of type conversion.


> Instead, let’s just say arithmetic on INT_NAN is undefined behavior; sanitizers can then insert assertions in debug mode.

I mean, if using that value is supposed to be UB anyway, then the current behavior if standard ints conforms to the spec. Doesn’t that make the proposal largely pointless?


If I want to do bitwise operations on an integer, and I want to set/mask the MSB (maybe it's just a bitfield for me), I mask with INT_NAN? That also sounds very weird


Wow, someone is advocating for one's complement numerics? In 2022? Good grief.

The author needs to go back and study some history as to why we dropped that idea 40+ years ago.


Maybe this is just too prosaic, but I'd like 64-bit unsigned integers in Java. I'd like Unums, which are not integers, especially the low precision variants.


What is the purpose of having a 64-bit unsigned integer? Could you tell me an example of a use case where 9×10^18 is not enough (so you cannot use unsigned 64-bit integer), but 18×10^18 will definitely be enough (so you do not need BigInteger)?


If you find yourself wanting any of this, just use a language with algebraic data types and save everyone (especially yourself) some pain.


Any reasons for not adding these as a library and expect it from the language? C++ gives you all tools for that


Hardware support. It is trivial to write libraries that can emulate these in many languages, but the performance drop is going to be pretty sharp if you are coding your basic arithmetic by hand. Unless you are writing code for an FPGA...


You could use the same hardware support in your library couldn't you? Compile flags to check your target hardware and enable/disable code if you have the support


the one about unsigned integers with one bit missing would be trivial to implement as a library in C++ with no significant downside. all you have to do is make a class wrapping a signed int and put debug checks for if the high 1 bit is set behind an if constexpr in operator= and the copy constructor. in most other languages this would bring a big performance penalty, but this is one thing that C++ is actually very good at.


This is incorrect. For one thing, you're going to need to overload all the operators that mutate the integer in place, e.g. operator++, operator +=, operator -=, operator *=, shifts, etc. And those checks can be quite expensive. For example, your code is probably littered with for loops that are implemented using operator++ or operator+=, and that means on every loop iteration you need to check for overflow, which is expensive if the loop body is simple. GCC and Clang already implement -ftrapv which does something similar (it adds compiler checks at all places where signed integers can overflow and trap if overflow occurs). I've used -ftrapv in debug builds but for most programs you don't want it in non-debug builds because the overhead is too high.


the "if constexpr" makes sure the checks only happen in debug mode. you'd definitely want to turn the checks off for release builds.


LLVM already supports types like i31 and i7, but without language support it may be difficult to reliably convince the optimizer that's what you mean.


Sorry, bro. Even if we have time machines, we're in the bad timeline.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: