Hacker News new | past | comments | ask | show | jobs | submit login
C++ proposal: There are exactly 8 bits in a byte (open-std.org)
288 points by Twirrim 3 months ago | hide | past | favorite | 348 comments



Previously, in JF's "Can we acknowledge that every real computer works this way?" series: "Signed Integers are Two’s Complement" <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p09...>


Maybe specifying that floats are always IEEE floats should be next? Though that would obsolete this Linux kernel classic so maybe not.

https://github.com/torvalds/linux/blob/master/include/math-e...


I'm literally giving a talk next week who's first slide is essentially "Why IEEE 754 is not a sufficient description of floating-point semantics" and I'm sitting here trying to figure out what needs to be thrown out of the talk to make it fit the time slot.

One of the most surprising things about floating-point is that very little is actually IEEE 754; most things are merely IEEE 754-ish, and there's a long tail of fiddly things that are different that make it only -ish.


The IEEE 754 standard has been updated several times, often by relaxing previous mandates in order to make various hardware implementations become compliant retroactively (eg, adding Intel's 80-bit floats as a standard floating point size).

It'll be interesting if the "-ish" bits are still "-ish" with the current standard.


The first 754 standard (1985) was essentially formalization of the x87 arithmetic; it defines a "double extended" format. It is not mandatory:

> Implementations should support the extended format corresponding to the widest basic format supported.

_if_ it exists, it is required to have at least as many bits as the x87 long double type.¹

The language around extended formats changed in the 2008 standard, but the meaning didn't:

> Language standards or implementations should support an extended precision format that extends the widest basic format that is supported in that radix.

That language is still present in the 2019 standard. So nothing has ever really changed here. Double-extended is recommended, but not required. If it exists, the significand and exponent must be at least as large as those of the Intel 80-bit format, but they may also be larger.

---

¹ At the beginning of the standardization process, Kahan and Intel engineers still hoped that the x87 format would gradually expand in subsequent CPU generations until it became what is now the standard 128b quad format; they didn't understand the inertia of binary compatibility yet. So the text only set out minimum precision and exponent range. By the time the standard was published in 1985, it was understood internally that they would never change the type, but by then other companies had introduced different extended-precision types (e.g. the 96-bit type in Apple's SANE), so it was never pinned down.


The first 754 standard has still removed some 8087 features, mainly the "projective" infinity and it has slightly changed the definition of the remainder function, so it was not completely compatible with 8087.

Intel 80387 was made compliant with the final standard and by that time there were competing FPUs also compliant with the final standard, e.g. Motorola 68881.


I'm interested by your future talk, do you plan to publish a video or a transcript?


I too would like to see it.


> there's a long tail of fiddly things that are different that make it only -ish.

Perhaps a way to fill some time would be gradually revealing parts of a convoluted Venn diagram or mind-map of the fiddling things. (That is, assuming there's any sane categorization.)


can you give a brief example? Very intrigued.


Hi! I'm JF. I half-jokingly threatened to do IEEE float in 2018 https://youtu.be/JhUxIVf1qok?si=QxZN_fIU2Th8vhxv&t=3250

I wouldn't want to lose the Linux humor tho!


That line is actually from a famous Dilbert cartoon.

I found this snapshot of it, though it's not on the real Dilbert site: https://www.reddit.com/r/linux/comments/73in9/computer_holy_...


Whether double floats can silently have 80 bit accumulators is a controversial thing. Numerical analysis people like it. Computer science types seem not to because it's unpredictable. I lean towards, "we should have it, but it should be explicit", but this is not the most considered opinion. I think there's a legitimate reason why Intel included it in x87, and why DSPs include it.


Numerical analysis people do not like it. Having _explicitly controlled_ wider accumulation available is great. Having compilers deciding to do it for you or not in unpredictable ways is anathema.


It isn’t harmful, right? Just like getting a little accuracy from a fused multiply add. It just isn’t useful if you can’t depend on it.


It can be harmful. In GCC while compiling a 32 bit executable, making an std::map< float, T > can cause infinite loops or crashes in your program.

This is because when you insert a value into the map, it has 80 bit precision, and that number of bits is used when comparing the value you are inserting during the traversal of the tree.

After the float is stored in the tree, it's clamped to 32 bits.

This can cause the element to be inserted into into the wrong order in the tree, and this breaks the assumptions of the algorithm leaidng to the crash or infinite loop.

Compiling for 64 bits or explicitly disabling x87 float math makes this problem go away.

I have actually had this bug in production and it was very hard to track down.


10 years ago, a coworker had a really hard time root-causing a bug. I shoulder-debugged it by noticing the bit patterns: it was a miscompile of LLVM itself by GCC, where GCC was using an x87 fldl/fstpl move for a union { double; int64; }. The active member was actually the int64, and GCC chose FP moved based on what was the first member of the union... but the int64 happened to be the representation of SNaN, so the instructions transformed it quietly to a qNaN as part of moving. The "fix" was to change the order of the union's members in LLVM. The bug is still open, though it's had recent activity: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58416


It also affected eMacs compilation and the fix is in the trunk now.

Wow 11 years for such a banal minimal code trigger. I really don’t quiet understand how we can have the scale of infrastructure in operation when this kind of infrastructure software bugs exist. This is not just gcc. All the working castle of cards is an achievement by itself and also a reminder that good enough is all that is needed.

I also highly doubt you could get a 1 in 1000 developers to successfully debug this issue were it happening in the wild, and much smaller to actually fix it.


If you think that’s bad let me tell you about the time we ran into a bug in memmove.

It had to be an unaligned memmove and using a 32 bit binary on a 64 bit system, but still! memmove!

And this bug existed for years.

This caused our database replicas to crash every week or so for a long time.


What use case do you have that requires indexing a hashmap by a floating point value? Keep in mind, even with a compliant implementation that isn't widening your types behind your back, you still have to deal with NaN.

In fact, Rust has the Eq trait specifically to keep f32/f64s out of hash tables, because NaN breaks them really bad.


std::map is not a hash map. It's a tree map. It supports range queries, upper and lower bound queries. Quite useful for geometric algorithms.


Rust's BTreeMap, which is much closer to what std::map is, also requires Ord (ie types which claim to possess total order) for any key you can put in the map.

However, Ord is an ordinary safe trait. So while we're claiming to be totally ordered, we're allowed to be lying, the resulting type is crap but it's not unsafe. So as with sorting the algorithms inside these container types, unlike in C or C++ actually must not blow up horribly when we were lying (or as is common in real software, simply clumsy and mistaken)

The infinite loop would be legal (but I haven't seen it) because that's not unsafe, but if we end up with Undefined Behaviour that's a fault in the container type.

This is another place where in theory C++ gives itself license to deliver better performance at the cost of reduced safety but the reality in existing software is that you get no safety but also worse performance. The popular C++ compilers are drifting towards tacit acceptance that Rust made the right choice here and so as a QoI decision they should ship the Rust-style algorithms.


> you still have to deal with NaN.

Detecting and filtering out NaNs is both trivial and reliable as long as nobody instructs the compiler to break basic floating point operations (so no ffast-math). Dealing with a compiler that randomly changes the values of your variables is much harder.


That's purely a problem of Rust being wrong.

Floats have a total order, Rust people just decided to not use it.


Are you mixing up long double with float?


Old Intel CPUs only had long double, 32 bit and 64 bit floats were a compiler hack on top of the 80 bit floating point stack.


dang that's a good war story.


It’s absolutely harmful. It turns computations that would be guaranteed to be exact (e.g. head-tail arithmetic primitives used in computational geometry) into “maybe it’s exact and maybe it’s not, it’s at the compiler’s whim” and suddenly your tests for triangle orientation do not work correctly and your mesh-generation produces inadmissible meshes, so your PDE solver fails.


Thank you, I found this hint very interesting. Is there a source you wouldn't mind pointing me to for those "head, tail" methods?

I am assuming it relates to the kinds of "variable precision floating point with bounds" methods used in CGAL and the like; Googling turns up this survey paper:

https://inria.hal.science/inria-00344355/PDF/p.pdf

Any additional references welcome!


Note here is a good starting point for the issue itself: http://www.cs.cmu.edu/~quake/triangle.exact.html

References for the actual methods used in Triangle: http://www.cs.cmu.edu/~quake/robust.html


If not done properly, double rounding (round to extended precision then rounding to working precision) can actually introduce larger approximation error than round to nearest working precision directly. So it can actually make some numerical algorithms perform worse.


I suppose it could be harmful if you write code that depends on it without realizing it, and then something changes so it stops doing that.


I get what you mean and agree, and have seen almost traumatized rants against ffast-math from the very same people.

After digging, I think this is the kind of thing I'm referring to:

https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf

https://news.ycombinator.com/item?id=37028310

I've seen other course notes, I think also from Kahan, extolling 80-bit hardware.

Personally I am starting to think that, if I'm really thinking about precision, I had maybe better just use fixed point, but this again is just a "lean" that could prove wrong over time. Somehow we use floats everywhere and it seems to work pretty well, almost unreasonably so.


Yeah. Kahan was involved in the design of the 8087, so he’s always wanted to _have_ extended precision available. What he (and I, and most other numerical analysts) are opposed to is the fact that (a) language bindings historically had no mechanism to force rounding to float/double when necessary, and (b) compilers commonly spilled x87 intermediate results to the stack as doubles, leading to intermediate rounding that was extremely sensitive to optimization and subroutine calls, making debugging numerical issues harder than it should be.

Modern floating-point is much more reproducible than fixed-point, FWIW, since it has an actual standard that’s widely adopted, and these excess-precision issues do not apply to SSE or ARM FPUs.


I was curious about float16, and TIL that the 2008 revision of the standard includes it as an interchange format:

https://en.wikipedia.org/wiki/IEEE_754-2008_revision


Note that this type (which Rust will/ does in nightly call "f16" and a C-like language would probably name "half") is not the only popular 16-bit floating point type, as some people want to have https://en.wikipedia.org/wiki/Bfloat16_floating-point_format


The IEEE FP16 format is what is useful in graphics applications, e.g. for storing color values.

The Google BF16 format is useful strictly only for machine learning/AI applications, because its low precision is insufficient for anything else. BF16 has very low precision, but an exponent range equal to FP32, which makes overflows and underflows less likely.


Permalink (press 'y' anywhere on GitHub): https://github.com/torvalds/linux/blob/4d939780b70592e0f4bc6....


That file hasn't been touched in over 19 years. I don't think we have to worry about the non-permalink url breaking any time soon.


Which one? Remember the decimal IEEE 754 floating point formats exist too. Do folks in banking use IEEE decimal formats? I remember we used to have different math libs to link against depending, but this was like 40 years ago.


Binding float to the IEEE 754 binary32 format would not preclude use of decimal formats; they have their own bindings (e.g. _Decimal64 in C23). (I think they're still a TR for C++, but I haven't been keeping track).


Nothing prevents banks (or anyone else) from using a compiler where "float" means binary floating point while some other native or user-defined type supports decimal floating point. In fact, that's probably for the best, since they'll probably have exacting requirements for that type so it makes sense for the application developer to write that type themselves.


I was referring to banks using decimal libraries because they work in base 10 numbers, and I recall a big announcement many years ago when the stock market officially switched from fractional stock pricing to cents "for the benefit of computers and rounding", or some such excuse. It always struck me as strange, since binary fixed and floating point represent those particular quantities exactly, without rounding error. Now with normal dollars and cents calculations, I can see why a decimal library might be preferred.


Java is big for banks, and `BigInteger` is common for monetary types.


At the very least, division by zero should not be undefined for floats.


Love it


During an internship in 1986 I wrote C code for a machine with 10-bit bytes, the BBN C/70. It was a horrible experience, and the existence of the machine in the first place was due to a cosmic accident of the negative kind.


I wrote code on a DECSYSTEM-20, the C compiler was not officially supported. It had a 36-bit word and a 7-bit byte. Yep, when you packed bytes into a word there were bits left over.

And I was tasked with reading a tape with binary data in 8-bit format. Hilarity ensued.


That is so strange. If it were 9-bit bytes, that would make sense: 8bits+parity. Then a word is just 32bits+4 parity.


7 bits matches ASCII, so you can implement entire ASCII character set, and simultaneously it means you get to fit one more character per byte.

Using RADIX-50, or SIXBIT, you could fit more but you'd lose ASCII-compatibility


8 bits in a byte exist in the first place because "obviously" a byte is a 7 bit char + parity.

(*) For some value of "obviously".


Hah. Why did they do that?


Which part of it?

8 bit tape? Probably the format the hardware worked in... not actually sure I haven't used real tapes but it's plausible.

36 bit per word computer? Sometimes 0..~4Billion isn't enough. 4 more bits would get someone to 64 billion, or +/- 32 billion.

As it turns out, my guess was ALMOST correct

https://en.wikipedia.org/wiki/36-bit_computing

Paraphrasing, legacy keying systems were based on records of up to 10 printed decimal digits of accuracy for input. 35 bits would be required to match the +/- input but 36 works better as a machine word and operations on 6 x 6 bit (yuck?) characters; or some 'smaller' machines which used a 36 bit larger word and 12 or 18 bit small words. Why the yuck? That's only 64 characters total, so these systems only supported UPPERCASE ALWAYS numeric digits and some other characters.


Somehow this machine found its way onto The Heart of Gold in a highly improbable chain of events.


I programmed the Intel Intellivision cpu which had a 10 bit "decl". A wacky machine. It wasn't powerful enough for C.


I've worked on a machine with 9-bit bytes (and 81-bit instructions) and others with 6-bit ones - nether has a C compiler


The Nintendo64 had 9-bit RAM. But, C viewed it as 8 bit. The 9th bit was only there for the RSP (GPU).


I think the pdp-10 could have 9 bit bytes, depending on decisions you made in the compiler. I notice it's hard to Google information about this though. People say lots of confusing, conflicting things. When I google pdp-10 byte size it says a c++ compiler chose to represent char as 36 bits.


PDP-10 byte size is not fixed. Bytes can be 0 to 36 bits wide. (Sure, 0 is not very useful; still legal.)

I don't think there is a C++ compiler for the PDP-10. One of the C compiler does have a 36-bit char type.


I was summarizing this from a Google search. https://isocpp.org/wiki/faq/intrinsic-types#:~:text=One%20wa....

As I read it, this link may be describing a hypothetical rather than real compiler. But I did not parse that on initial scan of the Google result.


Do you have any links/info on how that 0-bit byte worked? It sounds like just the right thing for a Friday afternoon read ;D


It should be in the description for the byte instructions: LDB, DPB, IBP, and ILDB. http://bitsavers.org/pdf/dec/pdp10/1970_PDP-10_Ref/

Basically, loading a 0-bit byte from memory gets you a 0. Depositing a 0-bit byte will not alter memory, but may do an ineffective read-modify-write cycle. Incrementing a 0-bit byte pointer will leave it unchanged.


10-bit arithmetics are actually not uncommon on fpgas these days and are used in production in relatively modern applications.

10-bit C, however, ..........


How so? Arithmetic on FPGA usually use the minimum size that works, because any size over that will use more resources than needed.

9-bit bytes are pretty common in block RAM though, with the extra bit being used for either for ECC or user storage.


10-bit C might be close to non-existent, but I've heard that quite a few DSP are word addressed. In practice this means their "bytes" are 32 bits.

  sizeof(uint32_t) == 1


C itself was developed on machines that had 18 bit ints.


B was developed on the PDP-7. C was developed on the PDP-11.


D made a great leap forward with the following:

1. bytes are 8 bits

2. shorts are 16 bits

3. ints are 32 bits

4. longs are 64 bits

5. arithmetic is 2's complement

6. IEEE floating point

and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!

Oh, and Unicode was the character set. Not EBCDIC, RADIX-50, etc.


Zig is even better:

1. u8 and i8 are 8 bits.

2. u16 and i16 are 16 bits.

3. u32 and i32 are 32 bits.

4. u64 and i64 are 64 bits.

5. Arithmetic is an explicit choice. '+' overflowing is illegal behavior (will crash in debug and releasesafe), '+%' is 2's compliment wrapping, and '+|' is saturating arithmetic. Edit: forgot to mention @addWithOverflow(), which provides a tuple of the original type and a u1; there's also std.math.add(), which returns an error on overflow.

6. f16, f32, f64, f80, and f128 are the respective but length IEEE floating point types.

The question of the length of a byte doesn't even matter. If someone wants to compile to machine whose bytes are 12 bits, just use u12 and i12.


Zig allows any uX and iX in the range of 1 - 65,535, as well as u0


u0?? Why?


Sounds like zero-sized types in Rust, where it is used as marker types (eg. this struct own this lifetime). It also can be used to turn a HashMap into a HashSet by storing zero sized value. In Go a struct member of [0]func() (an array of function, with exactly 0 members) is used to make a type uncomparable as func() cannot be compared.


To avoid corner cases in auto-generated code?


To represent 0 without actually storing it in memory


Same deal with Rust.


I've heard that Rust wraps around by default?


Rust has two possible behaviours: panic or wrap. By default debug builds with panic, release builds with wrap. Both behaviours are 100% defined, so the compiler can't do any shenanigans.

There are also helper functions and types for unchecked/checked/wrapping/saturating arithmetic.


This is the way.


LLVM has:

i1 is 1 bit

i2 is 2 bits

i3 is 3 bits

i8388608 is 2^23 bits

(https://llvm.org/docs/LangRef.html#integer-type)

On the other hand, it doesn’t make a distinction between signed and unsigned integers. Users must take care to use special signed versions of operations where needed.


How does 5 work in practice? Surely no one is actually checking if their arithmetic overflows, especially from user-supplied or otherwise external values. Is there any use for the normal +?


You think no one checks if their arithmetic overflows?


I'm sure it's not literally no one but I bet the percent of additions that have explicit checks for overflow is for all practical purposes indistinguishable from 0.


Lots of secure code checks for overflow

    fillBufferWithData(buffer, data, offset, size)
You want to know that offset + size don't wrap past 32bits (or 64) and end up with nonsense and a security vulnerability.


Eh I like the nice names. Byte=8, short=16, int=32, long=64 is my preferred scheme when implementing languages. But either is better than C and C++.


It would be "nice" if not for C setting a precedent for these names to have unpredictable sizes. Meaning you have to learn the meaning of every single type for every single language, then remember which language's semantics apply to the code you're reading. (Sure, I can, but why do I have to?)

[ui][0-9]+ (and similar schemes) on the other hand anybody can understand at the first glance.


> D made a great leap forward

> and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!

Nah. It is actually pretty bad. Type names with explicit sizes (u8, i32, etc) are way better in every way.


> Type names with explicit sizes (u8, i32, etc) are way better in every way

Until one realizes that the entire namespace of innn, unnn, fnnn, etc., is reserved.


You are right, they come with a cost.


"1. bytes are 8 bits"

How big is a bit?


This doesn't feel like a serious question, but in case this is still a mystery to you… the name bit is a portmanteau of binary digit, and as indicated by the word "binary", there are only two possible digits that can be used as values for a bit: 0 and 1.


So trinary and quaternary digits are trits and quits?


Yes, trit is commonly used for ternary logic. "quit" I have never heard in such a context.


So shouldn't a two-state datum be a twit ?


A bit is a measure of information theoretical entropy. Specifically, one bit has been defined as the uncertainty of the outcome of a single fair coin flip. A single less than fair coin would have less than one bit of entropy; a coin that always lands heads up has zero bits, n fair coins have n bits of entropy and so on.

https://en.m.wikipedia.org/wiki/Information_theory

https://en.m.wikipedia.org/wiki/Entropy_(information_theory)


That is a bit in information theory. It has nothing to do with the computer/digital engineering term being discussed here.


This comment I feel sure would repulse Shannon in the deepest way. A (digital, stored) bit, abstractly seeks to encode and make useful through computation the properties of information theory.

Your comment must be sarcasm or satire, surely.


I do not know or care what would Mr. Shannon think. What I do know is that the base you chose for the logarithm on the entropy equation has nothing to do with the amount of bits you assign to a word on a digital architecture :)


How philosophical do you want to get? Technically, voltage is a continuous signal, but we sample only at clock cycle intervals, and if the sample at some cycle is below a threshold, we call that 0. Above, we call it 1. Our ability to measure whether a signal is above or below a threshold is uncertain, though, so for values where the actual difference is less than our ability to measure, we have to conclude that a bit can actually take three values: 0, 1, and we can't tell but we have no choice but to pick one.

The latter value is clearly less common than 0 and 1, but how much less? I don't know, but we have to conclude that the true size of a bit is probably something more like 1.00000000000000001 bits rather than 1 bit.


> How big is a bit?

A quarter nybble.


A bit is either a 0 or 1. A byte is the smallest addressable piece of memory in your architecture.


Technically the smallest addressable piece of memory is a word.


I don't think the term word has any consistent meaning. Certainly x86 doesn't use the term word to mean smallest addressable unit of memory. The x86 documentation defines a word as 16 bits, but x86 is byte addressable.

ARM is similar, ARM processors define a word as 32-bits, even on 64-bit ARM processors, but they are also byte addressable.

As best as I can tell, it seems like a word is whatever the size of the arithmetic or general purpose register is at the time that the processor was introduced, and even if later a new processor is introduced with larger registers, for backwards compatibility the size of a word remains the same.


Depends on your definition of addressable.

Lots of CISC architectures allow memory accesses in various units even if they call general-purpose-register-sized quantities "word".

Iirc the C standard specifies that all memory can be accessed via char*.


Every ISA I've ever used has used the term "word" to describe a 16- or 32-bit quantity, while having instructions to load and store individual bytes (8 bit quantities). I'm pretty sure you're straight up wrong here.


That's only true on a word-addressed machine; most CPUs are byte-addressed.


The difference between address A and address A+1 is one byte. By definition.

Some hardware may raise an exception if you attempt to retrieve a value at an address that is not a (greater than 1) multiple of a byte, but that has no bearing on the definition of a byte.


Which … if your heap always returns N bit aligned values, for some N … is there a name for that? The smallest heap addressable segment?


If your detector is sensitive enough, it could be just a single electron that's either present or absent.


At least 2 or 3


Depends on your physical media.


That's a bit self-pat-on-the-back-ish, isn't it, Mr. Bright, the author of D language? :)


Of course!

Over the years I've known some engineers who, as a side project, wrote some great software. Nobody was interested in it. They'd come to me and ask why that is? I suggest writing articles about their project, and being active on the forums. Otherwise, who would ever know about it?

They said that was unseemly, and wouldn't do it.

They wound up sad and bitter.

The "build it and they will come" is a stupid Hollywood fraud.

BTW, the income I receive from D is $0. It's my gift. You'll also note that I've suggested many times improvements that could be made to C, copying proven ideas in D. Such as this one:

https://www.digitalmars.com/articles/C-biggest-mistake.html

C++ has already adopted many ideas from D.


> https://www.digitalmars.com/articles/C-biggest-mistake.html

To be fair, this one lies on the surface for anyone trying to come up with an improved C. It's one of the first things that gets corrected in nearly all C derivatives.

> C++ has already adopted many ideas from D.

Do you have a list?

Especially for the "adopted from D" bit rather than being a evolutionary and logical improvement to the language.


Yeah, this is something Java got right as well. It got "unsigned" wrong, but it got standardizing primitive bits correct

byte = 8 bits

short = 16

int = 32

long = 64

float = 32 bit IEEE

double = 64 bit IEEE


I like the Rust approach more: usize/isize are the native integer types, and with every other numeric type, you have to mention the size explicitly.

On the C++ side, I sometimes use an alias that contains the word "short" for 32-bit integers. When I use them, I'm explicitly assuming that the numbers are small enough to fit in a smaller than usual integer type, and that it's critical enough to performance that the assumption is worth making.


<cstdint> has int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, and uint64_t. I still go back and forth between uint64_t, size_t, and unsigned int, but am defaulting to uint64_t more and more, even if it doesn't matter.


That doesn't really fix it, because of the integral promotion rules.


hindsight has its advantages


> you have to mention the size explicitly

It's unbelievably ugly. Every piece of code working with any kind of integer screams "I am hardware dependent in some way".

E.g. in a structure representing an automobile, the number of wheels has to be some i8 or i16, which looks ridiculous.

Why would you take a language in which you can write functional pipelines over collections of objects, and make it look like assembler.


If you don't care about the size of your number, just use isize or usize.

If you do care, then isn't it better to specify it explicitly than trying to guess it and having different compilers disagreeing on the size?


A type called isize is some kind of size. It looks wrong for something that isn't a size.


Then just define a type alias, which is good practice if you want your types to be more descriptive: https://doc.rust-lang.org/reference/items/type-aliases.html


Nope! Because then you will also define an alias, and Suzy will define an alias, and Bob will define an alias, ...

We should all agree on int and uint; not some isize nonsense, and not bobint or suzyint.


Alas, it's pretty clear that we won't!


Ok, it is obvious that you are looking for something to complaint about and don't want to find a solution. That is not a productive attitude in life, but whatever floats your boat. Have a good day.


> looking for something to complaint about

You know, that describes pretty much everyone who has anything to do with Rust.

"My ls utility isn't written in Rust, yikes! Let's fix that!"

"The comments under this C++-related HN submission aren't talking about Rust enough, yikes! Let's fix that!"

I'm obviously pointing to a solution: have a standard module that any Rust program can depend on coming from the language, which has a few sanely named types. Rather than every program defining its own.


You insist that we should all agree on something but you don't specify what.


Actually, if you don't care about the size of your small number, use `i32`. If it's a big number, use `i64`.

`isize`/`usize` should only be used for memory-related quantities — that's why they renamed from `int`/`uint`.


If you use i32, it looks like you care. Without studying the code, I can't be sure that it could be changed to i16 or i64 without breaking something.

Usually, I just want the widest type that is efficient on the machine, and I don't want it to have an inappropriate name. I don't care about the wasted space, because it only matters in large arrays, and often not even then.


> If you use i32, it looks like you care.

In Rust, that's not really the case. `i32` is the go-to integer type.

`isize` on the other hand would look really weird in code — it's an almost unused integer type. I also prefer having integers that don't depend on the machine I'm running them on.


Some 32 bit thing being the go to integer type flies against software engineering and CS.

It's going to get expensive on a machine that has only 64 bit integers, which must be accessed on 8 byte aligned boundaries.


And which machine is that? The only computers that I can think of with only 64-bit integers are the old Cray vector supercomputers, and they used word addressing to begin with.


It will likely be common in another 25 to 30 years, as 32 bit systems fade into the past.

Therefore, declaring that int32 is the go to integer type is myopic.

Forty years ago, a program like this could be run on a 16 bit machine (e.g. MS-DOS box):

  #include <stdio.h>

  int main(int argc, char **argv)
  {
    while (argc-- > 0)
      puts(*argv++);
    return 0;
  }
int was 16 bits. That was fine; you would never pass anywhere near 32000 arguments to a program.

Today, that same program does the same thing on a modern machine with a wider int.

Good thing that some int16 had not been declared the go to integer type.

Rust's integer types are deliberately designed (by people who know better) in order to be appealing to people who know shit all about portability and whose brains cannot handle reasoning about types with a bit of uncertainty.


Sorry, but I fail to see where the problem is. Any general purpose ISA designed in the past 40 years can handle 8/16/32 bit integers just fine regardless of the register size. That includes the 64-bit x86-64 or ARM64 from which you are typing.

The are a few historical architectures that couldn't handle smaller integers, like the first generation Alpha, but:

    a) those are long dead.

    b) hardware engineers learnt from their mistake and no modern general purpose architecture has repeated it (specialized architectures like DSP and GPU are another story though).

    c) worst case scenario, you can simulate it in software.


Except defining your types with arbitrary names is still hardware dependent, it's just now something you have to remember or guess.

Can you remember the name for a 128 bit integer in your preferred language off the top of your head? I can intuit it in Rust or Zig (and many others).

In D it's... oh... it's int128.

https://dlang.org/phobos/std_int128.html

https://github.com/dlang/phobos/blob/master/std/int128.d


In C it will almost certainly be int128_t, when standardized. 128 bit support is currently a compiler extension (found in GCC, Clang and others).

A type that provides a 128 bit integer exactly should have 128 in its name.

That is not the argument at all.

The problem is only having types like that, and stipulating nonsense like that the primary "go to" integer type is int32.


It was actually supposed to be `cent` and`ucent`, but we needed a library type to stand in for it at the moment.


Is it any better calling it an int where it's assumed to be an i32 and 30 of the bits are wasted.


what you call things matters, so yes, it is better.


Yep. Pity about getting chars / string encoding wrong though. (Java chars are 16 bits).

But it’s not alone in that mistake. All the languages invented in that era made the same mistake. (C#, JavaScript, etc).


Java was just unlucky, it standardised it's strings at the wrong time (when Unicode was 16-bit code points): Java was announced in May 1995, and the following comment from the Unicode history wiki page makes it clear what happened: "In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. ..."


Java strings are byte[]'s if their contents contain only Latin-1 values (the first 256 codepoints of Unicode). This shipped in Java 9.

JEP 254: Compact Strings

https://openjdk.org/jeps/254


What's the right way?


UTF-8

When D was first implemented, circa 2000, it wasn't clear whether UTF-8, UTF-16, or UTF-32 was going to be the winner. So D supported all three.


utf8, for essentially the reasons mentioned in this manifesto: https://utf8everywhere.org/


Yep. Notably supported by go, python3, rust and swift. And probably all new programming languages created from here on.


I would say anyone mentioning a specific encoding / size just wants to see the world burn. Unicode is variable length on various levels, how many people want to deal with the fact that the unicode of their text could be non normalized or want the ability to cut out individual "char" elements only to get a nonsensical result because the following elements were logically connected to that char? Give developers a decent high level abstraction and don't force them to deal with the raw bits unless they ask for it.


I think this is what Rust does, if I remember correctly, it provides APIs in string to enumerate the characters accurately. That meaning, not necessarily byte by byte.


https://pastebin.com/raw/D7p7mRLK

My comment in a pastebin. HN doesn't like unicode.

You need this crate to deal with it in Rust, it's not part of the base libraries:

https://crates.io/crates/unicode-segmentation

The languages that have this kind of feature built-in in the standard library, to my knowledge, are Swift, JavaScript, C# and Java. Swift is the only one, of those four, that treat operating on graphemes as the default. JavaScript requires Intl.Segmenter, C# requires StringInfo, Java requires BreakIterator.

By the way, Python, the language caused so much hurt with their 2.x->3.x transition promising better unicode support in return for this pain couldn't even do this right. There is no concept of graphemes in the standard library. So much for the batteries included bit.

>>> test = " "

>>> [char for char in test]

['', '\u200d', '', '\u200d', '', '\u200d', '']

>>> len(test)

7

In JavaScript REPL (nodejs):

> let test = " "

undefined

> [...new Intl.Segmenter().segment(test)][0].segment;

' '

> [...new Intl.Segmenter().segment(test)].length;

1

Works as it should.

In python you would need a third party library.

Swift is truly the nicest of programming languages as far as strings are concerned. It just works as it always should have been.

let test = " "

for char in test {

    print(char)
}

print(test.count)

output :

1

[Execution complete with exit code 0]

I, as a non-Apple user, feel quite the Apple envy whenever I think about swift. It's such a nice language, but there's little ecosystem outside of Apple UIs.

But man, no using third party libraries, or working with a wrapper segmenter class or iterator. Just use the base string literals as is. It. Just. Works.


For context, it looks like you’re talking about iterating by grapheme clusters.

I understand how iterating through a string by grapheme clusters is convenient for some applications. But it’s far from obvious to me that doing so should be the language’s default. Dealing with grapheme clusters requires a Unicode database, which needs to live somewhere and needs to be updated continuously as Unicode grows. (Should rust statically link that library into every app that uses it?)

Generally there are 3 ways to iterate a string: by UTF8 bytes (or ucs2 code points like Java/js/c#), by Unicode codepoint or by grapheme clusters. UTF8 encoding comes up all the time when encoding / decoding strings - like, to json or when sending content over http. Codepoints are, in my opinion, the correct approach when doing collaborative editing or patching strings. And grapheme clusters are useful in frontend user interfaces - like when building a terminal.

Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least. Tell me - why should grapheme clusters be the default way to iterate over a string? I can see the argument in Swift, which is a language built for frontend UI. But in a systems language like rust? That seems like a terrible default to me. UTF8 bytes are by far the most useful representation for strings in systems code, since from the pov of systems code, strings are usually just data.


This was not meant as criticism for rust in particular (though, while it shouldn't be the default behavior of strings in a systems language, surely at least the official implementation of a wrapper should exist?), but high level languages with ton of baggage like python should definitely provide the correct way to handle strings, the amount of software I've seen that are unable to properly handle strings because the language didn't provide the required grapheme handling and the developer was also not aware of the reality of graphemes and unicode..

You mention terminals, yes, it's one of the area where graphemes are an absolute must, but pretty much any time you are going to do something to text like deciding "I am going to put a linebreak here so that the text doesn't overflow beyond the box, beyond this A4 page I want to print, beyond the browser's window" grapheme handling is involved.

Any time a user is asked to input something too. I've seen most software take the "iterate over characters" approach to real time user input and they break down things like those emojis into individual components whenever you paste something in.

For that matter, backspace doesn't work properly on software you would expect to do better than that. Put the emoji from my pastebin in Microsoft Edge's search/url bar, then hit backspace, see what happens. While the browser displays the emoji correctly, the input field treats it the way Python segments it in my example: you need to press backspace 7 times to delete it. 7 times! Windows Terminal on the other hand has the quirk of showing a lot of extra spaces after the emoji (despite displaying the emoji correctly too) and will also require 11 backspace to delete it.

Notepad handles it correctly: press backspace once, it's deleted, like any normal character.

> Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least.

This doesn't say anything about grapheme clusters being useless. I've cited examples of popular software doing the wrong thing precisely because, like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.

The dismissiveness over more sane string handling as a standard is not unlike C++ developers pretending that developers are doing the right thing with memory management so we don't need a GC (or rust's ownership paradigm). Nonsense.


Those are good examples! Notably, all of them are in reasonably low level, user-facing code.

Your examples are implementing custom text input boxes (Excel, Edge), line breaks while printing, and implementing a terminal application. I agree that in all of those cases, grapheme cluster segmentation is appropriate. But that doesn't make grapheme cluster based iteration "the correct way to handle strings". There's no "correct"! There are at least 3 different ways to iterate through a string, and different applications have different needs.

Good languages should make all of these options easy for programmers to use when they need them. Writing a custom input box? Use grapheme clusters. Writing a text based CRDT? Treat a string as a list of unicode codepoints. Writing an HTTP library? Treat the headers and HTML body as ASCII / opaque bytes. Etc.

I take the criticism that rust makes grapheme iteration harder than the others. But eh, rust has truly excellent crates for that within arms reach. I don't see any advantage in moving grapheme based segmentation into std. Well, maybe it would make it easier to educate idiot developers about this stuff. But there's no real technical reason. Its situationally useful - but less useful than lots of other 3rd party crates like rand, tokio and serde.

> like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.

It says that in 30+ years of programming, I've never programmed a text input field from scratch. Why would I? That's the job of the operating system. Making my own sounds like a huge waste of time.


While I don't agree with not having unsigned as part of the primitive times, and look forward to Valhala fixing that, it was based on the experience most devs don't get unsigned arithmetic right.

"For me as a language designer, which I don't really count myself as these days, what "simple" really ended up meaning was could I expect J. Random Developer to hold the spec in his head. That definition says that, for instance, Java isn't -- and in fact a lot of these languages end up with a lot of corner cases, things that nobody really understands. Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex. The language part of Java is, I think, pretty simple. The libraries you have to look up."

http://www.gotw.ca/publications/c_family_interview.htm


I mean practically speaking in C++ we have (it just hasn't made it to the standard):

1. char 8 bit

2. short 16 bit

3. int 32 bit

4. long long 64 bit

5. arithmetic is 2s complement

6. IEEE floating point (float is 32, double is 64 bit)

Along with other stuff like little endian, etc.

Some people just mistakenly think they can't rely on such stuff, because it isn't in the standard. But they forget that having an ISO standard comes on top of what most other languages have, which rely solely on the documentation.


I work every day with real-life systems where int can be 32 or 64 bits, long long can be 64 or 128 bits, long double can be 64 or 80 or 128 bits, some systems do not have IEEE 754 floating point (no denormals!) some are big endian and some are little endian. These things are not in the language standard because they are not standard in the real world.

Practically speaking, the language is the way it is, and has succeeded so well for so long, because it meets the requirements of its application.


There are also people who write COBOL for a living. What you say is not relevant at all for 99.99% of C++ code written today. Also, all compilers can be configured to be non-standard compliant in many different ways, the classic example being -fno-exceptions. Nobody says all kinds of using a standardized language must be standard conformant.


> (it just hasn't made it to the standard)

That's the problem


You are aware that D and rust and all the other languages this is being compared to don't even have an ISO standard, right?


Yeah, so their documentation serves as the authority on how you're supposed to write your code for it to be "correct D" or "correct Rust". The compiler implementors write their compilers against the documentation (and vice versa). That documentation is clear on these things.

In C, the ISO standard is the authority on how you're supposed to write your code for it to be "correct C". The compiler implementors write their compilers against the ISO standard. That standard is not clear on these things.


I don't think this is true. The target audience of the ISO standard is the implementers of compilers and other tools around the language. Even the people involved in creating it make that clear by publishing other material like the core guidelines, conference talks, books, online articles, etc., which are targeted to the users of the language.


Core guidelines, conference talks, books, online articles, etc. are not authoritative. If I really want to know if my C code is correct C, I consult the standard. If the standard and an online article disagrees, the article is wrong, definitionally.


Correction: if you want to know if your compiler is correct, you look at the ISO standard. But even as a compiler writer, the ISO standard is not exhaustive. For example the ISO standard doesn't define stuff like include directories, static or dynamic linking, etc.


Some people are still dealing with DSPs.

https://thephd.dev/conformance-should-mean-something-fputc-a...

Me? I just dabble with documenting an unimplemented "50% more bits per byte than the competition!" 12-bit fantasy console of my own invention - replete with inventions such as "UTF-12" - for shits and giggles.


Yes, I'm trying to figure out which are still relevant and whether they target a modern C++, or intend to. I've been asking for a few years and haven't gotten positive answers. The only one that been brought up is TI, I added info in the updated draft: https://isocpp.org/files/papers/D3477R1.html


> and would benefit from C23’s _BigInt

s/_BigInt/_BitInt/


Dang, will fix when I get home! Thanks Nick, and hi!



They can just target C++23 or earlier, right? I have a small collection of SHARCs but I am not going to go crying to the committee if they make C++30 (or whatever) not support CHAR_BIT=32


no doubt you've got your brainfuck compiler hard at work on this ...


TI DSP Assembler is pretty high level, it's "almost C" already.

Writing geophysical | military signal and image processing applications on custom DSP clusters is suprisingly straightforward and doesn't need C++.

It's a RISC architecture optimised for DSP | FFT | Array processing with the basic simplification that char text is for hosts, integers and floats are at least 32 bit and 32 bits (or 64) is the smallest addressable unit.

Fantastic architecture to work with for numerics, deep computational pipelines, once "primed" you push in raw aquisition samples in chunks every clock cycle and extract processed moving window data chunks every clock cycle.

A single ASM instruction in a cycle can accumulate totals from vector multiplication and modulo update indexes on three vectors (two inputs and and out).

Not your mama's brainfuck.


Didn't the PDP-8 have 12-bit bytes?


Is C++ capable of deprecating or simplifying anything?

Honest question, haven't followed closely. rand() is broken,I;m told unfixable and last I heard still wasn't deprecated.

Is this proposal a test? "Can we even drop support for a solution to a problem literally nobody has?"


Signed integers did not have to be 2’s complement, there were 3 valid representations: signed mag, 1s and 2s complement. Modern C and C++ dropped this and mandate 2s complement (“as if” but that distinction is moot here, you can do the same for CHAR_BIT). So there is certainly precedence for this sort of thing.


As mentioned by others, we've dropped trigraph and deprecated rand (and offer an alternative). I also have:

* p2809 Trivial infinite loops are not Undefined Behavior * p1152 Deprecating volatile * p0907 Signed Integers are Two's Complement * p2723 Zero-initialize objects of automatic storage duration * p2186 Removing Garbage Collection Support

So it is possible to change things!


GC API from C++11 was removed in C++23, understandibly so, given that it wasn't designed taking into account the needs of Unreal C++ and C++/CLI, the only two major variants that have GC support.

Exception specifications have been removed, although some want them back for value type exceptions, if that ever happens.

auto_ptr has been removed, given its broken design.

Now on the simplying side, not really, as the old ways still need to be understood.


I think you are right. Absolutely.

Don’t break perfection!! Just accumulate more perfection.

What we need is a new C++ symbol that reliably references eight bit bytes, without breaking compatibility, or wasting annnnnny opportunity to expand the kitchen sink once again.

I propose “unsigned byte8” and (2’s complement) “signed byte8”. And “byte8” with undefined sign behavior because we can always use some more spice.

“unsigned decimal byte8” and “signed decimal byte8”, would limit legal values to 0 to 10 and -10 to +10.

For the damn accountants.

“unsigned centimal byte8” and “signed centimal byte8”, would limit legal values to 0 to 100 and -100 to +100.

For the damn accountants who care about the cost of bytes.

Also for a statistically almost valid, good enough for your customer’s alpha, data type for “age” fields in databases.

And “float byte8” obviously.


> For the damn accountants who care about the cost of bytes.

Finally! A language that can calculate my S3 bill


How is rand() broken? It seems to produce random-ish values, which is what it's for. It obviously doesn't produce cryptographically secure random values, but that's expected (and reflects other languages' equivalent functions). For a decently random integer that's quick to compute, rand() works just fine.


RAND_MAX is only guaranteed to be at least 32767. So if you use `rand() % 10000` you'll have real biased towards 0-2767, even `rand() % 1000` is already not uniform (biased towards 0-767). And that assumes rand() is good uniform from 0-RAND_MAX in the first place.


> The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call.

It cannot be called safely from a multi-threaded application for one


C++ long ago crossed the line where making any change is more work than any benefit it could ever create.


This is such an odd thing to read & compare to how eager my colleagues are to upgrade the compiler to take advantage of new features. There's so much less need to specify types in situations where the information is implicitly available after C++ 20/17. So many boost libraries have been replaced by superior std versions.

And this has happened again and again on this enormous codebase that started before it was even called 'C++'.


It is one of my favourite languages, but I think it has already crossed over the complexity threshold PL/I was known for.


well they managed to get two's complement requirement into C++20. there is always hope.


Well then someone somewhere with some mainframe got so angry they decided to write a manifesto to condemn kids these days and announced a fork of Qt because Qt committed the cardinal sin of adopting C++20. So don’t say “a problem literally nobody has”, someone always has a use case; although at some point it’s okay to make a decision to ignore them.

https://lscs-software.com/LsCs-Manifesto.html

https://news.ycombinator.com/item?id=41614949

Edit: Fixed typo pointed out by child.


> because Qt committed the carnal sin of adopting C++20

I do believe you meant to write "cardinal sin," good sir. Unless Qt has not only become sentient but also corporeal when I wasn't looking and gotten close and personal with the C++ standard...


This person is unhinged.

> It's a desktop on a Linux distro meant to create devices to better/save lives.

If you are creating life critical medical devices you should not be using linux.


> If you are creating life critical medical devices you should not be using linux.

Hmm, what do you mean?

Like, no you should not adopt some buggy or untested distro, instead choose each component carefully and disable all un-needed updates...

But that beats working on an unstable, randomly and capriciously deprecated and broken OS (windows/mac over the years), that you can perform zero practical review, casual or otherwise, legal or otherwise, and that insists upon updating and further breaking itself at regular intervals...

Unless you mean to talk maybe about some microkernel with a very simple graphical UI, which, sure yes, much less complexity...


I mean you should be building life critical medical devices on top of an operating system like QNX or vxworks which are much more stable and simpler.


Regulations are complex, but not every medical device or part of it is "life critical". There are plenty of regulated medical devices floating around running Linux, often based on Yocto. There is some debate in the industry about the particulars of this SOUP (software of unknown provenance) in general, but the mere idea of Linux in a medical device is old news and isn't crackpot or anything.

The goal for this guy seems to be a Linux distro primarily to serve as a reproducible dev environment that must include his own in-progress EDT editor clone, but can include others as long as they're not vim or use Qt. Ironically Qt closed-source targets vxWorks and QNX. Dräger ventilators use it for their frontend.

https://www.logikalsolutions.com/wordpress/information-techn...

Like the general idea of a medical device linux distros (for both dev host and targets) is not a bad one. But the thinking and execution in this case is totally derailed due to outsized and unfocused reactions to details that don't matter (ancient IRS tax computers), QtQuick having some growing pains over a decade ago, personal hatred of vim, conflating a hatred of Agile with CI/CD.


> You can't use non-typesafe junk when lives are on the line.

Their words, not mine. If lives are on the line you probably shouldn’t be using linux in your medical device. And I hope my life never depends on a medical device running linux.


Wow.

https://theminimumyouneedtoknow.com/

https://lscs-software.com/LsCs-Roadmap.html

"Many of us got our first exposure to Qt on OS/2 in or around 1987."

Uh huh.

> someone always has a use case;

No he doesn't. He's just unhinged. The machines this dude bitches about don't even have a modern C++ compiler nor do they support any kind of display system relevant to Qt. They're never going to be a target for Qt. Further irony is this dude proudly proclaims this fork will support nothing but Wayland and Vulkan on Linux.

"the smaller processors like those in sensors, are 1's complement for a reason."

The "reason" is never explained.

"Why? Because nothing is faster when it comes to straight addition and subtraction of financial values in scaled integers. (Possibly packed decimal too, but uncertain on that.)"

Is this a justification for using Unisys mainframes, or is the implication that they are fastest because of 1's complement? (not that this is even close to being true - as any dinosaurs are decomissioned they're fucking replaced with capable but not TOL commodity Xeon CPU based hardware running emulation, I don't think Unisys makes any non x86 hardware anymore) Anyway, may need to refresh that CS education.

There's some rambling about the justification being data conversion, but what serialization protocols mandate 1's complement anyway, and if those exist someone has already implemented 2's complement supporting libraries for the past 50 years since that has been the overwhelming status quo. We somehow manage to deal with endianness and decimal conversions as well.

"Passing 2's complement data to backend systems or front end sensors expecting 1's complement causes catastrophes."

99.999% of every system MIPS, ARM, x86, Power, etc for the last 40 years uses 2's complement, so this has been the normal state of the world since forever.

Also the enterpriseist of languages, Java somehow has survived mandating 2's complement.

This is all very unhinged.

I'm not holding my breath to see this ancient Qt fork fully converted to "modified" Barr spec but that will be a hoot.


Yeah, I think many of their arguments are not quite up to snuff. I would be quite interested how 1s compliment is faster, it is simpler and thus the hardware could be faster, iff you figure out how to deal with the drawbacks like -0 vs +0 (you could do it in hardware pretty easily...)

Buuuut then the Unisys thing. Like you say they dont make processors (for the market) and themselves just use Intel now...and even if they make some special secret processors I don't think the IRS is using top secret processors to crunch our taxes, even in the hundreds of millions of record realm with average hundreds of items per record, modern CPUs run at billions of ops per second...so I suspect we are talking some tens of seconds, and some modest amount of RAM (for a server).

The one point he does have is interoperability, which if a lot of (especially medical) equipment uses 1s compliment because its cheaper (in terms of silicon), using "modern" tools is likely to be a bad fit.

Compatability is King, and where medical devices are concerned I would be inclined to agree that not changing things is better than "upgrading" - its all well and good to have two systems until a crisis hits and some doctor plus the wrong sensor into the wrong device...


> The one point he does have is interoperability, which if a lot of (especially medical) equipment uses 1s compliment

No it’s completely loony. Note that even the devices he claims to work with for medical devices are off the shelf ARM processors (ie what everybody uses). No commonly used commodity processors for embedded have used 1’s complement in the last 50 years.

> equipment uses 1s compliment because its cheaper (in terms of silicon)

Yeah that makes no sense. If you need an ALU at all, 2s complement requires no more silicon and is simpler to work with. That’s why it was recommended by von Neumann in 1945. 1s complement is only simpler if you don’t have an adder of any kind, which is then not a CPU, certainly not a C/C++ target.

Even the shittiest low end PIC microcontroller from the 70s uses 2s complement.

It is possible that a sensing device with no microprocessor or computation of any kind (ie a bare ADC) may generate values in sign-mag or 1s complement (and it’s usually the former, again how stupid this is) - but this has nothing to do with the C implementation of whatever host connects to it which is certainly 2s. I guarantee you no embedded processor this dude ever worked with in the medical industry used anything other than 2s complement - you would have always needed to do a conversion.

This truly is one of the most absurd issues to get wrapped up on. It might be dementia, sadly.

https://github.com/RolandHughes/ls-cs/blob/master/README.md

Maintaining a fork of a large C++ framework (well of another obscure fork) where the top most selling point is a fixation on avoiding C++20 all because they dropped support for integer representations that have no extant hardware with recent C++ compilers - and any theoretical hardware wouldn’t run this framework anyway, that doesn’t seem well attached to reality.


> it is simpler and thus the hardware could be faster

Is it though? With twos compliment ADD and SUB are the same hardware for unsigned and signed. MUL/IMUL is also the same for the lower half of the result (i.e. 32bit × 32bit = 32bit). So you're ALU and ISA are simple and flexible by design.


For calculations, of course it’s not simpler or faster. At best, you could probably make hardware where it’s close to a wash.

One that lectures on the importance of college you would think would demonstrate the critical thinking skills to ask themselves why the top supercomputers use 2’s complement like everyone else.

The only aspect of 1’s or sign mag that is simpler is in generation. If you have a simple ADC that gives you a magnitude based on a count and a direction, it is trivial to just output that directly. 1’s I guess is not too much harder with XORs (but what’s the point?). 2’s requires some kind of ripple carry logic, the add 1 is one way, there are some other methods you can work out but still more logic than sign-mag. This is pretty much the only place where non 2’s complement has any advantage. Finally for an I2C or SPI sensor like a temp sensor it is more likely you will get none of the above and have some asymmetric scale. Anybody in embedded bloviating on this ought to know.

In his ramblings the mentions of packed decimal (BCD) are a nice touch. C, C++ has never supported that to begin with so I have no idea why that must also be “considered”.


C++17 removed trigraphs


Which was quite controversial. Imagine that.


One obvious example is auto_ptr. And from what I can see it is quite successful -- in a well maintained C++ codebase using C++ 11 or later, you just don't see auto_ptr in the code.


they do it left and right when it meets their fancy, otherwise it is unconscionable.

Like making over "auto". Or adding "start_lifetime_as" and declaring most existing code that uses mmap non-conformant.

But then someone asks for a thing that would require to stop pretending that C++ can be parsed top down in a single pass. Immediate rejection!


>haven't followed closely

Don't worry, most people complaining about C++ complexity don't.


Hahaha, you're including Bjarne in that sweeping generalization? C++ has long had a culture problem revolving around arrogance an belittling others, maybe it is growing out of it?

I would point out that for any language, if one has to follow the standards committee closely to be an effective programmer in that language, complexity is likely to be an issue. Fortunately in this case it probably isn't required.

I see garbage collection came in c++11 and has now gone. Would following that debacle make many or most c++ programmers more effective?


Hi! Thanks for the interest on my proposal. I have an updated draft based on feedback I've received so far: https://isocpp.org/files/papers/D3477R1.html


Love the snark in the proposal. Just one gem

> The question isn’t whether there are still architectures where bytes aren’t 8-bits (there are!) but whether these care about modern C++... and whether modern C++ cares about them.


I have mixed feelings about this. On the one hand, it's obviously correct--there is no meaningful use for CHAR_BIT to be anything other than 8.

On the other hand, it seems like some sort of concession to the idea that you are entitled to some sort of just world where things make sense and can be reasoned out given your own personal, deeply oversimplified model of what's going on inside the computer. This approach can take you pretty far, but it's a garden path that goes nowhere--eventually you must admit that you know nothing and the best you can do is a formal argument that conditional on the documentation being correct you have constructed a correct program.

This is a huge intellectual leap, and in my personal experience the further you go without being forced to acknowledge it the harder it will be to make the jump.

That said, there seems to be an increasing popularity of physical electronics projects among the novice set these days... hopefully read the damn spec sheet will become the new read the documentation


And yet every time I run an autoconf script I watch as it checks the bits in a byte and saves the output in config.h as though anyone planned to act on it.


As with any highly used language you end up running into what I call the COBOL problem. It will work for the vast majority of cases except where there's a system that forces an update and all of a sudden a traffic control system doesn't work or a plane falls out of the sky.

You'd have to have some way of testing all previous code in the compilation (pardon my ignorance if this is somehow obvious) to make sure this macro isn't already used. You also risk forking the language with any kind of breaking changes like this. How difficult it would be to test if a previous code base uses a charbit macro and whether it can be updated to the new compiler sounds non obvious. What libraries would then be considered breaking? Would interacting with other compiled code (possibly stupid question) that used charbit also cause problems? Just off the top of my head.

I agree that it sounds nonintuitive. I'd suggest creating a conversion tool first and demonstrating it was safe to use even in extreme cases and then make the conversion. But that's just my unenlightened opinion.


That's not really the problem here--CHAR_BIT is already 8 everywhere in practice, and all real existing code[1] handles CHAR_BIT being 8.

The question is "does any code need to care about CHAR_BIT > 8 platforms" and the answer of course is no, its just should we perform the occult standards ceremony to acknowledge this, or continue to ritually pretend to standards compliant 16 bit DSPs are a thing.

[1] I'm sure artifacts of 7, 9, 16, 32, etc[2] bit code & platforms exist, but they aren't targeting or implementing anything resembling modern ISO C++ and can continue to exist without anyone's permission.

[2] if we're going for unconventional bitness my money's on 53, which at least has practical uses in 2024


This is both uncontroversial and incredibly spicy. I love it.


I'm totally fine with enforcing that int8_t == char == 8-bits, however I'm not sure about spreading the misconception that a byte is 8-bits. A byte with 8-bits is called an octet.

At the same time, a `byte` is already an "alias" for `char` since C++17 anyway[1].

[1] https://en.cppreference.com/w/cpp/types/byte


My first experience with computers was 45 years ago, and a "byte" back then was defined as an 8-bit quantity. And in the intervening 45 years, I've never come across a different meaning for "byte". I'll ask for a citation for a definition of "byte" that isn't 8-bits.


1979 is quite recent as computer history goes, and many conventions had settled by then. The Wikipedia article discusses the etymology of "byte" and how the definition evolved from loosely "a group of bits less than a word" to "precisely 8 bits". https://en.wikipedia.org/wiki/Byte


That's interesting because maybe a byte will not be 8-bit in 45 years from now on.

I'm mostly discussing from the sake of it because I don't really mind as a C/C++ user. We could just use "octet" and call it a day, but now there is an ambiguity with the past definition and potential in the future definition (in which case I hope the term "byte" will just disappear).


> A byte with 8-bits is called an octet

The networking RFC's since inception have always used octet as well.


Nah, a byte is 8 bits.

This is a normative statement, not a descriptive statement.


I, for one, hate that int8 == signed char.

std::cout << (int8_t)32 << std::endl; //should print 32 dang it


Now you can also enjoy the fact that you can't even compile:

  std::cout << (std::byte)32 << std::endl;
because there is no default operator<< defined.


Very enjoyable. It will a constant reminder that I need to decide how I want std::byte to print - character or integer ...


Nothing to do with C++, but:

I kinda like the idea of 6-bit byte retro-microcomputer (resp. 24-bit, that would be a word). Because microcomputers typically deal with small number of objects (and prefer arrays to pointers), it would save memory.

VGA was 6-bit per color, you can have a readable alphabet in 6x4 bit matrix, you can stuff basic LISP or Forth language into 6-bit alphabet, and the original System/360 only had 24-bit addresses.

What's there not to love? 12MiB of memory, with independently addressable 6-bits, should be enough for anyone. And if it's not enough, you can naturally extend FAT-12 to FAT-24 for external storage. Or you can use 48-bit pointers, which are pretty much as useful as 64-bit pointers.


Or you can have 8 bit bytes, and 3 byte words. That’s still 24 bits.


There are DSP chips that have C compilers, and do not have 8 bit bytes; smallest addressable unit is 16 (or larger).

Less than a decade ago I worked with something like that: the TeakLite III DSP from CEVA.


I just put static_assert(CHAR_BITS==8); in one place and move on. Haven't had it fire since it was #if equivalent


Not sure about that, seems pretty controversial to me. Are we forgetting about the UNIVACs?


This would be a great setup for a time travelling science fiction where there is some legacy UNIVAC software that needs to be debugged, and John Titor, instead of looking for an IBM 5100, came back to the year 2024 to find a pre-P3477R0 compiler.


Steins;byte


Do UNIVACs care about modern C++ compilers? Do modern C++ compilers care about UNIVACs?

Given that Wikipedia says UNIVAC was discontinued in 1986 I’m pretty sure the answer is no and no!


The UNIVAC 1108 (and descendants) mainframe architecture was not discontinued in 1986. The company that owned it (Sperry) merged with Burroughs in that year to form Unisys. The platform still exists, but now runs as a software emulator under x86-64. The OS is still maintained and had a new release just last year. Around the time of the merger the old school name “UNIVAC” was retired in a rebranding, but the platform survived.

Its OS, OS 2200, does have a C compiler. Not sure if there ever was a C++ compiler, if there once was it is no longer around. But that C compiler is not being kept up to date with the latest standards, it only officially supports C89/C90 - this is a deeply legacy system, most application software is written in COBOL and the OS itself itself is mainly written in assembler and a proprietary Pascal-like language called “PLUS”. They might add some features from newer standards if particularly valuable, but formal compliance with C99/C11/C17/C23/etc is not a goal.

The OS does contain components written in C++, most notably the HotSpot JVM. However, from what I understand, the JVM actually runs in x86-64 Linux processes on the host system, outside of the emulated mainframe environment, but the mainframe emulator is integrated with those Linux processes so they can access mainframe files/data/apps.


That would be a resounding no then. Nice. So why are we talking about an obscure and irrelevant-to-the-discussion platform? Internet comments I swear.


I got curious, there is a Wikipedia page describing what languages are currently available,

https://en.wikipedia.org/wiki/Unisys_OS_2200_programming_lan...


Hopefully we are; it's been a long time, but as I remember indexing in strings on them is a disaster.


They still exist. You can still run OS 2200 on a Clearpath Dorado.[1] Although it's actually Intel Xeon processors doing an emulation.

Yes, indexing strings of 6-bit FIELDATA characters was a huge headache. UNIVAC had the unfortunate problem of having to settle on a character code in the early 1960s, before ASCII was standardized. At the time, a military 6-bit character set looked like the next big thing. It was better than IBM's code, which mapped to punch card holes and the letters weren't all in one block.

[1] https://www.unisys.com/siteassets/collateral/info-sheets/inf...


idk. by today most software already assumes 8 bit == byte in subtle ways all over the place to a point you kinda have to use a fully custom or at least fully self reviewed and patched stack of C libraries

so delegating such by now very very edge cases to non standard C seems fine, i.e. seems to IMHO not change much at all in practice

and C/C++ compilers are anyway full of non standard extensions and it's not that CHAR_BIT go away or you as a non-standard extension assume it might not be 8


> most software already assumes 8 bit == byte in subtle ways all over the place

Which is the real reason why 8-bits should be adopted as the standard byte size.

I didn't even realize that the byte was defined as anything other than 8-bits until recently. I have known, for decades, that there were non-8-bit character encodings (including ASCII) and word sizes were all over the map (including some where word size % 8 != 0). Enough thought about that last point should have helped me realize that there were machines where the byte was not 8-bits, yet the rarity of encountering such systems left me with the incorrect notion that a byte was defined as 8-bits.

Now if someone with enough background to figure it out doesn't figure it out, how can someone without that background figure it out? Someone who has only experienced systems with 8-bit bytes. Someone who has only read books that make the explicit assumption of 8-bit bytes (which virtually every book does). Anything they write has the potential of breaking on systems with a different byte size. The idea of writing portable code because the compiler itself is "standards compliant" breaks down. You probably should modify the standard to ensure the code remains portable by either forcing the compiler for non-8-bit systems to handle the exceptions, or simply admitting that compiler does not portable code for non-8-bit systems.


What will be the benefit?

- CHAR_BIT cannot go away; reams of code references it.

- You still need the constant 8. It's better if it has a name.

- Neither the C nor C++ standard will be simplified if CHAR_BIT is declared to be 8. Only a few passages will change. Just, certain possible implementations will be rendered nonconforming.

- There are specialized platforms with C compilers, such as DSP chips, that are not byte addressable machines. They are in current use; they are not museum pieces.


> We can find vestigial support, for example GCC dropped dsp16xx in 2004, and 1750a in 2002.

Honestly kind of surprised it was relavent as late as 2004. I thought the era of non 8-bit bytes was like 1970s or earlier.


JF Bastien is a legend for this, haha.

I would be amazed if there's any even remotely relevant code that deals meaningfully with CHAR_BIT != 8 these days.

(... and yes, it's about time.)


Here's a bit of 40 year old code I wrote which originally ran on 36-bit PDP-10 machines, but will work on non-36 bit machines.[1] It's a self-contained piece of code to check passwords for being obvious. This will detect any word in the UNIX dictionary, and most English words, using something that's vaguely like a Bloom filter.

This is so old it predates ANSI C; it's in K&R C. It used to show up on various academic sites. Now it's obsolete enough to have scrolled off Google. I've seen copies of this on various academic sites over the years, but it seems to have finally scrolled off.

I think we can dispense with non 8-bit bytes at this point.

[1] https://animats.com/source/obvious/obvious.c


Huh, that’s clever!


The tms320c28x DSPs have 16 bit char, so e.g. the Opus audio codec codebase works with 16-bit char (or at least it did at one point -- I wouldn't be shocked if it broke from time to time, since I don't think anyone runs regression tests on such a platform).

For some DSP-ish sort of processors I think it doesn't make sense to have addressability at char level, and the gates to support it would be better spent on better 16 and 32 bit multipliers. ::shrugs::

I feel kind of ambivalent about the standards proposal. We already have fixed size types. If you want/need an exact type, that already exists. The non-fixed size types set minimums and allow platforms to set larger sizes for performance reasons.

Having no fast 8-bit level access is a perfectly reasonable decision for a small DSP.

Might it be better instead to migrate many users of char to (u)int8_t?

The proposed alternative of CHAR_BIT congruent to 0 mod 8 also sounds pretty reasonable, in that it captures the existing non-8-bit char platforms and also the justification for non-8-bit char platforms (that if you're not doing much string processing but instead doing all math processing, the additional hardware for efficient 8 bit access is a total waste).


I thinks it's fine to relegate non 8 bit chars to non-standard C given that a lot of software anyway assumes 8bit bytes already implicitly. Non standard extensions for certain use-cases isn't anything new for C compilers. Also it's a C++ proposal I'm not sure if you program DSPs with C++ :think:


I added a mention of TI's hardware in my latest draft: https://isocpp.org/files/papers/D3477R1.html


Any thoughts on the fact that some vendors basically don't offer a C compiler now? E.g. MSVC has essentially forced C++ limitations back onto the C language to reduce C++ vs C maintance costs?


DSP chips are a common exception that people bring up. I think some TI made ones have 64 bit chars.

Edit: I see TFA mentions them but questions how relevant C++ is in that sort of embedded environment.


Yes, but you're already in specialized territory if you're using that


The current proposal says:

> A byte is 8 bits, which is at least large enough to contain the ordinary literal encoding of any element of the basic character set literal character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is bits in a byte.

But instead of the "and is composed" ending, it feels like you'd change the intro to say that "A byte is 8 contiguous bits, which is".

We can also remove the "at least", since that was there to imply a requirement on the number of bits being large enough for UTF-8.

Personally, I'd make a "A byte is 8 contiguous bits." a standalone sentence. Then explain as follow up that "A byte is large enough to contain...".


Hmm, I wonder if any modern languages can work on computers that use trits instead of bits.

https://en.wikipedia.org/wiki/Ternary_computer


Possible, but likely slow. There's nothing in the "C abstract machine" that mandates specific hardware. But, the bitshift is only a fast operation when you have bits. Similarly with bitwise boolean operations.


It'd just be a translation/compiler problem. Most languages don't really have a "bit", instead it's usually a byte with the upper bits ignored.


While we're at it, perhaps we should also presume little-endian byte order. As much as I prefer big-endian, little-endian had won.

As consolation, big-endian will likely live on forever as the network byte order.


As a person who designed and built a hobby CPU with a sixteen-bit byte, I’m not sure how I feel about this proposal.


But how many bytes are there in a word?


If you're on x86, the answer can be simultaneously 16, 32, and 64.


Don’t you mean 2,4, and 8?


Bits, bytes, whatever.


"Word" is an outdated concept we should try to get rid of.


You're right. To be consistent with bytes we should call it a snack.


Henceforth, it follows that a doublesnack is called a lunch. And a quadruplesnack a fourthmeal.


There's only one right answer:

Nybble - 4 bits

Byte - 8 bits

Snyack - 16 bits

Lyunch - 32 bits

Dynner - 64 bits


In the spirit of redefining the kilobyte, we should define byte as having a nice, metric 10 bits. An 8 bit thing is obviously a bibyte. Then power of 2 multiples of them can include kibibibytes, mebibibytes, gibibibytes, and so on for clarity.


ಠ_ಠ


And what about elevensies?

(Ok,. I guess there's a difference between bits and hob-bits)


This is incompatible with cultures where lunch is bigger than dinner.


or an f-word


It's very useful on hardware that is not an x86 CPU.


As an abstraction on the size of a CPU register, it really turned out to be more confusing than useful.


On RISC machines, it can be very useful to have the concept of "words," because that indicates things about how the computer loads and stores data, as well as the native instruction size. In DSPs and custom hardware, it can indicate the only available datatype.

The land of x86 goes to great pains to eliminate the concept of a word at a silicon cost.


Fortunately we have `register_t` these days.


Is it 32 or 64 bits on ARM64? Why not both?


ARM64 has a 32-bit word, even though the native pointer size and general register size is 64 bits. To access just the lower 32 bits of a register Xn you refer to it as Wn.


such as...?


Appeasing that attitude is what prevented Microsoft from migrating to LP64. Would have been an easier task if their 32-bit LONG type never existed, they stuck with DWORD, and told the RISC platforms to live with it.


How exactly ? How else do you suggest CPUs do addressing ?

Or are you suggesting to increase the size of a byte until it's the same size as a word, and merge both concepts ?


I'm saying the term "Word" abstracting the number of bytes a CPU can process in a single operation is an outdated concept. We don't really talk about word-sized values anymore. Instead we mostly explicit on the size of value in bits. Even the idea of a CPU having just one relevant word size is a bit outdated.


There are 4 bytes in word:

  const char word[] = {‘w’, ‘o’, ‘r’, ‘d’};
  assert(sizeof word == 4);


I've seen 6 8-bit characters/word (Burroughs large systems, they also support 8 6-bit characters/word)


So please do excuse my ignorance, but is there a "logic" related reason other than hardware cost limitations ala "8 was cheaper than 10 for the same number of memory addresses" that bytes are 8 bits instead of 10? Genuinely curious, as a high-level dev of twenty years, I don't know why 8 was selected.

To my naive eye, It seems like moving to 10 bits per byte would be both logical and make learning the trade just a little bit easier?


One example from the software side: A common thing to do in data processing is to obtain bit offsets (compression, video decoding etc.). If a byte would be 10 bits you would need mod%10 operations everywhere which is slow and/or complex. In contrast mod%(2^N) is one logic processor instruction.


If you're ignoring what's efficient to use then just use a decimal data type and let the hardware figure out how to calculate that for you best. If what's efficient matters then address management, hardware operation implementations, and data packing are all simplest when the group size is a power of the base.


Eight is a nice power of two.


Can you explain how that's helpful? I'm not being obtuse, I just don't follow


One thought is that it's always a whole number of bits (3) to bit-address within a byte. It's 3.5 bits to bit address a 10 bit byte. Sorta just works out nicer in general to have powers of 2 when working on base 2.


This is basically the reason.

Another part of it is the fact that it's a lot easier to represent stuff with hex if the bytes line up.

I can represent "255" with "0xFF" which fits nice and neat in 1 byte. However, now if a byte is 10bits that hex no longer really works. You have 1024 values to represent. The max value would be 0x3FF which just looks funky.

Coming up with an alphanumeric system to represent 2^10 cleanly just ends up weird and unintuitive.


We probably wouldn't have chosen hex in a theoretical world where bytes were 10 bits, right? It would probably be two groups of 5 like 02:21 == 85 (like an ip address) or five groups of two 0x01111 == 85. It just has to be one of its divisors.


  02:21
or instead of digits from 0 to F, the letters would go up V. 85 would be 0x2k I think (2 * 32 + 21)


Because modern computing has settled on the Boolean (binary) logic (0/1 or true/false) in the chip design, which has given us 8 bit bytes (a power of two). It is the easiest and most reliable to design and implement in the hardware.

On the other hand, if computing settled on a three-valued logic (e.g. 0/1/«something» where «something» has been proposed as -1, «undefined»/«unknown»/«undecided» or a «shade of grey»), we would have had 9 bit bytes (a power of three).

10 was tried numerous times at the dawn of computing and… it was found too unwieldy in the circuit design.


> On the other hand, if computing settled on a three-valued logic (e.g. 0/1/«something» where «something» has been proposed as -1, «undefined»/«unknown/undecided» or a «shade of grey»), we would have had 9 bit bytes (a power of three).

Is this true? 4 ternary bits give you really convenient base 12 which has a lot of desirable properties for things like multiplication and fixed point. Though I have no idea what ternary building blocks would look like so it’s hard to visualize potential hardware.


It is hard to say whether it would have been 9 or 12, now that people have stopped experimenting with alternative hardware designs. 9-bit byte designs certainly did exist (and maybe even the 12-bit designs), too, although they were still based on the Boolean logic.

I have certainly heard an argument that ternary logic would have been a better choice, if it won over, but it is history now, and we are left with the vestiges of the ternary logic in SQL (NULL values which are semantically «no value» / «undefined» values).


Many circuits have ceil(log_2(N_bits)) scaling wrt to propagation delay/other dimensions so you’re just leaving efficiency on the table if you aren’t using a power of 2 for your bit size.


It's easier to go from a bit number to (byte, bit) if you don't have to divide by 10.


I'm fairly sure it's because the English character set fits nicely into a byte. 7 bits would have have worked as well, but 7 is a very odd width for something in a binary computer.


likely mostly as a concession to ASCII in the end. you used a typewriter to write into and receive terminal output from machines back in the day. terminals would use ASCII. there were machines with all sorts of smallest-addressable-sizes, but eight bit bytes align nicely with ASCII. makes strings easier. making strings easier makes programming easier. easier programming makes a machine more popular. once machines started standardizing on eight bit bytes, others followed. when they went to add more data, they kept the byte since code was written for bytes, and made their new registeres two bytes. then two of those. then two of those. so we're sitting at 64 bit registers on the backs of all that that came before.


I'm not sure why you think being able to store values from -512 to +511 is more logical than -128 to +127?


Buckets of 10 seem more regular to beings with 10 fingers that can be up or down?


Computers are not beings with 10 fingers that can be up or down.

Powers of two are more natural in a binary computer. Then add the fact that 8 is the smallest power of two that allows you to fit the Latin alphabet plus most common symbols as a character encoding.

We're all about building towers of abstractions. It does make sense to aim for designs that are natural for humans when you're closer to the top of the stack. Bytes are fairly low down the stack, so it makes more sense for them to be natural to computers.


I think 8bits (really 7 bits) was chosen because it holds a value closest to +/- 100. What is regular just depends on how you look at it.


Unless they are Addams who have 10 fingers and 11 toes as it is known abundantly well.


I wish I knew what a 9 bit byte means.

One fun fact I found the other day: ASCII is 7 bits, but when it was used with punch cards there was an 8th bit to make sure you didn't punch the wrong number of holes. https://rabbit.eng.miami.edu/info/ascii.html


A 9-bit byte is found on 36-bit machines in quarter-word mode.

Parity is for paper tape, not punched cards. Paper tape parity was never standardized. Nor was parity for 8-bit ASCII communications. Which is why there were devices with settings for EVEN, ODD, ZERO, and ONE for the 8th bit.

Punched cards have their very own encodings, only of historical interest.


>A 9-bit byte is found on 36-bit machines in quarter-word mode.

I've only programmed in high level programming languages in 8-bit-byte machines. I can't understand what you mean by this sentence.

So in a 36-bit CPU a word is 36 bits. And a byte isn't a word. But what is a word and how does it differ from a byte?

If you asked me what 32-bit/64-bit means in a CPU, I'd say it's how large memory addresses can be. Is that true for 36-bit CPUs or does it mean something else? If it's something else, then that means 64-bit isn't the "word" of a 64-bit CPU, so what would the word be?

This is all very confusing.


A word is the unit of addressing. A 36-bit machine has 36 bits of data stored at address 1, and another 36 bits at address 2, and so forth. This is inconvenient for text processing. You have to do a lot of shifting and masking. There's a bit of hardware help on some machines. UNIVAC hardware allowed accessing one-sixth of a word (6 bits), or one-quarter of a word (8 bits), or one-third of a word (12 bits), or a half of a word (18 bits). You had to select sixth-word mode (old) or quarter-word mode (new) as a machine state.

Such machines are not byte-addressable. They have partial word accesses, instead.

Machines have been built with 4, 8, 12, 16, 24, 32, 36, 48, 56, 60, and 64 bit word lengths.

Many "scientific" computers were built with 36-bit words and a 36-bit arithmetic unit. This started with the IBM 701 (1952), although an FPU came later, and continued through the IBM 7094. The byte-oriented IBM System/360 machines replaced those, and made byte-addressable architecture the standard. UNIVAC followed along with the UNIVAC 1103 (1953), which continued through the 1103A and 1105 vacuum tube machines, the later transistorized machines 1107 and 1108, and well into the 21st century. Unisys will still sell you a 36-bit machine, although it's really an emulator running on Intel Xeon CPUs.

The main argument for 36 bits was that 36-bit floats have four more bits of precision, or one more decimal digit, than 32-bit floats. 1 bit of sign, 8 bits of exponent and 27 bits of mantissa gives you a full 8 decimal digits of precision, while standard 32-bit floats with an 1 bit sign, 7-bit exponent and a 24 bit mantissa only give you 7 full decimal digits. Double precision floating point came years later; it takes 4x as much hardware.


I see. I never realized that machines needed to be random number of bits because they couldn't do double-precision so it was easier to make the word larger and do "half" precision instead.

Thanks a lot for your explanation, but does that mean "byte" is any amount of data that can be fetched in a given mode in such machines?

e.g. you have 6-bit, 9-bit, 12-bit, and 18-bit bytes in a 36-bit machine in sixth-word mode, quarter-word mode, third-word mode, and half-word mode, respectively? Which means in full-word mode the "byte" would be 36 bits?


The term "byte" was introduced by IBM at the launch of the IBM System/360 in 1964. [1], which event also introduced the term "throughput". IBM never used it officially in reference to their 36-bit machines. By 1969, IBM had discontinued selling their 36-bit machines. UNIVAC and DEC held onto 36 bits for several more decades, though.

[1] https://www.ibm.com/history/system-360


I don't think so. In the "normal" world, you can't address anything smaller than a byte, and you can only address in increments of a byte. A "word" is usually the size of the integer registers in the CPU. So the 36-bit machine would have a word size of 36 bits, and either six-bit bytes or nine-bit bytes, depending on how it was configured.

At least, if I understood all of this...


One PDP-10 operating system stored five 7-bit characters in one 36-bit word. This was back when memory cost a million dollars a megabyte in the 1970s.


36 bits also gave you 10 decimal digits for fixed point calculations. My mom says that this was important for atomic calculations back in the 1950s - you needed that level of precision on the masses.


Ah, I hope nobody ever uses that additional bit for additional encoding. That could cause all kinds of incompatibilities...


Ignoring this C++ proposal, especially because C and C++ seem like a complete nightmare when it comes to this stuff, I've almost gotten into the habit of treating a "byte" as a conceptual concept. Many serial protocols will often define a "byte", and it might be 7, 8, 9, 11, 12, or whatever bits long.


  #define SCHAR_MIN -127
  #define SCHAR_MAX 128
Is this two typos or am I missing the joke?


Typo, I fixed it in the new draft: https://isocpp.org/files/papers/D3477R1.html


And then we lose communication with Europa Clipper.


Why? Pls no. We've been told (in school!) that byte is byte. Its only sometimes 8bits long (ok, most of the time these days). Do not destroy the last bits of fun. Is network order little endian too?


I think there's plenty of fun left in the standard if they remove this :)


Heretic, do not defile the last remnants of true order!


This is entertaining and probably a good idea but the justification is very abstract.

Specifically, has there even been a C++ compiler on a system where bytes weren't 8 bits? If so, when was it last updated?


There were/are C++ compilers for PDP-10 (9 bit byte). Those haven't been maintained AFAICT, but there are C++ compilers for various DSP's where the smallest unit of access is 16 or 32 bits that are still being sold.


I know some DSPs have 24-bit "bytes", and there are C compilers available for them.


Don't Unisys' Clearpath mainframes (still commercially available, IIRC) 36-bit word and 9-bit bytes?

OTOH, I believe C and C++ are not recommended as languages on the platform.


C++ 'programmers' demonstrating their continued brilliance at bullshitting people they're being productive (Had to check if publishing date was April fools. It's not.) They should start a new committee next to formalize what direction electrons flow. If they do it now they'll be able to have it ready to bloat the next C++ standards no one reads or uses.


the fact that this isn't already done after all these years is one of the reasons why I no longer use C/C++. it takes years and years to get anything done, even the tiniest, most obvious drama free changes. contrast with Go, which has had this since version 1, in 2012:

https://pkg.go.dev/builtin@go1#byte


Don't worry, 20 years from now Go will also be struggling to change assumptions baked into the language in 2012.


Incredible things are happening in the C++ community.


I wish the types were all in bytes instead of bits too. u1 is unsigned 1 byte and u8 is 8 bytes.

That's probably not going to fly anymore though


I like the diversity of hardware and strange machines. So this saddens me. But I'm in the minority I think.


Hoesntly at thought this might be an onion headline. But then I stopped to think about it.


There are FOUR bits.

Jean-Luc Picard


I'm appalled by the parochialism in these comments. Memory access sizes other than 8 bits being inconvenient doesn't make this a good idea.


Amazing stuff guys. Bravo.


This is an egoistical viewpoint, but if I want 8 bits in a byte I have plenty of choices anyway - Zig, Rust, D, you name it. Should the need for another byte width come up, either for past or future architectures C and C++ are my only practical choice.

Sure, it is selfish to expect C and C++ do the dirty work, while more modern languages get away skimping on it. On the other hand I think especially C++ is doing itself a disservice trying to become a kind of half-baked Rust.


Bold leadership


But think of ternary computers!


Doesn't matter ternary computers just have ternary bits, 8 of them ;)


Supposedly, "bit" is short for "binary digit", so we'd need a separate term for "ternary digit", but I don't wanna go there.


The prefix is tri-, not ti- so I don’t think there was any concern of going anywhere.

It’s tricycle and tripod, not ticycle.


The standard term is "trit" because they didn't want to go there.


Ternary computers have 8 tits to a byte.


Should be either 9 or 27 I'd think.


Why can’t it be 8?, the fact that it’s a trit doesn’t put any constraint on the byte (tryte ? size). You could actually make it 5 or 6 trits (~9.5 bits) for similar information density. The Setun used 6 trit addressable units.


How many bytes is a devour?


In a char, not in a byte. Byte != char


A common programming error in C is reading input as char rather than int.

https://man7.org/linux/man-pages/man3/fgetc.3.html

fgetc(3) and its companions always return character-by-character input as an int, and the reason is that EOF is represented as -1. An unsigned char is unable to represent EOF. If you're using the wrong return value, you'll never detect this condition.

However, if you don't receive an EOF, then it should be perfectly fine to cast the value to unsigned char without loss of precision.


In C and C++ a byte and a char are the same size by definition. Don't confuse a byte with an octet.



Just mandate that everything must be run on an Intel or ARM chip and be done with it. Stop pretending anything else is viable.


C++ is the second-most-widely-available language (behind C). Many other things are viable. Everything from a Z15 IBM mainframe to almost every embedded chip in existence. ("Viable" meaning "still being produced and used in volume, and still being used in new designs".)

The next novel chip design is going to have a C++ compiler too. No, we don't yet know what its architecture will be.


Oh, but we do know - In order to be compatible with the existent languages, it's going to have to look similar to what we have now. It will have to keep 8 bit bytes instead of going wider that's what IBM came up with in the 1950s, and it will have to be a stack-oriented machine that looks like a VAX so it can run C programs. Unicode will always be a second-class character set behind ASCII because it has to look like it runs Unix, and we will always use IEEE floating point with all its inaccuracy because using scaling decimal data types just makes too much sense and we can't have that.


RISC V is gaining ground.


formerly or formally?


Obviously




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: