I'm literally giving a talk next week who's first slide is essentially "Why IEEE 754 is not a sufficient description of floating-point semantics" and I'm sitting here trying to figure out what needs to be thrown out of the talk to make it fit the time slot.
One of the most surprising things about floating-point is that very little is actually IEEE 754; most things are merely IEEE 754-ish, and there's a long tail of fiddly things that are different that make it only -ish.
The IEEE 754 standard has been updated several times, often by relaxing previous mandates in order to make various hardware implementations become compliant retroactively (eg, adding Intel's 80-bit floats as a standard floating point size).
It'll be interesting if the "-ish" bits are still "-ish" with the current standard.
The first 754 standard (1985) was essentially formalization of the x87 arithmetic; it defines a "double extended" format. It is not mandatory:
> Implementations should support the extended format corresponding to the widest basic format supported.
_if_ it exists, it is required to have at least as many bits as the x87 long double type.¹
The language around extended formats changed in the 2008 standard, but the meaning didn't:
> Language standards or implementations should support an extended precision format that extends the widest basic format that is supported in that radix.
That language is still present in the 2019 standard. So nothing has ever really changed here. Double-extended is recommended, but not required. If it exists, the significand and exponent must be at least as large as those of the Intel 80-bit format, but they may also be larger.
---
¹ At the beginning of the standardization process, Kahan and Intel engineers still hoped that the x87 format would gradually expand in subsequent CPU generations until it became what is now the standard 128b quad format; they didn't understand the inertia of binary compatibility yet. So the text only set out minimum precision and exponent range. By the time the standard was published in 1985, it was understood internally that they would never change the type, but by then other companies had introduced different extended-precision types (e.g. the 96-bit type in Apple's SANE), so it was never pinned down.
The first 754 standard has still removed some 8087 features, mainly the "projective" infinity and it has slightly changed the definition of the remainder function, so it was not completely compatible with 8087.
Intel 80387 was made compliant with the final standard and by that time there were competing FPUs also compliant with the final standard, e.g. Motorola 68881.
> there's a long tail of fiddly things that are different that make it only -ish.
Perhaps a way to fill some time would be gradually revealing parts of a convoluted Venn diagram or mind-map of the fiddling things. (That is, assuming there's any sane categorization.)
Whether double floats can silently have 80 bit accumulators is a controversial thing. Numerical analysis people like it. Computer science types seem not to because it's unpredictable. I lean towards, "we should have it, but it should be explicit", but this is not the most considered opinion. I think there's a legitimate reason why Intel included it in x87, and why DSPs include it.
Numerical analysis people do not like it. Having _explicitly controlled_ wider accumulation available is great. Having compilers deciding to do it for you or not in unpredictable ways is anathema.
It can be harmful. In GCC while compiling a 32 bit executable, making an std::map< float, T > can cause infinite loops or crashes in your program.
This is because when you insert a value into the map, it has 80 bit precision, and that number of bits is used when comparing the value you are inserting during the traversal of the tree.
After the float is stored in the tree, it's clamped to 32 bits.
This can cause the element to be inserted into into the wrong order in the tree, and this breaks the assumptions of the algorithm leaidng to the crash or infinite loop.
Compiling for 64 bits or explicitly disabling x87 float math makes this problem go away.
I have actually had this bug in production and it was very hard to track down.
10 years ago, a coworker had a really hard time root-causing a bug. I shoulder-debugged it by noticing the bit patterns: it was a miscompile of LLVM itself by GCC, where GCC was using an x87 fldl/fstpl move for a union { double; int64; }. The active member was actually the int64, and GCC chose FP moved based on what was the first member of the union... but the int64 happened to be the representation of SNaN, so the instructions transformed it quietly to a qNaN as part of moving. The "fix" was to change the order of the union's members in LLVM. The bug is still open, though it's had recent activity: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58416
It also affected eMacs compilation and the fix is in the trunk now.
Wow 11 years for such a banal minimal code trigger. I really don’t quiet understand how we can have the scale of infrastructure in operation when this kind of infrastructure software bugs exist. This is not just gcc. All the working castle of cards is an achievement by itself and also a reminder that good enough is all that is needed.
I also highly doubt you could get a 1 in 1000 developers to successfully debug this issue were it happening in the wild, and much smaller to actually fix it.
What use case do you have that requires indexing a hashmap by a floating point value? Keep in mind, even with a compliant implementation that isn't widening your types behind your back, you still have to deal with NaN.
In fact, Rust has the Eq trait specifically to keep f32/f64s out of hash tables, because NaN breaks them really bad.
Rust's BTreeMap, which is much closer to what std::map is, also requires Ord (ie types which claim to possess total order) for any key you can put in the map.
However, Ord is an ordinary safe trait. So while we're claiming to be totally ordered, we're allowed to be lying, the resulting type is crap but it's not unsafe. So as with sorting the algorithms inside these container types, unlike in C or C++ actually must not blow up horribly when we were lying (or as is common in real software, simply clumsy and mistaken)
The infinite loop would be legal (but I haven't seen it) because that's not unsafe, but if we end up with Undefined Behaviour that's a fault in the container type.
This is another place where in theory C++ gives itself license to deliver better performance at the cost of reduced safety but the reality in existing software is that you get no safety but also worse performance. The popular C++ compilers are drifting towards tacit acceptance that Rust made the right choice here and so as a QoI decision they should ship the Rust-style algorithms.
Detecting and filtering out NaNs is both trivial and reliable as long as nobody instructs the compiler to break basic floating point operations (so no ffast-math). Dealing with a compiler that randomly changes the values of your variables is much harder.
It’s absolutely harmful. It turns computations that would be guaranteed to be exact (e.g. head-tail arithmetic primitives used in computational geometry) into “maybe it’s exact and maybe it’s not, it’s at the compiler’s whim” and suddenly your tests for triangle orientation do not work correctly and your mesh-generation produces inadmissible meshes, so your PDE solver fails.
Thank you, I found this hint very interesting. Is there a source you wouldn't mind pointing me to for those "head, tail" methods?
I am assuming it relates to the kinds of "variable precision floating point with bounds" methods used in CGAL and the like; Googling turns up this survey paper:
If not done properly, double rounding (round to extended precision then rounding to working precision) can actually introduce larger approximation error than round to nearest working precision directly. So it can actually make some numerical algorithms perform worse.
I've seen other course notes, I think also from Kahan, extolling 80-bit hardware.
Personally I am starting to think that, if I'm really thinking about precision, I had maybe better just use fixed point, but this again is just a "lean" that could prove wrong over time. Somehow we use floats everywhere and it seems to work pretty well, almost unreasonably so.
Yeah. Kahan was involved in the design of the 8087, so he’s always wanted to _have_ extended precision available. What he (and I, and most other numerical analysts) are opposed to is the fact that (a) language bindings historically had no mechanism to force rounding to float/double when necessary, and (b) compilers commonly spilled x87 intermediate results to the stack as doubles, leading to intermediate rounding that was extremely sensitive to optimization and subroutine calls, making debugging numerical issues harder than it should be.
Modern floating-point is much more reproducible than fixed-point, FWIW, since it has an actual standard that’s widely adopted, and these excess-precision issues do not apply to SSE or ARM FPUs.
Note that this type (which Rust will/ does in nightly call "f16" and a C-like language would probably name "half") is not the only popular 16-bit floating point type, as some people want to have https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
The IEEE FP16 format is what is useful in graphics applications, e.g. for storing color values.
The Google BF16 format is useful strictly only for machine learning/AI applications, because its low precision is insufficient for anything else. BF16 has very low precision, but an exponent range equal to FP32, which makes overflows and underflows less likely.
Which one? Remember the decimal IEEE 754 floating point formats exist too. Do folks in banking use IEEE decimal formats? I remember we used to have different math libs to link against depending, but this was like 40 years ago.
Binding float to the IEEE 754 binary32 format would not preclude use of decimal formats; they have their own bindings (e.g. _Decimal64 in C23). (I think they're still a TR for C++, but I haven't been keeping track).
Nothing prevents banks (or anyone else) from using a compiler where "float" means binary floating point while some other native or user-defined type supports decimal floating point. In fact, that's probably for the best, since they'll probably have exacting requirements for that type so it makes sense for the application developer to write that type themselves.
I was referring to banks using decimal libraries because they work in base 10 numbers, and I recall a big announcement many years ago when the stock market officially switched from fractional stock pricing to cents "for the benefit of computers and rounding", or some such excuse. It always struck me as strange, since binary fixed and floating point represent those particular quantities exactly, without rounding error. Now with normal dollars and cents calculations, I can see why a decimal library might be preferred.
During an internship in 1986 I wrote C code for a machine with 10-bit bytes, the BBN C/70. It was a horrible experience, and the existence of the machine in the first place was due to a cosmic accident of the negative kind.
I wrote code on a DECSYSTEM-20, the C compiler was not officially supported. It had a 36-bit word and a 7-bit byte. Yep, when you packed bytes into a word there were bits left over.
And I was tasked with reading a tape with binary data in 8-bit format. Hilarity ensued.
Paraphrasing, legacy keying systems were based on records of up to 10 printed decimal digits of accuracy for input. 35 bits would be required to match the +/- input but 36 works better as a machine word and operations on 6 x 6 bit (yuck?) characters; or some 'smaller' machines which used a 36 bit larger word and 12 or 18 bit small words. Why the yuck? That's only 64 characters total, so these systems only supported UPPERCASE ALWAYS numeric digits and some other characters.
I think the pdp-10 could have 9 bit bytes, depending on decisions you made in the compiler. I notice it's hard to Google information about this though. People say lots of confusing, conflicting things. When I google pdp-10 byte size it says a c++ compiler chose to represent char as 36 bits.
Basically, loading a 0-bit byte from memory gets you a 0. Depositing a 0-bit byte will not alter memory, but may do an ineffective read-modify-write cycle. Incrementing a 0-bit byte pointer will leave it unchanged.
5. Arithmetic is an explicit choice. '+' overflowing is illegal behavior (will crash in debug and releasesafe), '+%' is 2's compliment wrapping, and '+|' is saturating arithmetic. Edit: forgot to mention @addWithOverflow(), which provides a tuple of the original type and a u1; there's also std.math.add(), which returns an error on overflow.
6. f16, f32, f64, f80, and f128 are the respective but length IEEE floating point types.
The question of the length of a byte doesn't even matter. If someone wants to compile to machine whose bytes are 12 bits, just use u12 and i12.
Sounds like zero-sized types in Rust, where it is used as marker types (eg. this struct own this lifetime). It also can be used to turn a HashMap into a HashSet by storing zero sized value. In Go a struct member of [0]func() (an array of function, with exactly 0 members) is used to make a type uncomparable as func() cannot be compared.
Rust has two possible behaviours: panic or wrap. By default debug builds with panic, release builds with wrap. Both behaviours are 100% defined, so the compiler can't do any shenanigans.
There are also helper functions and types for unchecked/checked/wrapping/saturating arithmetic.
On the other hand, it doesn’t make a distinction between signed and unsigned integers. Users must take care to use special signed versions of operations where needed.
How does 5 work in practice? Surely no one is actually checking if their arithmetic overflows, especially from user-supplied or otherwise external values. Is there any use for the normal +?
I'm sure it's not literally no one but I bet the percent of additions that have explicit checks for overflow is for all practical purposes indistinguishable from 0.
It would be "nice" if not for C setting a precedent for these names to have unpredictable sizes. Meaning you have to learn the meaning of every single type for every single language, then remember which language's semantics apply to the code you're reading. (Sure, I can, but why do I have to?)
[ui][0-9]+ (and similar schemes) on the other hand anybody can understand at the first glance.
This doesn't feel like a serious question, but in case this is still a mystery to you… the name bit is a portmanteau of binary digit, and as indicated by the word "binary", there are only two possible digits that can be used as values for a bit: 0 and 1.
A bit is a measure of information theoretical entropy. Specifically, one bit has been defined as the uncertainty of the outcome of a single fair coin flip. A single less than fair coin would have less than one bit of entropy; a coin that always lands heads up has zero bits, n fair coins have n bits of entropy and so on.
This comment I feel sure would repulse Shannon in the deepest way. A (digital, stored) bit, abstractly seeks to encode and make useful through computation the properties of information theory.
I do not know or care what would Mr. Shannon think. What I do know is that the base you chose for the logarithm on the entropy equation has nothing to do with the amount of bits you assign to a word on a digital architecture :)
How philosophical do you want to get? Technically, voltage is a continuous signal, but we sample only at clock cycle intervals, and if the sample at some cycle is below a threshold, we call that 0. Above, we call it 1. Our ability to measure whether a signal is above or below a threshold is uncertain, though, so for values where the actual difference is less than our ability to measure, we have to conclude that a bit can actually take three values: 0, 1, and we can't tell but we have no choice but to pick one.
The latter value is clearly less common than 0 and 1, but how much less? I don't know, but we have to conclude that the true size of a bit is probably something more like 1.00000000000000001 bits rather than 1 bit.
I don't think the term word has any consistent meaning. Certainly x86 doesn't use the term word to mean smallest addressable unit of memory. The x86 documentation defines a word as 16 bits, but x86 is byte addressable.
ARM is similar, ARM processors define a word as 32-bits, even on 64-bit ARM processors, but they are also byte addressable.
As best as I can tell, it seems like a word is whatever the size of the arithmetic or general purpose register is at the time that the processor was introduced, and even if later a new processor is introduced with larger registers, for backwards compatibility the size of a word remains the same.
Every ISA I've ever used has used the term "word" to describe a 16- or 32-bit quantity, while having instructions to load and store individual bytes (8 bit quantities). I'm pretty sure you're straight up wrong here.
The difference between address A and address A+1 is one byte. By definition.
Some hardware may raise an exception if you attempt to retrieve a value at an address that is not a (greater than 1) multiple of a byte, but that has no bearing on the definition of a byte.
Over the years I've known some engineers who, as a side project, wrote some great software. Nobody was interested in it. They'd come to me and ask why that is? I suggest writing articles about their project, and being active on the forums. Otherwise, who would ever know about it?
They said that was unseemly, and wouldn't do it.
They wound up sad and bitter.
The "build it and they will come" is a stupid Hollywood fraud.
BTW, the income I receive from D is $0. It's my gift. You'll also note that I've suggested many times improvements that could be made to C, copying proven ideas in D. Such as this one:
To be fair, this one lies on the surface for anyone trying to come up with an improved C. It's one of the first things that gets corrected in nearly all C derivatives.
> C++ has already adopted many ideas from D.
Do you have a list?
Especially for the "adopted from D" bit rather than being a evolutionary and logical improvement to the language.
I like the Rust approach more: usize/isize are the native integer types, and with every other numeric type, you have to mention the size explicitly.
On the C++ side, I sometimes use an alias that contains the word "short" for 32-bit integers. When I use them, I'm explicitly assuming that the numbers are small enough to fit in a smaller than usual integer type, and that it's critical enough to performance that the assumption is worth making.
<cstdint> has int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, and uint64_t. I still go back and forth between uint64_t, size_t, and unsigned int, but am defaulting to uint64_t more and more, even if it doesn't matter.
Ok, it is obvious that you are looking for something to complaint about and don't want to find a solution. That is not a productive attitude in life, but whatever floats your boat. Have a good day.
You know, that describes pretty much everyone who has anything to do with Rust.
"My ls utility isn't written in Rust, yikes! Let's fix that!"
"The comments under this C++-related HN submission aren't talking about Rust enough, yikes! Let's fix that!"
I'm obviously pointing to a solution: have a standard module that any Rust program can depend on coming from the language, which has a few sanely named types. Rather than every program defining its own.
If you use i32, it looks like you care. Without studying the code, I can't be sure that it could be changed to i16 or i64 without breaking something.
Usually, I just want the widest type that is efficient on the machine, and I don't want it to have an inappropriate name. I don't care about the wasted space, because it only matters in large arrays, and often not even then.
In Rust, that's not really the case. `i32` is the go-to integer type.
`isize` on the other hand would look really weird in code — it's an almost unused integer type. I also prefer having integers that don't depend on the machine I'm running them on.
And which machine is that? The only computers that I can think of with only 64-bit integers are the old Cray vector supercomputers, and they used word addressing to begin with.
It will likely be common in another 25 to 30 years, as 32 bit systems fade into the past.
Therefore, declaring that int32 is the go to integer type is myopic.
Forty years ago, a program like this could be run on a 16 bit machine (e.g. MS-DOS box):
#include <stdio.h>
int main(int argc, char **argv)
{
while (argc-- > 0)
puts(*argv++);
return 0;
}
int was 16 bits. That was fine; you would never pass anywhere near 32000 arguments to a program.
Today, that same program does the same thing on a modern machine with a wider int.
Good thing that some int16 had not been declared the go to integer type.
Rust's integer types are deliberately designed (by people who know better) in order to be appealing to people who know shit all about portability and whose brains cannot handle reasoning about types with a bit of uncertainty.
Sorry, but I fail to see where the problem is. Any general purpose ISA designed in the past 40 years can handle 8/16/32 bit integers just fine regardless of the register size. That includes the 64-bit x86-64 or ARM64 from which you are typing.
The are a few historical architectures that couldn't handle smaller integers, like the first generation Alpha, but:
a) those are long dead.
b) hardware engineers learnt from their mistake and no modern general purpose architecture has repeated it (specialized architectures like DSP and GPU are another story though).
c) worst case scenario, you can simulate it in software.
Java was just unlucky, it standardised it's strings at the wrong time (when Unicode was 16-bit code points):
Java was announced in May 1995, and the following comment from the Unicode history wiki page makes it clear what happened: "In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. ..."
I would say anyone mentioning a specific encoding / size just wants to see the world burn. Unicode is variable length on various levels, how many people want to deal with the fact that the unicode of their text could be non normalized or want the ability to cut out individual "char" elements only to get a nonsensical result because the following elements were logically connected to that char? Give developers a decent high level abstraction and don't force them to deal with the raw bits unless they ask for it.
I think this is what Rust does, if I remember correctly, it provides APIs in string to enumerate the characters accurately. That meaning, not necessarily byte by byte.
The languages that have this kind of feature built-in in the standard library, to my knowledge, are Swift, JavaScript, C# and Java. Swift is the only one, of those four, that treat operating on graphemes as the default. JavaScript requires Intl.Segmenter, C# requires StringInfo, Java requires BreakIterator.
By the way, Python, the language caused so much hurt with their 2.x->3.x transition promising better unicode support in return for this pain couldn't even do this right. There is no concept of graphemes in the standard library. So much for the batteries included bit.
Swift is truly the nicest of programming languages as far as strings are concerned. It just works as it always should have been.
let test = " "
for char in test {
print(char)
}
print(test.count)
output :
1
[Execution complete with exit code 0]
I, as a non-Apple user, feel quite the Apple envy whenever I think about swift. It's such a nice language, but there's little ecosystem outside of Apple UIs.
But man, no using third party libraries, or working with a wrapper segmenter class or iterator. Just use the base string literals as is. It. Just. Works.
For context, it looks like you’re talking about iterating by grapheme clusters.
I understand how iterating through a string by grapheme clusters is convenient for some applications. But it’s far from obvious to me that doing so should be the language’s default. Dealing with grapheme clusters requires a Unicode database, which needs to live somewhere and needs to be updated continuously as Unicode grows. (Should rust statically link that library into every app that uses it?)
Generally there are 3 ways to iterate a string: by UTF8 bytes (or ucs2 code points like Java/js/c#), by Unicode codepoint or by grapheme clusters. UTF8 encoding comes up all the time when encoding / decoding strings - like, to json or when sending content over http. Codepoints are, in my opinion, the correct approach when doing collaborative editing or patching strings. And grapheme clusters are useful in frontend user interfaces - like when building a terminal.
Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least. Tell me - why should grapheme clusters be the default way to iterate over a string? I can see the argument in Swift, which is a language built for frontend UI. But in a systems language like rust? That seems like a terrible default to me. UTF8 bytes are by far the most useful representation for strings in systems code, since from the pov of systems code, strings are usually just data.
This was not meant as criticism for rust in particular (though, while it shouldn't be the default behavior of strings in a systems language, surely at least the official implementation of a wrapper should exist?), but high level languages with ton of baggage like python should definitely provide the correct way to handle strings, the amount of software I've seen that are unable to properly handle strings because the language didn't provide the required grapheme handling and the developer was also not aware of the reality of graphemes and unicode..
You mention terminals, yes, it's one of the area where graphemes are an absolute must, but pretty much any time you are going to do something to text like deciding "I am going to put a linebreak here so that the text doesn't overflow beyond the box, beyond this A4 page I want to print, beyond the browser's window" grapheme handling is involved.
Any time a user is asked to input something too. I've seen most software take the "iterate over characters" approach to real time user input and they break down things like those emojis into individual components whenever you paste something in.
For that matter, backspace doesn't work properly on software you would expect to do better than that. Put the emoji from my pastebin in Microsoft Edge's search/url bar, then hit backspace, see what happens. While the browser displays the emoji correctly, the input field treats it the way Python segments it in my example: you need to press backspace 7 times to delete it. 7 times! Windows Terminal on the other hand has the quirk of showing a lot of extra spaces after the emoji (despite displaying the emoji correctly too) and will also require 11 backspace to delete it.
Notepad handles it correctly: press backspace once, it's deleted, like any normal character.
> Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least.
This doesn't say anything about grapheme clusters being useless. I've cited examples of popular software doing the wrong thing precisely because, like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.
The dismissiveness over more sane string handling as a standard is not unlike C++ developers pretending that developers are doing the right thing with memory management so we don't need a GC (or rust's ownership paradigm). Nonsense.
Those are good examples! Notably, all of them are in reasonably low level, user-facing code.
Your examples are implementing custom text input boxes (Excel, Edge), line breaks while printing, and implementing a terminal application. I agree that in all of those cases, grapheme cluster segmentation is appropriate. But that doesn't make grapheme cluster based iteration "the correct way to handle strings". There's no "correct"! There are at least 3 different ways to iterate through a string, and different applications have different needs.
Good languages should make all of these options easy for programmers to use when they need them. Writing a custom input box? Use grapheme clusters. Writing a text based CRDT? Treat a string as a list of unicode codepoints. Writing an HTTP library? Treat the headers and HTML body as ASCII / opaque bytes. Etc.
I take the criticism that rust makes grapheme iteration harder than the others. But eh, rust has truly excellent crates for that within arms reach. I don't see any advantage in moving grapheme based segmentation into std. Well, maybe it would make it easier to educate idiot developers about this stuff. But there's no real technical reason. Its situationally useful - but less useful than lots of other 3rd party crates like rand, tokio and serde.
> like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.
It says that in 30+ years of programming, I've never programmed a text input field from scratch. Why would I? That's the job of the operating system. Making my own sounds like a huge waste of time.
While I don't agree with not having unsigned as part of the primitive times, and look forward to Valhala fixing that, it was based on the experience most devs don't get unsigned arithmetic right.
"For me as a language designer, which I don't really count myself as these days, what "simple" really ended up meaning was could I expect J. Random Developer to hold the spec in his head. That definition says that, for instance, Java isn't -- and in fact a lot of these languages end up with a lot of corner cases, things that nobody really understands. Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex. The language part of Java is, I think, pretty simple. The libraries you have to look up."
I mean practically speaking in C++ we have (it just hasn't made it to the standard):
1. char 8 bit
2. short 16 bit
3. int 32 bit
4. long long 64 bit
5. arithmetic is 2s complement
6. IEEE floating point (float is 32, double is 64 bit)
Along with other stuff like little endian, etc.
Some people just mistakenly think they can't rely on such stuff, because it isn't in the standard. But they forget that having an ISO standard comes on top of what most other languages have, which rely solely on the documentation.
I work every day with real-life systems where int can be 32 or 64 bits, long long can be 64 or 128 bits, long double can be 64 or 80 or 128 bits, some systems do not have IEEE 754 floating point (no denormals!) some are big endian and some are little endian. These things are not in the language standard because they are not standard in the real world.
Practically speaking, the language is the way it is, and has succeeded so well for so long, because it meets the requirements of its application.
There are also people who write COBOL for a living. What you say is not relevant at all for 99.99% of C++ code written today. Also, all compilers can be configured to be non-standard compliant in many different ways, the classic example being -fno-exceptions. Nobody says all kinds of using a standardized language must be standard conformant.
Yeah, so their documentation serves as the authority on how you're supposed to write your code for it to be "correct D" or "correct Rust". The compiler implementors write their compilers against the documentation (and vice versa). That documentation is clear on these things.
In C, the ISO standard is the authority on how you're supposed to write your code for it to be "correct C". The compiler implementors write their compilers against the ISO standard. That standard is not clear on these things.
I don't think this is true. The target audience of the ISO standard is the implementers of compilers and other tools around the language. Even the people involved in creating it make that clear by publishing other material like the core guidelines, conference talks, books, online articles, etc., which are targeted to the users of the language.
Core guidelines, conference talks, books, online articles, etc. are not authoritative. If I really want to know if my C code is correct C, I consult the standard. If the standard and an online article disagrees, the article is wrong, definitionally.
Correction: if you want to know if your compiler is correct, you look at the ISO standard. But even as a compiler writer, the ISO standard is not exhaustive. For example the ISO standard doesn't define stuff like include directories, static or dynamic linking, etc.
Me? I just dabble with documenting an unimplemented "50% more bits per byte than the competition!" 12-bit fantasy console of my own invention - replete with inventions such as "UTF-12" - for shits and giggles.
Yes, I'm trying to figure out which are still relevant and whether they target a modern C++, or intend to. I've been asking for a few years and haven't gotten positive answers. The only one that been brought up is TI, I added info in the updated draft: https://isocpp.org/files/papers/D3477R1.html
They can just target C++23 or earlier, right? I have a small collection of SHARCs but I am not going to go crying to the committee if they make C++30 (or whatever) not support CHAR_BIT=32
TI DSP Assembler is pretty high level, it's "almost C" already.
Writing geophysical | military signal and image processing applications on custom DSP clusters is suprisingly straightforward and doesn't need C++.
It's a RISC architecture optimised for DSP | FFT | Array processing with the basic simplification that char text is for hosts, integers and floats are at least 32 bit and 32 bits (or 64) is the smallest addressable unit.
Fantastic architecture to work with for numerics, deep computational pipelines, once "primed" you push in raw aquisition samples in chunks every clock cycle and extract processed moving window data chunks every clock cycle.
A single ASM instruction in a cycle can accumulate totals from vector multiplication and modulo update indexes on three vectors (two inputs and and out).
Signed integers did not have to be 2’s complement, there were 3 valid representations: signed mag, 1s and 2s complement. Modern C and C++ dropped this and mandate 2s complement (“as if” but that distinction is moot here, you can do the same for CHAR_BIT). So there is certainly precedence for this sort of thing.
GC API from C++11 was removed in C++23, understandibly so, given that it wasn't designed taking into account the needs of Unreal C++ and C++/CLI, the only two major variants that have GC support.
Exception specifications have been removed, although some want them back for value type exceptions, if that ever happens.
auto_ptr has been removed, given its broken design.
Now on the simplying side, not really, as the old ways still need to be understood.
Don’t break perfection!! Just accumulate more perfection.
What we need is a new C++ symbol that reliably references eight bit bytes, without breaking compatibility, or wasting annnnnny opportunity to expand the kitchen sink once again.
I propose “unsigned byte8” and (2’s complement) “signed byte8”. And “byte8” with undefined sign behavior because we can always use some more spice.
“unsigned decimal byte8” and “signed decimal byte8”, would limit legal values to 0 to 10 and -10 to +10.
For the damn accountants.
“unsigned centimal byte8” and “signed centimal byte8”, would limit legal values to 0 to 100 and -100 to +100.
For the damn accountants who care about the cost of bytes.
Also for a statistically almost valid, good enough for your customer’s alpha, data type for “age” fields in databases.
How is rand() broken? It seems to produce random-ish values, which is what it's for. It obviously doesn't produce cryptographically secure random values, but that's expected (and reflects other languages' equivalent functions). For a decently random integer that's quick to compute, rand() works just fine.
RAND_MAX is only guaranteed to be at least 32767. So if you use `rand() % 10000` you'll have real biased towards 0-2767, even `rand() % 1000` is already not uniform (biased towards 0-767). And that assumes rand() is good uniform from 0-RAND_MAX in the first place.
This is such an odd thing to read & compare to how eager my colleagues are to upgrade the compiler to take advantage of new features. There's so much less need to specify types in situations where the information is implicitly available after C++ 20/17. So many boost libraries have been replaced by superior std versions.
And this has happened again and again on this enormous codebase that started before it was even called 'C++'.
Well then someone somewhere with some mainframe got so angry they decided to write a manifesto to condemn kids these days and announced a fork of Qt because Qt committed the cardinal sin of adopting C++20. So don’t say “a problem literally nobody has”, someone always has a use case; although at some point it’s okay to make a decision to ignore them.
> because Qt committed the carnal sin of adopting C++20
I do believe you meant to write "cardinal sin," good sir. Unless Qt has not only become sentient but also corporeal when I wasn't looking and gotten close and personal with the C++ standard...
> If you are creating life critical medical devices you should not be using linux.
Hmm, what do you mean?
Like, no you should not adopt some buggy or untested distro, instead choose each component carefully and disable all un-needed updates...
But that beats working on an unstable, randomly and capriciously deprecated and broken OS (windows/mac over the years), that you can perform zero practical review, casual or otherwise, legal or otherwise, and that insists upon updating and further breaking itself at regular intervals...
Unless you mean to talk maybe about some microkernel with a very simple graphical UI, which, sure yes, much less complexity...
Regulations are complex, but not every medical device or part of it is "life critical". There are plenty of regulated medical devices floating around running Linux, often based on Yocto. There is some debate in the industry about the particulars of this SOUP (software of unknown provenance) in general, but the mere idea of Linux in a medical device is old news and isn't crackpot or anything.
The goal for this guy seems to be a Linux distro primarily to serve as a reproducible dev environment that must include his own in-progress EDT editor clone, but can include others as long as they're not vim or use Qt.
Ironically Qt closed-source targets vxWorks and QNX. Dräger ventilators use it for their frontend.
Like the general idea of a medical device linux distros (for both dev host and targets) is not a bad one. But the thinking and execution in this case is totally derailed due to outsized and unfocused reactions to details that don't matter (ancient IRS tax computers), QtQuick having some growing pains over a decade ago, personal hatred of vim, conflating a hatred of Agile with CI/CD.
> You can't use non-typesafe junk when lives are on the line.
Their words, not mine. If lives are on the line you probably shouldn’t be using linux in your medical device. And I hope my life never depends on a medical device running linux.
"Many of us got our first exposure to Qt on OS/2 in or around 1987."
Uh huh.
> someone always has a use case;
No he doesn't. He's just unhinged. The machines this dude bitches about don't even have a modern C++ compiler nor do they support any kind of display system relevant to Qt. They're never going to be a target for Qt.
Further irony is this dude proudly proclaims this fork will support nothing but Wayland and Vulkan on Linux.
"the smaller processors like those in sensors, are 1's complement for a reason."
The "reason" is never explained.
"Why? Because nothing is faster when it comes to straight addition and subtraction of financial values in scaled integers. (Possibly packed decimal too, but uncertain on that.)"
Is this a justification for using Unisys mainframes, or is the implication that they are fastest because of 1's complement? (not that this is even close to being true - as any dinosaurs are decomissioned they're fucking replaced with capable but not TOL commodity Xeon CPU based hardware running emulation, I don't think Unisys makes any non x86 hardware anymore) Anyway, may need to refresh that CS education.
There's some rambling about the justification being data conversion, but what serialization protocols mandate 1's complement anyway, and if those exist someone has already implemented 2's complement supporting libraries for the past 50 years since that has been the overwhelming status quo. We somehow manage to deal with endianness and decimal conversions as well.
"Passing 2's complement data to backend systems or front end sensors expecting 1's complement causes catastrophes."
99.999% of every system MIPS, ARM, x86, Power, etc for the last 40 years uses 2's complement, so this has been the normal state of the world since forever.
Also the enterpriseist of languages, Java somehow has survived mandating 2's complement.
This is all very unhinged.
I'm not holding my breath to see this ancient Qt fork fully converted to "modified" Barr spec but that will be a hoot.
Yeah, I think many of their arguments are not quite up to snuff. I would be quite interested how 1s compliment is faster, it is simpler and thus the hardware could be faster, iff you figure out how to deal with the drawbacks like -0 vs +0 (you could do it in hardware pretty easily...)
Buuuut then the Unisys thing. Like you say they dont make processors (for the market) and themselves just use Intel now...and even if they make some special secret processors I don't think the IRS is using top secret processors to crunch our taxes, even in the hundreds of millions of record realm with average hundreds of items per record, modern CPUs run at billions of ops per second...so I suspect we are talking some tens of seconds, and some modest amount of RAM (for a server).
The one point he does have is interoperability, which if a lot of (especially medical) equipment uses 1s compliment because its cheaper (in terms of silicon), using "modern" tools is likely to be a bad fit.
Compatability is King, and where medical devices are concerned I would be inclined to agree that not changing things is better than "upgrading" - its all well and good to have two systems until a crisis hits and some doctor plus the wrong sensor into the wrong device...
> The one point he does have is interoperability, which if a lot of (especially medical) equipment uses 1s compliment
No it’s completely loony. Note that even the devices he claims to work with for medical devices are off the shelf ARM processors (ie what everybody uses). No commonly used commodity processors for embedded have used 1’s complement in the last 50 years.
> equipment uses 1s compliment because its cheaper (in terms of silicon)
Yeah that makes no sense.
If you need an ALU at all, 2s complement requires no more silicon and is simpler to work with. That’s why it was recommended by von Neumann in 1945.
1s complement is only simpler if you don’t have an adder of any kind, which is then not a CPU, certainly not a C/C++ target.
Even the shittiest low end PIC microcontroller from the 70s uses 2s complement.
It is possible that a sensing device with no microprocessor or computation of any kind (ie a bare ADC) may generate values in sign-mag or 1s complement (and it’s usually the former, again how stupid this is) - but this has nothing to do with the C implementation of whatever host connects to it which is certainly 2s. I guarantee you no embedded processor this dude ever worked with in the medical industry used anything other than 2s complement - you would have always needed to do a conversion.
This truly is one of the most absurd issues to get wrapped up on. It might be dementia, sadly.
Maintaining a fork of a large C++ framework (well of another obscure fork) where the top most selling point is a fixation on avoiding C++20 all because they dropped support for integer representations that have no extant hardware with recent C++ compilers - and any theoretical hardware wouldn’t run this framework anyway, that doesn’t seem well attached to reality.
> it is simpler and thus the hardware could be faster
Is it though? With twos compliment ADD and SUB are the same hardware for unsigned and signed. MUL/IMUL is also the same for the lower half of the result (i.e. 32bit × 32bit = 32bit). So you're ALU and ISA are simple and flexible by design.
For calculations, of course it’s not simpler or faster. At best, you could probably make hardware where it’s close to a wash.
One that lectures on the importance of college you would think would demonstrate the critical thinking skills to ask themselves why the top supercomputers use 2’s complement like everyone else.
The only aspect of 1’s or sign mag that is simpler is in generation. If you have a simple ADC that gives you a magnitude based on a count and a direction, it is trivial to just output that directly. 1’s I guess is not too much harder with XORs (but what’s the point?). 2’s requires some kind of ripple carry logic, the add 1 is one way, there are some other methods you can work out but still more logic than sign-mag. This is pretty much the only place where non 2’s complement has any advantage.
Finally for an I2C or SPI sensor like a temp sensor it is more likely you will get none of the above and have some asymmetric scale. Anybody in embedded bloviating on this ought to know.
In his ramblings the mentions of packed decimal (BCD) are a nice touch. C, C++ has never supported that to begin with so I have no idea why that must also be “considered”.
One obvious example is auto_ptr. And from what I can see it is quite successful -- in a well maintained C++ codebase using C++ 11 or later, you just don't see auto_ptr in the code.
Hahaha, you're including Bjarne in that sweeping generalization? C++ has long had a culture problem revolving around arrogance an belittling others, maybe it is growing out of it?
I would point out that for any language, if one has to follow the standards committee closely to be an effective programmer in that language, complexity is likely to be an issue. Fortunately in this case it probably isn't required.
I see garbage collection came in c++11 and has now gone. Would following that debacle make many or most c++ programmers more effective?
> The question isn’t whether there are still architectures where bytes aren’t 8-bits (there are!) but whether these care about modern C++... and whether modern C++ cares about them.
I have mixed feelings about this. On the one hand, it's obviously correct--there is no meaningful use for CHAR_BIT to be anything other than 8.
On the other hand, it seems like some sort of concession to the idea that you are entitled to some sort of just world where things make sense and can be reasoned out given your own personal, deeply oversimplified model of what's going on inside the computer. This approach can take you pretty far, but it's a garden path that goes nowhere--eventually you must admit that you know nothing and the best you can do is a formal argument that conditional on the documentation being correct you have constructed a correct program.
This is a huge intellectual leap, and in my personal experience the further you go without being forced to acknowledge it the harder it will be to make the jump.
That said, there seems to be an increasing popularity of physical electronics projects among the novice set these days... hopefully read the damn spec sheet will become the new read the documentation
And yet every time I run an autoconf script I watch as it checks the bits in a byte and saves the output in config.h as though anyone planned to act on it.
As with any highly used language you end up running into what I call the COBOL problem. It will work for the vast majority of cases except where there's a system that forces an update and all of a sudden a traffic control system doesn't work or a plane falls out of the sky.
You'd have to have some way of testing all previous code in the compilation (pardon my ignorance if this is somehow obvious) to make sure this macro isn't already used. You also risk forking the language with any kind of breaking changes like this. How difficult it would be to test if a previous code base uses a charbit macro and whether it can be updated to the new compiler sounds non obvious. What libraries would then be considered breaking? Would interacting with other compiled code (possibly stupid question) that used charbit also cause problems? Just off the top of my head.
I agree that it sounds nonintuitive. I'd suggest creating a conversion tool first and demonstrating it was safe to use even in extreme cases and then make the conversion. But that's just my unenlightened opinion.
That's not really the problem here--CHAR_BIT is already 8 everywhere in practice, and all real existing code[1] handles CHAR_BIT being 8.
The question is "does any code need to care about CHAR_BIT > 8 platforms" and the answer of course is no, its just should we perform the occult standards ceremony to acknowledge this, or continue to ritually pretend to standards compliant 16 bit DSPs are a thing.
[1] I'm sure artifacts of 7, 9, 16, 32, etc[2] bit code & platforms exist, but they aren't targeting or implementing anything resembling modern ISO C++ and can continue to exist without anyone's permission.
[2] if we're going for unconventional bitness my money's on 53, which at least has practical uses in 2024
I'm totally fine with enforcing that int8_t == char == 8-bits, however I'm not sure about spreading the misconception that a byte is 8-bits. A byte with 8-bits is called an octet.
At the same time, a `byte` is already an "alias" for `char` since C++17 anyway[1].
My first experience with computers was 45 years ago, and a "byte" back then was defined as an 8-bit quantity. And in the intervening 45 years, I've never come across a different meaning for "byte". I'll ask for a citation for a definition of "byte" that isn't 8-bits.
1979 is quite recent as computer history goes, and many conventions had settled by then. The Wikipedia article discusses the etymology of "byte" and how the definition evolved from loosely "a group of bits less than a word" to "precisely 8 bits". https://en.wikipedia.org/wiki/Byte
That's interesting because maybe a byte will not be 8-bit in 45 years from now on.
I'm mostly discussing from the sake of it because I don't really mind as a C/C++ user. We could just use "octet" and call it a day, but now there is an ambiguity with the past definition and potential in the future definition (in which case I hope the term "byte" will just disappear).
I kinda like the idea of 6-bit byte retro-microcomputer (resp. 24-bit, that would be a word). Because microcomputers typically deal with small number of objects (and prefer arrays to pointers), it would save memory.
VGA was 6-bit per color, you can have a readable alphabet in 6x4 bit matrix, you can stuff basic LISP or Forth language into 6-bit alphabet, and the original System/360 only had 24-bit addresses.
What's there not to love? 12MiB of memory, with independently addressable 6-bits, should be enough for anyone. And if it's not enough, you can naturally extend FAT-12 to FAT-24 for external storage. Or you can use 48-bit pointers, which are pretty much as useful as 64-bit pointers.
This would be a great setup for a time travelling science fiction where there is some legacy UNIVAC software that needs to be debugged, and John Titor, instead of looking for an IBM 5100, came back to the year 2024 to find a pre-P3477R0 compiler.
The UNIVAC 1108 (and descendants) mainframe architecture was not discontinued in 1986. The company that owned it (Sperry) merged with Burroughs in that year to form Unisys. The platform still exists, but now runs as a software emulator under x86-64. The OS is still maintained and had a new release just last year. Around the time of the merger the old school name “UNIVAC” was retired in a rebranding, but the platform survived.
Its OS, OS 2200, does have a C compiler. Not sure if there ever was a C++ compiler, if there once was it is no longer around. But that C compiler is not being kept up to date with the latest standards, it only officially supports C89/C90 - this is a deeply legacy system, most application software is written in COBOL and the OS itself itself is mainly written in assembler and a proprietary Pascal-like language called “PLUS”. They might add some features from newer standards if particularly valuable, but formal compliance with C99/C11/C17/C23/etc is not a goal.
The OS does contain components written in C++, most notably the HotSpot JVM. However, from what I understand, the JVM actually runs in x86-64 Linux processes on the host system, outside of the emulated mainframe environment, but the mainframe emulator is integrated with those Linux processes so they can access mainframe files/data/apps.
They still exist. You can still run OS 2200 on a Clearpath Dorado.[1] Although it's actually Intel Xeon processors doing an emulation.
Yes, indexing strings of 6-bit FIELDATA characters was a huge headache. UNIVAC had the unfortunate problem of having to settle on a character code in the early 1960s, before ASCII was standardized. At the time, a military 6-bit character set looked like the next big thing. It was better than IBM's code, which mapped to punch card holes and the letters weren't all in one block.
idk. by today most software already assumes 8 bit == byte in subtle ways all over the place to a point you kinda have to use a fully custom or at least fully self reviewed and patched stack of C libraries
so delegating such by now very very edge cases to non standard C seems fine, i.e. seems to IMHO not change much at all in practice
and C/C++ compilers are anyway full of non standard extensions and it's not that CHAR_BIT go away or you as a non-standard extension assume it might not be 8
> most software already assumes 8 bit == byte in subtle ways all over the place
Which is the real reason why 8-bits should be adopted as the standard byte size.
I didn't even realize that the byte was defined as anything other than 8-bits until recently. I have known, for decades, that there were non-8-bit character encodings (including ASCII) and word sizes were all over the map (including some where word size % 8 != 0). Enough thought about that last point should have helped me realize that there were machines where the byte was not 8-bits, yet the rarity of encountering such systems left me with the incorrect notion that a byte was defined as 8-bits.
Now if someone with enough background to figure it out doesn't figure it out, how can someone without that background figure it out? Someone who has only experienced systems with 8-bit bytes. Someone who has only read books that make the explicit assumption of 8-bit bytes (which virtually every book does). Anything they write has the potential of breaking on systems with a different byte size. The idea of writing portable code because the compiler itself is "standards compliant" breaks down. You probably should modify the standard to ensure the code remains portable by either forcing the compiler for non-8-bit systems to handle the exceptions, or simply admitting that compiler does not portable code for non-8-bit systems.
- CHAR_BIT cannot go away; reams of code references it.
- You still need the constant 8. It's better if it has a name.
- Neither the C nor C++ standard will be simplified if CHAR_BIT is declared to be 8. Only a few passages will change. Just, certain possible implementations will be rendered nonconforming.
- There are specialized platforms with C compilers, such as DSP chips, that are not byte addressable machines. They are in current use; they are not museum pieces.
Here's a bit of 40 year old code I wrote which originally ran on 36-bit PDP-10 machines, but will work on non-36 bit machines.[1] It's a self-contained piece of code to check passwords for being obvious. This will detect any word in the UNIX dictionary, and most English words, using something that's vaguely like a Bloom filter.
This is so old it predates ANSI C; it's in K&R C. It used to show up on various academic sites. Now it's obsolete enough to have scrolled off Google.
I've seen copies of this on various academic sites over the years, but it seems to have finally scrolled off.
I think we can dispense with non 8-bit bytes at this point.
The tms320c28x DSPs have 16 bit char, so e.g. the Opus audio codec codebase works with 16-bit char (or at least it did at one point -- I wouldn't be shocked if it broke from time to time, since I don't think anyone runs regression tests on such a platform).
For some DSP-ish sort of processors I think it doesn't make sense to have addressability at char level, and the gates to support it would be better spent on better 16 and 32 bit multipliers. ::shrugs::
I feel kind of ambivalent about the standards proposal. We already have fixed size types. If you want/need an exact type, that already exists. The non-fixed size types set minimums and allow platforms to set larger sizes for performance reasons.
Having no fast 8-bit level access is a perfectly reasonable decision for a small DSP.
Might it be better instead to migrate many users of char to (u)int8_t?
The proposed alternative of CHAR_BIT congruent to 0 mod 8 also sounds pretty reasonable, in that it captures the existing non-8-bit char platforms and also the justification for non-8-bit char platforms (that if you're not doing much string processing but instead doing all math processing, the additional hardware for efficient 8 bit access is a total waste).
I thinks it's fine to relegate non 8 bit chars to non-standard C given that a lot of software anyway assumes 8bit bytes already implicitly. Non standard extensions for certain use-cases isn't anything new for C compilers. Also it's a C++ proposal I'm not sure if you program DSPs with C++ :think:
Any thoughts on the fact that some vendors basically don't offer a C compiler now? E.g. MSVC has essentially forced C++ limitations back onto the C language to reduce C++ vs C maintance costs?
> A byte is 8 bits, which is at least large enough to contain the ordinary literal encoding of any element of the basic character set literal character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is bits in a byte.
But instead of the "and is composed" ending, it feels like you'd change the intro to say that "A byte is 8 contiguous bits, which is".
We can also remove the "at least", since that was there to imply a requirement on the number of bits being large enough for UTF-8.
Personally, I'd make a "A byte is 8 contiguous bits." a standalone sentence. Then explain as follow up that "A byte is large enough to contain...".
Possible, but likely slow. There's nothing in the "C abstract machine" that mandates specific hardware. But, the bitshift is only a fast operation when you have bits. Similarly with bitwise boolean operations.
In the spirit of redefining the kilobyte, we should define byte as having a nice, metric 10 bits. An 8 bit thing is obviously a bibyte. Then power of 2 multiples of them can include kibibibytes, mebibibytes, gibibibytes, and so on for clarity.
On RISC machines, it can be very useful to have the concept of "words," because that indicates things about how the computer loads and stores data, as well as the native instruction size. In DSPs and custom hardware, it can indicate the only available datatype.
The land of x86 goes to great pains to eliminate the concept of a word at a silicon cost.
ARM64 has a 32-bit word, even though the native pointer size and general register size is 64 bits. To access just the lower 32 bits of a register Xn you refer to it as Wn.
Appeasing that attitude is what prevented Microsoft from migrating to LP64. Would have been an easier task if their 32-bit LONG type never existed, they stuck with DWORD, and told the RISC platforms to live with it.
I'm saying the term "Word" abstracting the number of bytes a CPU can process in a single operation is an outdated concept. We don't really talk about word-sized values anymore. Instead we mostly explicit on the size of value in bits. Even the idea of a CPU having just one relevant word size is a bit outdated.
So please do excuse my ignorance, but is there a "logic" related reason other than hardware cost limitations ala "8 was cheaper than 10 for the same number of memory addresses" that bytes are 8 bits instead of 10? Genuinely curious, as a high-level dev of twenty years, I don't know why 8 was selected.
To my naive eye, It seems like moving to 10 bits per byte would be both logical and make learning the trade just a little bit easier?
One example from the software side: A common thing to do in data processing is to obtain bit offsets (compression, video decoding etc.). If a byte would be 10 bits you would need mod%10 operations everywhere which is slow and/or complex. In contrast mod%(2^N) is one logic processor instruction.
If you're ignoring what's efficient to use then just use a decimal data type and let the hardware figure out how to calculate that for you best. If what's efficient matters then address management, hardware operation implementations, and data packing are all simplest when the group size is a power of the base.
One thought is that it's always a whole number of bits (3) to bit-address within a byte. It's 3.5 bits to bit address a 10 bit byte. Sorta just works out nicer in general to have powers of 2 when working on base 2.
Another part of it is the fact that it's a lot easier to represent stuff with hex if the bytes line up.
I can represent "255" with "0xFF" which fits nice and neat in 1 byte. However, now if a byte is 10bits that hex no longer really works. You have 1024 values to represent. The max value would be 0x3FF which just looks funky.
Coming up with an alphanumeric system to represent 2^10 cleanly just ends up weird and unintuitive.
We probably wouldn't have chosen hex in a theoretical world where bytes were 10 bits, right? It would probably be two groups of 5 like 02:21 == 85 (like an ip address) or five groups of two 0x01111 == 85. It just has to be one of its divisors.
Because modern computing has settled on the Boolean (binary) logic (0/1 or true/false) in the chip design, which has given us 8 bit bytes (a power of two). It is the easiest and most reliable to design and implement in the hardware.
On the other hand, if computing settled on a three-valued logic (e.g. 0/1/«something» where «something» has been proposed as -1, «undefined»/«unknown»/«undecided» or a «shade of grey»), we would have had 9 bit bytes (a power of three).
10 was tried numerous times at the dawn of computing and… it was found too unwieldy in the circuit design.
> On the other hand, if computing settled on a three-valued logic (e.g. 0/1/«something» where «something» has been proposed as -1, «undefined»/«unknown/undecided» or a «shade of grey»), we would have had 9 bit bytes (a power of three).
Is this true? 4 ternary bits give you really convenient base 12 which has a lot of desirable properties for things like multiplication and fixed point. Though I have no idea what ternary building blocks would look like so it’s hard to visualize potential hardware.
It is hard to say whether it would have been 9 or 12, now that people have stopped experimenting with alternative hardware designs. 9-bit byte designs certainly did exist (and maybe even the 12-bit designs), too, although they were still based on the Boolean logic.
I have certainly heard an argument that ternary logic would have been a better choice, if it won over, but it is history now, and we are left with the vestiges of the ternary logic in SQL (NULL values which are semantically «no value» / «undefined» values).
Many circuits have ceil(log_2(N_bits)) scaling wrt to propagation delay/other dimensions so you’re just leaving efficiency on the table if you aren’t using a power of 2 for your bit size.
I'm fairly sure it's because the English character set fits nicely into a byte. 7 bits would have have worked as well, but 7 is a very odd width for something in a binary computer.
likely mostly as a concession to ASCII in the end. you used a typewriter to write into and receive terminal output from machines back in the day. terminals would use ASCII. there were machines with all sorts of smallest-addressable-sizes, but eight bit bytes align nicely with ASCII. makes strings easier. making strings easier makes programming easier. easier programming makes a machine more popular. once machines started standardizing on eight bit bytes, others followed. when they went to add more data, they kept the byte since code was written for bytes, and made their new registeres two bytes. then two of those. then two of those. so we're sitting at 64 bit registers on the backs of all that that came before.
Computers are not beings with 10 fingers that can be up or down.
Powers of two are more natural in a binary computer. Then add the fact that 8 is the smallest power of two that allows you to fit the Latin alphabet plus most common symbols as a character encoding.
We're all about building towers of abstractions. It does make sense to aim for designs that are natural for humans when you're closer to the top of the stack. Bytes are fairly low down the stack, so it makes more sense for them to be natural to computers.
One fun fact I found the other day: ASCII is 7 bits, but when it was used with punch cards there was an 8th bit to make sure you didn't punch the wrong number of holes. https://rabbit.eng.miami.edu/info/ascii.html
A 9-bit byte is found on 36-bit machines in quarter-word mode.
Parity is for paper tape, not punched cards. Paper tape parity was never standardized. Nor was parity for 8-bit ASCII communications. Which is why there were devices with settings for EVEN, ODD, ZERO, and ONE for the 8th bit.
Punched cards have their very own encodings, only of historical interest.
>A 9-bit byte is found on 36-bit machines in quarter-word mode.
I've only programmed in high level programming languages in 8-bit-byte machines. I can't understand what you mean by this sentence.
So in a 36-bit CPU a word is 36 bits. And a byte isn't a word. But what is a word and how does it differ from a byte?
If you asked me what 32-bit/64-bit means in a CPU, I'd say it's how large memory addresses can be. Is that true for 36-bit CPUs or does it mean something else? If it's something else, then that means 64-bit isn't the "word" of a 64-bit CPU, so what would the word be?
A word is the unit of addressing. A 36-bit machine has 36 bits of data stored at address 1, and another 36 bits at address 2, and so forth. This is inconvenient for text processing. You have to do a lot of shifting and masking. There's a bit of hardware help on some machines. UNIVAC hardware allowed accessing one-sixth of a word (6 bits), or one-quarter of a word (8 bits), or one-third of a word (12 bits), or a half of a word (18 bits). You had to select sixth-word mode (old) or quarter-word mode (new) as a machine state.
Such machines are not byte-addressable. They have partial word accesses, instead.
Machines have been built with 4, 8, 12, 16, 24, 32, 36, 48, 56, 60, and 64 bit word lengths.
Many "scientific" computers were built with 36-bit words and a 36-bit arithmetic unit.
This started with the IBM 701 (1952), although an FPU came later, and continued through the IBM 7094. The byte-oriented IBM System/360 machines replaced those, and made byte-addressable architecture the standard.
UNIVAC followed along with the UNIVAC 1103 (1953), which continued through the 1103A and 1105 vacuum tube machines, the later transistorized machines 1107 and 1108, and well into the 21st century. Unisys will still sell you a 36-bit machine, although it's really an emulator running on Intel Xeon CPUs.
The main argument for 36 bits was that 36-bit floats have four more bits of precision, or one more decimal digit, than 32-bit floats. 1 bit of sign, 8 bits of exponent and 27 bits of mantissa gives you a full 8 decimal digits of precision, while standard 32-bit floats with an 1 bit sign, 7-bit exponent and a 24 bit mantissa only give you 7 full decimal digits.
Double precision floating point came years later; it takes 4x as much hardware.
I see. I never realized that machines needed to be random number of bits because they couldn't do double-precision so it was easier to make the word larger and do "half" precision instead.
Thanks a lot for your explanation, but does that mean "byte" is any amount of data that can be fetched in a given mode in such machines?
e.g. you have 6-bit, 9-bit, 12-bit, and 18-bit bytes in a 36-bit machine in sixth-word mode, quarter-word mode, third-word mode, and half-word mode, respectively? Which means in full-word mode the "byte" would be 36 bits?
The term "byte" was introduced by IBM at the launch of the IBM System/360 in 1964. [1], which event also introduced the term "throughput". IBM never used it officially in reference to their 36-bit machines. By 1969, IBM had discontinued selling their 36-bit machines. UNIVAC and DEC held onto 36 bits for several more decades, though.
I don't think so. In the "normal" world, you can't address anything smaller than a byte, and you can only address in increments of a byte. A "word" is usually the size of the integer registers in the CPU. So the 36-bit machine would have a word size of 36 bits, and either six-bit bytes or nine-bit bytes, depending on how it was configured.
36 bits also gave you 10 decimal digits for fixed point calculations. My mom says that this was important for atomic calculations back in the 1950s - you needed that level of precision on the masses.
Ignoring this C++ proposal, especially because C and C++ seem like a complete nightmare when it comes to this stuff, I've almost gotten into the habit of treating a "byte" as a conceptual concept. Many serial protocols will often define a "byte", and it might be 7, 8, 9, 11, 12, or whatever bits long.
Why? Pls no. We've been told (in school!) that byte is byte. Its only sometimes 8bits long (ok, most of the time these days). Do not destroy the last bits of fun.
Is network order little endian too?
There were/are C++ compilers for PDP-10 (9 bit byte). Those haven't been maintained AFAICT, but there are C++ compilers for various DSP's where the smallest unit of access is 16 or 32 bits that are still being sold.
C++ 'programmers' demonstrating their continued brilliance at bullshitting people they're being productive (Had to check if publishing date was April fools. It's not.) They should start a new committee next to formalize what direction electrons flow. If they do it now they'll be able to have it ready to bloat the next C++ standards no one reads or uses.
the fact that this isn't already done after all these years is one of the reasons why I no longer use C/C++. it takes years and years to get anything done, even the tiniest, most obvious drama free changes. contrast with Go, which has had this since version 1, in 2012:
This is an egoistical viewpoint, but if I want 8 bits in a byte I have plenty of choices anyway - Zig, Rust, D, you name it.
Should the need for another byte width come up, either for past or future architectures C and C++ are my only practical choice.
Sure, it is selfish to expect C and C++ do the dirty work, while more modern languages get away skimping on it. On the other hand I think especially C++ is doing itself a disservice trying to become a kind of half-baked Rust.
Why can’t it be 8?, the fact that it’s a trit doesn’t put any constraint on the byte (tryte ? size). You could actually make it 5 or 6 trits (~9.5 bits) for similar information density. The Setun used 6 trit addressable units.
fgetc(3) and its companions always return character-by-character input as an int, and the reason is that EOF is represented as -1. An unsigned char is unable to represent EOF. If you're using the wrong return value, you'll never detect this condition.
However, if you don't receive an EOF, then it should be perfectly fine to cast the value to unsigned char without loss of precision.
C++ is the second-most-widely-available language (behind C). Many other things are viable. Everything from a Z15 IBM mainframe to almost every embedded chip in existence. ("Viable" meaning "still being produced and used in volume, and still being used in new designs".)
The next novel chip design is going to have a C++ compiler too. No, we don't yet know what its architecture will be.
Oh, but we do know - In order to be compatible with the existent languages, it's going to have to look similar to what we have now. It will have to keep 8 bit bytes instead of going wider that's what IBM came up with in the 1950s, and it will have to be a stack-oriented machine that looks like a VAX so it can run C programs. Unicode will always be a second-class character set behind ASCII because it has to look like it runs Unix, and we will always use IEEE floating point with all its inaccuracy because using scaling decimal data types just makes too much sense and we can't have that.