C++ proposal: There are exactly 8 bits in a byte

favorited · 2024-10-17T23:54:21 1729209261

Previously, in JF's "Can we acknowledge that every real computer works this way?" series: "Signed Integers are Two’s Complement" <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p09...>

jsheard · 2024-10-18T00:25:31 1729211131

Maybe specifying that floats are always IEEE floats should be next? Though that would obsolete this Linux kernel classic so maybe not.

https://github.com/torvalds/linux/blob/master/include/math-e...

jcranmer · 2024-10-18T04:54:36 1729227276

I'm literally giving a talk next week who's first slide is essentially "Why IEEE 754 is not a sufficient description of floating-point semantics" and I'm sitting here trying to figure out what needs to be thrown out of the talk to make it fit the time slot.

One of the most surprising things about floating-point is that very little is actually IEEE 754; most things are merely IEEE 754-ish, and there's a long tail of fiddly things that are different that make it only -ish.

chungy · 2024-10-18T05:19:46 1729228786

The IEEE 754 standard has been updated several times, often by relaxing previous mandates in order to make various hardware implementations become compliant retroactively (eg, adding Intel's 80-bit floats as a standard floating point size).

It'll be interesting if the "-ish" bits are still "-ish" with the current standard.

stephencanon · 2024-10-18T13:40:37 1729258837

The first 754 standard (1985) was essentially formalization of the x87 arithmetic; it defines a "double extended" format. It is not mandatory:

> Implementations should support the extended format corresponding to the widest basic format supported.

_if_ it exists, it is required to have at least as many bits as the x87 long double type.¹

The language around extended formats changed in the 2008 standard, but the meaning didn't:

> Language standards or implementations should support an extended precision format that extends the widest basic format that is supported in that radix.

That language is still present in the 2019 standard. So nothing has ever really changed here. Double-extended is recommended, but not required. If it exists, the significand and exponent must be at least as large as those of the Intel 80-bit format, but they may also be larger.

---

¹ At the beginning of the standardization process, Kahan and Intel engineers still hoped that the x87 format would gradually expand in subsequent CPU generations until it became what is now the standard 128b quad format; they didn't understand the inertia of binary compatibility yet. So the text only set out minimum precision and exponent range. By the time the standard was published in 1985, it was understood internally that they would never change the type, but by then other companies had introduced different extended-precision types (e.g. the 96-bit type in Apple's SANE), so it was never pinned down.

adrian_b · 2024-10-19T07:17:45 1729322265

The first 754 standard has still removed some 8087 features, mainly the "projective" infinity and it has slightly changed the definition of the remainder function, so it was not completely compatible with 8087.

Intel 80387 was made compliant with the final standard and by that time there were competing FPUs also compliant with the final standard, e.g. Motorola 68881.

speedgoose · 2024-10-18T05:06:59 1729228019

I'm interested by your future talk, do you plan to publish a video or a transcript?

selimthegrim · 2024-10-21T19:20:39 1729538439

I too would like to see it.

Terr_ · 2024-10-18T05:03:24 1729227804

> there's a long tail of fiddly things that are different that make it only -ish.

Perhaps a way to fill some time would be gradually revealing parts of a convoluted Venn diagram or mind-map of the fiddling things. (That is, assuming there's any sane categorization.)

aydyn · 2024-10-18T10:02:08 1729245728

can you give a brief example? Very intrigued.

jfbastien · 2024-10-18T02:26:35 1729218395

Hi! I'm JF. I half-jokingly threatened to do IEEE float in 2018 https://youtu.be/JhUxIVf1qok?si=QxZN_fIU2Th8vhxv&t=3250

I wouldn't want to lose the Linux humor tho!

AnimalMuppet · 2024-10-18T00:50:19 1729212619

That line is actually from a famous Dilbert cartoon.

I found this snapshot of it, though it's not on the real Dilbert site: https://www.reddit.com/r/linux/comments/73in9/computer_holy_...

FooBarBizBazz · 2024-10-18T02:03:19 1729216999

Whether double floats can silently have 80 bit accumulators is a controversial thing. Numerical analysis people like it. Computer science types seem not to because it's unpredictable. I lean towards, "we should have it, but it should be explicit", but this is not the most considered opinion. I think there's a legitimate reason why Intel included it in x87, and why DSPs include it.

stephencanon · 2024-10-18T03:04:59 1729220699

Numerical analysis people do not like it. Having _explicitly controlled_ wider accumulation available is great. Having compilers deciding to do it for you or not in unpredictable ways is anathema.

bee_rider · 2024-10-18T03:30:13 1729222213

It isn’t harmful, right? Just like getting a little accuracy from a fused multiply add. It just isn’t useful if you can’t depend on it.

Negitivefrags · 2024-10-18T04:10:01 1729224601

It can be harmful. In GCC while compiling a 32 bit executable, making an std::map< float, T > can cause infinite loops or crashes in your program.

This is because when you insert a value into the map, it has 80 bit precision, and that number of bits is used when comparing the value you are inserting during the traversal of the tree.

After the float is stored in the tree, it's clamped to 32 bits.

This can cause the element to be inserted into into the wrong order in the tree, and this breaks the assumptions of the algorithm leaidng to the crash or infinite loop.

Compiling for 64 bits or explicitly disabling x87 float math makes this problem go away.

I have actually had this bug in production and it was very hard to track down.

jfbastien · 2024-10-18T04:43:15 1729226595

10 years ago, a coworker had a really hard time root-causing a bug. I shoulder-debugged it by noticing the bit patterns: it was a miscompile of LLVM itself by GCC, where GCC was using an x87 fldl/fstpl move for a union { double; int64; }. The active member was actually the int64, and GCC chose FP moved based on what was the first member of the union... but the int64 happened to be the representation of SNaN, so the instructions transformed it quietly to a qNaN as part of moving. The "fix" was to change the order of the union's members in LLVM. The bug is still open, though it's had recent activity: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58416

ptsneves · 2024-10-18T07:09:09 1729235349

It also affected eMacs compilation and the fix is in the trunk now.

Wow 11 years for such a banal minimal code trigger. I really don’t quiet understand how we can have the scale of infrastructure in operation when this kind of infrastructure software bugs exist. This is not just gcc. All the working castle of cards is an achievement by itself and also a reminder that good enough is all that is needed.

I also highly doubt you could get a 1 in 1000 developers to successfully debug this issue were it happening in the wild, and much smaller to actually fix it.

Negitivefrags · 2024-10-18T10:17:23 1729246643

If you think that’s bad let me tell you about the time we ran into a bug in memmove.

It had to be an unaligned memmove and using a 32 bit binary on a 64 bit system, but still! memmove!

And this bug existed for years.

This caused our database replicas to crash every week or so for a long time.

kmeisthax · 2024-10-18T05:30:04 1729229404

What use case do you have that requires indexing a hashmap by a floating point value? Keep in mind, even with a compliant implementation that isn't widening your types behind your back, you still have to deal with NaN.

In fact, Rust has the Eq trait specifically to keep f32/f64s out of hash tables, because NaN breaks them really bad.

meindnoch · 2024-10-18T07:24:49 1729236289

std::map is not a hash map. It's a tree map. It supports range queries, upper and lower bound queries. Quite useful for geometric algorithms.

tialaramex · 2024-10-18T09:56:44 1729245404

Rust's BTreeMap, which is much closer to what std::map is, also requires Ord (ie types which claim to possess total order) for any key you can put in the map.

However, Ord is an ordinary safe trait. So while we're claiming to be totally ordered, we're allowed to be lying, the resulting type is crap but it's not unsafe. So as with sorting the algorithms inside these container types, unlike in C or C++ actually must not blow up horribly when we were lying (or as is common in real software, simply clumsy and mistaken)

The infinite loop would be legal (but I haven't seen it) because that's not unsafe, but if we end up with Undefined Behaviour that's a fault in the container type.

This is another place where in theory C++ gives itself license to deliver better performance at the cost of reduced safety but the reality in existing software is that you get no safety but also worse performance. The popular C++ compilers are drifting towards tacit acceptance that Rust made the right choice here and so as a QoI decision they should ship the Rust-style algorithms.

josefx · 2024-10-18T07:30:29 1729236629

> you still have to deal with NaN.

Detecting and filtering out NaNs is both trivial and reliable as long as nobody instructs the compiler to break basic floating point operations (so no ffast-math). Dealing with a compiler that randomly changes the values of your variables is much harder.

oxnrtr · 2024-10-18T20:34:48 1729283688

That's purely a problem of Rust being wrong.

Floats have a total order, Rust people just decided to not use it.

ndesaulniers · 2024-10-18T04:37:27 1729226247

Are you mixing up long double with float?

josefx · 2024-10-18T07:44:14 1729237454

Old Intel CPUs only had long double, 32 bit and 64 bit floats were a compiler hack on top of the 80 bit floating point stack.

blt · 2024-10-18T04:20:50 1729225250

dang that's a good war story.

stephencanon · 2024-10-18T10:33:13 1729247593

It’s absolutely harmful. It turns computations that would be guaranteed to be exact (e.g. head-tail arithmetic primitives used in computational geometry) into “maybe it’s exact and maybe it’s not, it’s at the compiler’s whim” and suddenly your tests for triangle orientation do not work correctly and your mesh-generation produces inadmissible meshes, so your PDE solver fails.

FooBarBizBazz · 2024-10-18T14:33:05 1729261985

Thank you, I found this hint very interesting. Is there a source you wouldn't mind pointing me to for those "head, tail" methods?

I am assuming it relates to the kinds of "variable precision floating point with bounds" methods used in CGAL and the like; Googling turns up this survey paper:

https://inria.hal.science/inria-00344355/PDF/p.pdf

Any additional references welcome!

stephencanon · 2024-10-18T19:48:41 1729280921

Note here is a good starting point for the issue itself: http://www.cs.cmu.edu/~quake/triangle.exact.html

References for the actual methods used in Triangle: http://www.cs.cmu.edu/~quake/robust.html

lf37300 · 2024-10-18T04:10:52 1729224652

If not done properly, double rounding (round to extended precision then rounding to working precision) can actually introduce larger approximation error than round to nearest working precision directly. So it can actually make some numerical algorithms perform worse.

eternityforest · 2024-10-18T03:57:46 1729223866

I suppose it could be harmful if you write code that depends on it without realizing it, and then something changes so it stops doing that.

FooBarBizBazz · 2024-10-18T14:16:49 1729261009

I get what you mean and agree, and have seen almost traumatized rants against ffast-math from the very same people.

After digging, I think this is the kind of thing I'm referring to:

https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf

https://news.ycombinator.com/item?id=37028310

I've seen other course notes, I think also from Kahan, extolling 80-bit hardware.

Personally I am starting to think that, if I'm really thinking about precision, I had maybe better just use fixed point, but this again is just a "lean" that could prove wrong over time. Somehow we use floats everywhere and it seems to work pretty well, almost unreasonably so.

stephencanon · 2024-10-18T22:58:26 1729292306

Yeah. Kahan was involved in the design of the 8087, so he’s always wanted to _have_ extended precision available. What he (and I, and most other numerical analysts) are opposed to is the fact that (a) language bindings historically had no mechanism to force rounding to float/double when necessary, and (b) compilers commonly spilled x87 intermediate results to the stack as doubles, leading to intermediate rounding that was extremely sensitive to optimization and subroutine calls, making debugging numerical issues harder than it should be.

Modern floating-point is much more reproducible than fixed-point, FWIW, since it has an actual standard that’s widely adopted, and these excess-precision issues do not apply to SSE or ARM FPUs.

conradev · 2024-10-18T03:46:01 1729223161

I was curious about float16, and TIL that the 2008 revision of the standard includes it as an interchange format:

https://en.wikipedia.org/wiki/IEEE_754-2008_revision

tialaramex · 2024-10-18T09:27:17 1729243637

Note that this type (which Rust will/ does in nightly call "f16" and a C-like language would probably name "half") is not the only popular 16-bit floating point type, as some people want to have https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

adrian_b · 2024-10-19T07:23:04 1729322584

The IEEE FP16 format is what is useful in graphics applications, e.g. for storing color values.

The Google BF16 format is useful strictly only for machine learning/AI applications, because its low precision is insufficient for anything else. BF16 has very low precision, but an exponent range equal to FP32, which makes overflows and underflows less likely.

heinrich5991 · 2024-10-18T09:37:53 1729244273

Permalink (press 'y' anywhere on GitHub): https://github.com/torvalds/linux/blob/4d939780b70592e0f4bc6....

crote · 2024-10-18T15:07:57 1729264077

That file hasn't been touched in over 19 years. I don't think we have to worry about the non-permalink url breaking any time soon.

seoulbigchris · 2024-10-18T08:27:11 1729240031

Which one? Remember the decimal IEEE 754 floating point formats exist too. Do folks in banking use IEEE decimal formats? I remember we used to have different math libs to link against depending, but this was like 40 years ago.

stephencanon · 2024-10-18T13:42:36 1729258956

Binding float to the IEEE 754 binary32 format would not preclude use of decimal formats; they have their own bindings (e.g. _Decimal64 in C23). (I think they're still a TR for C++, but I haven't been keeping track).

quietbritishjim · 2024-10-18T10:59:09 1729249149

Nothing prevents banks (or anyone else) from using a compiler where "float" means binary floating point while some other native or user-defined type supports decimal floating point. In fact, that's probably for the best, since they'll probably have exacting requirements for that type so it makes sense for the application developer to write that type themselves.

seoulbigchris · 2024-10-18T15:42:45 1729266165

I was referring to banks using decimal libraries because they work in base 10 numbers, and I recall a big announcement many years ago when the stock market officially switched from fractional stock pricing to cents "for the benefit of computers and rounding", or some such excuse. It always struck me as strange, since binary fixed and floating point represent those particular quantities exactly, without rounding error. Now with normal dollars and cents calculations, I can see why a decimal library might be preferred.

mu53 · 2024-10-18T11:55:32 1729252532

Java is big for banks, and `BigInteger` is common for monetary types.

Silphendio · 2024-10-18T06:27:15 1729232835

At the very least, division by zero should not be undefined for floats.

NL807 · 2024-10-18T00:27:54 1729211274

Love it

pjdesno · 2024-10-17T23:31:38 1729207898

During an internship in 1986 I wrote C code for a machine with 10-bit bytes, the BBN C/70. It was a horrible experience, and the existence of the machine in the first place was due to a cosmic accident of the negative kind.

Isamu · 2024-10-18T03:19:01 1729221541

I wrote code on a DECSYSTEM-20, the C compiler was not officially supported. It had a 36-bit word and a 7-bit byte. Yep, when you packed bytes into a word there were bits left over.

And I was tasked with reading a tape with binary data in 8-bit format. Hilarity ensued.

Ballas · 2024-10-18T06:42:40 1729233760

That is so strange. If it were 9-bit bytes, that would make sense: 8bits+parity. Then a word is just 32bits+4 parity.

p_l · 2024-10-18T10:01:34 1729245694

7 bits matches ASCII, so you can implement entire ASCII character set, and simultaneously it means you get to fit one more character per byte.

Using RADIX-50, or SIXBIT, you could fit more but you'd lose ASCII-compatibility

otabdeveloper4 · 2024-10-18T17:04:34 1729271074

8 bits in a byte exist in the first place because "obviously" a byte is a 7 bit char + parity.

(*) For some value of "obviously".

bee_rider · 2024-10-18T03:31:00 1729222260

Hah. Why did they do that?

mjevans · 2024-10-18T08:38:58 1729240738

Which part of it?

8 bit tape? Probably the format the hardware worked in... not actually sure I haven't used real tapes but it's plausible.

36 bit per word computer? Sometimes 0..~4Billion isn't enough. 4 more bits would get someone to 64 billion, or +/- 32 billion.

As it turns out, my guess was ALMOST correct

https://en.wikipedia.org/wiki/36-bit_computing

Paraphrasing, legacy keying systems were based on records of up to 10 printed decimal digits of accuracy for input. 35 bits would be required to match the +/- input but 36 works better as a machine word and operations on 6 x 6 bit (yuck?) characters; or some 'smaller' machines which used a 36 bit larger word and 12 or 18 bit small words. Why the yuck? That's only 64 characters total, so these systems only supported UPPERCASE ALWAYS numeric digits and some other characters.

csours · 2024-10-17T23:44:56 1729208696

Somehow this machine found its way onto The Heart of Gold in a highly improbable chain of events.

WalterBright · 2024-10-18T00:30:38 1729211438

I programmed the Intel Intellivision cpu which had a 10 bit "decl". A wacky machine. It wasn't powerful enough for C.

Taniwha · 2024-10-18T00:32:02 1729211522

I've worked on a machine with 9-bit bytes (and 81-bit instructions) and others with 6-bit ones - nether has a C compiler

corysama · 2024-10-18T02:57:11 1729220231

The Nintendo64 had 9-bit RAM. But, C viewed it as 8 bit. The 9th bit was only there for the RSP (GPU).

asveikau · 2024-10-18T01:46:56 1729216016

I think the pdp-10 could have 9 bit bytes, depending on decisions you made in the compiler. I notice it's hard to Google information about this though. People say lots of confusing, conflicting things. When I google pdp-10 byte size it says a c++ compiler chose to represent char as 36 bits.

larsbrinkhoff · 2024-10-18T06:30:38 1729233038

PDP-10 byte size is not fixed. Bytes can be 0 to 36 bits wide. (Sure, 0 is not very useful; still legal.)

I don't think there is a C++ compiler for the PDP-10. One of the C compiler does have a 36-bit char type.

asveikau · 2024-10-18T14:01:07 1729260067

I was summarizing this from a Google search. https://isocpp.org/wiki/faq/intrinsic-types#:~:text=One%20wa....

As I read it, this link may be describing a hypothetical rather than real compiler. But I did not parse that on initial scan of the Google result.

eqvinox · 2024-10-18T10:04:42 1729245882

Do you have any links/info on how that 0-bit byte worked? It sounds like just the right thing for a Friday afternoon read ;D

larsbrinkhoff · 2024-10-18T11:04:39 1729249479

It should be in the description for the byte instructions: LDB, DPB, IBP, and ILDB. http://bitsavers.org/pdf/dec/pdp10/1970_PDP-10_Ref/

Basically, loading a 0-bit byte from memory gets you a 0. Depositing a 0-bit byte will not alter memory, but may do an ineffective read-modify-write cycle. Incrementing a 0-bit byte pointer will leave it unchanged.

aldanor · 2024-10-18T02:00:01 1729216801

10-bit arithmetics are actually not uncommon on fpgas these days and are used in production in relatively modern applications.

10-bit C, however, ..........

eulgro · 2024-10-18T02:11:53 1729217513

How so? Arithmetic on FPGA usually use the minimum size that works, because any size over that will use more resources than needed.

9-bit bytes are pretty common in block RAM though, with the extra bit being used for either for ECC or user storage.

loup-vaillant · 2024-10-18T11:36:47 1729251407

10-bit C might be close to non-existent, but I've heard that quite a few DSP are word addressed. In practice this means their "bytes" are 32 bits.

  sizeof(uint32_t) == 1

kazinator · 2024-10-18T02:21:37 1729218097

C itself was developed on machines that had 18 bit ints.

larsbrinkhoff · 2024-10-18T06:32:18 1729233138

B was developed on the PDP-7. C was developed on the PDP-11.

WalterBright · 2024-10-18T00:33:10 1729211590

D made a great leap forward with the following:

1. bytes are 8 bits

2. shorts are 16 bits

3. ints are 32 bits

4. longs are 64 bits

5. arithmetic is 2's complement

6. IEEE floating point

and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!

Oh, and Unicode was the character set. Not EBCDIC, RADIX-50, etc.

Laremere · 2024-10-18T02:31:17 1729218677

Zig is even better:

1. u8 and i8 are 8 bits.

2. u16 and i16 are 16 bits.

3. u32 and i32 are 32 bits.

4. u64 and i64 are 64 bits.

5. Arithmetic is an explicit choice. '+' overflowing is illegal behavior (will crash in debug and releasesafe), '+%' is 2's compliment wrapping, and '+|' is saturating arithmetic. Edit: forgot to mention @addWithOverflow(), which provides a tuple of the original type and a u1; there's also std.math.add(), which returns an error on overflow.

6. f16, f32, f64, f80, and f128 are the respective but length IEEE floating point types.

The question of the length of a byte doesn't even matter. If someone wants to compile to machine whose bytes are 12 bits, just use u12 and i12.

Cloudef · 2024-10-18T08:02:42 1729238562

Zig allows any uX and iX in the range of 1 - 65,535, as well as u0

renox · 2024-10-18T15:22:25 1729264945

u0?? Why?

whs · 2024-10-19T03:35:26 1729308926

Sounds like zero-sized types in Rust, where it is used as marker types (eg. this struct own this lifetime). It also can be used to turn a HashMap into a HashSet by storing zero sized value. In Go a struct member of [0]func() (an array of function, with exactly 0 members) is used to make a type uncomparable as func() cannot be compared.

xigoi · 2024-10-18T15:55:19 1729266919

To avoid corner cases in auto-generated code?

Cloudef · 2024-10-19T03:01:19 1729306879

To represent 0 without actually storing it in memory

notfed · 2024-10-18T05:00:06 1729227606

Same deal with Rust.

loup-vaillant · 2024-10-18T11:39:17 1729251557

I've heard that Rust wraps around by default?

Measter · 2024-10-18T12:05:37 1729253137

Rust has two possible behaviours: panic or wrap. By default debug builds with panic, release builds with wrap. Both behaviours are 100% defined, so the compiler can't do any shenanigans.

There are also helper functions and types for unchecked/checked/wrapping/saturating arithmetic.

__turbobrew__ · 2024-10-18T03:03:00 1729220580

This is the way.

Someone · 2024-10-18T11:10:51 1729249851

LLVM has:

i1 is 1 bit

i2 is 2 bits

i3 is 3 bits

…

i8388608 is 2^23 bits

(https://llvm.org/docs/LangRef.html#integer-type)

On the other hand, it doesn’t make a distinction between signed and unsigned integers. Users must take care to use special signed versions of operations where needed.

Spivak · 2024-10-18T03:03:21 1729220601

How does 5 work in practice? Surely no one is actually checking if their arithmetic overflows, especially from user-supplied or otherwise external values. Is there any use for the normal +?

dullcrisp · 2024-10-18T04:00:10 1729224010

You think no one checks if their arithmetic overflows?

Spivak · 2024-10-18T04:32:05 1729225925

I'm sure it's not literally no one but I bet the percent of additions that have explicit checks for overflow is for all practical purposes indistinguishable from 0.

nox101 · 2024-10-18T06:06:46 1729231606

Lots of secure code checks for overflow

    fillBufferWithData(buffer, data, offset, size)

You want to know that offset + size don't wrap past 32bits (or 64) and end up with nonsense and a security vulnerability.

mort96 · 2024-10-18T06:48:44 1729234124

Eh I like the nice names. Byte=8, short=16, int=32, long=64 is my preferred scheme when implementing languages. But either is better than C and C++.

shiomiru · 2024-10-18T11:49:52 1729252192

It would be "nice" if not for C setting a precedent for these names to have unpredictable sizes. Meaning you have to learn the meaning of every single type for every single language, then remember which language's semantics apply to the code you're reading. (Sure, I can, but why do I have to?)

[ui][0-9]+ (and similar schemes) on the other hand anybody can understand at the first glance.

bmacho · 2024-10-18T12:20:01 1729254001

> D made a great leap forward

> and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!

Nah. It is actually pretty bad. Type names with explicit sizes (u8, i32, etc) are way better in every way.

WalterBright · 2024-10-18T22:38:45 1729291125

> Type names with explicit sizes (u8, i32, etc) are way better in every way

Until one realizes that the entire namespace of innn, unnn, fnnn, etc., is reserved.

bmacho · 2024-10-20T14:03:09 1729432989

You are right, they come with a cost.

gerdesj · 2024-10-18T01:10:26 1729213826

"1. bytes are 8 bits"

How big is a bit?

thamer · 2024-10-18T02:06:05 1729217165

This doesn't feel like a serious question, but in case this is still a mystery to you… the name bit is a portmanteau of binary digit, and as indicated by the word "binary", there are only two possible digits that can be used as values for a bit: 0 and 1.

seoulbigchris · 2024-10-18T08:16:54 1729239414

So trinary and quaternary digits are trits and quits?

eqvinox · 2024-10-18T10:07:10 1729246030

Yes, trit is commonly used for ternary logic. "quit" I have never heard in such a context.

euroderf · 2024-10-20T17:04:09 1729443849

So shouldn't a two-state datum be a twit ?

basementcat · 2024-10-18T03:41:39 1729222899

A bit is a measure of information theoretical entropy. Specifically, one bit has been defined as the uncertainty of the outcome of a single fair coin flip. A single less than fair coin would have less than one bit of entropy; a coin that always lands heads up has zero bits, n fair coins have n bits of entropy and so on.

https://en.m.wikipedia.org/wiki/Information_theory

https://en.m.wikipedia.org/wiki/Entropy_(information_theory)

fourier54 · 2024-10-18T04:04:16 1729224256

That is a bit in information theory. It has nothing to do with the computer/digital engineering term being discussed here.

sirsinsalot · 2024-10-18T05:00:26 1729227626

This comment I feel sure would repulse Shannon in the deepest way. A (digital, stored) bit, abstractly seeks to encode and make useful through computation the properties of information theory.

Your comment must be sarcasm or satire, surely.

fourier54 · 2024-10-18T15:51:00 1729266660

I do not know or care what would Mr. Shannon think. What I do know is that the base you chose for the logarithm on the entropy equation has nothing to do with the amount of bits you assign to a word on a digital architecture :)

nonameiguess · 2024-10-18T03:14:47 1729221287

How philosophical do you want to get? Technically, voltage is a continuous signal, but we sample only at clock cycle intervals, and if the sample at some cycle is below a threshold, we call that 0. Above, we call it 1. Our ability to measure whether a signal is above or below a threshold is uncertain, though, so for values where the actual difference is less than our ability to measure, we have to conclude that a bit can actually take three values: 0, 1, and we can't tell but we have no choice but to pick one.

The latter value is clearly less common than 0 and 1, but how much less? I don't know, but we have to conclude that the true size of a bit is probably something more like 1.00000000000000001 bits rather than 1 bit.

CoastalCoder · 2024-10-18T01:55:44 1729216544

> How big is a bit?

A quarter nybble.

poincaredisk · 2024-10-18T01:41:13 1729215673

A bit is either a 0 or 1. A byte is the smallest addressable piece of memory in your architecture.

elromulous · 2024-10-18T01:56:58 1729216618

Technically the smallest addressable piece of memory is a word.

Maxatar · 2024-10-18T03:13:05 1729221185

I don't think the term word has any consistent meaning. Certainly x86 doesn't use the term word to mean smallest addressable unit of memory. The x86 documentation defines a word as 16 bits, but x86 is byte addressable.

ARM is similar, ARM processors define a word as 32-bits, even on 64-bit ARM processors, but they are also byte addressable.

As best as I can tell, it seems like a word is whatever the size of the arithmetic or general purpose register is at the time that the processor was introduced, and even if later a new processor is introduced with larger registers, for backwards compatibility the size of a word remains the same.

asveikau · 2024-10-18T03:05:27 1729220727

Depends on your definition of addressable.

Lots of CISC architectures allow memory accesses in various units even if they call general-purpose-register-sized quantities "word".

Iirc the C standard specifies that all memory can be accessed via char*.

mort96 · 2024-10-18T06:50:49 1729234249

Every ISA I've ever used has used the term "word" to describe a 16- or 32-bit quantity, while having instructions to load and store individual bytes (8 bit quantities). I'm pretty sure you're straight up wrong here.

throw16180339 · 2024-10-19T01:03:50 1729299830

That's only true on a word-addressed machine; most CPUs are byte-addressed.

bregma · 2024-10-18T10:07:21 1729246041

The difference between address A and address A+1 is one byte. By definition.

Some hardware may raise an exception if you attempt to retrieve a value at an address that is not a (greater than 1) multiple of a byte, but that has no bearing on the definition of a byte.

Nevermark · 2024-10-18T02:00:29 1729216829

Which … if your heap always returns N bit aligned values, for some N … is there a name for that? The smallest heap addressable segment?

zombot · 2024-10-18T08:36:51 1729240611

If your detector is sensitive enough, it could be just a single electron that's either present or absent.

dullcrisp · 2024-10-18T04:01:52 1729224112

At least 2 or 3

amelius · 2024-10-18T09:35:21 1729244121

Depends on your physical media.

eps · 2024-10-18T17:37:33 1729273053

That's a bit self-pat-on-the-back-ish, isn't it, Mr. Bright, the author of D language? :)

WalterBright · 2024-10-19T02:21:30 1729304490

Of course!

Over the years I've known some engineers who, as a side project, wrote some great software. Nobody was interested in it. They'd come to me and ask why that is? I suggest writing articles about their project, and being active on the forums. Otherwise, who would ever know about it?

They said that was unseemly, and wouldn't do it.

They wound up sad and bitter.

The "build it and they will come" is a stupid Hollywood fraud.

BTW, the income I receive from D is $0. It's my gift. You'll also note that I've suggested many times improvements that could be made to C, copying proven ideas in D. Such as this one:

https://www.digitalmars.com/articles/C-biggest-mistake.html

C++ has already adopted many ideas from D.

eps · 2024-10-19T18:27:47 1729362467

> https://www.digitalmars.com/articles/C-biggest-mistake.html

To be fair, this one lies on the surface for anyone trying to come up with an improved C. It's one of the first things that gets corrected in nearly all C derivatives.

> C++ has already adopted many ideas from D.

Do you have a list?

Especially for the "adopted from D" bit rather than being a evolutionary and logical improvement to the language.

cogman10 · 2024-10-18T01:22:35 1729214555

Yeah, this is something Java got right as well. It got "unsigned" wrong, but it got standardizing primitive bits correct

byte = 8 bits

short = 16

int = 32

long = 64

float = 32 bit IEEE

double = 64 bit IEEE

jltsiren · 2024-10-18T01:35:05 1729215305

I like the Rust approach more: usize/isize are the native integer types, and with every other numeric type, you have to mention the size explicitly.

On the C++ side, I sometimes use an alias that contains the word "short" for 32-bit integers. When I use them, I'm explicitly assuming that the numbers are small enough to fit in a smaller than usual integer type, and that it's critical enough to performance that the assumption is worth making.

jonstewart · 2024-10-18T02:19:11 1729217951

<cstdint> has int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, and uint64_t. I still go back and forth between uint64_t, size_t, and unsigned int, but am defaulting to uint64_t more and more, even if it doesn't matter.

WalterBright · 2024-10-19T02:23:00 1729304580

That doesn't really fix it, because of the integral promotion rules.

Jerrrrrrry · 2024-10-18T01:55:05 1729216505

hindsight has its advantages

kazinator · 2024-10-18T02:24:48 1729218288

> you have to mention the size explicitly

It's unbelievably ugly. Every piece of code working with any kind of integer screams "I am hardware dependent in some way".

E.g. in a structure representing an automobile, the number of wheels has to be some i8 or i16, which looks ridiculous.

Why would you take a language in which you can write functional pipelines over collections of objects, and make it look like assembler.

pezezin · 2024-10-18T02:51:11 1729219871

If you don't care about the size of your number, just use isize or usize.

If you do care, then isn't it better to specify it explicitly than trying to guess it and having different compilers disagreeing on the size?

kazinator · 2024-10-18T02:54:02 1729220042

A type called isize is some kind of size. It looks wrong for something that isn't a size.

pezezin · 2024-10-18T04:46:27 1729226787

Then just define a type alias, which is good practice if you want your types to be more descriptive: https://doc.rust-lang.org/reference/items/type-aliases.html

kazinator · 2024-10-18T06:55:45 1729234545

Nope! Because then you will also define an alias, and Suzy will define an alias, and Bob will define an alias, ...

We should all agree on int and uint; not some isize nonsense, and not bobint or suzyint.

jclulow · 2024-10-18T07:23:23 1729236203

Alas, it's pretty clear that we won't!

pezezin · 2024-10-18T07:46:43 1729237603

Ok, it is obvious that you are looking for something to complaint about and don't want to find a solution. That is not a productive attitude in life, but whatever floats your boat. Have a good day.

kazinator · 2024-10-18T18:26:11 1729275971

> looking for something to complaint about

You know, that describes pretty much everyone who has anything to do with Rust.

"My ls utility isn't written in Rust, yikes! Let's fix that!"

"The comments under this C++-related HN submission aren't talking about Rust enough, yikes! Let's fix that!"

I'm obviously pointing to a solution: have a standard module that any Rust program can depend on coming from the language, which has a few sanely named types. Rather than every program defining its own.

hahamaster · 2024-10-18T23:56:21 1729295781

You insist that we should all agree on something but you don't specify what.

heinrich5991 · 2024-10-18T09:41:06 1729244466

Actually, if you don't care about the size of your small number, use `i32`. If it's a big number, use `i64`.

`isize`/`usize` should only be used for memory-related quantities — that's why they renamed from `int`/`uint`.

kazinator · 2024-10-18T18:29:37 1729276177

If you use i32, it looks like you care. Without studying the code, I can't be sure that it could be changed to i16 or i64 without breaking something.

Usually, I just want the widest type that is efficient on the machine, and I don't want it to have an inappropriate name. I don't care about the wasted space, because it only matters in large arrays, and often not even then.

heinrich5991 · 2024-10-18T20:59:49 1729285189

> If you use i32, it looks like you care.

In Rust, that's not really the case. `i32` is the go-to integer type.

`isize` on the other hand would look really weird in code — it's an almost unused integer type. I also prefer having integers that don't depend on the machine I'm running them on.

kazinator · 2024-10-19T01:03:18 1729299798

Some 32 bit thing being the go to integer type flies against software engineering and CS.

It's going to get expensive on a machine that has only 64 bit integers, which must be accessed on 8 byte aligned boundaries.

pezezin · 2024-10-20T01:37:36 1729388256

And which machine is that? The only computers that I can think of with only 64-bit integers are the old Cray vector supercomputers, and they used word addressing to begin with.

kazinator · 2024-10-21T06:02:59 1729490579

It will likely be common in another 25 to 30 years, as 32 bit systems fade into the past.

Therefore, declaring that int32 is the go to integer type is myopic.

Forty years ago, a program like this could be run on a 16 bit machine (e.g. MS-DOS box):

  #include <stdio.h>

  int main(int argc, char **argv)
  {
    while (argc-- > 0)
      puts(*argv++);
    return 0;
  }

int was 16 bits. That was fine; you would never pass anywhere near 32000 arguments to a program.

Today, that same program does the same thing on a modern machine with a wider int.

Good thing that some int16 had not been declared the go to integer type.

Rust's integer types are deliberately designed (by people who know better) in order to be appealing to people who know shit all about portability and whose brains cannot handle reasoning about types with a bit of uncertainty.

pezezin · 2024-10-21T08:08:07 1729498087

Sorry, but I fail to see where the problem is. Any general purpose ISA designed in the past 40 years can handle 8/16/32 bit integers just fine regardless of the register size. That includes the 64-bit x86-64 or ARM64 from which you are typing.

The are a few historical architectures that couldn't handle smaller integers, like the first generation Alpha, but:

    a) those are long dead.

    b) hardware engineers learnt from their mistake and no modern general purpose architecture has repeated it (specialized architectures like DSP and GPU are another story though).

    c) worst case scenario, you can simulate it in software.

itishappy · 2024-10-18T13:52:51 1729259571

Except defining your types with arbitrary names is still hardware dependent, it's just now something you have to remember or guess.

Can you remember the name for a 128 bit integer in your preferred language off the top of your head? I can intuit it in Rust or Zig (and many others).

In D it's... oh... it's int128.

https://dlang.org/phobos/std_int128.html

https://github.com/dlang/phobos/blob/master/std/int128.d

kazinator · 2024-10-21T06:07:16 1729490836

In C it will almost certainly be int128_t, when standardized. 128 bit support is currently a compiler extension (found in GCC, Clang and others).

A type that provides a 128 bit integer exactly should have 128 in its name.

That is not the argument at all.

The problem is only having types like that, and stipulating nonsense like that the primary "go to" integer type is int32.

WalterBright · 2024-10-19T02:25:47 1729304747

It was actually supposed to be `cent` and`ucent`, but we needed a library type to stand in for it at the moment.

Spivak · 2024-10-18T03:08:01 1729220881

Is it any better calling it an int where it's assumed to be an i32 and 30 of the bits are wasted.

kazinator · 2024-10-18T18:30:22 1729276222

what you call things matters, so yes, it is better.

josephg · 2024-10-18T01:43:43 1729215823

Yep. Pity about getting chars / string encoding wrong though. (Java chars are 16 bits).

But it’s not alone in that mistake. All the languages invented in that era made the same mistake. (C#, JavaScript, etc).

davidgay · 2024-10-18T04:51:11 1729227071

Java was just unlucky, it standardised it's strings at the wrong time (when Unicode was 16-bit code points): Java was announced in May 1995, and the following comment from the Unicode history wiki page makes it clear what happened: "In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. ..."

jeberle · 2024-10-18T03:53:25 1729223605

Java strings are byte[]'s if their contents contain only Latin-1 values (the first 256 codepoints of Unicode). This shipped in Java 9.

JEP 254: Compact Strings

https://openjdk.org/jeps/254

paragraft · 2024-10-18T01:55:06 1729216506

What's the right way?

WalterBright · 2024-10-18T02:06:34 1729217194

UTF-8

When D was first implemented, circa 2000, it wasn't clear whether UTF-8, UTF-16, or UTF-32 was going to be the winner. So D supported all three.

Remnant44 · 2024-10-18T02:08:50 1729217330

utf8, for essentially the reasons mentioned in this manifesto: https://utf8everywhere.org/

josephg · 2024-10-18T02:51:01 1729219861

Yep. Notably supported by go, python3, rust and swift. And probably all new programming languages created from here on.

josefx · 2024-10-18T11:04:35 1729249475

I would say anyone mentioning a specific encoding / size just wants to see the world burn. Unicode is variable length on various levels, how many people want to deal with the fact that the unicode of their text could be non normalized or want the ability to cut out individual "char" elements only to get a nonsensical result because the following elements were logically connected to that char? Give developers a decent high level abstraction and don't force them to deal with the raw bits unless they ask for it.

consteval · 2024-10-18T16:49:27 1729270167

I think this is what Rust does, if I remember correctly, it provides APIs in string to enumerate the characters accurately. That meaning, not necessarily byte by byte.

speedyjay · 2024-10-18T20:03:16 1729281796

https://pastebin.com/raw/D7p7mRLK

My comment in a pastebin. HN doesn't like unicode.

You need this crate to deal with it in Rust, it's not part of the base libraries:

https://crates.io/crates/unicode-segmentation

The languages that have this kind of feature built-in in the standard library, to my knowledge, are Swift, JavaScript, C# and Java. Swift is the only one, of those four, that treat operating on graphemes as the default. JavaScript requires Intl.Segmenter, C# requires StringInfo, Java requires BreakIterator.

By the way, Python, the language caused so much hurt with their 2.x->3.x transition promising better unicode support in return for this pain couldn't even do this right. There is no concept of graphemes in the standard library. So much for the batteries included bit.

>>> test = " "

>>> [char for char in test]

['', '\u200d', '', '\u200d', '', '\u200d', '']

>>> len(test)

7

In JavaScript REPL (nodejs):

> let test = " "

undefined

> [...new Intl.Segmenter().segment(test)][0].segment;

' '

> [...new Intl.Segmenter().segment(test)].length;

1

Works as it should.

In python you would need a third party library.

Swift is truly the nicest of programming languages as far as strings are concerned. It just works as it always should have been.

let test = " "

for char in test {

    print(char)

}

print(test.count)

output :

1

[Execution complete with exit code 0]

I, as a non-Apple user, feel quite the Apple envy whenever I think about swift. It's such a nice language, but there's little ecosystem outside of Apple UIs.

But man, no using third party libraries, or working with a wrapper segmenter class or iterator. Just use the base string literals as is. It. Just. Works.

josephg · 2024-10-18T22:21:31 1729290091

For context, it looks like you’re talking about iterating by grapheme clusters.

I understand how iterating through a string by grapheme clusters is convenient for some applications. But it’s far from obvious to me that doing so should be the language’s default. Dealing with grapheme clusters requires a Unicode database, which needs to live somewhere and needs to be updated continuously as Unicode grows. (Should rust statically link that library into every app that uses it?)

Generally there are 3 ways to iterate a string: by UTF8 bytes (or ucs2 code points like Java/js/c#), by Unicode codepoint or by grapheme clusters. UTF8 encoding comes up all the time when encoding / decoding strings - like, to json or when sending content over http. Codepoints are, in my opinion, the correct approach when doing collaborative editing or patching strings. And grapheme clusters are useful in frontend user interfaces - like when building a terminal.

Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least. Tell me - why should grapheme clusters be the default way to iterate over a string? I can see the argument in Swift, which is a language built for frontend UI. But in a systems language like rust? That seems like a terrible default to me. UTF8 bytes are by far the most useful representation for strings in systems code, since from the pov of systems code, strings are usually just data.

speedyjay · 2024-10-19T05:09:24 1729314564

This was not meant as criticism for rust in particular (though, while it shouldn't be the default behavior of strings in a systems language, surely at least the official implementation of a wrapper should exist?), but high level languages with ton of baggage like python should definitely provide the correct way to handle strings, the amount of software I've seen that are unable to properly handle strings because the language didn't provide the required grapheme handling and the developer was also not aware of the reality of graphemes and unicode..

You mention terminals, yes, it's one of the area where graphemes are an absolute must, but pretty much any time you are going to do something to text like deciding "I am going to put a linebreak here so that the text doesn't overflow beyond the box, beyond this A4 page I want to print, beyond the browser's window" grapheme handling is involved.

Any time a user is asked to input something too. I've seen most software take the "iterate over characters" approach to real time user input and they break down things like those emojis into individual components whenever you paste something in.

For that matter, backspace doesn't work properly on software you would expect to do better than that. Put the emoji from my pastebin in Microsoft Edge's search/url bar, then hit backspace, see what happens. While the browser displays the emoji correctly, the input field treats it the way Python segments it in my example: you need to press backspace 7 times to delete it. 7 times! Windows Terminal on the other hand has the quirk of showing a lot of extra spaces after the emoji (despite displaying the emoji correctly too) and will also require 11 backspace to delete it.

Notepad handles it correctly: press backspace once, it's deleted, like any normal character.

> Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least.

This doesn't say anything about grapheme clusters being useless. I've cited examples of popular software doing the wrong thing precisely because, like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.

The dismissiveness over more sane string handling as a standard is not unlike C++ developers pretending that developers are doing the right thing with memory management so we don't need a GC (or rust's ownership paradigm). Nonsense.

josephg · 2024-10-19T08:06:06 1729325166

Those are good examples! Notably, all of them are in reasonably low level, user-facing code.

Your examples are implementing custom text input boxes (Excel, Edge), line breaks while printing, and implementing a terminal application. I agree that in all of those cases, grapheme cluster segmentation is appropriate. But that doesn't make grapheme cluster based iteration "the correct way to handle strings". There's no "correct"! There are at least 3 different ways to iterate through a string, and different applications have different needs.

Good languages should make all of these options easy for programmers to use when they need them. Writing a custom input box? Use grapheme clusters. Writing a text based CRDT? Treat a string as a list of unicode codepoints. Writing an HTTP library? Treat the headers and HTML body as ASCII / opaque bytes. Etc.

I take the criticism that rust makes grapheme iteration harder than the others. But eh, rust has truly excellent crates for that within arms reach. I don't see any advantage in moving grapheme based segmentation into std. Well, maybe it would make it easier to educate idiot developers about this stuff. But there's no real technical reason. Its situationally useful - but less useful than lots of other 3rd party crates like rand, tokio and serde.

> like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.

It says that in 30+ years of programming, I've never programmed a text input field from scratch. Why would I? That's the job of the operating system. Making my own sounds like a huge waste of time.

pjmlp · 2024-10-18T08:38:33 1729240713

While I don't agree with not having unsigned as part of the primitive times, and look forward to Valhala fixing that, it was based on the experience most devs don't get unsigned arithmetic right.

"For me as a language designer, which I don't really count myself as these days, what "simple" really ended up meaning was could I expect J. Random Developer to hold the spec in his head. That definition says that, for instance, Java isn't -- and in fact a lot of these languages end up with a lot of corner cases, things that nobody really understands. Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex. The language part of Java is, I think, pretty simple. The libraries you have to look up."

http://www.gotw.ca/publications/c_family_interview.htm

stkdump · 2024-10-18T05:39:20 1729229960

I mean practically speaking in C++ we have (it just hasn't made it to the standard):

1. char 8 bit

2. short 16 bit

3. int 32 bit

4. long long 64 bit

5. arithmetic is 2s complement

6. IEEE floating point (float is 32, double is 64 bit)

Along with other stuff like little endian, etc.

Some people just mistakenly think they can't rely on such stuff, because it isn't in the standard. But they forget that having an ISO standard comes on top of what most other languages have, which rely solely on the documentation.

bregma · 2024-10-18T10:17:24 1729246644

I work every day with real-life systems where int can be 32 or 64 bits, long long can be 64 or 128 bits, long double can be 64 or 80 or 128 bits, some systems do not have IEEE 754 floating point (no denormals!) some are big endian and some are little endian. These things are not in the language standard because they are not standard in the real world.

Practically speaking, the language is the way it is, and has succeeded so well for so long, because it meets the requirements of its application.

stkdump · 2024-10-18T19:34:43 1729280083

There are also people who write COBOL for a living. What you say is not relevant at all for 99.99% of C++ code written today. Also, all compilers can be configured to be non-standard compliant in many different ways, the classic example being -fno-exceptions. Nobody says all kinds of using a standardized language must be standard conformant.

mort96 · 2024-10-18T06:51:36 1729234296

> (it just hasn't made it to the standard)

That's the problem

stkdump · 2024-10-18T19:22:33 1729279353

You are aware that D and rust and all the other languages this is being compared to don't even have an ISO standard, right?

mort96 · 2024-10-19T00:35:49 1729298149

Yeah, so their documentation serves as the authority on how you're supposed to write your code for it to be "correct D" or "correct Rust". The compiler implementors write their compilers against the documentation (and vice versa). That documentation is clear on these things.

In C, the ISO standard is the authority on how you're supposed to write your code for it to be "correct C". The compiler implementors write their compilers against the ISO standard. That standard is not clear on these things.

stkdump · 2024-10-19T03:53:10 1729309990

I don't think this is true. The target audience of the ISO standard is the implementers of compilers and other tools around the language. Even the people involved in creating it make that clear by publishing other material like the core guidelines, conference talks, books, online articles, etc., which are targeted to the users of the language.

mort96 · 2024-10-19T09:55:34 1729331734

Core guidelines, conference talks, books, online articles, etc. are not authoritative. If I really want to know if my C code is correct C, I consult the standard. If the standard and an online article disagrees, the article is wrong, definitionally.

stkdump · 2024-10-21T08:04:48 1729497888

Correction: if you want to know if your compiler is correct, you look at the ISO standard. But even as a compiler writer, the ISO standard is not exhaustive. For example the ISO standard doesn't define stuff like include directories, static or dynamic linking, etc.

MaulingMonkey · 2024-10-17T23:50:29 1729209029

Some people are still dealing with DSPs.

https://thephd.dev/conformance-should-mean-something-fputc-a...

Me? I just dabble with documenting an unimplemented "50% more bits per byte than the competition!" 12-bit fantasy console of my own invention - replete with inventions such as "UTF-12" - for shits and giggles.

jfbastien · 2024-10-18T02:30:37 1729218637

Yes, I'm trying to figure out which are still relevant and whether they target a modern C++, or intend to. I've been asking for a few years and haven't gotten positive answers. The only one that been brought up is TI, I added info in the updated draft: https://isocpp.org/files/papers/D3477R1.html

ndesaulniers · 2024-10-18T04:44:31 1729226671

> and would benefit from C23’s _BigInt

s/_BigInt/_BitInt/

jfbastien · 2024-10-18T05:01:41 1729227701

Dang, will fix when I get home! Thanks Nick, and hi!

ndesaulniers · 2024-10-19T01:12:52 1729300372

https://www.youtube.com/watch?v=UL5x60lpGIE

jeffbee · 2024-10-17T23:54:06 1729209246

They can just target C++23 or earlier, right? I have a small collection of SHARCs but I am not going to go crying to the committee if they make C++30 (or whatever) not support CHAR_BIT=32

PaulDavisThe1st · 2024-10-18T00:13:17 1729210397

no doubt you've got your brainfuck compiler hard at work on this ...

defrost · 2024-10-18T02:05:44 1729217144

TI DSP Assembler is pretty high level, it's "almost C" already.

Writing geophysical | military signal and image processing applications on custom DSP clusters is suprisingly straightforward and doesn't need C++.

It's a RISC architecture optimised for DSP | FFT | Array processing with the basic simplification that char text is for hosts, integers and floats are at least 32 bit and 32 bits (or 64) is the smallest addressable unit.

Fantastic architecture to work with for numerics, deep computational pipelines, once "primed" you push in raw aquisition samples in chunks every clock cycle and extract processed moving window data chunks every clock cycle.

A single ASM instruction in a cycle can accumulate totals from vector multiplication and modulo update indexes on three vectors (two inputs and and out).

Not your mama's brainfuck.

Narishma · 2024-10-20T04:09:35 1729397375

Didn't the PDP-8 have 12-bit bytes?

harry8 · 2024-10-17T23:35:18 1729208118

Is C++ capable of deprecating or simplifying anything?

Honest question, haven't followed closely. rand() is broken,I;m told unfixable and last I heard still wasn't deprecated.

Is this proposal a test? "Can we even drop support for a solution to a problem literally nobody has?"

epcoa · 2024-10-17T23:51:52 1729209112

Signed integers did not have to be 2’s complement, there were 3 valid representations: signed mag, 1s and 2s complement. Modern C and C++ dropped this and mandate 2s complement (“as if” but that distinction is moot here, you can do the same for CHAR_BIT). So there is certainly precedence for this sort of thing.

jfbastien · 2024-10-18T02:33:07 1729218787

As mentioned by others, we've dropped trigraph and deprecated rand (and offer an alternative). I also have:

* p2809 Trivial infinite loops are not Undefined Behavior * p1152 Deprecating volatile * p0907 Signed Integers are Two's Complement * p2723 Zero-initialize objects of automatic storage duration * p2186 Removing Garbage Collection Support

So it is possible to change things!

pjmlp · 2024-10-18T08:42:20 1729240940

GC API from C++11 was removed in C++23, understandibly so, given that it wasn't designed taking into account the needs of Unreal C++ and C++/CLI, the only two major variants that have GC support.

Exception specifications have been removed, although some want them back for value type exceptions, if that ever happens.

auto_ptr has been removed, given its broken design.

Now on the simplying side, not really, as the old ways still need to be understood.

Nevermark · 2024-10-18T02:04:24 1729217064

I think you are right. Absolutely.

Don’t break perfection!! Just accumulate more perfection.

What we need is a new C++ symbol that reliably references eight bit bytes, without breaking compatibility, or wasting annnnnny opportunity to expand the kitchen sink once again.

I propose “unsigned byte8” and (2’s complement) “signed byte8”. And “byte8” with undefined sign behavior because we can always use some more spice.

“unsigned decimal byte8” and “signed decimal byte8”, would limit legal values to 0 to 10 and -10 to +10.

For the damn accountants.

“unsigned centimal byte8” and “signed centimal byte8”, would limit legal values to 0 to 100 and -100 to +100.

For the damn accountants who care about the cost of bytes.

Also for a statistically almost valid, good enough for your customer’s alpha, data type for “age” fields in databases.

And “float byte8” obviously.

bastawhiz · 2024-10-18T03:23:47 1729221827

> For the damn accountants who care about the cost of bytes.

Finally! A language that can calculate my S3 bill

mort96 · 2024-10-18T06:55:14 1729234514

How is rand() broken? It seems to produce random-ish values, which is what it's for. It obviously doesn't produce cryptographically secure random values, but that's expected (and reflects other languages' equivalent functions). For a decently random integer that's quick to compute, rand() works just fine.

tntxtnt · 2024-10-18T15:36:55 1729265815

RAND_MAX is only guaranteed to be at least 32767. So if you use `rand() % 10000` you'll have real biased towards 0-2767, even `rand() % 1000` is already not uniform (biased towards 0-767). And that assumes rand() is good uniform from 0-RAND_MAX in the first place.

akdev1l · 2024-10-18T12:48:42 1729255722

> The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call.

It cannot be called safely from a multi-threaded application for one

hyperhello · 2024-10-18T00:58:46 1729213126

C++ long ago crossed the line where making any change is more work than any benefit it could ever create.

BoringTimesGang · 2024-10-18T09:46:05 1729244765

This is such an odd thing to read & compare to how eager my colleagues are to upgrade the compiler to take advantage of new features. There's so much less need to specify types in situations where the information is implicitly available after C++ 20/17. So many boost libraries have been replaced by superior std versions.

And this has happened again and again on this enormous codebase that started before it was even called 'C++'.

pjmlp · 2024-10-18T08:43:09 1729240989

It is one of my favourite languages, but I think it has already crossed over the complexity threshold PL/I was known for.

nialv7 · 2024-10-17T23:55:06 1729209306

well they managed to get two's complement requirement into C++20. there is always hope.

oefrha · 2024-10-18T01:12:05 1729213925

Well then someone somewhere with some mainframe got so angry they decided to write a manifesto to condemn kids these days and announced a fork of Qt because Qt committed the cardinal sin of adopting C++20. So don’t say “a problem literally nobody has”, someone always has a use case; although at some point it’s okay to make a decision to ignore them.

https://lscs-software.com/LsCs-Manifesto.html

https://news.ycombinator.com/item?id=41614949

Edit: Fixed typo pointed out by child.

ripe · 2024-10-18T01:33:09 1729215189

> because Qt committed the carnal sin of adopting C++20

I do believe you meant to write "cardinal sin," good sir. Unless Qt has not only become sentient but also corporeal when I wasn't looking and gotten close and personal with the C++ standard...

__turbobrew__ · 2024-10-18T03:12:54 1729221174

This person is unhinged.

> It's a desktop on a Linux distro meant to create devices to better/save lives.

If you are creating life critical medical devices you should not be using linux.

smaudet · 2024-10-18T04:51:13 1729227073

> If you are creating life critical medical devices you should not be using linux.

Hmm, what do you mean?

Like, no you should not adopt some buggy or untested distro, instead choose each component carefully and disable all un-needed updates...

But that beats working on an unstable, randomly and capriciously deprecated and broken OS (windows/mac over the years), that you can perform zero practical review, casual or otherwise, legal or otherwise, and that insists upon updating and further breaking itself at regular intervals...

Unless you mean to talk maybe about some microkernel with a very simple graphical UI, which, sure yes, much less complexity...

__turbobrew__ · 2024-10-18T05:11:51 1729228311

I mean you should be building life critical medical devices on top of an operating system like QNX or vxworks which are much more stable and simpler.

epcoa · 2024-10-18T15:54:39 1729266879

Regulations are complex, but not every medical device or part of it is "life critical". There are plenty of regulated medical devices floating around running Linux, often based on Yocto. There is some debate in the industry about the particulars of this SOUP (software of unknown provenance) in general, but the mere idea of Linux in a medical device is old news and isn't crackpot or anything.

The goal for this guy seems to be a Linux distro primarily to serve as a reproducible dev environment that must include his own in-progress EDT editor clone, but can include others as long as they're not vim or use Qt. Ironically Qt closed-source targets vxWorks and QNX. Dräger ventilators use it for their frontend.

https://www.logikalsolutions.com/wordpress/information-techn...

Like the general idea of a medical device linux distros (for both dev host and targets) is not a bad one. But the thinking and execution in this case is totally derailed due to outsized and unfocused reactions to details that don't matter (ancient IRS tax computers), QtQuick having some growing pains over a decade ago, personal hatred of vim, conflating a hatred of Agile with CI/CD.

__turbobrew__ · 2024-10-19T05:22:23 1729315343

> You can't use non-typesafe junk when lives are on the line.

Their words, not mine. If lives are on the line you probably shouldn’t be using linux in your medical device. And I hope my life never depends on a medical device running linux.

epcoa · 2024-10-18T02:30:20 1729218620

Wow.

https://theminimumyouneedtoknow.com/

https://lscs-software.com/LsCs-Roadmap.html

"Many of us got our first exposure to Qt on OS/2 in or around 1987."

Uh huh.

> someone always has a use case;

No he doesn't. He's just unhinged. The machines this dude bitches about don't even have a modern C++ compiler nor do they support any kind of display system relevant to Qt. They're never going to be a target for Qt. Further irony is this dude proudly proclaims this fork will support nothing but Wayland and Vulkan on Linux.

"the smaller processors like those in sensors, are 1's complement for a reason."

The "reason" is never explained.

"Why? Because nothing is faster when it comes to straight addition and subtraction of financial values in scaled integers. (Possibly packed decimal too, but uncertain on that.)"

Is this a justification for using Unisys mainframes, or is the implication that they are fastest because of 1's complement? (not that this is even close to being true - as any dinosaurs are decomissioned they're fucking replaced with capable but not TOL commodity Xeon CPU based hardware running emulation, I don't think Unisys makes any non x86 hardware anymore) Anyway, may need to refresh that CS education.

There's some rambling about the justification being data conversion, but what serialization protocols mandate 1's complement anyway, and if those exist someone has already implemented 2's complement supporting libraries for the past 50 years since that has been the overwhelming status quo. We somehow manage to deal with endianness and decimal conversions as well.

"Passing 2's complement data to backend systems or front end sensors expecting 1's complement causes catastrophes."

99.999% of every system MIPS, ARM, x86, Power, etc for the last 40 years uses 2's complement, so this has been the normal state of the world since forever.

Also the enterpriseist of languages, Java somehow has survived mandating 2's complement.

This is all very unhinged.

I'm not holding my breath to see this ancient Qt fork fully converted to "modified" Barr spec but that will be a hoot.

smaudet · 2024-10-18T05:14:28 1729228468

Yeah, I think many of their arguments are not quite up to snuff. I would be quite interested how 1s compliment is faster, it is simpler and thus the hardware could be faster, iff you figure out how to deal with the drawbacks like -0 vs +0 (you could do it in hardware pretty easily...)

Buuuut then the Unisys thing. Like you say they dont make processors (for the market) and themselves just use Intel now...and even if they make some special secret processors I don't think the IRS is using top secret processors to crunch our taxes, even in the hundreds of millions of record realm with average hundreds of items per record, modern CPUs run at billions of ops per second...so I suspect we are talking some tens of seconds, and some modest amount of RAM (for a server).

The one point he does have is interoperability, which if a lot of (especially medical) equipment uses 1s compliment because its cheaper (in terms of silicon), using "modern" tools is likely to be a bad fit.

Compatability is King, and where medical devices are concerned I would be inclined to agree that not changing things is better than "upgrading" - its all well and good to have two systems until a crisis hits and some doctor plus the wrong sensor into the wrong device...

epcoa · 2024-10-18T06:23:37 1729232617

> The one point he does have is interoperability, which if a lot of (especially medical) equipment uses 1s compliment

No it’s completely loony. Note that even the devices he claims to work with for medical devices are off the shelf ARM processors (ie what everybody uses). No commonly used commodity processors for embedded have used 1’s complement in the last 50 years.

> equipment uses 1s compliment because its cheaper (in terms of silicon)

Yeah that makes no sense. If you need an ALU at all, 2s complement requires no more silicon and is simpler to work with. That’s why it was recommended by von Neumann in 1945. 1s complement is only simpler if you don’t have an adder of any kind, which is then not a CPU, certainly not a C/C++ target.

Even the shittiest low end PIC microcontroller from the 70s uses 2s complement.

It is possible that a sensing device with no microprocessor or computation of any kind (ie a bare ADC) may generate values in sign-mag or 1s complement (and it’s usually the former, again how stupid this is) - but this has nothing to do with the C implementation of whatever host connects to it which is certainly 2s. I guarantee you no embedded processor this dude ever worked with in the medical industry used anything other than 2s complement - you would have always needed to do a conversion.

This truly is one of the most absurd issues to get wrapped up on. It might be dementia, sadly.

https://github.com/RolandHughes/ls-cs/blob/master/README.md

Maintaining a fork of a large C++ framework (well of another obscure fork) where the top most selling point is a fixation on avoiding C++20 all because they dropped support for integer representations that have no extant hardware with recent C++ compilers - and any theoretical hardware wouldn’t run this framework anyway, that doesn’t seem well attached to reality.

Dagonfly · 2024-10-19T08:08:31 1729325311

> it is simpler and thus the hardware could be faster

Is it though? With twos compliment ADD and SUB are the same hardware for unsigned and signed. MUL/IMUL is also the same for the lower half of the result (i.e. 32bit × 32bit = 32bit). So you're ALU and ISA are simple and flexible by design.

epcoa · 2024-10-20T02:26:20 1729391180

For calculations, of course it’s not simpler or faster. At best, you could probably make hardware where it’s close to a wash.

One that lectures on the importance of college you would think would demonstrate the critical thinking skills to ask themselves why the top supercomputers use 2’s complement like everyone else.

The only aspect of 1’s or sign mag that is simpler is in generation. If you have a simple ADC that gives you a magnitude based on a count and a direction, it is trivial to just output that directly. 1’s I guess is not too much harder with XORs (but what’s the point?). 2’s requires some kind of ripple carry logic, the add 1 is one way, there are some other methods you can work out but still more logic than sign-mag. This is pretty much the only place where non 2’s complement has any advantage. Finally for an I2C or SPI sensor like a temp sensor it is more likely you will get none of the above and have some asymmetric scale. Anybody in embedded bloviating on this ought to know.

In his ramblings the mentions of packed decimal (BCD) are a nice touch. C, C++ has never supported that to begin with so I have no idea why that must also be “considered”.

mrpippy · 2024-10-18T01:20:01 1729214401

C++17 removed trigraphs

poincaredisk · 2024-10-18T01:42:52 1729215772

Which was quite controversial. Imagine that.

rty32 · 2024-10-18T11:06:53 1729249613

One obvious example is auto_ptr. And from what I can see it is quite successful -- in a well maintained C++ codebase using C++ 11 or later, you just don't see auto_ptr in the code.

112233 · 2024-10-18T10:25:31 1729247131

they do it left and right when it meets their fancy, otherwise it is unconscionable.

Like making over "auto". Or adding "start_lifetime_as" and declaring most existing code that uses mmap non-conformant.

But then someone asks for a thing that would require to stop pretending that C++ can be parsed top down in a single pass. Immediate rejection!