I'm not a fan of strlcpy(3)

nine_k · on July 15, 2024

I'd say that in general any format of variable size that does not state its length before the variable part is provoking buffer-overflow bugs. Thus such formats should be avoided in any kind of binary interop,

This holds even if the total length of the data in an interaction is not known ahead of time. E.g. an audio stream can be of indeterminate length, not known when the first byte is sent over the network, but each UDP packet has a well-determined length given in the header.

The length field can be made variable-size in a rather fool-proof way [1], allowing to economically represent both tiny and huge sizes.

(Zip files, WAD files, etc have that info at the very end, but this is because a file has a well-defined end before you start appending to it; fseek(fp, 0, SEEK_END) can't miss.)

[1]: http://personal.kent.edu/~sbirch/Music_Production/MP-II/MIDI...

EasyMark · on July 15, 2024

one of the main reasons I use c++ as a "better c" and use the string library in embedded (or an embedded optimized version) all time, along with basic containers and algos, and "simple" classes and inheritance. Yeah I know there are c string libraries out there that also solve the problem, and even wrote a couple of my own over the years for "minimal size to functionality" ratio.

kevin_thibedeau · on July 16, 2024

The problem with C++ strings for embedded is the heavy dependence on the heap for many operations and the constant copying of ROM literals into RAM objects. If your target platform is "embedded" with 1GiB+ RAM then you can work in systems programming mode and not care. If it's 32KiB, a heap may be too much of a liability.

Thorrez · on July 16, 2024

I think you can probably avoid unnecessary copies by using std::string_view where possible instead of std::string .

queuebert · on July 15, 2024

Niklaus Wirth has entered the chat.

Tuna-Fish · on July 15, 2024

There are a whole bunch of other things I dislike about Pascal, but in this one he was just undeniably correct. Worrying about 1-3 extra bytes (on the 16 and 32 bit platforms where it potentially mattered) was just not worth all the issues that null-terminated strings brought with them.

My current favorite string implementations are the various compact string crates for Rust. Generally, you want a string to be able to do at least three things:

- Pointer, length, capacity tuple, for heap-allocated strings, 24B on x64.

- String inlined into the 24B buffer.

- Pointer, length tuple where pointer points to rodata.

You can do any of that and the discriminator in 24B, given the healthy assumption that all strings are shorter than 2^63.

Sadly, switching costs are massive and every programming language is pretty much struck with the string they started with. Hopefully, whatever comes next can crib from smartstring or compact_str or the like.

gumby · on July 15, 2024

There was a paper on this in the late 70s from the Cedar group at PARC. This was back when computer science papers were actual scientific papers, so full of analysis of different alogorithms' performance with counted vs delimited strings. Counted strings won hands down on anything but strings so short the length was a large percentage of overall size.

Yet...since nobody reads the literature, we have all continued to suffer.

ninkendo · on July 15, 2024

The late 70s was way too late for this… null terminated strings were already adopted by C and UNIX by then, and the rest is history.

eru · on July 16, 2024

And sadly, we know that the C folks don't read papers; otherwise they wouldn't have come up with Go later.

eru · on July 16, 2024

> Sadly, switching costs are massive and every programming language is pretty much struck with the string they started with.

Haskell is _almost_ flexible enough to be able to use a different string than the one it started with.

The language itself actually is flexible enough, but many of the libraries are not.

The main thing making Haskell flexible enough is that a literal like "foo" can be statically determined to be the right string type that you want to use. (And that happens at compile time, it's not a runtime conversion.)

IgorPartola · on July 15, 2024

Yes except when no.

Imagine you are writing performance sensitive code. You want to get a substring from a string, one that is not going to live outside your hot loop. In standard C you can just reference a part of a string with a pointer offset. All standard functions will continue working and you didn’t have to make any calls outside of your loop, not to the allocator, not to memcopy, nothing.

With strings being objects that are prefixed by a header cannot do this. At a minimum you need to allocate a new header, if not the whole string. Yes that’s the safer route but also a lot less performant.

Most crucially, you can build the header string implementation on top of C strings. You cannot do the opposite.

Realistically though C strings (aka null terminated strings) are just not a great thing because of the null termination. For my money, I would prefer to just use untermianted arrays and a separate size variable, as well as wide character strings for actual display stuff. This way all the interop must include string lengths (or some other way to determine length), and all internal stuff may be just ASCII but must not leave your internal logic and never be shown to the user.

AaronFriel · on July 15, 2024

> Imagine you are writing performance sensitive code. You want to get a substring from a string, one that is not going to live outside your hot loop. In standard C you can just reference a part of a string with a pointer offset.

If you want your substring to terminate in the same place as the original, at a null terminator. But that sadly is almost never the case, and as many C practitioners know, references like this are often unsafe and so APIs that substring tend to copy. That's just what they have to do to pass address sanitizer and static analysis checks.

If you want arbitrary views on a null terminated string, well, it's no longer null terminated and that's just the start of your problems in C.

In languages like Rust and Go, taking a view of a string or array is safe and doesn't copy the underlying data or require an allocation. So if you are writing performance sensitive code where substrings are a major contributor to CPU cycles, best go with those language (or C++) rather than C.

IgorPartola · on July 15, 2024

That’s fair: you won’t be able to use any libc functions that rely on null termination. But a lot of the time you don’t need to either. Think writing the substring to a socket or comparing it to a known constant.

Tuna-Fish · on July 15, 2024

In Rust, you would do both of those with a &str, which works fine. Just works exactly as in C, with no calls to memcopy or allocator or anything. And you would also be able to do all the other things that in C use null termination, too.

Tuna-Fish · on July 15, 2024

The solution in Rust is separate String and &str. &str is a reference to somewhere within String, and the length of the referred to region, and borrows from the String it refers to.

Any function that does not need to modify a String takes a &str. Any function that does modify a String typically takes a String, which means they consume their input. (Because of utf-8, in-place modification is generally a pipedream.)

Also, the headers are typically allocated on stack. Rust is a lot less shy about types that are larger than a pointer living inline whereever they are used, and this is something that seems to work a lot better than the alternative.

IgorPartola · on July 15, 2024

Allocating headers and strings separately blows your CPU cache. Hardly a performant way of doing hot loops.

saagarjha · on July 15, 2024

Compared to calling strlen a bunch, which I’m sure is significantly more performant.

IgorPartola · on July 15, 2024

You never need to call strlen unless you are getting your inputs from a place that doesn’t give you a string length (such as stdin).

deathanatos · on July 16, 2024

So which is it, then? Does keeping size separate "blows your CPU cache"¹ or not? You can't argue it does in one case (Rust) but not in your case…

(And note that the representation you're responding to is not really a "header", in the same sense that the trailing null is a "footer". The representation does not require the length be contiguous with the data, but that's what upthread was trying to say in the first place.)

¹(it doesn't…)

eru · on July 16, 2024

So now you are arguing that by default your strings should come with a length? Great!

If you want that, you might as well bake that length into the string type by default (and use a specialised type, perhaps a naked raw pointer into the string) for when you don't want to pass the length.

saagarjha · on July 16, 2024

That's most interfaces…?

kevin_thibedeau · on July 16, 2024

Not argv[].

saagarjha · on July 17, 2024

You still need to call strlen on each element?

tialaramex · on July 16, 2024

To get a correct understanding, if you aren't a Rust person, Rust's String is (literally, though this is opaque) Vec<u8> with checks to ensure it's actually not just arbitrary bytes you're storing but UTF-8 text.

Vec<u8> unlike String has a direct equivalent (well, the Rust design is slightly better from decades of extra experience, but same rough shape) in C++ std::vector<std::byte>

The C++ std::string is this very weird thing that results from having standardized std::string in C++ 98, then changing their mind after implementation experience. So while it's serviceable it's pretty poor and nobody should model what they're doing on this. There have been HN links to articles by people like Raymond Chen on what this type looks like now.

pcwalton · on July 15, 2024

In order to access the string contents in the first place you need the pointer. The length is stored right next to it. So they're both going to be in the same cache line, assuming proper alignment. In the rare case in which they straddle a cache line, you just have to load once and then the length remains in cache for the remainder of the loop. (This is true regardless of where the length lives, in fact; as far as CPU cache is concerned it really makes little difference either way.)

(This is assuming SROA didn't break apart the string and put the length in a register, which it often does, making this entire question moot.)

Tuna-Fish · on July 15, 2024

Huh? The headers are either in registers or in stack. The top of stack is always in L1. There is no way in which this is inferior to handing over a pointer to a string and a length separately, other than requiring two additional words of storage in registers/stack.

IgorPartola · on July 15, 2024

How is that? Say you are reading 1000 lines of stdin at once to process them. Which registers are your string and substring headers stored.

Tuna-Fish · on July 15, 2024

If you are reading 1000 lines from stdin at once to separate Strings, you are already going to be accessing memory in 1000 places at the same time, and making it 1001 isn't meaningfully worse for your cache. (Implementation would be Vec<String>, which would lay out the 1000 headers contiguously.)

But I genuinely have a hard time understanding for what kind of workload I would ever do that. If you want to read a 1000 lines of stdin, and cannot use an iterator and must work on them at the same time, I would likely much rather read them into a single string and then split that into a 1000 &str using the .lines() iterator.

dgfitz · on July 15, 2024

I was miffed at: 1000 lines from stdin. It’s the same problem 1000 times, not 1000 problems at once.

Tuna-Fish · on July 15, 2024

Presumably the idea is, for example, sorting? In which case you do have to read the entire input before you can do anything. But the way I'd do that is to read the entire stdin to a single String, then work with &str pointers to it.

pezezin · on July 16, 2024

If you really care about performance, you should not allocate within hot loops.

tsimionescu · on July 15, 2024

Null terminated strings have a footer, so it is the exact same problem, just on the other end of the string. It is inherently impossible to substring an arbitrary string without copying and using the same memory layout for the full string and the substring(s).

Of course, if your string type is a struct containing a size and a pointer, you can easily have multiple substrings pointing into the same byte array.

samatman · on July 16, 2024

> Imagine you are writing performance sensitive code. You want to get a substring from a string, one that is not going to live outside your hot loop.

Zig uses slices for this (and everything else except interop): a pointer and a length, with some nicely ergonomic syntax for making another one, like `slice[start..][0..length]`.

When you're building strings then you have an ArrayList, which keeps a capacity around, and maybe a pointer to an Allocator depending on what you want. It's trivial to get the slice out of this when you need it.

Doing anything useful with a string requires knowing where it is (the pointer) and how much of it you have (the length) so keeping them together in one fat pointer is just good sense. It's remarkable how much easier this is to work with than C strings.

GoblinSlayer · on July 16, 2024

Efficient substring in C? Absolutely. Why don't we see real code? https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/pute...

BiteCode_dev · on July 15, 2024

Yes but that's the rare case.

The rare case should be possible, just not the default.

In Rust, you would make custom string handling unsafe for the bottleneck.

IgorPartola · on July 15, 2024

Rare for whom? Doing a lot of kernels or embedded development lately?

BiteCode_dev · on July 16, 2024

kernels or embedded development is rare compared to web dev, app dev, cli tooling, automation, etc.

In fact, it's pretty damn niche.

And rust is a general language, so it favors the most common case, but let the niche case be possible.

JonChesterfield · on July 15, 2024

48bit address space and 128bit of return value on systemv make pointer, size, 32bit of capacity-past-the-end attractive on x64. Specifically ptr, size as u64 with the extra capacity stored across the high 16 bits of each of them.

Tuna-Fish · on July 16, 2024

What do you do when you have a string that's longer than 2^32 that gets truncated to len=0? Instantly freeing the buffer might not be what the user wants, if they intend to immediately reuse it for an another very long string, for example.

I think that's a pretty bad case of premature optimization, especially because the first CPUs with 57 bit support are now hitting mainstream. Just use 3 words, it's not that much extra space.

JonChesterfield · on July 17, 2024

Realloc/remap down to 4gb in that case sounds OK to me. > 4gb allocated from a structure which can't do any resizing seems moderately unlikely, but sure, I guess free is also correct in that case.

Two 64 bit values can be returned in registers on the systemv x64 abi, three get passed as a pointer to stack memory. It's an optimisation but I think it's a valid one.

57 bit address space has been coming any year now for maybe a decade, I'll worry about that when it happens.

msla · on July 15, 2024

Yes, his arrays which have length as an immutable part of their type certainly prevent certain kinds of bugs. Too bad about making it impossible to write generic array-handling subroutines, even if you accept the generally inexpressive type system as a given.

ThreatSystems · on July 15, 2024

Up to

interroboink · on July 15, 2024

I'm sure you know this, but just another point for people to keep in mind: using a length+contents representation makes it harder to modify the payload, if needed (more bookkeeping). And using a variable-size length makes that even harder, since you might have to shift or re-copy the full payload to make room for the new "length" header.

Of course, once you're done processing and are sending it along (as in serialization, that you mention), it's not an issue.

EPWN3D · on July 15, 2024

The point of strlcpy(3) is not to be the world's best and most efficient string copier. It's to provide a drop-in replacement to previous, memory-unsafe string copy routines in constrained environments where you have to have bounds on stuff and might not have an allocator.

If there are bugs with truncation in the resulting buffer, those are the program's bugs, and they existing before strlcpy(3) came into the picture.

foresto · on July 15, 2024

> The point of strlcpy(3) is not to be the world's best and most efficient string copier. It's to provide a drop-in replacement to previous, memory-unsafe string copy routines

It's not a drop-in replacement, though. Not even if you ignore the different return type.

strncpy guarantees that the buffer will be completely overwritten (filling with null chars at the end), while strlcpy will happily leave remnants of whatever was there before.

Just dropping in strlcpy wherever strncpy appears can lead to data leaks or inconsistent hashes, for example, depending on how the buffer's contents are used.

EPWN3D · on July 16, 2024

That behavior should be totally irrelevant to code bases that are trying to handle C strings properly. If you have some reliance on the content of the buffer after the terminator, you've for problems that the string copy routine cannot help you with.

foresto · on July 16, 2024

> That behavior should be totally irrelevant to code bases that are trying to handle C strings properly.

"No true Scotsman..."

kevin_thibedeau · on July 16, 2024

It is a problem when people foolishly dump structs and fixed size buffers to storage without proper serialization. If you need that level of performance then you own the consequences.

foobiekr · on July 15, 2024

This. I don't understand the objection and I spent 20 years writing C code. The reason to use strlcpy is not to _fix the bug_ but rather _to prevent the bug from turning into a crash, memory corruption, or exploit_. It also forces discipline by carrying around the length. As you say, it's also a drop-in.

A truncation bug is a hell of a lot easier to debug than memory corruption.

mafuyu · on July 15, 2024

I've worked on embedded RTOS projects where we had our own strlcpy implementation - it's fine. Well, I mean, all the str functions suck because C strings, but that's exactly why sticking to a good shared set of idioms and staying organized is so important. And in C, that means manually tagging buffers with their length, no getting around that. Given that, strlcpy is less bug-prone than strncpy, simply due to requiring less lines of code to use correctly per invocation.

I think a lot of the confusion in the C string discourse comes from people thinking they should rely on the NULL termination byte for string length. You really shouldn't, and if you have to do it, you need to be extra careful to check all your assertions that it will be properly terminated. Just carry around the length, and bundle it with the pointer in a struct to pass it around when it makes sense. Not the most ergonomic, but it's C, what can ya do.

eru · on July 16, 2024

It's pretty funny that C strings were decided to be NULL terminated in the ancient past for 'convenience', but it turns out you still need to carry the length around anyway.

mafuyu · on July 16, 2024

Not to defend C strings too hard, but it does make some sort of sense, IMO. You have to manage all your buffers manually in C, whether they contain a string or not. If you store a string of length 5 in a 10-byte buffer, you still need to manage the 10-byte buffer. Raw pointers kept things very flexible and lightweight when C was created.

Nowadays, things like C++ string_view's and Rust str slices handle this for you automatically, but those came around much later and require more sophistication at compile time.

eru · on July 17, 2024

Yes, but it's not that much more sophistication, because C already supports structs. (Though I'm not sure if the first versions of C already had structs?)

rini17 · on July 15, 2024

> It also forces discipline by carrying around the length.

LOL. It does not force anything - you can mishandle source or destination buffer lengths very easily and compiler won't say anything.

I sometimes wonder what kind of disaster will have to happen to make C programmers agree on a standard buffer (i.e. pointer+size) type with mandatory runtime bounds enforcement ....

foobiekr · on July 15, 2024

Force is too strong a word. Yes, it's possible someone just passes whatever, or just passes strlen(s) which is an even dumber answer.

Someone · on July 15, 2024

> It's to provide a drop-in replacement to previous, memory-unsafe string copy routines

Nitpick: it’s not quite a drop-in. Prototypes of these functions are

  char * strncpy(char *dst, const char *src, size_t num);
  size_t strlcpy(char *dst, const char *src, size_t num);

strncpy(dst, src, num) always returns dst (https://cplusplus.com/reference/cstring/strncpy/), which is quite useless, as the caller knew that already.

strlcpy(dst, src, num) returns the total length of the string it tried to create (https://www.unix.com/man-page/posix/3/strlcpy/). Callers can use that to detect that the string didn’t fit the buffer and reallocate a buffer that’s long enough.

saagarjha · on July 15, 2024

That’s exactly why you shouldn’t be using it: it does a very bad job at that, with behavior basically nobody wants.

Someone · on July 15, 2024

But doing that if you have a large ancient C code base is a lot of work.

The reason for the existence of strlcpy isn’t that it is perfect, it’s that it’s the best option with good UX for integration into an existing C code base.

saagarjha · on July 15, 2024

It's not, though. That's the point: the interface it provides is not very good. The API surface for "I have a string here and I want you to put it there but only the first n bytes" is well-defined and can be done in a much better way than what strlcpy does.

EPWN3D · on July 16, 2024

The API surface is "I have a string that I'm pretty sure will fit in this buffer, but if it doesn't I don't want to cough up control of my bootloader."

saagarjha · on July 17, 2024

Yep and that space has space for improvement

kazinator · on July 15, 2024

I'm not a fan of Unix man page section numbers in parentheses.

strlcpy is a stopgap, whack-a-mole solution for buffer overflows. It is rationalized by the reasoning that it does not make the program less wrong, while (probably) making it more secure.

When truncation matters and you have a fixed size buffer, that buffer should be large enough in order for it to be justifiable to say that someone is misusing the application. Perhaps a tester trying to break it.

Nobody’s surname needs 128+ bytes. No reasonable URL for a firmware update download needs 4096 bytes.

If truncation matters, no, it does not always make sense to accept a gig of data and be ready for more. You can impose a limit. A violation of the limit is an error, treated like a case of bad input.

saagarjha · on July 15, 2024

People have long surnames, especially if you take multi-byte characters into account. And someone with the same mindset as you is the reason why my profile picture in Google’s employee directory never did actually show up in Safari–one of the path components ended up being a couple thousand characters long (I wonder if it was literally a base64 encoding of the image itself?)

kazinator · on July 16, 2024

No, my reasoning does not say that a browser shouldn't handle a long URL (which is a substring of a page it has already accepted and rendered).

saagarjha · on July 17, 2024

I fail to see the difference here?

CoastalCoder · on July 15, 2024

Out of curiosity, would you mind sharing some details about your surname?

E.g., its length in its native alphabet, or its length as a UTF-8 string?

saagarjha · on July 16, 2024

  $ swift
  Welcome to Apple Swift version 6.0 (swiftlang-6.0.0.5.15 clang-1600.0.22.6).
  Type :help for assistance.
    1> "झा".count
  $R0: Int = 1
    2> "झा".utf8.count
  $R1: Int = 6

nathell · on July 15, 2024

> Nobody’s surname needs 128+ bytes.

https://en.wikipedia.org/wiki/Hubert_Blaine_Wolfeschlegelste.... would beg to differ.

psadauskas · on July 15, 2024

A couple jobs ago, I worked on writing an API client for a CRM system that supported 2GB for most of the text fields (name, address line 1, job title, etc...). It also offered up 99 "custom field" text fields, also allowing up to 2GB each.

I'd considered base64-encoding my ripped DVD collection, and using them to store another backup copy for me.

kazinator · on July 16, 2024

He would be too busy begging numerous agencies to handle his name to beg to differ with you.

Is there a picture of the ID page of that man's passport? Or of a driver's license or similar?

Whatever is on that is his actual name.

CoastalCoder · on July 15, 2024

I love that the guy's occupation was "typesetter".

dgfitz · on July 15, 2024

I'm not sure his name is long enough for this to be an issue.

Izkata · on July 15, 2024

The URL uses a shortened version, his full name on the page is:

Adolph Blaine Charles David Earl Frederick Gerald Hubert Irvin John Kenneth Lloyd Martin Nero Oliver Paul Quincy Randolph Sherman Thomas Uncas Victor William Xerxes Yancy Zeus Wolfeschlegelsteinhausenbergerdorffwelchevoralternwarengewissenhaftschaferswessenschafewarenwohlgepflegeundsorgfaltigkeitbeschutzenvorangreifendurchihrraubgierigfeindewelchevoralternzwolfhunderttausendjahresvorandieerscheinenvonderersteerdemenschderraumschiffgenachtmittungsteinundsiebeniridiumelektrischmotorsgebrauchlichtalsseinursprungvonkraftgestartseinlangefahrthinzwischensternartigraumaufdersuchennachbarschaftdersternwelchegehabtbewohnbarplanetenkreisedrehensichundwohinderneuerassevonverstandigmenschlichkeitkonntefortpflanzenundsicherfreuenanlebenslanglichfreudeundruhemitnichteinfurchtvorangreifenvorandererintelligentgeschopfsvonhinzwischensternartigraum Sr.

mattigames · on July 15, 2024

He should have added a few numbers in there to really throw off some login systems.

eru · on July 16, 2024

And spaces.

xyst · on July 15, 2024

Active Directory would cry.

ezekiel68 · on July 15, 2024

The exception proves the rule. There is no way to read the description of the translation of the surname without coming to the conclusion that this was name was chosen for mischief, outside of the bounds of reasonable human societal expectations. Mischief is fine and all but there is no authentic "gotcha" compulsatory requirement for society to accomodate the chaos resulting from such a mischievous personal preference. (and I say this as someone who has legally changed his name twice)

tl;dr Make you bed; lie in it.

koito17 · on July 15, 2024

For conventions of the English language, this is an exception, but not in other languages. Arabic names have up to five components. A 128-byte limit leaves an average of 25 bytes for each component. Consider that Arabic script in UTF-8 consumes 2 bytes per glyph, and graphemes in Arabic are compositions of these glyphs.

For a more extreme example, consider conventions of Japanese. Middle names do not exist in Japanese. In fact, middle names are impossible to input into the 戸籍 (family registry). Forms in Japan are designed around the assumption that each person has exactly two names. Many Europeans would be unable to input their full name in such system. In this example, it'd be unreasonable to suggest most Europeans are acting "outside of the bounds of reasonable human societal expectations".

In general, the most effective solution I've seen for handling names is to have a single name field and treat it opaquely. If you need an inflection, ask for that separately.

eru · on July 16, 2024

> In general, the most effective solution I've seen for handling names is to have a single name field and treat it opaquely. If you need an inflection, ask for that separately.

Yes, in general. Though sometimes you need to know about specific parts of a name as interpreted in a specific cultural context.

Eg in Germany by law you need to register a family name when you get married, and your kids are going to have that family name, if you want it or not. (The parents don't necessarily have to have that name, eg the parents can opt for double-barrelled names or they can keep their old name. But the kids only get one family name and they all get the same name, and there are restrictions on which name you can pick.)

In contrast, here in Singapore they just give you a blank space in a form where you can put in your new baby's complete name as you please.

(OK, technically, you get multiple blank spaces, because you can eg give your kid a western name like Jay Random Smith and a Chinese name that is a completely separate name and doesn't need to have anything to do with the western name. I think you can also get eg a tamil name, if you want to etc.)

eru · on July 16, 2024

Compare Mr Karl-Theodor Maria Nikolaus Johann Jacob Philipp Franz Joseph Sylvester Buhl-Freiherr von und zu Guttenberg https://en.wikipedia.org/wiki/Karl-Theodor_zu_Guttenberg

And he's not even the guy with the longest name, and his parents did not make up his name to spite some length restrictions.

mort96 · on July 15, 2024

That surname is like 22 bytes

fmbb · on July 15, 2024

Adolph Blaine Charles David Earl Frederick Gerald Hubert Irvin John Kenneth Lloyd Martin Nero Oliver Paul Quincy Randolph Sherman Thomas Uncas Victor William Xerxes Yancy Zeus Wolfeschlegelsteinhausenbergerdorffwelchevoralternwarengewissenhaftschaferswessenschafewarenwohlgepflegeundsorgfaltigkeitbeschutzenvorangreifendurchihrraubgierigfeindewelchevoralternzwolfhunderttausendjahresvorandieerscheinenvonderersteerdemenschderraumschiffgenachtmittungsteinundsiebeniridiumelektrischmotorsgebrauchlichtalsseinursprungvonkraftgestartseinlangefahrthinzwischensternartigraumaufdersuchennachbarschaftdersternwelchegehabtbewohnbarplanetenkreisedrehensichundwohinderneuerassevonverstandigmenschlichkeitkonntefortpflanzenundsicherfreuenanlebenslanglichfreudeundruhemitnichteinfurchtvorangreifenvorandererintelligentgeschopfsvonhinzwischensternartigraum Sr.

riehwvfbk · on July 15, 2024

That's the shortened version. The full version of the surname is 666 characters.

bawolff · on July 15, 2024

I guess it kind of goes to the poster's point, since Wikipedia articles truncate at 255 bytes (Since that was the max size of a VARCHAR in mysql 3)

mort96 · on July 15, 2024

I apologize, I should've read more of the article. My bad

hnfong · on July 15, 2024

Surprised nobody has linked this yet:

https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

Also note that while the counter examples might sound extreme, in some languages each character might need 3 bytes in UTF-8, and 128/3 ~= 43 characters doesn't seem to be that outrageous.

smsm42 · on July 15, 2024

> Nobody’s surname needs 128+ bytes

Oh we're doing that "false things that programmers believe about the world" again! Fun! Let's consider cultures where people can have more than one surname. Ever heard about Pablo Diego José Francisco de Paula Juan Nepomuceno María de los Remedios Cipriano de la Santísima Trinidad Ruiz y Picasso? Yes, that's his full name. Of course, he didn't frequently use it, but if you make a system dealing with people's names, you'd eventually end up having to support something like that.

pezezin · on July 16, 2024

In this case, his family names (Spanish people have two) were just "Ruiz y Picasso", the rest was is given name. So you can argue that his family name fits within 128 bytes.

But that brings us to the assumption that people have just one family name with is just one word, which is very much not the case in many cultures around the world.

kazinator · on July 16, 2024

A programmer who believes surnames are one word has not heard of John Von Neumann.

sham1 · on July 17, 2024

And then there's the assumption that you'd write it in that way, and in that order. There might be situations where you'd want to use the Hungarian and write "Neumann János" instead.

I.e. the assumption that the family names come last is not necessarily correct even in Europe. Let alone if one deals with the various Chinese languages, Japanese, or Korean. And probably others.

pezezin · on July 17, 2024

That is also true. I am currently working in an international project in Japan, so when designing our systems and databases I always insists on the terms "given" and "family" names instead of "first" and "last" names, to prevent confusions.

Same with the dates. I always ask my colleagues to please write the years in full, otherwise it can be very difficult to know if it is DMY or YMD (thankfully we don't have to deal with MDY).

bloak · on July 15, 2024

> Nobody’s surname needs 128+ bytes.

According to Wikipedia some joker did have a 666-character surname, officially, in the USA. Perhaps the best thing to do would be to truncate the field at some reasonable limit to prevent people from crashing your system with ridiculous values but make sure your system works properly with the truncated names so, for example, it doesn't panic because the truncated name isn't equal to the original name.

With spaces, punctuation and diacritics, comparing even short names for equality is a bit dangerous and probably best avoided. If you expect two text fields to match and they don't, even after normalisation, you could consider flagging the case for human review later but continuing without an error for the time being.

saghm · on July 15, 2024

If I'm understanding correctly what you mean by "joker", it sounds like the changed their name to be this long intentionally? That seems like the type of thing where most software probably can get away with just not supporting then; with the possible exception of mandatory government services or something, there's no reason software should need to account for people taking such extreme steps of their own volition.

mynameisvlad · on July 15, 2024

Because programmers are too lazy to properly handle long names? That's a stretch for denying someone service and you know it.

Like, yes, nobody is forced to accept their name unless they're running a government service, but using it as an excuse is just that, an excuse.

jkaptur · on July 15, 2024

Any system will have some limit on the length of names - if nothing else, the budget for storage.

A non-lazy programmer will determine an appropriate limit, document it, continuously test that the entire system can handle that length correctly, and continuously test that helpful errors are returned when too-long names are input.

chihuahua · on July 15, 2024

What if my legal name is 500 trillion characters long? Should every project design their storage system to accommodate this?

If you look at the 666-character name, it's no more or less ridiculous than 500 trillion characters.

immibis · on July 15, 2024

Ask the government whether your legal name can be that long. They'll say no.

kazinator · on July 16, 2024

Handling arbitrary length input without caring how that could be abused is also lazy.

If you're working in a stack that nicely handles arbitrary lengths, it takes extra consideration and effort to put in limits.

david422 · on July 15, 2024

> nobody is forced to accept their name unless they're running a government service

I dunno, would you expect that the government should be allowed to dictate how long a person's name can officially be? If yes, then problem solved, nobody may have names longer than X, and all services will accept X. If no, then there has to be a practical limit on name sizes that government services can accept, and people will be unhappy because it doesn't accept their "official" name.

eru · on July 16, 2024

Alas, that only works if you are dealing only with people under the jurisdiction of the government in question.

There's always them pesky foreigners.

edflsafoiewq · on July 16, 2024

There's also the fact governments aren't static. The past, and future, are foreign countries.

saghm · on July 15, 2024

To be clear, I'm not arguing for or against a specific value as the "maximum length" of a name. I'm drawing a distinction here in terms of a potential user's intentional choices and what that means for providing support.

> Because programmers are too lazy to properly handle long names? That's a stretch for denying someone service and you know it.

I don't think someone should be denied service if they happen to have a long name, but I genuinely don't think it's a stretch not to try to handle people going out of their way to subvert expected norms. In this case, the argument is more philosophical than technical because there isn't an obvious heuristic for determining whether a name is intentionally made long or not, but there are places where I do think it's worth it for programmers to consider.

As an aside, I'd argue there's more nuance than "properly handling long names" or "being lazy. There's already an inherent limit on how large a name can fit into memory, and that limit might even fluctuate depending on the number of users being processed in a server at a given time and how much memory a given server has. Is a 1 GB name too long to be worth handling, or is not handling it "laziness"? If you're arguing that any name that the government accepts should be accepted by any software, how do you know what the limit is that the government will accept? If you have international customers, is the limit larger? If there's no documented limit, do you just need to hope your software is at least as robust as the government's? My point isn't that these situations are equivalent to a name that's 666 characters long, but that arguing that not handling 666 characters is lazy already is a blend of implicit technical assumptions (servers have enough memory that handling names with 666 characters isn't an issue) and social assumptions (it's possible for someone to actually have a name that long), and I don't think that "pretend all names can fit into memory fine and just crash or time out or something if there are too many names that are too long according to the parameters of the runtime and the hardware" is the obvious best choice from a fairness perspective.

KptMarchewa · on July 15, 2024

You can deal with jokers: https://find-and-update.company-information.service.gov.uk/c...

aftbit · on July 15, 2024

He allegedly told the utility company that he wouldn't pay his bill unless they spelled his name correctly, which caused them to print it on three lines. Maybe this guy is just a walking software test case?

kazinator · on July 16, 2024

The bill just needs the correct account number and address.

You don't get to just stipulate new conditions for paying your bill. If it's not in the service contract that your giant name has to be spelled completely on the bill for it to be payable, then that condition doesn't exist.

aftbit · on July 16, 2024

Are you a lawyer with expertise in Philadelphia commercial law of the 1950s? I'm sure not, so I wouldn't want to take that fight. It was apparently easier for the company to just write the guy's name correctly.

Or maybe Wikipedia is wrong, or the source was bad. You have to pay to read the 1955 article and I can't be bothered right now. Citation 16 below if you're interested.

https://en.wikipedia.org/wiki/Hubert_Blaine_Wolfeschlegelste...

Too · on July 16, 2024

That’s exactly what MICROS~1 did with the 8.3 file names on DOS. To add support for longer file names in systems that did not do so originally.

cjs_ac · on July 15, 2024

> Nobody’s surname needs 128+ bytes.

This attitude ensures that the US software industry will never conquer the world.

sgt · on July 15, 2024

The US software industry has already conquered the world, and continues to do so. It doesn't mean that it'll always be this way, but if you wanted to associate software with a single country, then the US would be your answer.

shrimp_emoji · on July 15, 2024

Why does it need to conquer the world? I just want to deal with sane, performant software.

estebank · on July 15, 2024

And people in other locales want to be able to use their own name without software mangling it.

gav · on July 15, 2024

At some point there is a physical limitation, there's no passport in the world that accepts a 666-character name.

The US only gives you 21 characters on the DS-11 for a surname.

Narishma · on July 15, 2024

You're assuming one character is stored as one byte, which is only the case for English.

shrimp_emoji · on July 15, 2024

That sounds like a them problem. ;D

I once heard that "decision" comes from a Latin root word meaning "to cut off (the other options; to pay opportunity cost)". I will decide to optimize for my use case.

bregma · on July 15, 2024

I just want to deal with sane, performant people.

We'll just both need to learn to live with disappointment.

rat87 · on July 16, 2024

> nobody's surname needs 128 bytes

I believe that's #6 falsehoods programmers believe about names (with examples)

https://shinesolutions.com/2018/01/08/falsehoods-programmers...

Ingoring multi byte characters there are still plenty of long names https://www.ancestry.com/c/ancestry-blog/discovering-the-his...

If you're going to try for a "reasonable" max name length it would probably need to be at least 4kb.

mjevans · on July 15, 2024

Names are hard, don't force things into your assumptions.

If you must, record fields such as:

Full Legal Name - Freeform input no string input limitation. If you feel like this is an attack vector, send the data out for human review.

Full Mailing Address - Don't try to break this down, allow multi-line, free form input. This is something you might want to validate with your shipping carrier and/or a human.

A short 'nickname' used as such.

tsimionescu · on July 15, 2024

You always need size limitation. You really don't want to allow 10GB strings to be stored in the full name field.

Also, in almost any situation where you need a legal name, you actually want to follow a lot of rules. This idea that people's original names are somehow sacrosanct is a misunderstanding. If you're doing business in a European country for example, you have to write your name in Latin/Cyrillic letters, perhaps with a few symbols like ' or - allowed as well, and typically with a few accents/diacritics specific to each country. You certainly can't register as 田中 in any context that requires a legal name in France, you'd have to write that as Tanaka.

And this is natural because legal records are meant for authorities in some specific country to read, and compare with other legal docs - so they need to at least be readable to those authorities.

mjevans · on July 16, 2024

A 10GB string does sound pathologically large; however it's an argument about the absurd.

What size is for sure enough? Well I'm not so sure. What if someone has a lot of titles. What if society decides that someone's Legal Name requires a post quantum cryptography key that happens to be 20MB (binary) long?

Also, FYI, at least PostgreSQL doesn't give you a free lunch for any variable length string; length requirements are a column constraint that DEGRADES performance because it has to check.

An external name validator of some sort could check things. Commonly allowed cases could pass by computer check, while actual humans could review edge cases. Someone trying to abuse the name field like that probably needs human review elsewhere anyway.

tsimionescu · on July 16, 2024

I would bet you that Postgres insert and query performance is better overall if all names in your table are, say, up to 10KB long than if you have a bunch of bots inserting 20MB-long "names". And 10GB long strings are just not supported by Postgres, at least with default settings.

And the point of adding limitations on length is that you shouldn't even accept the HTTP request if it passes some size, as it will severely degrade performance if you allow someone to upload a 10GB string, even if you separate it into a human review area.

Finally, if the legal requirements change and legal names can legitimately contain cryptographic material, than your system has to change. There is no point in designing a system that tries to work for any possible use case.

bigstrat2003 · on July 16, 2024

Names aren't actually that hard. It's perfectly reasonable to assume that your system is going to be used in your country and culture, and handle the cases which are relevant for that. Edge cases within that context, and expanding to other cultural contexts, can be handled as they come up. But until then, YAGNI.

thayne · on July 15, 2024

> Nobody’s surname needs 128+ bytes

128 bytes would only be 42 characters if each character uses 3 bytes, as would be the case in some languages. Which isn't an unreasonable length, especially if the name has a lot of combining characters.

kazinator · on July 17, 2024

Ok, let's make it 50 Unicode codepoints.

bawolff · on July 15, 2024

> Nobody’s surname needs 128+ bytes

That is only 32 astral characters... Seems kind of close for comfort. Not to mention combining characters

bear8642 · on July 15, 2024

> I'm not a fan of Unix man page section numbers in parentheses.

Why not? Confusion if function call with section number as argument?

kazinator · on July 16, 2024

It's visual clutter that only communicates

- I see the entire computing world with Unix blinders on my eyes.

- I can't imagine a strlcpy function being used on a system that isn't Unix and that doesn't have a man page for it in section 3.

- I don't care that C has been internationally standardized since 1989 with a printf(3) function; printf(3) it's just another Unix function in section 3 of my man pages.

- If I don't affix (2) or (3), how will people know I'm not talking about something other than a C library function? I don't understand this "context" stuff in writing and speaking.

shrimp_emoji · on July 15, 2024

I once learned what they mean.

But I forgot and now do not.

Karellen · on July 15, 2024

They describe the section of the manual, where 1 = programs, 2 = syscalls, 3 = userspace functions, 4 = special files, 5 = file formats, 6 = games, 7 = misc/overviews.

So printf(1) is the man page for the /usr/bin/printf command, while printf(3) is the man page for the libc printf() function.

Alternatively, readdir(2) is the man page for the readdir syscall, while readdir(3) is the man page for the libc wrapper, which no longer actually calls readdir(2). See also syslog(2) and syslog(3).

Or, time(1) is the man page for /usr/bin/time to time how long commands take. time(2) is the syscall to return the number of seconds since the epoch. And time(7) gives you an overview of time and timers on the system.

_kst_ · on July 16, 2024

I just discovered that typing man 'printf(1)' or man 'printf(3)' actually works (at least with the man command on Ubuntu, provided by the "man-db" package).

You have to quote or escape the parentheses because they're shell metacharacters, which IMHO makes that syntax more trouble than it's worth.

Other commands that work are "man 3 printf", "man -s 3 printf", and "man printf.3". I think "man 3 printf" is the oldest version.

At least one other version of the man command (NetBSD) doesn't accept "man printf.3" or "man 'printf(3)'".

spauldo · on July 15, 2024

Unless you're on a UNIX system the number isn't important. If you are on a UNIX system, then it's useful for telling the difference between commands (like write(1), hostname(1), or printf(1)), system calls (like write(2)), library functions (like printf(3)), or config files (like hostname(5)).

kazinator · on July 16, 2024

In an article about C programming, the context tells you that everything that is given in a typewriter font is a C identifier, unless noted otherwise.

dark-star · on July 15, 2024

> Nobody’s surname needs 128+ bytes. No reasonable URL for a firmware update download needs 4096 bytes.

...and surely the "seconds" field of a timestamp is always between 0 and 59 inclusive, addresses will include a state and a building number, phone numbers contain only digits and maybe a leading + sign, etc.

Wrong assumptions like this are one of the root causes of (in the best case) bad UI or (worst case) annyoing bugs.

128 bytes for a surname is only about 60 unicode characters, less if you include RTL markings and characters outside the BMP.

A URL can contain SHA hashes (think: reproducible builds) and can thus be very long (okay, 4k characters is pushing it quite a bit but I wouldn't rule it out like you did...)

bigstrat2003 · on July 15, 2024

There have to be limits somewhere. Memory and storage space is not infinite. On the other hand, "how many characters can someone have in their name" is infinite. That means that no matter what limit you choose, someone will eventually exceed that limit. And you have to have a limit.

This is not a matter of "wrong assumptions". At the end of the day, all you can do is set a limit such that you're comfortable with the risk that someone will be outside the limit you have set. And risk tolerance, as always, is a matter of opinion and not fact.

margalabargala · on July 15, 2024

This is a poor argument.

Firstly, "how many characters can someone have in their legal name" is decidedly not infinite, because it has to be sufficiently short that some governmental entity was willing to record it.

Secondly, as a reply to a comment (quite reasonably) pointing out that 60 unicode characters may not be enough for a surprisingly large number of people, this makes even less sense. Memory and storage space are not infinite, but 128 bytes per name is still unreasonably low. One could buy a single 12TB hard drive and store the names of every single living human, allocating 1.5KB per person.

bigstrat2003 · on July 16, 2024

The point is, you are always going to exclude some number of people's names when you set constraints. So you have to decide what is right for your use case, just like with any other engineering decision. There's no such thing as "right" or "wrong" here, simply what is best for the context. 128 bytes for a name is going to be unreasonably low in some contexts (e.g. for Arabic names), but not for others (e.g. for American names).

rat87 · on July 16, 2024

We live in a global society and there are plenty of Arab Americans so Arabic names are American names.

kazinator · on July 17, 2024

Upthread, I was the one who posited the 128 byte figure. It was for a surname, not full name.

I now posit 50 unicode codepoints for a surname.

Fit or fuck off.

jquery · on July 15, 2024

> it has to be sufficiently short that some governmental entity was willing to record it.

How short is that, exactly?

margalabargala · on July 17, 2024

It's not hard to apply an upper bound here.

If you arrive at a government office when they open for the day, and try to spell out your name to the clerk, and are unable to finish before the office closes for the day, that name is too long.

Realistically they will probably tell you to leave well before that.

No one will have a legal name longer than the legal system will allow them to have.

smaudet · on July 15, 2024

> Wrong assumptions like this

Assumptions, or features? I'm all for inclusive behavior, however I'm also for well tailored solutions. Having support for 8k characters when you are going to usually use maybe 20 isn't smart or correct either. That's why we have utf-8, not utf-32, you can grow the bytes when you need to, but only then.

> A URL can contain SHA hashes

It can, or it can not - again, perhaps a feature, not a bug. The hash can also live in a file named by convention, and downloaded/checked separately. Maybe there are other scenarios where you might need a really long url, but domain + release path + name + major.minor.patch should get you 99% of the way.

What's "reasonable" is relative, always designing for the edge case is good in some cases, but its also OK (and perhaps better) to optimize on occasion.

dhosek · on July 15, 2024

When I was in college, the only number in my mailing address was the zip code.

bell-cot · on July 15, 2024

THIS. Not for every use case, obviously. But for huge number of them. And the "FooBaz length 5307 bytes in check_input(), truncating to 4095 bytes..." errors (which are trivial to log, or ignore, as you wish) can reveal many interesting things.

morpheuskafka · on July 15, 2024

128 bytes is only 4 32-bit characters. Now, I think 4-byte UTF-8 characters are pretty rare, but at least 3-byte ones are certainly common in names, even legal names.

If you allow users to type emojis in their name you will definitely run out as the color/gender selectors take up an additional code point.

cpburns2009 · on July 15, 2024

You confused bits with bytes. 128 bytes can encode 32 4-byte long characters.

ktpsns · on July 15, 2024

Given how nice system programming languages we have these days, I refrain to let classic Null-terminated C-Strings entering my program. Even on embedded programming we opt-in for std::string (over Arduino's String). I am just happy to save our time in favour of having some X percentage less optimal code.

tialaramex · on July 15, 2024

It is seriously unfortunate that C++ managed to standardize std::string, a not-very-good owning container type, but not (until much, much later) std::string_view, the non-owning slice type you need far more often.

Even if Rust had chosen to make &str literally just mean &[u8] rather than promising it is UTF-8 text, the fact &str existed in Rust 1.0 was a huge win. Every API that doesn't care who owns the textual data can ask you for this non-owning slice type, where in C++ it had to either insist on caring about ownership (std::string) or resort to 1970s hacks and take char * with zero terminated strings.

And then in modern C++ std::string cares about and preserves the stupid C-style zero termination anyway, so you're paying for it even if you never use it.

roelschroeven · on July 15, 2024

> And then in modern C++ std::string cares about and preserves the stupid C-style zero termination anyway, so you're paying for it even if you never use it.

I don't think this in itself is a real problem. You pay for the zero at the end, which is not much. The real cost of zero termination is having to scan the whole string to find out the size, which with std::string is only needed when using it with C-style APIs.

wmanley · on July 15, 2024

and also you have to copy (probably allocating) to get a substring.

account42 · on July 15, 2024

Only if you want a substring with separate ownership though - a string_view doesn't have to be NUL-terminated.

jabl · on July 15, 2024

If you want to pass the string_view to some API that expects NULL terminated strings, then a copy is necessary (well, maybe is some cases you can cheat by writing a NULL in the string and remembering the original character, and then after the API call restore the character).

This isn't as much a fault of a string_view type of mechanism, but rather API's wanting NULL terminated strings. Which are kind of hard to avoid on mainstream systems today, even at the syscall interface. Oh well..

account42 · on July 16, 2024

Sure, but the thread here was about the forced NUL-terminator in std::string and the costs associated with that. If you want a NUL-terminator (e.g. for use with a C API) then you have to pay the copy (and in the general case, allocation) cost for substrings no matter how your internal strings look (unless you can munge the original string) and std::string is exactly the right abstraction for you.

But yeah, it would be nice if the kernel and POSIX APIs had better support for pointer+size strings.

bobmcnamara · on July 15, 2024

> And then in modern C++ std::string cares about and preserves the stupid C-style zero termination anyway, so you're paying for it even if you never use it.

Is this required now? I've seen a system where this was only null terminated after calling .c_str()

nmeofthestate · on July 15, 2024

c_str has to be constant complexity, so I guess the memory needs to be already allocated for that null character. I'd be surprised to see an implementation that doesn't just ensure that \0 is there all the time.

bobmcnamara · on July 16, 2024

Ah, the system I ran into would've been pre-c++11.

Only saw it trying to debug a heap issue and I was surprised because I thought surely it's a null terminated string already right? They also checked the heap allocation size, so it would only reallocate if the length of string data % 8 was zero.

tialaramex · on July 15, 2024

Facebook / Meta had their own string type which did this, turns out now you have an exciting bug because you end up assuming uninitialized values have properties but they don't, reading an uninitialized value is Undefined Behaviour and so your stdlib conspires with the OS to screw you in some corner cases you'd never even thought about because that saved them a few CPU cycles.

The bug will be crazy rare, but of course there are a lot of Facebook users, so if one transaction out of a billion goes haywire, and you have 100M users doing 100 transactions on average, the bug happens ten times. Good luck.

mjevans · on July 15, 2024

Golang's slices view of the world is addictive.

pasc1878 · on July 15, 2024

Depending on your usage it is not necessarily less optimal either.

You never need to walk the string to find the \0 byte. e.g. for strlen.

For short strings no heap memory needs to be allocated or deallocated.

account42 · on July 15, 2024

It's really too bad though that the short string optimization capacity is neither standardized nor user-controlled.

raverbashing · on July 15, 2024

Exactly this

It seems C is going around in circles while everybody else has moved on

No, speed and "efficiency" are not a be-all, for-all.

Safety is more important than that except in very specific cases. And even in these cases there are better ways and libraries to deal with this issue without requiring the use of "official" newer C functions that seem to still be half broken

There's so much fiction regarding memory issues and limited memory issues and what to do if we hit limited memory issues when in practice terminate and give up is the (often enforced by the OS) norm.

saagarjha · on July 15, 2024

I look forward to your solution to bridge these libraries to every other person’s slightly different implementation of the same library, which also has to talk to every other interface that cropped up over the last 50 years and takes null-terminated strings anyways.

shrimp_emoji · on July 15, 2024

I haven't moved on. :> C FTW! Bloated langs are cringe.

Slyfox33 · on July 15, 2024

What you call "bloat" is how other languages handle complexity that C necessitates that the programmer handle. Enjoy.

raverbashing · on July 15, 2024

Have fun playing with a chainsaw with the safety disabled

John_Cena · on July 15, 2024

Have fun playing with your Fischer Price KidSafe™ playset? >:)

At some point we are just outlining in general terms what we want from a language. If C was a toolbox it is a limited number of essential tools, other languages add so many things Alton Brown would faint from the unitasking nature of them.

C programmers love this. I know I do.

raverbashing · on July 16, 2024

The "kids tool" comment is so weird.

I could do pretty much whatever I wanted in (DOS) user space with Pascal.

I can do pretty much whatever I want in a modern OS user space with whatever lang I prefer. "Oh but you might need C bindings" because the OS was build that way! (And with Windows/COM you might prefer a C++ bindings - just saying ;) )

> If C was a toolbox it is a limited number of essential tools

C is an old toolbox where 1/3 of them is a rusty finger-remover, 1/3 is a clunky barely do nothing metal crap and 1/3 of them kinda works

I'm all for a simplified set of essential tools, but not one where it's sharper on the user handle than it is on the business end

bUt C iS jUsT hIgH lEvEl aSsEmBlY no it is not

kazinator · on July 15, 2024

If someone ever has to use your stuff over FFI from a high level language, they will curse you for not just using C strings.

gpderetta · on July 15, 2024

Null terminated C strings are still terrible for FFI. Pointer and length is a better solution and it is trivially interoperable with code that uses, say, string_view.

mikewarot · on July 15, 2024

If someone could port the Free Pascal string library to C, it would solve a lot of problems with new C code. It reference counts and does all the management of strings. You never have to allocate or free them, and they can store gigabytes of text. You can delete from the middle of a string too!

They're counted, zero terminated, ASCII or Unicode, and magic as far as I'm concerned.

Oh... And a string copy is an O(1) operation as it only breaks the copy on modification.

Edit: correct to O(1), thanks mort96

foobiekr · on July 15, 2024

For most strings, it seems to be that using a varint would solve the overhead problem. For short strings the overhead would be no longer than the null byte (which you could discard, except when interacting with existing APIs).

But as with _all_ string solutions, it's the POSIX interface, standard library, and other libraries that screw you. If you're programming in C today, it's because you're integrating with a ton of C code, and thus it's impossible to make progress since the scope of improvement is so small.

It's always struck me as weird that Rust treats strings the way it does - the capacity value is not useful for many cases of strings, and it would have cost them one bit to special case the handling constant strings without the cap measure, which would be better. Most strings are _short_ which makes the overhead worse, proportionally.

zokier · on July 15, 2024

Its not like there is any shortage of alternative string libraries for C; sometimes I feel everyone has gone and invented their own.

Antirezs sds is just one example https://github.com/antirez/sds

theamk · on July 15, 2024

Pascal strings have overhead of 2 ints per string (16 bytes on 64-bit systems)

The kind of person who calls a single pass through the string a "horribly inefficient solution" will faint at the idea of burdening every string with 16 more bytes of data.

adgjlsfhk1 · on July 15, 2024

it's pretty trivial to implement this as a max of 14 byte overhead (with small string optimization), but more importantly, it's only 16 bytes on 64 bit systems, which pretty much by definition aren't that memory constrained (since otherwise you would be on 32 but).

aidenn0 · on July 15, 2024

> ...it's only 16 bytes on 64 bit systems, which pretty much by definition aren't that memory constrained (since otherwise you would be on 32 but).

I'm not sure about that. There are plenty of 64-bit systems with less than 4GB of RAM

btown · on July 15, 2024

Including your laptop if you have a few Electron apps open!

aidenn0 · on July 15, 2024

Maybe other people's laptops, but 32GB is table-stakes for any laptop I'll buy.

pjmlp · on July 15, 2024

For some reason Multics got a higher security score from DoD than UNIX, guess why.

wruza · on July 15, 2024

This is a purely psychological problem. I’d say most of C is psychological, not technical. If I were a world dictator, one of my first orders was to lock C developers in a room with only python for few months. Or ruby, in severe cases. Some of them really need to touch grass.

mbivert · on July 15, 2024

> If I were a world dictator, one of my first orders was to lock C developers in a room with only python for few months. Or ruby, in severe cases.

I would additionally do the exact opposite: lock Python & Ruby developers in a room with only C for a few months.

C is a great language to learn programming, but Python or Ruby are, nowadays, in the most cases, better languages to program with. For example, C's sharpness is a notoriously famous source of bugs; yet it forces to develop rigor, discipline.

tmtvl · on July 16, 2024

But if you only give them a few months that's barely enough time to run a simple Hello World.

arethuza · on July 15, 2024

As an aside: can you have an O(0) operation that actually does anything?

mort96 · on July 15, 2024

It doesn't really make sense within the context of complexity analysis as something distinct from constant-time, which is denoted with O(1). A copy of a CoW string is O(1).

dhosek · on July 15, 2024

This. Pretty much with complexity analysis, you factor out any constants and only look at the term of complexity with the highest growth rate, so you end up with 1 < ln n < n < nª < 2^n (this can be extended indefinitely by replacing the n in the last case with anything to the right of n, but in practice, these are the only ones that matter.

carapace · on July 15, 2024

Stuff that happens at compile-time is O(0) (well, technically it's amortized over the number of times you run the compiled code, eh? Huh, how does JIT compilation affect Big-O analysis?)

Sohcahtoa82 · on July 15, 2024

O(0) is essentially meaningless. The only way a task could possibly be O(0) is if it isn't done at all, as even if the task is guaranteed to run in a Planck second [0], that's still constant time and would be O(1).

[0] https://simple.wikipedia.org/wiki/Planck_time

mikewarot · on July 15, 2024

It copies the pointer to the data and increments the reference count. When you modify a string it checks the count and copies it prior to modification if it's not 1.

calfuris · on July 16, 2024

Only if you have an operation that actually does something in precisely 0 time.

MarkSweep · on July 15, 2024

The Better String Library (aka batting, not to be confused with COM’s BSTR) is fairly nice:

https://bstring.sourceforge.net/

The string keeps track of the buffer size and how much has been used, allowing allocations to be somewhat amortized. The string buffer itself is zero-terminated for easy interop with code that expects standard C strings.

    struct tagbstring {
        int mlen;
        int slen;
        unsigned char * data;
    };

I used it on a microcontroller where I wanted something small and simple. The main missing feature is the lack of a small-string optimization like some implementations of std::string have. (Before anyone complains about this string type being too inefficient for a microcontroller, I had 1 MB of flash and 192KB of RAM, so I was not super constrained for resources)

bitwize · on July 15, 2024

Man, I want Ada.Strings.Fixed, Ada.Strings.Bounded, and Ada.Strings.Unbounded in C.

Retr0id · on July 15, 2024

TIL of memccpy() https://www.man7.org/linux/man-pages/man3/memccpy.3.html

To be honest, every time I need to deal with strings in C I feel like I'm banging rocks together, regardless of approach. I try to avoid it at all costs.

commandersaki · on July 15, 2024

I can never remember the nuances of the 50 various string functions and which shouldn't be used.

What I do remember is that virtually all string problems can be solved with snprintf/asprintf.

saagarjha · on July 15, 2024

snprintf is a worse strlcpy: not only does it need to call strlen if you pass it a string parameter, it also ends up in ??? territory if the string is long enough because its return type is int.

astrobe_ · on July 15, 2024

The printf family of functions also runs a mini-interpreter that has its cost, because its main use is interpolation (the % placeholders). Some compilers can substitute them for more efficient versions (e.g. printf without any % in the format string -> puts). I don't know if they can detect and substitute an snprintf with a "%s%s" format string.

wruza · on July 15, 2024

Ah, good old “what if my string exceeds two gigabytes” dilemma.

zokier · on July 15, 2024

> and which shouldn't be used.

my rule of thumb is that if its name begins with str then it shouldn't be used.

sumtechguy · on July 15, 2024

If you decide to use these functions. Beware of the sharp edges. Read the documentation. Read the documentation for your specific version and platform and compiler you are using. Between different CRT's these things can act differently even though they say they do the same thing.

Arch-TK · on July 15, 2024

The standard library string handling stuff is atrocious and it surprises me that wholesale replacement of that stuff isn't more common.

VyseofArcadia · on July 15, 2024

It is a long, time-honored tradition to attempt to improve on flawed standard library functions with equally flawed functions.

raverbashing · on July 15, 2024

If I proposed a new strxyzcpy function that only null-terminated the string when the length was not a prime number and that wiped your hard drive if the destination string, before the copy, contained the sequence 'xyz' in ascii I would be very afraid someone in the C committee would think it would be a nice idea to add it.

riehwvfbk · on July 15, 2024

Does the length take locale into consideration?

Gibbon1 · on July 15, 2024

FFS I had some code for a microcontroller break because locale.