Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
I'm not a fan of strlcpy(3) (nrk.neocities.org)
196 points by signa11 on July 15, 2024 | hide | past | favorite | 346 comments


I'd say that in general any format of variable size that does not state its length before the variable part is provoking buffer-overflow bugs. Thus such formats should be avoided in any kind of binary interop,

This holds even if the total length of the data in an interaction is not known ahead of time. E.g. an audio stream can be of indeterminate length, not known when the first byte is sent over the network, but each UDP packet has a well-determined length given in the header.

The length field can be made variable-size in a rather fool-proof way [1], allowing to economically represent both tiny and huge sizes.

(Zip files, WAD files, etc have that info at the very end, but this is because a file has a well-defined end before you start appending to it; fseek(fp, 0, SEEK_END) can't miss.)

[1]: http://personal.kent.edu/~sbirch/Music_Production/MP-II/MIDI...


one of the main reasons I use c++ as a "better c" and use the string library in embedded (or an embedded optimized version) all time, along with basic containers and algos, and "simple" classes and inheritance. Yeah I know there are c string libraries out there that also solve the problem, and even wrote a couple of my own over the years for "minimal size to functionality" ratio.


The problem with C++ strings for embedded is the heavy dependence on the heap for many operations and the constant copying of ROM literals into RAM objects. If your target platform is "embedded" with 1GiB+ RAM then you can work in systems programming mode and not care. If it's 32KiB, a heap may be too much of a liability.


I think you can probably avoid unnecessary copies by using std::string_view where possible instead of std::string .


Niklaus Wirth has entered the chat.


There are a whole bunch of other things I dislike about Pascal, but in this one he was just undeniably correct. Worrying about 1-3 extra bytes (on the 16 and 32 bit platforms where it potentially mattered) was just not worth all the issues that null-terminated strings brought with them.

My current favorite string implementations are the various compact string crates for Rust. Generally, you want a string to be able to do at least three things:

- Pointer, length, capacity tuple, for heap-allocated strings, 24B on x64.

- String inlined into the 24B buffer.

- Pointer, length tuple where pointer points to rodata.

You can do any of that and the discriminator in 24B, given the healthy assumption that all strings are shorter than 2^63.

Sadly, switching costs are massive and every programming language is pretty much struck with the string they started with. Hopefully, whatever comes next can crib from smartstring or compact_str or the like.


There was a paper on this in the late 70s from the Cedar group at PARC. This was back when computer science papers were actual scientific papers, so full of analysis of different alogorithms' performance with counted vs delimited strings. Counted strings won hands down on anything but strings so short the length was a large percentage of overall size.

Yet...since nobody reads the literature, we have all continued to suffer.


The late 70s was way too late for this… null terminated strings were already adopted by C and UNIX by then, and the rest is history.


And sadly, we know that the C folks don't read papers; otherwise they wouldn't have come up with Go later.


> Sadly, switching costs are massive and every programming language is pretty much struck with the string they started with.

Haskell is _almost_ flexible enough to be able to use a different string than the one it started with.

The language itself actually is flexible enough, but many of the libraries are not.

The main thing making Haskell flexible enough is that a literal like "foo" can be statically determined to be the right string type that you want to use. (And that happens at compile time, it's not a runtime conversion.)


Yes except when no.

Imagine you are writing performance sensitive code. You want to get a substring from a string, one that is not going to live outside your hot loop. In standard C you can just reference a part of a string with a pointer offset. All standard functions will continue working and you didn’t have to make any calls outside of your loop, not to the allocator, not to memcopy, nothing.

With strings being objects that are prefixed by a header cannot do this. At a minimum you need to allocate a new header, if not the whole string. Yes that’s the safer route but also a lot less performant.

Most crucially, you can build the header string implementation on top of C strings. You cannot do the opposite.

Realistically though C strings (aka null terminated strings) are just not a great thing because of the null termination. For my money, I would prefer to just use untermianted arrays and a separate size variable, as well as wide character strings for actual display stuff. This way all the interop must include string lengths (or some other way to determine length), and all internal stuff may be just ASCII but must not leave your internal logic and never be shown to the user.


> Imagine you are writing performance sensitive code. You want to get a substring from a string, one that is not going to live outside your hot loop. In standard C you can just reference a part of a string with a pointer offset.

If you want your substring to terminate in the same place as the original, at a null terminator. But that sadly is almost never the case, and as many C practitioners know, references like this are often unsafe and so APIs that substring tend to copy. That's just what they have to do to pass address sanitizer and static analysis checks.

If you want arbitrary views on a null terminated string, well, it's no longer null terminated and that's just the start of your problems in C.

In languages like Rust and Go, taking a view of a string or array is safe and doesn't copy the underlying data or require an allocation. So if you are writing performance sensitive code where substrings are a major contributor to CPU cycles, best go with those language (or C++) rather than C.


That’s fair: you won’t be able to use any libc functions that rely on null termination. But a lot of the time you don’t need to either. Think writing the substring to a socket or comparing it to a known constant.


In Rust, you would do both of those with a &str, which works fine. Just works exactly as in C, with no calls to memcopy or allocator or anything. And you would also be able to do all the other things that in C use null termination, too.


The solution in Rust is separate String and &str. &str is a reference to somewhere within String, and the length of the referred to region, and borrows from the String it refers to.

Any function that does not need to modify a String takes a &str. Any function that does modify a String typically takes a String, which means they consume their input. (Because of utf-8, in-place modification is generally a pipedream.)

Also, the headers are typically allocated on stack. Rust is a lot less shy about types that are larger than a pointer living inline whereever they are used, and this is something that seems to work a lot better than the alternative.


Allocating headers and strings separately blows your CPU cache. Hardly a performant way of doing hot loops.


Compared to calling strlen a bunch, which I’m sure is significantly more performant.


You never need to call strlen unless you are getting your inputs from a place that doesn’t give you a string length (such as stdin).


So which is it, then? Does keeping size separate "blows your CPU cache"¹ or not? You can't argue it does in one case (Rust) but not in your case…

(And note that the representation you're responding to is not really a "header", in the same sense that the trailing null is a "footer". The representation does not require the length be contiguous with the data, but that's what upthread was trying to say in the first place.)

¹(it doesn't…)


So now you are arguing that by default your strings should come with a length? Great!

If you want that, you might as well bake that length into the string type by default (and use a specialised type, perhaps a naked raw pointer into the string) for when you don't want to pass the length.


That's most interfaces…?


Not argv[].


You still need to call strlen on each element?


To get a correct understanding, if you aren't a Rust person, Rust's String is (literally, though this is opaque) Vec<u8> with checks to ensure it's actually not just arbitrary bytes you're storing but UTF-8 text.

Vec<u8> unlike String has a direct equivalent (well, the Rust design is slightly better from decades of extra experience, but same rough shape) in C++ std::vector<std::byte>

The C++ std::string is this very weird thing that results from having standardized std::string in C++ 98, then changing their mind after implementation experience. So while it's serviceable it's pretty poor and nobody should model what they're doing on this. There have been HN links to articles by people like Raymond Chen on what this type looks like now.


In order to access the string contents in the first place you need the pointer. The length is stored right next to it. So they're both going to be in the same cache line, assuming proper alignment. In the rare case in which they straddle a cache line, you just have to load once and then the length remains in cache for the remainder of the loop. (This is true regardless of where the length lives, in fact; as far as CPU cache is concerned it really makes little difference either way.)

(This is assuming SROA didn't break apart the string and put the length in a register, which it often does, making this entire question moot.)


Huh? The headers are either in registers or in stack. The top of stack is always in L1. There is no way in which this is inferior to handing over a pointer to a string and a length separately, other than requiring two additional words of storage in registers/stack.


How is that? Say you are reading 1000 lines of stdin at once to process them. Which registers are your string and substring headers stored.


If you are reading 1000 lines from stdin at once to separate Strings, you are already going to be accessing memory in 1000 places at the same time, and making it 1001 isn't meaningfully worse for your cache. (Implementation would be Vec<String>, which would lay out the 1000 headers contiguously.)

But I genuinely have a hard time understanding for what kind of workload I would ever do that. If you want to read a 1000 lines of stdin, and cannot use an iterator and must work on them at the same time, I would likely much rather read them into a single string and then split that into a 1000 &str using the .lines() iterator.


I was miffed at: 1000 lines from stdin. It’s the same problem 1000 times, not 1000 problems at once.


Presumably the idea is, for example, sorting? In which case you do have to read the entire input before you can do anything. But the way I'd do that is to read the entire stdin to a single String, then work with &str pointers to it.


If you really care about performance, you should not allocate within hot loops.


Null terminated strings have a footer, so it is the exact same problem, just on the other end of the string. It is inherently impossible to substring an arbitrary string without copying and using the same memory layout for the full string and the substring(s).

Of course, if your string type is a struct containing a size and a pointer, you can easily have multiple substrings pointing into the same byte array.


> Imagine you are writing performance sensitive code. You want to get a substring from a string, one that is not going to live outside your hot loop.

Zig uses slices for this (and everything else except interop): a pointer and a length, with some nicely ergonomic syntax for making another one, like `slice[start..][0..length]`.

When you're building strings then you have an ArrayList, which keeps a capacity around, and maybe a pointer to an Allocator depending on what you want. It's trivial to get the slice out of this when you need it.

Doing anything useful with a string requires knowing where it is (the pointer) and how much of it you have (the length) so keeping them together in one fat pointer is just good sense. It's remarkable how much easier this is to work with than C strings.


Efficient substring in C? Absolutely. Why don't we see real code? https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/pute...


Yes but that's the rare case.

The rare case should be possible, just not the default.

In Rust, you would make custom string handling unsafe for the bottleneck.


Rare for whom? Doing a lot of kernels or embedded development lately?


kernels or embedded development is rare compared to web dev, app dev, cli tooling, automation, etc.

In fact, it's pretty damn niche.

And rust is a general language, so it favors the most common case, but let the niche case be possible.


48bit address space and 128bit of return value on systemv make pointer, size, 32bit of capacity-past-the-end attractive on x64. Specifically ptr, size as u64 with the extra capacity stored across the high 16 bits of each of them.


What do you do when you have a string that's longer than 2^32 that gets truncated to len=0? Instantly freeing the buffer might not be what the user wants, if they intend to immediately reuse it for an another very long string, for example.

I think that's a pretty bad case of premature optimization, especially because the first CPUs with 57 bit support are now hitting mainstream. Just use 3 words, it's not that much extra space.


Realloc/remap down to 4gb in that case sounds OK to me. > 4gb allocated from a structure which can't do any resizing seems moderately unlikely, but sure, I guess free is also correct in that case.

Two 64 bit values can be returned in registers on the systemv x64 abi, three get passed as a pointer to stack memory. It's an optimisation but I think it's a valid one.

57 bit address space has been coming any year now for maybe a decade, I'll worry about that when it happens.


Yes, his arrays which have length as an immutable part of their type certainly prevent certain kinds of bugs. Too bad about making it impossible to write generic array-handling subroutines, even if you accept the generally inexpressive type system as a given.


Up to


I'm sure you know this, but just another point for people to keep in mind: using a length+contents representation makes it harder to modify the payload, if needed (more bookkeeping). And using a variable-size length makes that even harder, since you might have to shift or re-copy the full payload to make room for the new "length" header.

Of course, once you're done processing and are sending it along (as in serialization, that you mention), it's not an issue.


The point of strlcpy(3) is not to be the world's best and most efficient string copier. It's to provide a drop-in replacement to previous, memory-unsafe string copy routines in constrained environments where you have to have bounds on stuff and might not have an allocator.

If there are bugs with truncation in the resulting buffer, those are the program's bugs, and they existing before strlcpy(3) came into the picture.


> The point of strlcpy(3) is not to be the world's best and most efficient string copier. It's to provide a drop-in replacement to previous, memory-unsafe string copy routines

It's not a drop-in replacement, though. Not even if you ignore the different return type.

strncpy guarantees that the buffer will be completely overwritten (filling with null chars at the end), while strlcpy will happily leave remnants of whatever was there before.

Just dropping in strlcpy wherever strncpy appears can lead to data leaks or inconsistent hashes, for example, depending on how the buffer's contents are used.


That behavior should be totally irrelevant to code bases that are trying to handle C strings properly. If you have some reliance on the content of the buffer after the terminator, you've for problems that the string copy routine cannot help you with.


> That behavior should be totally irrelevant to code bases that are trying to handle C strings properly.

"No true Scotsman..."


It is a problem when people foolishly dump structs and fixed size buffers to storage without proper serialization. If you need that level of performance then you own the consequences.


This. I don't understand the objection and I spent 20 years writing C code. The reason to use strlcpy is not to _fix the bug_ but rather _to prevent the bug from turning into a crash, memory corruption, or exploit_. It also forces discipline by carrying around the length. As you say, it's also a drop-in.

A truncation bug is a hell of a lot easier to debug than memory corruption.


I've worked on embedded RTOS projects where we had our own strlcpy implementation - it's fine. Well, I mean, all the str functions suck because C strings, but that's exactly why sticking to a good shared set of idioms and staying organized is so important. And in C, that means manually tagging buffers with their length, no getting around that. Given that, strlcpy is less bug-prone than strncpy, simply due to requiring less lines of code to use correctly per invocation.

I think a lot of the confusion in the C string discourse comes from people thinking they should rely on the NULL termination byte for string length. You really shouldn't, and if you have to do it, you need to be extra careful to check all your assertions that it will be properly terminated. Just carry around the length, and bundle it with the pointer in a struct to pass it around when it makes sense. Not the most ergonomic, but it's C, what can ya do.


It's pretty funny that C strings were decided to be NULL terminated in the ancient past for 'convenience', but it turns out you still need to carry the length around anyway.


Not to defend C strings too hard, but it does make some sort of sense, IMO. You have to manage all your buffers manually in C, whether they contain a string or not. If you store a string of length 5 in a 10-byte buffer, you still need to manage the 10-byte buffer. Raw pointers kept things very flexible and lightweight when C was created.

Nowadays, things like C++ string_view's and Rust str slices handle this for you automatically, but those came around much later and require more sophistication at compile time.


Yes, but it's not that much more sophistication, because C already supports structs. (Though I'm not sure if the first versions of C already had structs?)


> It also forces discipline by carrying around the length.

LOL. It does not force anything - you can mishandle source or destination buffer lengths very easily and compiler won't say anything.

I sometimes wonder what kind of disaster will have to happen to make C programmers agree on a standard buffer (i.e. pointer+size) type with mandatory runtime bounds enforcement ....


Force is too strong a word. Yes, it's possible someone just passes whatever, or just passes strlen(s) which is an even dumber answer.


> It's to provide a drop-in replacement to previous, memory-unsafe string copy routines

Nitpick: it’s not quite a drop-in. Prototypes of these functions are

  char * strncpy(char *dst, const char *src, size_t num);
  size_t strlcpy(char *dst, const char *src, size_t num);
strncpy(dst, src, num) always returns dst (https://cplusplus.com/reference/cstring/strncpy/), which is quite useless, as the caller knew that already.

strlcpy(dst, src, num) returns the total length of the string it tried to create (https://www.unix.com/man-page/posix/3/strlcpy/). Callers can use that to detect that the string didn’t fit the buffer and reallocate a buffer that’s long enough.


That’s exactly why you shouldn’t be using it: it does a very bad job at that, with behavior basically nobody wants.


But doing that if you have a large ancient C code base is a lot of work.

The reason for the existence of strlcpy isn’t that it is perfect, it’s that it’s the best option with good UX for integration into an existing C code base.


It's not, though. That's the point: the interface it provides is not very good. The API surface for "I have a string here and I want you to put it there but only the first n bytes" is well-defined and can be done in a much better way than what strlcpy does.


The API surface is "I have a string that I'm pretty sure will fit in this buffer, but if it doesn't I don't want to cough up control of my bootloader."


Yep and that space has space for improvement


I'm not a fan of Unix man page section numbers in parentheses.

strlcpy is a stopgap, whack-a-mole solution for buffer overflows. It is rationalized by the reasoning that it does not make the program less wrong, while (probably) making it more secure.

When truncation matters and you have a fixed size buffer, that buffer should be large enough in order for it to be justifiable to say that someone is misusing the application. Perhaps a tester trying to break it.

Nobody’s surname needs 128+ bytes. No reasonable URL for a firmware update download needs 4096 bytes.

If truncation matters, no, it does not always make sense to accept a gig of data and be ready for more. You can impose a limit. A violation of the limit is an error, treated like a case of bad input.


People have long surnames, especially if you take multi-byte characters into account. And someone with the same mindset as you is the reason why my profile picture in Google’s employee directory never did actually show up in Safari–one of the path components ended up being a couple thousand characters long (I wonder if it was literally a base64 encoding of the image itself?)


No, my reasoning does not say that a browser shouldn't handle a long URL (which is a substring of a page it has already accepted and rendered).


I fail to see the difference here?


Out of curiosity, would you mind sharing some details about your surname?

E.g., its length in its native alphabet, or its length as a UTF-8 string?


  $ swift
  Welcome to Apple Swift version 6.0 (swiftlang-6.0.0.5.15 clang-1600.0.22.6).
  Type :help for assistance.
    1> "झा".count
  $R0: Int = 1
    2> "झा".utf8.count
  $R1: Int = 6


> Nobody’s surname needs 128+ bytes.

https://en.wikipedia.org/wiki/Hubert_Blaine_Wolfeschlegelste.... would beg to differ.


A couple jobs ago, I worked on writing an API client for a CRM system that supported 2GB for most of the text fields (name, address line 1, job title, etc...). It also offered up 99 "custom field" text fields, also allowing up to 2GB each.

I'd considered base64-encoding my ripped DVD collection, and using them to store another backup copy for me.


He would be too busy begging numerous agencies to handle his name to beg to differ with you.

Is there a picture of the ID page of that man's passport? Or of a driver's license or similar?

Whatever is on that is his actual name.


I love that the guy's occupation was "typesetter".


I'm not sure his name is long enough for this to be an issue.


The URL uses a shortened version, his full name on the page is:

Adolph Blaine Charles David Earl Frederick Gerald Hubert Irvin John Kenneth Lloyd Martin Nero Oliver Paul Quincy Randolph Sherman Thomas Uncas Victor William Xerxes Yancy Zeus Wolfeschlegel­steinhausen­bergerdorff­welche­vor­altern­waren­gewissenhaft­schafers­wessen­schafe­waren­wohl­gepflege­und­sorgfaltigkeit­beschutzen­vor­angreifen­durch­ihr­raubgierig­feinde­welche­vor­altern­zwolfhundert­tausend­jahres­voran­die­erscheinen­von­der­erste­erdemensch­der­raumschiff­genacht­mit­tungstein­und­sieben­iridium­elektrisch­motors­gebrauch­licht­als­sein­ursprung­von­kraft­gestart­sein­lange­fahrt­hinzwischen­sternartig­raum­auf­der­suchen­nachbarschaft­der­stern­welche­gehabt­bewohnbar­planeten­kreise­drehen­sich­und­wohin­der­neue­rasse­von­verstandig­menschlichkeit­konnte­fortpflanzen­und­sich­erfreuen­an­lebenslanglich­freude­und­ruhe­mit­nicht­ein­furcht­vor­angreifen­vor­anderer­intelligent­geschopfs­von­hinzwischen­sternartig­raum Sr.


He should have added a few numbers in there to really throw off some login systems.


And spaces.


Active Directory would cry.


The exception proves the rule. There is no way to read the description of the translation of the surname without coming to the conclusion that this was name was chosen for mischief, outside of the bounds of reasonable human societal expectations. Mischief is fine and all but there is no authentic "gotcha" compulsatory requirement for society to accomodate the chaos resulting from such a mischievous personal preference. (and I say this as someone who has legally changed his name twice)

tl;dr Make you bed; lie in it.


For conventions of the English language, this is an exception, but not in other languages. Arabic names have up to five components. A 128-byte limit leaves an average of 25 bytes for each component. Consider that Arabic script in UTF-8 consumes 2 bytes per glyph, and graphemes in Arabic are compositions of these glyphs.

For a more extreme example, consider conventions of Japanese. Middle names do not exist in Japanese. In fact, middle names are impossible to input into the 戸籍 (family registry). Forms in Japan are designed around the assumption that each person has exactly two names. Many Europeans would be unable to input their full name in such system. In this example, it'd be unreasonable to suggest most Europeans are acting "outside of the bounds of reasonable human societal expectations".

In general, the most effective solution I've seen for handling names is to have a single name field and treat it opaquely. If you need an inflection, ask for that separately.


> In general, the most effective solution I've seen for handling names is to have a single name field and treat it opaquely. If you need an inflection, ask for that separately.

Yes, in general. Though sometimes you need to know about specific parts of a name as interpreted in a specific cultural context.

Eg in Germany by law you need to register a family name when you get married, and your kids are going to have that family name, if you want it or not. (The parents don't necessarily have to have that name, eg the parents can opt for double-barrelled names or they can keep their old name. But the kids only get one family name and they all get the same name, and there are restrictions on which name you can pick.)

In contrast, here in Singapore they just give you a blank space in a form where you can put in your new baby's complete name as you please.

(OK, technically, you get multiple blank spaces, because you can eg give your kid a western name like Jay Random Smith and a Chinese name that is a completely separate name and doesn't need to have anything to do with the western name. I think you can also get eg a tamil name, if you want to etc.)


Compare Mr Karl-Theodor Maria Nikolaus Johann Jacob Philipp Franz Joseph Sylvester Buhl-Freiherr von und zu Guttenberg https://en.wikipedia.org/wiki/Karl-Theodor_zu_Guttenberg

And he's not even the guy with the longest name, and his parents did not make up his name to spite some length restrictions.


That surname is like 22 bytes


Adolph Blaine Charles David Earl Frederick Gerald Hubert Irvin John Kenneth Lloyd Martin Nero Oliver Paul Quincy Randolph Sherman Thomas Uncas Victor William Xerxes Yancy Zeus Wolfeschlegel­steinhausen­bergerdorff­welche­vor­altern­waren­gewissenhaft­schafers­wessen­schafe­waren­wohl­gepflege­und­sorgfaltigkeit­beschutzen­vor­angreifen­durch­ihr­raubgierig­feinde­welche­vor­altern­zwolfhundert­tausend­jahres­voran­die­erscheinen­von­der­erste­erdemensch­der­raumschiff­genacht­mit­tungstein­und­sieben­iridium­elektrisch­motors­gebrauch­licht­als­sein­ursprung­von­kraft­gestart­sein­lange­fahrt­hinzwischen­sternartig­raum­auf­der­suchen­nachbarschaft­der­stern­welche­gehabt­bewohnbar­planeten­kreise­drehen­sich­und­wohin­der­neue­rasse­von­verstandig­menschlichkeit­konnte­fortpflanzen­und­sich­erfreuen­an­lebenslanglich­freude­und­ruhe­mit­nicht­ein­furcht­vor­angreifen­vor­anderer­intelligent­geschopfs­von­hinzwischen­sternartig­raum Sr.


That's the shortened version. The full version of the surname is 666 characters.


I guess it kind of goes to the poster's point, since Wikipedia articles truncate at 255 bytes (Since that was the max size of a VARCHAR in mysql 3)


I apologize, I should've read more of the article. My bad


Surprised nobody has linked this yet:

https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

Also note that while the counter examples might sound extreme, in some languages each character might need 3 bytes in UTF-8, and 128/3 ~= 43 characters doesn't seem to be that outrageous.


> Nobody’s surname needs 128+ bytes

Oh we're doing that "false things that programmers believe about the world" again! Fun! Let's consider cultures where people can have more than one surname. Ever heard about Pablo Diego José Francisco de Paula Juan Nepomuceno María de los Remedios Cipriano de la Santísima Trinidad Ruiz y Picasso? Yes, that's his full name. Of course, he didn't frequently use it, but if you make a system dealing with people's names, you'd eventually end up having to support something like that.


In this case, his family names (Spanish people have two) were just "Ruiz y Picasso", the rest was is given name. So you can argue that his family name fits within 128 bytes.

But that brings us to the assumption that people have just one family name with is just one word, which is very much not the case in many cultures around the world.


A programmer who believes surnames are one word has not heard of John Von Neumann.


And then there's the assumption that you'd write it in that way, and in that order. There might be situations where you'd want to use the Hungarian and write "Neumann János" instead.

I.e. the assumption that the family names come last is not necessarily correct even in Europe. Let alone if one deals with the various Chinese languages, Japanese, or Korean. And probably others.


That is also true. I am currently working in an international project in Japan, so when designing our systems and databases I always insists on the terms "given" and "family" names instead of "first" and "last" names, to prevent confusions.

Same with the dates. I always ask my colleagues to please write the years in full, otherwise it can be very difficult to know if it is DMY or YMD (thankfully we don't have to deal with MDY).


> Nobody’s surname needs 128+ bytes.

According to Wikipedia some joker did have a 666-character surname, officially, in the USA. Perhaps the best thing to do would be to truncate the field at some reasonable limit to prevent people from crashing your system with ridiculous values but make sure your system works properly with the truncated names so, for example, it doesn't panic because the truncated name isn't equal to the original name.

With spaces, punctuation and diacritics, comparing even short names for equality is a bit dangerous and probably best avoided. If you expect two text fields to match and they don't, even after normalisation, you could consider flagging the case for human review later but continuing without an error for the time being.


If I'm understanding correctly what you mean by "joker", it sounds like the changed their name to be this long intentionally? That seems like the type of thing where most software probably can get away with just not supporting then; with the possible exception of mandatory government services or something, there's no reason software should need to account for people taking such extreme steps of their own volition.


Because programmers are too lazy to properly handle long names? That's a stretch for denying someone service and you know it.

Like, yes, nobody is forced to accept their name unless they're running a government service, but using it as an excuse is just that, an excuse.


Any system will have some limit on the length of names - if nothing else, the budget for storage.

A non-lazy programmer will determine an appropriate limit, document it, continuously test that the entire system can handle that length correctly, and continuously test that helpful errors are returned when too-long names are input.


What if my legal name is 500 trillion characters long? Should every project design their storage system to accommodate this?

If you look at the 666-character name, it's no more or less ridiculous than 500 trillion characters.


Ask the government whether your legal name can be that long. They'll say no.


Handling arbitrary length input without caring how that could be abused is also lazy.

If you're working in a stack that nicely handles arbitrary lengths, it takes extra consideration and effort to put in limits.


> nobody is forced to accept their name unless they're running a government service

I dunno, would you expect that the government should be allowed to dictate how long a person's name can officially be? If yes, then problem solved, nobody may have names longer than X, and all services will accept X. If no, then there has to be a practical limit on name sizes that government services can accept, and people will be unhappy because it doesn't accept their "official" name.


Alas, that only works if you are dealing only with people under the jurisdiction of the government in question.

There's always them pesky foreigners.


There's also the fact governments aren't static. The past, and future, are foreign countries.


To be clear, I'm not arguing for or against a specific value as the "maximum length" of a name. I'm drawing a distinction here in terms of a potential user's intentional choices and what that means for providing support.

> Because programmers are too lazy to properly handle long names? That's a stretch for denying someone service and you know it.

I don't think someone should be denied service if they happen to have a long name, but I genuinely don't think it's a stretch not to try to handle people going out of their way to subvert expected norms. In this case, the argument is more philosophical than technical because there isn't an obvious heuristic for determining whether a name is intentionally made long or not, but there are places where I do think it's worth it for programmers to consider.

As an aside, I'd argue there's more nuance than "properly handling long names" or "being lazy. There's already an inherent limit on how large a name can fit into memory, and that limit might even fluctuate depending on the number of users being processed in a server at a given time and how much memory a given server has. Is a 1 GB name too long to be worth handling, or is not handling it "laziness"? If you're arguing that any name that the government accepts should be accepted by any software, how do you know what the limit is that the government will accept? If you have international customers, is the limit larger? If there's no documented limit, do you just need to hope your software is at least as robust as the government's? My point isn't that these situations are equivalent to a name that's 666 characters long, but that arguing that not handling 666 characters is lazy already is a blend of implicit technical assumptions (servers have enough memory that handling names with 666 characters isn't an issue) and social assumptions (it's possible for someone to actually have a name that long), and I don't think that "pretend all names can fit into memory fine and just crash or time out or something if there are too many names that are too long according to the parameters of the runtime and the hardware" is the obvious best choice from a fairness perspective.



He allegedly told the utility company that he wouldn't pay his bill unless they spelled his name correctly, which caused them to print it on three lines. Maybe this guy is just a walking software test case?


The bill just needs the correct account number and address.

You don't get to just stipulate new conditions for paying your bill. If it's not in the service contract that your giant name has to be spelled completely on the bill for it to be payable, then that condition doesn't exist.


Are you a lawyer with expertise in Philadelphia commercial law of the 1950s? I'm sure not, so I wouldn't want to take that fight. It was apparently easier for the company to just write the guy's name correctly.

Or maybe Wikipedia is wrong, or the source was bad. You have to pay to read the 1955 article and I can't be bothered right now. Citation 16 below if you're interested.

https://en.wikipedia.org/wiki/Hubert_Blaine_Wolfeschlegelste...


That’s exactly what MICROS~1 did with the 8.3 file names on DOS. To add support for longer file names in systems that did not do so originally.


> Nobody’s surname needs 128+ bytes.

This attitude ensures that the US software industry will never conquer the world.


The US software industry has already conquered the world, and continues to do so. It doesn't mean that it'll always be this way, but if you wanted to associate software with a single country, then the US would be your answer.


Why does it need to conquer the world? I just want to deal with sane, performant software.


And people in other locales want to be able to use their own name without software mangling it.


At some point there is a physical limitation, there's no passport in the world that accepts a 666-character name.

The US only gives you 21 characters on the DS-11 for a surname.


You're assuming one character is stored as one byte, which is only the case for English.


That sounds like a them problem. ;D

I once heard that "decision" comes from a Latin root word meaning "to cut off (the other options; to pay opportunity cost)". I will decide to optimize for my use case.


I just want to deal with sane, performant people.

We'll just both need to learn to live with disappointment.


> nobody's surname needs 128 bytes

I believe that's #6 falsehoods programmers believe about names (with examples)

https://shinesolutions.com/2018/01/08/falsehoods-programmers...

Ingoring multi byte characters there are still plenty of long names https://www.ancestry.com/c/ancestry-blog/discovering-the-his...

If you're going to try for a "reasonable" max name length it would probably need to be at least 4kb.


Names are hard, don't force things into your assumptions.

If you must, record fields such as:

Full Legal Name - Freeform input no string input limitation. If you feel like this is an attack vector, send the data out for human review.

Full Mailing Address - Don't try to break this down, allow multi-line, free form input. This is something you might want to validate with your shipping carrier and/or a human.

A short 'nickname' used as such.


You always need size limitation. You really don't want to allow 10GB strings to be stored in the full name field.

Also, in almost any situation where you need a legal name, you actually want to follow a lot of rules. This idea that people's original names are somehow sacrosanct is a misunderstanding. If you're doing business in a European country for example, you have to write your name in Latin/Cyrillic letters, perhaps with a few symbols like ' or - allowed as well, and typically with a few accents/diacritics specific to each country. You certainly can't register as 田中 in any context that requires a legal name in France, you'd have to write that as Tanaka.

And this is natural because legal records are meant for authorities in some specific country to read, and compare with other legal docs - so they need to at least be readable to those authorities.


A 10GB string does sound pathologically large; however it's an argument about the absurd.

What size is for sure enough? Well I'm not so sure. What if someone has a lot of titles. What if society decides that someone's Legal Name requires a post quantum cryptography key that happens to be 20MB (binary) long?

Also, FYI, at least PostgreSQL doesn't give you a free lunch for any variable length string; length requirements are a column constraint that DEGRADES performance because it has to check.

An external name validator of some sort could check things. Commonly allowed cases could pass by computer check, while actual humans could review edge cases. Someone trying to abuse the name field like that probably needs human review elsewhere anyway.


I would bet you that Postgres insert and query performance is better overall if all names in your table are, say, up to 10KB long than if you have a bunch of bots inserting 20MB-long "names". And 10GB long strings are just not supported by Postgres, at least with default settings.

And the point of adding limitations on length is that you shouldn't even accept the HTTP request if it passes some size, as it will severely degrade performance if you allow someone to upload a 10GB string, even if you separate it into a human review area.

Finally, if the legal requirements change and legal names can legitimately contain cryptographic material, than your system has to change. There is no point in designing a system that tries to work for any possible use case.


Names aren't actually that hard. It's perfectly reasonable to assume that your system is going to be used in your country and culture, and handle the cases which are relevant for that. Edge cases within that context, and expanding to other cultural contexts, can be handled as they come up. But until then, YAGNI.


> Nobody’s surname needs 128+ bytes

128 bytes would only be 42 characters if each character uses 3 bytes, as would be the case in some languages. Which isn't an unreasonable length, especially if the name has a lot of combining characters.


Ok, let's make it 50 Unicode codepoints.


> Nobody’s surname needs 128+ bytes

That is only 32 astral characters... Seems kind of close for comfort. Not to mention combining characters


> I'm not a fan of Unix man page section numbers in parentheses.

Why not? Confusion if function call with section number as argument?


It's visual clutter that only communicates

- I see the entire computing world with Unix blinders on my eyes.

- I can't imagine a strlcpy function being used on a system that isn't Unix and that doesn't have a man page for it in section 3.

- I don't care that C has been internationally standardized since 1989 with a printf(3) function; printf(3) it's just another Unix function in section 3 of my man pages.

- If I don't affix (2) or (3), how will people know I'm not talking about something other than a C library function? I don't understand this "context" stuff in writing and speaking.


I once learned what they mean.

But I forgot and now do not.


They describe the section of the manual, where 1 = programs, 2 = syscalls, 3 = userspace functions, 4 = special files, 5 = file formats, 6 = games, 7 = misc/overviews.

So printf(1) is the man page for the /usr/bin/printf command, while printf(3) is the man page for the libc printf() function.

Alternatively, readdir(2) is the man page for the readdir syscall, while readdir(3) is the man page for the libc wrapper, which no longer actually calls readdir(2). See also syslog(2) and syslog(3).

Or, time(1) is the man page for /usr/bin/time to time how long commands take. time(2) is the syscall to return the number of seconds since the epoch. And time(7) gives you an overview of time and timers on the system.


I just discovered that typing man 'printf(1)' or man 'printf(3)' actually works (at least with the man command on Ubuntu, provided by the "man-db" package).

You have to quote or escape the parentheses because they're shell metacharacters, which IMHO makes that syntax more trouble than it's worth.

Other commands that work are "man 3 printf", "man -s 3 printf", and "man printf.3". I think "man 3 printf" is the oldest version.

At least one other version of the man command (NetBSD) doesn't accept "man printf.3" or "man 'printf(3)'".


Unless you're on a UNIX system the number isn't important. If you are on a UNIX system, then it's useful for telling the difference between commands (like write(1), hostname(1), or printf(1)), system calls (like write(2)), library functions (like printf(3)), or config files (like hostname(5)).


In an article about C programming, the context tells you that everything that is given in a typewriter font is a C identifier, unless noted otherwise.


> Nobody’s surname needs 128+ bytes. No reasonable URL for a firmware update download needs 4096 bytes.

...and surely the "seconds" field of a timestamp is always between 0 and 59 inclusive, addresses will include a state and a building number, phone numbers contain only digits and maybe a leading + sign, etc.

Wrong assumptions like this are one of the root causes of (in the best case) bad UI or (worst case) annyoing bugs.

128 bytes for a surname is only about 60 unicode characters, less if you include RTL markings and characters outside the BMP.

A URL can contain SHA hashes (think: reproducible builds) and can thus be very long (okay, 4k characters is pushing it quite a bit but I wouldn't rule it out like you did...)


There have to be limits somewhere. Memory and storage space is not infinite. On the other hand, "how many characters can someone have in their name" is infinite. That means that no matter what limit you choose, someone will eventually exceed that limit. And you have to have a limit.

This is not a matter of "wrong assumptions". At the end of the day, all you can do is set a limit such that you're comfortable with the risk that someone will be outside the limit you have set. And risk tolerance, as always, is a matter of opinion and not fact.


This is a poor argument.

Firstly, "how many characters can someone have in their legal name" is decidedly not infinite, because it has to be sufficiently short that some governmental entity was willing to record it.

Secondly, as a reply to a comment (quite reasonably) pointing out that 60 unicode characters may not be enough for a surprisingly large number of people, this makes even less sense. Memory and storage space are not infinite, but 128 bytes per name is still unreasonably low. One could buy a single 12TB hard drive and store the names of every single living human, allocating 1.5KB per person.


The point is, you are always going to exclude some number of people's names when you set constraints. So you have to decide what is right for your use case, just like with any other engineering decision. There's no such thing as "right" or "wrong" here, simply what is best for the context. 128 bytes for a name is going to be unreasonably low in some contexts (e.g. for Arabic names), but not for others (e.g. for American names).


We live in a global society and there are plenty of Arab Americans so Arabic names are American names.


Upthread, I was the one who posited the 128 byte figure. It was for a surname, not full name.

I now posit 50 unicode codepoints for a surname.

Fit or fuck off.


> it has to be sufficiently short that some governmental entity was willing to record it.

How short is that, exactly?


It's not hard to apply an upper bound here.

If you arrive at a government office when they open for the day, and try to spell out your name to the clerk, and are unable to finish before the office closes for the day, that name is too long.

Realistically they will probably tell you to leave well before that.

No one will have a legal name longer than the legal system will allow them to have.


> Wrong assumptions like this

Assumptions, or features? I'm all for inclusive behavior, however I'm also for well tailored solutions. Having support for 8k characters when you are going to usually use maybe 20 isn't smart or correct either. That's why we have utf-8, not utf-32, you can grow the bytes when you need to, but only then.

> A URL can contain SHA hashes

It can, or it can not - again, perhaps a feature, not a bug. The hash can also live in a file named by convention, and downloaded/checked separately. Maybe there are other scenarios where you might need a really long url, but domain + release path + name + major.minor.patch should get you 99% of the way.

What's "reasonable" is relative, always designing for the edge case is good in some cases, but its also OK (and perhaps better) to optimize on occasion.


When I was in college, the only number in my mailing address was the zip code.


THIS. Not for every use case, obviously. But for huge number of them. And the "FooBaz length 5307 bytes in check_input(), truncating to 4095 bytes..." errors (which are trivial to log, or ignore, as you wish) can reveal many interesting things.


128 bytes is only 4 32-bit characters. Now, I think 4-byte UTF-8 characters are pretty rare, but at least 3-byte ones are certainly common in names, even legal names.

If you allow users to type emojis in their name you will definitely run out as the color/gender selectors take up an additional code point.


You confused bits with bytes. 128 bytes can encode 32 4-byte long characters.


Given how nice system programming languages we have these days, I refrain to let classic Null-terminated C-Strings entering my program. Even on embedded programming we opt-in for std::string (over Arduino's String). I am just happy to save our time in favour of having some X percentage less optimal code.


It is seriously unfortunate that C++ managed to standardize std::string, a not-very-good owning container type, but not (until much, much later) std::string_view, the non-owning slice type you need far more often.

Even if Rust had chosen to make &str literally just mean &[u8] rather than promising it is UTF-8 text, the fact &str existed in Rust 1.0 was a huge win. Every API that doesn't care who owns the textual data can ask you for this non-owning slice type, where in C++ it had to either insist on caring about ownership (std::string) or resort to 1970s hacks and take char * with zero terminated strings.

And then in modern C++ std::string cares about and preserves the stupid C-style zero termination anyway, so you're paying for it even if you never use it.


> And then in modern C++ std::string cares about and preserves the stupid C-style zero termination anyway, so you're paying for it even if you never use it.

I don't think this in itself is a real problem. You pay for the zero at the end, which is not much. The real cost of zero termination is having to scan the whole string to find out the size, which with std::string is only needed when using it with C-style APIs.


and also you have to copy (probably allocating) to get a substring.


Only if you want a substring with separate ownership though - a string_view doesn't have to be NUL-terminated.


If you want to pass the string_view to some API that expects NULL terminated strings, then a copy is necessary (well, maybe is some cases you can cheat by writing a NULL in the string and remembering the original character, and then after the API call restore the character).

This isn't as much a fault of a string_view type of mechanism, but rather API's wanting NULL terminated strings. Which are kind of hard to avoid on mainstream systems today, even at the syscall interface. Oh well..


Sure, but the thread here was about the forced NUL-terminator in std::string and the costs associated with that. If you want a NUL-terminator (e.g. for use with a C API) then you have to pay the copy (and in the general case, allocation) cost for substrings no matter how your internal strings look (unless you can munge the original string) and std::string is exactly the right abstraction for you.

But yeah, it would be nice if the kernel and POSIX APIs had better support for pointer+size strings.


> And then in modern C++ std::string cares about and preserves the stupid C-style zero termination anyway, so you're paying for it even if you never use it.

Is this required now? I've seen a system where this was only null terminated after calling .c_str()


c_str has to be constant complexity, so I guess the memory needs to be already allocated for that null character. I'd be surprised to see an implementation that doesn't just ensure that \0 is there all the time.


Ah, the system I ran into would've been pre-c++11.

Only saw it trying to debug a heap issue and I was surprised because I thought surely it's a null terminated string already right? They also checked the heap allocation size, so it would only reallocate if the length of string data % 8 was zero.


Facebook / Meta had their own string type which did this, turns out now you have an exciting bug because you end up assuming uninitialized values have properties but they don't, reading an uninitialized value is Undefined Behaviour and so your stdlib conspires with the OS to screw you in some corner cases you'd never even thought about because that saved them a few CPU cycles.

The bug will be crazy rare, but of course there are a lot of Facebook users, so if one transaction out of a billion goes haywire, and you have 100M users doing 100 transactions on average, the bug happens ten times. Good luck.


Golang's slices view of the world is addictive.


Depending on your usage it is not necessarily less optimal either.

You never need to walk the string to find the \0 byte. e.g. for strlen.

For short strings no heap memory needs to be allocated or deallocated.


It's really too bad though that the short string optimization capacity is neither standardized nor user-controlled.


Exactly this

It seems C is going around in circles while everybody else has moved on

No, speed and "efficiency" are not a be-all, for-all.

Safety is more important than that except in very specific cases. And even in these cases there are better ways and libraries to deal with this issue without requiring the use of "official" newer C functions that seem to still be half broken

There's so much fiction regarding memory issues and limited memory issues and what to do if we hit limited memory issues when in practice terminate and give up is the (often enforced by the OS) norm.


I look forward to your solution to bridge these libraries to every other person’s slightly different implementation of the same library, which also has to talk to every other interface that cropped up over the last 50 years and takes null-terminated strings anyways.


I haven't moved on. :> C FTW! Bloated langs are cringe.


What you call "bloat" is how other languages handle complexity that C necessitates that the programmer handle. Enjoy.


Have fun playing with a chainsaw with the safety disabled


Have fun playing with your Fischer Price KidSafe™ playset? >:)

At some point we are just outlining in general terms what we want from a language. If C was a toolbox it is a limited number of essential tools, other languages add so many things Alton Brown would faint from the unitasking nature of them.

C programmers love this. I know I do.


The "kids tool" comment is so weird.

I could do pretty much whatever I wanted in (DOS) user space with Pascal.

I can do pretty much whatever I want in a modern OS user space with whatever lang I prefer. "Oh but you might need C bindings" because the OS was build that way! (And with Windows/COM you might prefer a C++ bindings - just saying ;) )

> If C was a toolbox it is a limited number of essential tools

C is an old toolbox where 1/3 of them is a rusty finger-remover, 1/3 is a clunky barely do nothing metal crap and 1/3 of them kinda works

I'm all for a simplified set of essential tools, but not one where it's sharper on the user handle than it is on the business end

bUt C iS jUsT hIgH lEvEl aSsEmBlY no it is not


If someone ever has to use your stuff over FFI from a high level language, they will curse you for not just using C strings.


Null terminated C strings are still terrible for FFI. Pointer and length is a better solution and it is trivially interoperable with code that uses, say, string_view.


If someone could port the Free Pascal string library to C, it would solve a lot of problems with new C code. It reference counts and does all the management of strings. You never have to allocate or free them, and they can store gigabytes of text. You can delete from the middle of a string too!

They're counted, zero terminated, ASCII or Unicode, and magic as far as I'm concerned.

Oh... And a string copy is an O(1) operation as it only breaks the copy on modification.

Edit: correct to O(1), thanks mort96


For most strings, it seems to be that using a varint would solve the overhead problem. For short strings the overhead would be no longer than the null byte (which you could discard, except when interacting with existing APIs).

But as with _all_ string solutions, it's the POSIX interface, standard library, and other libraries that screw you. If you're programming in C today, it's because you're integrating with a ton of C code, and thus it's impossible to make progress since the scope of improvement is so small.

It's always struck me as weird that Rust treats strings the way it does - the capacity value is not useful for many cases of strings, and it would have cost them one bit to special case the handling constant strings without the cap measure, which would be better. Most strings are _short_ which makes the overhead worse, proportionally.


Its not like there is any shortage of alternative string libraries for C; sometimes I feel everyone has gone and invented their own.

Antirezs sds is just one example https://github.com/antirez/sds


Pascal strings have overhead of 2 ints per string (16 bytes on 64-bit systems)

The kind of person who calls a single pass through the string a "horribly inefficient solution" will faint at the idea of burdening every string with 16 more bytes of data.


it's pretty trivial to implement this as a max of 14 byte overhead (with small string optimization), but more importantly, it's only 16 bytes on 64 bit systems, which pretty much by definition aren't that memory constrained (since otherwise you would be on 32 but).


> ...it's only 16 bytes on 64 bit systems, which pretty much by definition aren't that memory constrained (since otherwise you would be on 32 but).

I'm not sure about that. There are plenty of 64-bit systems with less than 4GB of RAM


Including your laptop if you have a few Electron apps open!


Maybe other people's laptops, but 32GB is table-stakes for any laptop I'll buy.


For some reason Multics got a higher security score from DoD than UNIX, guess why.


This is a purely psychological problem. I’d say most of C is psychological, not technical. If I were a world dictator, one of my first orders was to lock C developers in a room with only python for few months. Or ruby, in severe cases. Some of them really need to touch grass.


> If I were a world dictator, one of my first orders was to lock C developers in a room with only python for few months. Or ruby, in severe cases.

I would additionally do the exact opposite: lock Python & Ruby developers in a room with only C for a few months.

C is a great language to learn programming, but Python or Ruby are, nowadays, in the most cases, better languages to program with. For example, C's sharpness is a notoriously famous source of bugs; yet it forces to develop rigor, discipline.


But if you only give them a few months that's barely enough time to run a simple Hello World.


As an aside: can you have an O(0) operation that actually does anything?


It doesn't really make sense within the context of complexity analysis as something distinct from constant-time, which is denoted with O(1). A copy of a CoW string is O(1).


This. Pretty much with complexity analysis, you factor out any constants and only look at the term of complexity with the highest growth rate, so you end up with 1 < ln n < n < nª < 2^n (this can be extended indefinitely by replacing the n in the last case with anything to the right of n, but in practice, these are the only ones that matter.


Stuff that happens at compile-time is O(0) (well, technically it's amortized over the number of times you run the compiled code, eh? Huh, how does JIT compilation affect Big-O analysis?)


O(0) is essentially meaningless. The only way a task could possibly be O(0) is if it isn't done at all, as even if the task is guaranteed to run in a Planck second [0], that's still constant time and would be O(1).

[0] https://simple.wikipedia.org/wiki/Planck_time


It copies the pointer to the data and increments the reference count. When you modify a string it checks the count and copies it prior to modification if it's not 1.


Only if you have an operation that actually does something in precisely 0 time.


The Better String Library (aka batting, not to be confused with COM’s BSTR) is fairly nice:

https://bstring.sourceforge.net/

The string keeps track of the buffer size and how much has been used, allowing allocations to be somewhat amortized. The string buffer itself is zero-terminated for easy interop with code that expects standard C strings.

    struct tagbstring {
        int mlen;
        int slen;
        unsigned char * data;
    };
I used it on a microcontroller where I wanted something small and simple. The main missing feature is the lack of a small-string optimization like some implementations of std::string have. (Before anyone complains about this string type being too inefficient for a microcontroller, I had 1 MB of flash and 192KB of RAM, so I was not super constrained for resources)


Man, I want Ada.Strings.Fixed, Ada.Strings.Bounded, and Ada.Strings.Unbounded in C.


TIL of memccpy() https://www.man7.org/linux/man-pages/man3/memccpy.3.html

To be honest, every time I need to deal with strings in C I feel like I'm banging rocks together, regardless of approach. I try to avoid it at all costs.


I can never remember the nuances of the 50 various string functions and which shouldn't be used.

What I do remember is that virtually all string problems can be solved with snprintf/asprintf.


snprintf is a worse strlcpy: not only does it need to call strlen if you pass it a string parameter, it also ends up in ??? territory if the string is long enough because its return type is int.


The printf family of functions also runs a mini-interpreter that has its cost, because its main use is interpolation (the % placeholders). Some compilers can substitute them for more efficient versions (e.g. printf without any % in the format string -> puts). I don't know if they can detect and substitute an snprintf with a "%s%s" format string.


Ah, good old “what if my string exceeds two gigabytes” dilemma.


> and which shouldn't be used.

my rule of thumb is that if its name begins with str then it shouldn't be used.


If you decide to use these functions. Beware of the sharp edges. Read the documentation. Read the documentation for your specific version and platform and compiler you are using. Between different CRT's these things can act differently even though they say they do the same thing.


The standard library string handling stuff is atrocious and it surprises me that wholesale replacement of that stuff isn't more common.


It is a long, time-honored tradition to attempt to improve on flawed standard library functions with equally flawed functions.


If I proposed a new strxyzcpy function that only null-terminated the string when the length was not a prime number and that wiped your hard drive if the destination string, before the copy, contained the sequence 'xyz' in ascii I would be very afraid someone in the C committee would think it would be a nice idea to add it.


Does the length take locale into consideration?


FFS I had some code for a microcontroller break because locale.


Don’t forget than if it’s invoked at the 38th second of a minute, the behavior is undefined.


Oh, are those 0-indexed seconds or 1-indexed?


0-indexed if running on a Unix-like OS, 1-indexed otherwise.


Does Linux count as Unix-like? What if it’s NixOS?


Implementation defined.



A great tweet, and often useful. But I don't think it applies well here.

There is a long history of "this time, the strcpy replacement will be safe" followed by cve after cve after cve. At some point I feel that the response really should be to give up on trying to make c-style strings safe.


The problem is the function signatures of all the improved string functions are broken. You can never write a safe string function that takes two char pointers.

You really want

  int str_try_copy(str_buffer *dest, str_slice *src)


You are more than welcome to give up on it. Personally I feel that you can never make code that handles C strings secure. That said, people will still be using them decades from now, and it is possible to give them safer APIs. Discounting everything as "oh it can never be perfect therefore we might as well throw away any efforts to make it better" is not helpful.


Also relevant: https://nullprogram.com/blog/2021/07/30/ (it references this blog post and offers good solutions)


It's not very clear to me why in paragraph "Truncation matters" it is claimed that the strlen variant is necessarily better than the strlcpy variant. the strlcpy variant only scan the source and destination string once in the fast case (no reallocation needed), while the strlen variant needs to scan the source string at least twice. I guess in the common case you have to enlarge the destination a few times, then once it's big enough you don't have to enlarge it anymore and always hit the fast case, os it makes sense to optimize for that.

It might also be that in some programs with different access patterns that doesn't happen and it makes sense to optimize for the slow case, sure, but the author should acknowledge that variability instead of being adamant on what's better, even to the point of calling "schizo" the solution it doesn't understand. In my experience the pattern of optimizing the fast path makes a lot of sense.

BTW, the strlcpy/"schizo" variant could stand some improvement: realloc() already copies the part of the string within the original size of the buffer, so you can start copying at that point. Also, once you know that the destination is big enough to receive a full copy of the source you can use good old strcpy(). Cargo cult and random "linters"/"static checkers" will tell you shouldn't, but you know that it's a perfectly fine function to call once you've ensured that its prerequisites are satisfied.


C can add a whole alphabet if str?cpy functions, and they all will have issues, because the language lacks expressive power to build a safe abstraction. It's all ad-hoc juggling of buffers without a reliably tracked size.


The language is expressive enough to have a good string library. It has string.h instead because of historical reasons. When it was introduced the requirements for a string library were very different from today's.


Null terminated strings is a relic that should have been recycled the day after it was created.

Any attempts to add more letters to the strx functions is just polishing the turd.


The types that C knows about are the types that the assembly knows about. Strings, especially unicode strings, aren't something that the raw processor knows about (as far as I know). At the machine level, it is all ad-hoc juggling of buffers without a reliably tracked size, until you impose constraints like only use length-prefixed protocols and structures. Where "only" is difficult for humans to achieve. One slip up and wham.


C with its notion of an object, TBAA, and pointer provenance is already disconnected from what the machine is doing.

The portable assembly myth is a recipe for getting Undefined Behavior. C is built on an abstract machine described in the C spec, not any particular machine code or assembly.

Buffers (objects) in C already have an identity and a semantically important length. C just lacks features to keep track of this explicitly and enforce error handling.

Languages exist to provide a more useful abstraction on top of the machine, not to naively mirror everything even where it is unhelpful and dangerous. For example, BCPL did not have pointer types, only integers, because they were the same thing for the CPU. That was a mess that C (mostly) fixed by creating "fictional" types that didn't exist at the assembly level.


The abstract machine of C is defined with careful understanding of what real CPUs do.


In need of a copy of ISO C printout?


The people who define Cs abstract machine are well aware of what real hardware is like. The standard of course doesn't mention real hardware but what is in there is guided by real hardware behaviour. they add to the specs when a change would aid real implementation


How does AVX512 guide ISO C?


The ommitte empers have been awareof simd for a long time and asking that. So far they have either not agreed, or because they have seen no need because autovectorization has shown much promise without. (that is both of the above are true though not always to the same people)

multi core is where languages have had to change because the language model of 1970 wasn't good enough.


It was an example among many others.

How does (FPGA, HiLow, Havard, CUDA, MPI,...) guide ISO C?


How should they? in some cases they have decided that isn't where they want c to go, in others the model of 1970 is still good enough, and in others they are being slow (possible intentional to not make a mistake)


So C isn't about being designed close to the hardware after all.


it is designed with careful understanding of real hardware. However that is not close to any particular hardware.


You must be thinking of c++? There is no object in C just structs which is just a little bit of organization of the contiguous memory. C exists to make writing portable CPU level software easier than assembler. It was astonishingly successful at this niche; many more people could write printer drivers. While ptr types may not formally exist in assembly, the variety of addressing modes using registers or locations that are also integers has a natural resonance with ptr and array types.

I would say C precisely likes to mirror everything even where it is unhelpful and dangerous. The spirit is captured in the Hole Hawg article: http://www.team.net/mjb/hawg.html

It is the same sort of fun one has with self modifying code (JIT compilers) or setting 1 to have a value of 2 in Python.

ed: https://en.cppreference.com/w/c/language/object is what is being referred to. I'm still pretty sure in the 80s and 90s people thought of and used C as a portable low-level language, which is why things like Python and Linux were written in C.


C has objects in the context of how the Abstract C Machine is defined by ISO C standard, nothing to do with OOP.


Assembly only knows about raw bytes, nothing else.


Depends on the assembly, but even most (all?) RISC instruction sets know about words (and probably half-words too) in addition to bytes.


Pairs, quads and octects of bytes.


There's also vector instructions.


Operating on stream of bytes, defined by registers.


The context was this comment:

> Depends on the assembly, but even most (all?) RISC instruction sets know about words (and probably half-words too) in addition to bytes.

Of course, you could define words and half-words in terms of bytes, too. Just as you can do with vectors.

And many vector instructions operate more on eg streams of words than streams of bytes.


Nope, the context was:

"The types that C knows about are the types that the assembly knows about."


The main issue, which the article covers, is that there's really two different operations you want with copying C strings.

Do you want to copy and truncate, or just copy?

Within that, do you want to manage your own allocation, or do you want that abstracted?

There's too many decision points and tradeoffs to just neatly hide behind a single "one true function" for copying C strings.


Is it bad that for all application uses, I reach for `asprintf`?

As well as reaching for the %ms format for scanf for reading input.

For buffers, I use memcpy and length tracking. Any other approach seems like unnecessary headache. Or maybe modern hardware has spoiled me?


C's string handling is an abomination. Null terminated strings is and always has been a colossal mistake. The C standard and operating systems need to be updated such that null-terminated strings are deprecated and all APIs take a string_view/slice/whatever struct.


I’m always glad to see more people coming around to the fact that memccpy is the actual function they want, not these inefficient nonstandard garbage functions that are needlessly inefficient for no reason but that everyone flocks to anyways for “security”.


I just wish it also had a src_size. As it is, if dest is bigger than src, and src isn't null-terminated, it'll read past the end of src. You can make a MIN macro and use MIN(dest_size, src_size), but shouldn't have to IMO.


That’s exactly my experience with most of C. You go through a history of half-assed changes and realize that you’re still halfway there in $current_year, wondering how many years will it take to fix any of these obvious flaws. Committee surely has its reasons but on the surface it seems completely non-practicing, to say the least.


strlcpy() is now part of POSIX: https://sortix.org/blog/posix-2024/.


i guess that's what spikes interest. although strlcpy was first introduced in openbsd 2.4, 26 years ago! back then as a drop in replacement for strcpy.

so yeah, good things need time to adopt, no wonder it's not up-to-date tech, lol.

and because of NIH-syndrome we now got lot's of strXcpy functions to choose from.


Aside from the NULL-termination requirements there is arguably another big design issue with libc strings. I believe the interfaces that may allocate memory - must give you an opportunity to override the allocator. Aside from the SIMD implementation quality and throughput on Arm, that was one of the key reasons to start a new library: https://github.com/ashvardanian/StringZilla/blob/91d0a1a02fa...

Also not a huge fan of locale controls and wchar APIs :)


> In other words, there's nothing "improper", "bad practice", "code-smell" or whatever with using the mem* family of functions for operating on strings, because strings are just an array of (null-terminated) characters afterall.

So refreshing to see a common-sense take in a world of shrill low-level programming alarmists.


Here is my attempt at strings in C (and other stuff).

https://github.com/uecker/noplate

(attention: this is experimental and incomplete for trying ideas and is subject to change.)


I feel like this person went through an unnecessary and false tangent about how people are "afraid" of memcpy due to inexperience and missed the much more important criticism that arbitrary, naive truncation on a byte level doesn't play well with Unicode.


Do the str* functions handle Unicode any better though? I feel like you'd want a different library for that


that would be wcslcpy(3)


Nope. Unicode correctness is much more complicated than switching char to wchar_t. Many glyphs take multiple codepoints or multiple character units in either utf-16 or utf-8. You need something like the break iterator "character" separator in ICU.


Over time, for my needs, I've gravitated back to fixed-size buffers. There are many apps where it really doesn't matter that they handle any string ever without truncation. A string is too long and won't fit? Whoops, just use a shorter one.


Same here. It has become easier and more defensible to do so today with gobs of RAM, even in the embedded space (somewhat).

You no longer need to worry about running out of stack space, even in a deep call-stack with each function having its own little set of buffers for strings.


In cross platform environments, it gets horrible when one does something like:

#define strlcpy strncpy


okay but that's deliberate sabotage


I’m late to the party but this mostly rehashes already voiced concerns with all the existing “updated” strcpy functions. But what I was surprised to learn is that strdup wasn’t part of the C language spec (until now)!


strlcpy is the worst C string routine except for all the others.


TFA misses the entire point of strlcpy, which is to improve security by making your code less prone to common C programmer errors that are known causes of common exploits. The author’s suggested remedies reintroduce the potential for those vulnerabilities.


Real men use memcpy


If you know the length (and that it fits)

mempcpy() , then add the terminator yourself to be sure. memcpy has the slightly annoying flaw of returning a copy of the dest argument (already known data) rather than the more useful new information of a pointer to the end of the set memory.

  #IFNDEF _GNU_SOURCE
  #DEFINE mempcpy(X, Y, Z) (void *)((memcpy(X, Y, Z) - 1) + Z)
  #ENDIF // Not quite the same but close enough mempcpy, fails for Z = 0
edited: Since X shouldn't be NULL and thus not 0, apply the subtraction first to avoid the corner case (which also shouldn't happen given common HW / OS regions) of X + Z > (~(size_t)0) (SIZE_MAX)


I remembered the return value incorrectly. It's even better in that it returns a pointer to the byte immediately AFTER the last written. So remove the - 1. PS: there's also a version which works with wchar_t types (be it 2, 4 or whatever else)

  #DEFINE mempcpy(X, Y, Z) (void *)(memcpy(X, Y, Z) + Z)
  #DEFINE wmempcpy(X, Y, Z) (void *)(memcpy(X, Y, Z * sizeof(wchar_t)) + Z * sizeof(wchar_t))


"I like all my string copy's equally."

CUT TO:

"I'm not a fan of strlcpy(3)"


And that's why I'm not a fan of C for writing any kind of program: Every time you need to touch a string, which is an effortless task in any semi-modern language, you aim a trillion footguns at yourself and all your users.

It's also pretty telling that every article that tries to explain how to safely copy or concat strings in C, like this one, only ever works with ASCII, no attempt whatsoever to handle UTF-8 and keep code points together, let alone grapheme clusters. No wonder almost all C software has problems with non-English strings...


In my opinion, and as a general rule. You can't really ever truncate a user facing string correctly because it depends on language specifics. Hence it is my suggestion not to truncate user facing strings - and in fact you may want to consider them as binary blobs. On the other hand, as a matter of practicability sometimes you may have to, but certainly avoid doing so before it hits the presentation layer, and know that it may not be language correct.

There are many strings that are not user facing, which you expect to be of a certain nature, e.g. ASCII based protocols, and therefore you know what to do with them.

So the multi byte situation and strcpy, or std::string, or any other "standard" string function isn't really relevant as it's some other libraries duty.

the task of truncation and otherwise formatting UI strings is the preserve of the rendering layer(s).


Yes, truncating a user facing string requires more consideration regardless of programming language. For example do you truncate at the grapheme, word, sentence, new line, paragraph, or something else? How do you indicate truncation to the user, with an ellipsis perhaps? If you use an ellipsis is it appended to the original string or drawn by the GUI toolkit?

Note that the Unicode grapheme cluster, word, sentence, and line break algorithms are locale specific. Now consider how often programmers casually truncate strings, even in high-level languages, without accounting for the locale.


> I'm not a fan of C for writing any kind of program

C is okay — not fan-worthy, but okay — for one specific kind of program: a mostly-portable implementation of a real language in which to write every other kind of program. that’s not nothing: implementing a backend for every CPU-OS pair one wants to support is a pain, and a different skillset from writing, say, a web browser or text editor.

I wonder how much C has held back the development of computing.


C is fine. Yes, it has undefined behaviors and encourages subtle thinkos. But it’s a tool and sometimes it’s the right tool.

It’s the C hackers who insist on using it for inappropriate applications who create the problems.

Anything string-heavy and internationalized is a terrible place for C, even if you have spent 30 years dancing through the C minefield and are sure you’ll get it right this time.


The suckless.org initiative is a great example of this. Boasting about how certain software inherently sucks less because it has minimalist C code, rather than having defense-in-depth against security holes.

Newsflash, a few decades ago the majority of software was the minimalist C that they claim sucks less. There are many reasons we moved on from that world, security firmly among them.


> I wonder how much C has held back the development of computing.

Lisp is older than C. It didn't just hold computing back, it actively walked backwards.


Well, C doesn't really have strings, just pointers. Calling them "strings" is just an abstraction for us. They don't even have to be null-terminated (even though by convention they often are)!

And this is precisely what I want, in some cases. I use C when I need a low-level byte wrangling code that tightly interfaces with the operating system, or when I can't afford to allocate memory at will (like all effortless string-handling languages do under the hood), or when I don't want any runtime and need my code to behave like a shellcode, basically, or...

I guess C is still used because people find it useful. No need to be a fan of it.


> I can't afford to allocate memory at will

This was my mantra for the last 30 years. But now, even the tiniest thing has 1GB, linux and virtual memory. Except for very edge cases, this excuse is rapidly dying


But the waste adds up. There are a few projects at work whose integration tests I can't run on my work machine because it doesn't have enough ram (16GB) so I run them on my home PC instead. And that's even when I close IntelliJ, my web browsers, and kill all the background processes I can.


When I browse my local electronics web shops I see a lot of devices that couldn't boot a Linux kernel, some of which I wonder whether I could drive with a small consumer grade solar panel.


by convention: if it's not NUL terminated, it's not a c-string.


In defence of C, it should be oblivious to different string encodings (and I thought it was).

Coreutils should continue to do their job, and not need to be recompiled against whatever encoding is in fashion.


The standards, as usual for C, are paper-thin, doesn't even guarantee ordinal value for the characters specified (except 0-9) despite implementing strcmp. But GCC/MSVC do add some options on top of the standard, which includes some attempt at interpreting different encodings.

Then again, I was surprised to find support for literal encodings in C23. So perhaps my knowledge was even more outdated than I anticipated.


> It's also pretty telling that every article that tries to explain how to safely copy or concat strings in C, like this one, only ever works with ASCII, no attempt whatsoever to handle UTF-8 and keep code points together, let alone grapheme clusters.

Can you expand on this? Why does it matter to keep code points and grapheme clusters together in the case of truncation? If you’re already truncating the string, then you can just copy as many bytes as possible. Then, later when you interpret that string, you’ll hit a malformed codepoint and ignore it. I guess what you might be getting at is that if you have a codepoint sequence, then you shouldn’t copy the bytes if they can’t all fit in the truncated string?

I feel like this is an edge case and not “the reason almost all C software has problems with non-English strings”. 99% of the time, copying up until the null byte is fine, whether or not the string is UTF-8 or ASCII. The reason most C software doesn’t work with non-English strings is because the developer never added support. The bytes are still there, they just need to be interpreted correctly in the UI portions of the code.


Truncating mid-codepoint produces an invalid utf8 sequence, which some decoders will silently ignore or replace with <?> glyphs, and others will fail loudly.

Truncating mid-grapheme-cluster can change semantics in unexpected ways. For example, the "family" emoji(s) can be encoded as a set of conjoined codepoints for each family member: https://stackoverflow.com/questions/49958287/printing-family.... Depending on where you truncate, you might just get the father, which probably won't cause you any serious issues but it might cause confusion depending on context.

(IMHO, if you're truncating a way where this would matter, you should probably be truncating in screen-space, i.e., way higher up the stack)


Yep, this all makes sense and is what I was thinking as well. I also think you’re right about truncation being a better idea in the rendering state rather than in the bytes themselves. Either way, truncation of strings being displayed to the user requires extra care no matter what language you’re using.

As far as this being the reason most C code doesn’t handle UTF-8 though, I’m still skeptical.


Honestly, any language where you are manipulating strings is a footgun. Especially if you are manipulating text, and not character data. You can avoid many errors, but almost certainly not all. And as soon as you have someone that thinks they can piecewise localize any text, heaven help you.


With the up and coming browser LLM API, you could just have the LLM translate everything on the frontend. One of the few tasks that LLMs are consistently good at.


I feel this successfully upgrades a gun into a canon?


A modern artillery gun, where each shell costs upwards of $1K apiece, and lord have mercy on you if it ever goes off inside the barrel.


LLMs are not “consistently good” at translating UI text, because a proper translation requires understanding the application and the application context of each piece of text.


Yes. Context is very important. Otherwise, you end up compressing files into a "postcode file".[0]

[0]: https://news.ycombinator.com/item?id=36231313


Not that C was wise with strings, but slicing strings in general is a bad idea unless you’re doing it strictly at ascii borders, e.g. to capitalize windows disk letter or to split by cr?lf. Modern text is yet another “falsehoods programmers believe in” material.


I mean just use bstrlib why is it difficult?


Because no one can agree on which string library / abstraction to use, if they even use one at all which is rare.


I'd be happy if developers did that, but they don't:

So many C developer seem to think they're "smart enough" to handle strings correctly using only the standard library, not needing "training wheels" like the plebs - but then where do the countless string related memory safety errors and all the broken Unicode handling come from that plague every C program, no matter how clever the programmers, no matter how tough the review process, no matter how much static analysis is used?

The issue is that many C developers simply can't seem to admit that C string handling, as it was invented many decades ago, was simply fundamentally flawed. So easy to misuse with catastrophic results, even for plain 1-byte-strings. It has nothing to do with how "smart" you are or how careful you are.


Are there really people defending string manipulation in C? C is a slight abstraction on top of assembly. Assembly speaks in addresses and bytes. It's always been bad at trying to interpret human text because Assembly's job lies in the digital realm.

I think it's more a case that programming is English skewed and localization concerns are rare for the programmers these day who still need to work in the realm of C (so, mostly for older legacy software or the embedded realm).


It doesn't matter if anyone is defending it - C developers still use C strings and the broken C standard library's string functions (exclusively) rather than using one of the many good string libraries they have available.

The result is that the same mistakes with string handling get repeated again and again, often with catastrophic results.


Developers do a lot of unoptimal things for unoptimal reasons. Ultimately what's "right" or "wrong" is a case by case basis. So I don't see much point in venting against those developers without knowing their circumstance.

I mostly ask who's defending it for holistic purposes. Right or wrong, It would be nice to hear other perspectives on the matter if possible.


Smart is not the issue. I'm smart enough, but I'm lazy and forgetful. C makes it really accidentaly once in a while.


I mean just use C++ w/ std::string why is it difficult?


Most C programs compile and work correctly compiled as C++. The exceptions are often places you shouldn't write code like that, sometimes even outright bugs that C++'s stricter type system caught for you.


std::u8string


No, std::string.


> like this one, only ever works with ASCII, no attempt whatsoever to handle UTF-8

I mean if your viewpoint is to handle heavy UIs then yes maybe C is not your go-to language. However, there are other solutions to the character space problem besides UTF-8. Rather than complaining that your screwdriver can't bang nails, perhaps try using a screw?

No, you won't be able to support a language like chinese within a 256 bit charset, but perhaps it's not important to in all situations. The strength of ascii is that its small, simple. If that's too anglo-centric for you, maybe there's a better symbolic tiny charset that is more applicable. I'd support something like https://lojban.io/ becoming more prevalent, but there could be advantages to having a simple(r) symbolic transmission language not burdened with anglo-centric concepts.


Advocating a new global constructed language to accomodate C's shortcomings seems like the wrong direction to be thinking.


Less a global constructed language, more a "better" encoding. Base64 works really well for arbitrary binary-in-text encoding, for instance.


Unicode is that better encoding. The "small and efficient per locale encoding" that you proposed was the status quo, and was an endless source of mojibake. There is a reason we moved away from that.


I think there is a misunderstanding, which I tried to address but evidentally failed.

UTF-8 is fine for a display encoding. However, not every string encoding need be a display encoding, which the parent post seems to not be considering.

You could also have multiple display encodings, if it makes sense to (a tool only intended for use in a certain part of the world for instance), however that is not what I mean.


The code in the post does indeed have a bug if the goal is to copy UTF-8 data. However the bug is quite subtle and unlikely to occur in practice, and from your comment I suspect you don’t actually know what that problem is (because you seem to be pointing at a different issue which is actually irrelevant). Maybe you or someone else can describe what the real concern is?


> However the bug is quite subtle and unlikely to occur in practice,

Unlikely to occur? Do you think that will put the naysayers at ease?

> , and from your comment I suspect you don’t actually know what that problem is (because you seem to be pointing at a different issue which is actually irrelevant). Maybe you or someone else can describe what the real concern is?

Amazing that people have the patience for this.


¯\_(ツ)_/¯


Partially copying a 2-byte character?


UTF-8 works fine if you truncate a codepoint because the encoding scheme lets you detect this. The problem is more subtle than that (hint: it involves a 1-byte codepoint).


Truncating a UTF-8 codepoint is not fine because most software is not tested with partially broken UTF-8 so international users will likely run into many bugs.

Especially because concatenation is a very common operation so those sliced codepoints will be everywhere, including in the middle of text.


Morally I view “what do I do with my truncated string” to be a separate issue from “how do I truncate the string” as described in the article. Like, yes, you absolutely should not concatenate after doing this operation. But maybe you shouldn’t be showing the user a truncated string either even if it’s all ASCII. The question of “did you make an unparseable UTF-8 string” is answered with “no” and the more complicated but also more interesting question of “did you actually want this” remains unanswered.


This is fair, the article takes truncating a string to fit in a status bar as an example.


Also consider Unicode is not only international characters, but superscripts and other stuff ♥ᵃ

a: there was a list somewhere over which characters hackernews allows?


If you're alluding to NUL, I don't really see the issue?

Yes, many languages allow strings (UTF-8 or otherwise) to contain null bytes, and C's str*() functions obviously do not, but null-termination vs not is an orthogonal issue to ASCII vs UTF-8.

i.e. Yes it's (depending on context) an issue that C str*() cannot handle strings with embedded null bytes, but that's not a UTF-8-specific issue.


A function that can turn a correctly formatted UTF-8 string into a malformed UTF-8 string is, in my opinion, broken.


One problem here is that the string may not have been a correctly formatted UTF-8 string to begin with. No, not that it can contain any bytes-I mean, it might be ascribed even more than just decoding correctly. Maybe it is supposed to have the grapheme clusters preserved. Maybe the truncation should peel off the last file component because the string holds a path. The operation of “doing a dumb truncation” can be broken if you look at it from plenty of ways, and I don’t disagree with you, but I do want to make clear that the issue isn’t memcpy is breaking it but that if you need x, y, z maybe you’re reaching for the wrong tool. And conversely there is nothing inherently wrong with using it if you are going to use it in a way that is resilient to that kind of truncation.


What about a function that can turn a correctly spelled english sentence into a malformed english sentence? If you truncate to a fixed length this comes with the territory.


The null code point? That would be pedantic even by my standards.


Look I had to include it or someone is going to do a whole pedantic comment about how C can’t actually represent UTF-8 correctly


You could have just said it rather than going through this smug "I know something you don't know!" song and dance.

Also, by this rationale, NO string is ever safe in C, because pretty much every encoding technically supports codepoint 0 (even though you take your life into your hands should you ever use it). This is not a useful discussion.


I mean strings that don't use that codepoint are fine, and that's most strings as I mentioned above.


Actually, by just alluding to the bug without saying it explicitly, you managed to both be pedantic and not avoid the discussion.

This is not meant as a personal attack; I just want to point out how it looks on a casual reading :)


Well, I'm not perfect ;)


By that metric, C can't represent ASCII correctly either, because there's no particular reason you couldn't have a NUL character somewhere inside a string.


Indeed it can't. Many developers were bitten by this, and still are; plenty of critical bugs and security vulnerabilities rely on this quirk too.


Technically, C can. It's just C strings that are limited.


Sure, in the exact same way that C can handle unicode just fine too. The problem is, as always, C strings.


>hint: it involves a 1-byte codepoint

Which is?


There is also strscpy [1] which behaves like the authors use of memccpy except the former doesn’t require manually passing the null terminator as an argument.

[1] https://manpages.debian.org/testing/linux-manual-4.8/strscpy...


C programmers like their overly concise naming schemes don’t they… How do you expect a user to guess the subtle differences in behavior between strncpy, strlcpy and strscpy?

Sure screens were smaller before and you didn’t have autocomplete, but reaching out for documentation was also harder, and there was less of it. I guess “real programmers” memorize libc down to every detail and never use dependencies they didn’t write themselves.


In the 90s, as a working C programmer, I would make it a practice to read thru man 2 and man 3 from time to time, as well as to read all the options for gcc. The man pages were pretty good at covering the differences, if not all the implications of the differences.

And to this day I strive to minimize the dependencies of my software - any dependency you have is an essentially unbounded cost on the maintenance side. I used to sign up for the security mailing list for all the dependencies, and would expect to have to follow all the drama and politics for the depenencies in order to have context on when to upgrade and when not too. With Go I don't do that, but with Python I sometimes still do. And I've been reading the main R support email list for ever. I still think that a periodic reading of lwn.net is an essential part of being a responsible programmer.

I will also say, reading the article reminds me of the terror of strings in C without a good library and set of conventions, and why Go (and presumably all other modern languages) is so much more relaxing to use. The place I worked have everything length prefixed and by the end of the 90s all the network parsing was code generated from protocol description files, to remove the need for manually writing string copying code.


I was somewhat dismissive, but I agree that this way of thinking about dependencies is the right approach for systems programming. And it is fair to expect that users will read the manual in detail for any tool or library they adopt in the contexts where C is used.

It's just a bit frustrating to deal with so many names that are hard to understand and remember. C-style naming forces you to refer to the docs more often, and the docs are usually more sparse and less accessible than in other ecosystems. Man pages are relatively robust and they were a delight back in the day, but they have not been the gold standard for decades, and the documentation conventions for third-party libraries tend to be quite weak.


A distressing number of softwar engineers have overly accurate memories and don't notice when things become excessively cryptic or arcane.

However the implementations being much more open source now means a lot of bad documentation can be overcome with code reading or, if needed, stepping thru the code with a debugger. Wrong documentation is still expensive. I have a bitter taste in my mouth from integrating with OpenTelemetry Go libraries. It seems to be sorted now in 1.27and q 28 but 1.24 and for a few versions the docs were wrong, the examples were not transferable, and it took 5x the time it should have.


> The place I worked have everything length prefixed and by the end of the 90s all the network parsing was code generated from protocol description files, to remove the need for manually writing string copying code

Would you expound on this please? I've seen people reserve the first element as the size of the buffer but then, say a char buffer, has a size limit based on the size of char itself. What advice would you give yourself if you were just becoming a senior developer today? I do embedded myself


For protocols, you might be worrying about the bytes and use a 1 2 4 or 8 byte len, or some complicated variable length integer like ASN.1 or something homegrown (but actually, with the slower networks, if you have a 2M msg, you'd split it into 64k that fit into your 16-bit length prefixed protocol and stream it), but for the parsed data, probably you'd do size_t or ssize_t, byte[], for portability. The standard parsing API would pass in (buf, buf_len) (or (pos, buf, buf_len), and the standard struct would be (name_len size_t , name Byte *) (where Byte was typedef to unsigned char, so you wouldn't pick up signed char by mistake).


> C programmers like their overly concise naming schemes don’t they

It's a holdover from 1970s-era linkers, many of which required each exported symbol to be unique within the first six or so characters.


This is helpful, thanks. And I assume that no one imagined that you would need more variants of strcpy to begin with, strncpy is a pretty good name too, and then it's hard to break the pattern. Similar situations with lots of other names.

I just wish that C style used fewer single char names and hard-to-decipher acronyms. Smart programmers can't read minds either...


> like their overly concise naming schemes don’t they

One of the common features of "modern" languages that I can't comprehend is the love of single character sigils that completely change the meaning of a line of code or perhaps the entire enclosing function.

> I guess “real programmers” memorize libc down to every detail

Or, we just use 'man' and the massive 'section 3' of that system. A feature that no other language has managed to correctly replicate.


> How do you expect a user to guess the subtle differences

Software developers should never rely on guessing. Always, always read the specification.


You are right, perhaps "guess" is the wrong term, I meant that it is unreasonably hard to identify and remember which is which.


Usually you settle on one or two “go to” functions, if you don’t write your own wrapper function anyway. Given the subtleties of their semantics, fully descriptive names also seem unachievable. But really, familiarity comes with repeated exposure. If you use these every day, you’ll learn rather quickly.


It's a good point. To be fair, any system that lasts as long as C has will have legacy cruft that you need to learn to step around and it's hard to get rid of. Relative to other standards, C has been remarkably disciplined at keeping things clean.


> How do you expect a user to guess the subtle differences in behavior between strncpy, strlcpy and strscpy?

man (strxcopy)


Yes, but that is an internal function of the Linux kernel.

It is available in user programs only if you define it yourself, e.g. by using memccpy.


While the article itself is interesting, could they not have picked a less... offensive word in "I'd like to point out just how schizo this entire logic is"? Like strange, or weird, or unusual


"schizo" is not an offensive word.


Meaning #2: https://www.merriam-webster.com/dictionary/schizophrenic

Many words are used with different and derived meanings, usually disambiguated by context. For example, an offensive player is not offensive in the sense you used the word in.


None of those words convey the same meaning though. "Schizo" doesn't mean weird or unusual in this case. Maybe there's a better word but it really conveys a specific meaning, at least in the colloquial use of the word


How is it offensive?


Perhaps in a similar way as 'paki'.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: