Hacker News new | past | comments | ask | show | jobs | submit login
strcpy: A niche function you don't need (nullprogram.com)
150 points by grep_it on July 30, 2021 | hide | past | favorite | 150 comments



My favourite C string function is snprintf:

• It takes a buffer size and truncates the output to the buffer size if it's too large.

• The buffer size includes the null terminator, so the simplest pattern of snprintf(buf, sizeof(buf), …) is correct.

• It always null-terminates the output for you, even if truncated.

• By providing NULL as the buffer argument, it will tell you the buffer size you need if you want to dynamically allocate.

And of course, it can safely copy strings:

  snprintf(dst_buf, sizeof(dst_buf), "%s", src_str);
Including non-null-terminated ones:

  snprintf(dst_buf, sizeof(dst_buf), "%.*s", (int)src_str_len, src_str_data);
And it's standard and portable, unlike e.g. strlcpy. It's one of the best C99 additions.


My favourite C string function is asprintf. It's trivially safe, since it allocates an appropriately sized buffer, and it promotes correct programming by allowing developers to rely on "if I have a string then of course it was allocated by malloc and needs to be freed appropriately".

It's not part of the C standard, but it's trivial to ship around an implementation with your code.


Trivially safe? It can fail. As with snprintf, it's easy to forget to check the result, but at least with snprintf the result is always a safe string. With asprintf, not checking the result or doing it incorrectly will lead to undefined behaviour. From the man page:

> If memory allocation wasn't possible, or some other error occurs, these functions will return -1, and the contents of strp are undefined.


This is a dumb bug in the implementation. It should set dest to null on failure.


To be fair, dereferencing a null pointer is also UD in general rather than a guaranteed abort. (Some platforms may provide stronger guarantees, of course; many prevent pages from being mapped at 0x0 as a security mitigation.)


It's worse than this, as even on platforms where actually reading from 0x0 is a guaranteed abort, dereferencing a NULL pointer in C is still UB, meaning the compiler can assume it won't happen and optimize the program accordingly.

To take a rather convoluted example, if you dereference the pointer and then call a function that does a NULL check then writes to the pointer at some offset, it's possible that the compiler will in-line the function, then ellide the NULL check (since you've dereferenced it, the compiler assumes it's not NULL), then remove your dereference if it didn't have side-effects, so now the write goes through without any check. Granted, it would have to be a write to a massive offset to actually hit an allocated page, but I'm sure there are similar scenarios that are more realistic.


Well, yes, you need to check return codes. I file that under "trivial"; it's true of almost every function.


Although I agree with your general idea that asprintf() is the smarter choice, I think it's a stretch to call it trivially safe.

I think most people would call its API safe but not trivially safe, where trivial means "when I see that function called in other people's code, I don't need to pay attention to that call because it can't do anything crazy like cause undefined behaviour."

After all, if you include "as long as you check" in your definition of trivial, it is also trivial to check the parameters to strcpy() if you are using it right. And yet here we are, in a discussion about how that's a risk because it isn't used right.

If asprintf() terminated the program when it failed instead of leading on to undefined behaviour when unchecked, I'd call that trivially safe in a more pragmatic way. If you're going to ship a function for portability anyway, that's what I'd recommend. And in fact, that's what is used in software like GCC, called xasprintf() there. It returns the pointer or exits the program on allocation failure.


If asprintf() terminated the program when it failed instead of leading on to undefined behaviour when unchecked, I'd call that trivially safe in a more pragmatic way.

Ugh. Maybe in some contexts that's ok, but some of us write code which handles memory allocation failures with a bit more finesse than abort().


In most contexts aborting on malloc failure is OK, and preferable to trying to handle it gracefully, which has caused lots of problems, including security problems. On Linux you need to be running in a non-default configuration for malloc to be fallible in the first place (other than in trivial circumstances like attempting to request exabytes of memory in a single call).


Yeah, that's one of the things I don't like about Linux.


And some of us includes me.

But at the point where you're able to handle all memory allocation failures usefully, you're almost certainly doing something non-trivial to recover.

For example aborting some requested transaction, pruning items from your program's caches, delaying a subtask to release its temporary memory, or compressing some structures. At that point there's nothing "trivially" safe about a memory allocation.

Probably there are bugs in those recovery actions too. Even the obviously most simple recovery action of propagating an error return to abort a request: If that's a network request, just returning the error "sorry out of memory" is potentially going to fail. So you need recovery from the recovery path failing.


It's also slowest of all the string functions. Additionally, providing the NULL as the buffer argument does all the work of creating the actual string you're going to copy (in the case of using multiple format characters) and then discarding it. So you have to do the format operation twice, doubling the cost of any string operation.

Oh and it's not compatible with UTF-8.


What makes it incompatible with UTF-8?


Normally, UTF-8 works with C strings no problem, but it's possible that a multi-byte UTF-8 encoded codepoint is only halfway copied when it reaches the capacity and throws the NULL at the end.


Why is snprintf slow? I am surprised that it would be slow especially when compared to methods like asprintf that allocate the buffer.


I forget the exact reasoning now, but I remember it being about 10x slower than memcpy or strncpy. I think the main reason was because of the need to parse the format string.


Even then, printf and scanf are typically faster (and not even by a little bit, by a lot) than C++ iostreams formatted output, even though iostreams gets all the formatting information at compile-time, while printf has to parse the format string.

On the other hand, if people start to use snprintf in that particular form as a safe way of string copying, compilers could pattern-match this and substitute a direct implementation.


The modern C++ way of formatting strings is with std::format, or the external fmt library. It's faster than printf (and certainly streams!) while having convenient format strings (unlike streams) and optional compile time format parsing, combined with c++ type safety.


Everyone complains about streams, yet since Turbo C++ 1.0 for MS-DOS they have served me well for the kinds of applications I delivered into production with C++.


Isn't the biggest iostream bottleneck the forced sync with cstdio? You can turn that of at program start.


With that enabled it's comically slow, but even without the stdio syncing (and even operating on stringstreams) it's still much slower. For simple one-off formatting it's hard to notice, but once you write a parser for some engineering file formats (most of which are still ASCII) and have to read a few million floats here or there it becomes a really stark difference. For parsing floats specifically there are some implementations around which are optimized more, though some of the "fast and clever" ones are less accurate.


Oh I see, I thought you were comparing it to the printf style methods but compared to methods that do not take a format string that makes sense.


snprintf works most of the time but it can fail, and people almost never check the return value. For example it will always fail if you attempt to print a string >= 2GB. If that happens the output buffer may remain uninitialized (depends on the implementation) and you're at risk for a Heartbleed-like scenario.


I'm curious, what type of application would require printing strings over 2GB in size?


(Intentionally crafted) large metadata in a video file (which can be very large anyway, so 2GB extra doesn't stand out). A string field in a (machine-generated) JSON/XML dataset. And like tomohawk says, lack of a mechanism that prevents accruing large blobs (for example not limiting the size of an incoming HTTP header in a web server).


The one that plans to print much shorter strings, but accepts arbitrary strings.


Binary file contents are often strings, right? If you're using snprintf() to copy strings, as the thread was suggesting, there are valid reasons to try to copy such strings (though they would be few and far between).


A bug elsewhere could lead to this


Note that snprintf returns the number of bytes that would have been written if the dest buffer were large enough, not the number of bytes actually written. I've seen a few projects misunderstand that and write code like this:

    for (i = 0; i < ...; i++) {
        offset += snprintf(
            dest + offset,
            sizeof(dest) - offset,
            "%s...",
            str[i]);
    }
That will cause a buffer overflow: if iteration n gets truncated because the dest buffer fills up, iteration n+1 will write past the end of the buffer.


snprintf is underlying most logging modules I've done (logging to memory / file / network / console...) - I've been thinking about doing custom formatting routines but there's surprisingly little need for them.

You probably know this, but sizeof is not a function. I prefer the easier to type

    snprintf(buf, sizeof buf, ...);


For anyone curious, this is something that's been discussed on HN at length as a result of a lkml thread

https://lkml.org/lkml/2012/7/11/103

https://news.ycombinator.com/item?id=9629461


Linus disproves his point. First he claims that sizeof behaves like a function, then in the next breath, realizing the flaw in his logic, proceeds to describe and excuse the counterpoint: sizeof(*p)->member.

This is classic Linus--too emotionally invested in a preference. Except in this case it's particularly pointless and unjustified.

sizeof is an operator. Period. The point of not using parentheses is to continually drive that point home. It's praxis. Of course, it's not unreasonable to prefer using parentheses. And there's a middle ground: most C styles nestle function identifiers and opening parentheses in a function invocation, whereas they require a space between operators and binary operands. So if you prefer using parentheses, whether all the time or just in a particular circumstance, you can do:

  sizeof (*p)->member
or

  sizeof ((*p)->member)
That's not entirely consistent. sizeof is a unary operator, and style guides tend to prefer nestling unary operators while spacing binary operators. But nobody is trying to be pedantic here. The issue is readability, minimizing typos, and dealing with the fact that the sizeof operator, while defined as and behaving exactly like a unary operator, doesn't look like one.

Also, it's worth pointing out that not only does the C standard itself literally define sizeof as a unary operator, all the code examples in the standard put a space between sizeof and its operand. It's a stylistic convention, but hardly arbitrary.

By contrast, there are other constructs, like _Generic, where the code examples do NOT use spacing. _Generic is a specialized construct altogether, but syntactically it behaves somewhat like a macro, and it's customary to style macro invocations like function invocations.


That was a great comment, thanks a lot for the clarification!

Another point that I've been missing from Linus' post is that sizeof is special also in that its argument is never evaluated. sizeof launch_the_missiles() will never launch the missiles.

Yet another specialty is that sizeof applied to an array has no array decar. if char buf[1024];, then sizeof buf is 1024, not a pointer size like 4 or 8.

To me, it's not like a function at all. It's more like an assembler macro. I usually agree with Linus and learn a lot from his posts, but here I think he's unsuccessfully trying to find arguments for his (IMO, misleading) stylistic preferences.


My reading of his argument is that sizeof is (compile time) function that maps expressions or types to their storage size.

From that lens, C has a design mistake. I'm inclined to agree.


The + and - operators do exactly that with pointer operands, but nobody is claiming that they're behaving like functions.

In C++ most operators could end up invoking an actual, user-defined function at runtime. But, again, nobody is claiming that they therefore behave like functions.

I get the logic--if you squint really hard you can analogize sizeof to a function. It just doesn't work, as evidenced by his own example. C isn't LISP. It's C. It has a unique grammar with distinct classes of constructs. sizeof is a unary operator, parses exactly like every other unary operator (tokenization conflict with identifiers notwithstanding), which is quite different from how function calls are parsed, particularly wrt parentheses. sizeof has some unique characteristics, but every operator has unique characteristics; that's why they're operators, as opposed to some functional languages that try to subsume everything into function-like syntax.


It's too bad that many of the string handling functions in the C standard library are ticking time bombs. I like the approach taken in e.g. git which converts problematic function calls into compile errors https://github.com/git/git/blob/master/banned.h


I would rather use C++, however if I really must write C, then a string handling library like the SDS from Redis project is a must.

https://github.com/antirez/sds

Naturally turning on all security features of the C compiler being used, warnings, warnings as errors, static analysers, FORTIFY like libraries, the whole package.

And like almost everyone else, I end up having my occasional segfault regardless of the amount of precautions taken, because I am not elite enough.


I like it too, but then you're going down the uncanny valley of:

- The project is new, in which case you can easily/safely ban functions, but then why are you starting a new C project in 2021?

- The project already exists, and now you need to refactor out all the compile-time errors in order to move forward (time-consuming).

Keep in mind the first is a real question that should be answered. If your goal is to avoid undefined behavior/potential security headaches, then C should be entered into after careful consideration of cost/benefits. There are better alternatives for some projects but not others YMMV.


> The project is new, in which case you can safely ban functions, but then why are you starting a new C project in 2021?

Interoperability? A tons of chips have a single C compiler forked from an old version of GCC without even full C99 support and that's it. Good luck getting anything else running on that hardware.


Yep, for the 0.0001% of developers who target arcane architectures C is still pretty much it.


That's why you should evaluate the cost/benefit, like I said.

> If your goal is to avoid undefined behavior/potential security headaches, then C should be entered into after careful consideration of cost/benefits. There are better alternatives for some projects but not others YMMV.


One reason would be because you heard about Azure Sphere, their security marketing, and the 7 properties of highly secure platforms, so you order a devkit.

Then when you open the box you discover that despite all the security talk, the only way to officially program the Azure Sphere is with a C SDK.


> - The project already exists, and now you need to refactor out all the compile-time errors in order to move forward (time-consuming).

Use a static analysis tool with proper, non-brittle tracking of new vs old findings (including proper move and copy file tracking) and forbid new findings, e.g. by preventing offending code from being merged to master with tools like Gitlab, Bitbucket, what have you. Works very well in practice. Bonus points if the tool can also tell you when you modified a function and didn't clean up the old findings in there. Then you can really enforce incremental improvement.

Disclaimer: I work on Teamscale (https://teamscale.com), which does all that.


Both cases are real. If you start a new project in C, it means you have a very good reason - and hopefully a strategy of dealing with strings and other problematic issues.

If you need to deal with an older codebase, the all-or-nothing approach might not be appropriate - incremental improvements might be a better option. Yes, you will have more problems to deal with initially, but with time the situation will get better.


> If you start a new project in C, it means you have a very good reason

That does not follow.

If you do it, you ought to have a good reason, but very probably don't. Any reason you thought you had, if you had any at all, argues for C++ instead, because it works anywhere C does, and enables you to choose to use modern methods that are not inherently error-prone.

Whether you do choose to use modern methods, in any non-C language you end up using, is a whole other matter. But at least you can.


There are still platforms with C compilers but no (or very bad) C++ compilers. It's much easier to write a decent C compiler than a C++ one.


The set of such platforms that one might start a new project for is very limited. Maybe, flight computer in a Boeing 737 variant? Betting that has no C compiler. 8051?

The BSDs have supported C++ user-space programs for long enough that "just working" should be well within range.

A new project involving a BSD or Linux kernel subsystem, or PostgreSQL or SQLite component, is a plausible interpretation, where C++ is actually forbidden for wholly non-technical reasons.


Yeah. If you want your code to “just work” on stuff like the BSDs, then C (or an interpreter written in C) is still your best bet.


We use C because significant parts of our platform are GStreamer plugins… so C. Though if I were going to greenfield GStreamer C would still be a pretty reasonable option in 2021 IMO.


> Though if I were going to greenfield GStreamer C would still be a pretty reasonable option in 2021 IMO.

GStreamer is routinely exposed to hostile content in non-sandboxed contexts, such as your traditional Linux desktop; it even works as a browser plugin! Among projects you might want to write in C, GStreamer strikes me as one of the least reasonable. If GStreamer were more widely deployed on phones, it'd be prime attack surface for e.g. regimes attempting to surveil journalists.


"Yeah, but it's not that bad, because like, you can always choose not install the gstreamer-plugins-ugly package."


You can have reasonable C FFI from several other languages. Notably in this context Rust.

Of course if your code is just a thin layer of glue it's not worth it, you'd spend all your time getting into and out of the FFI. But if you write a non-trivial amount of code that actually does something because it isn't just a glue layer, C might actually not be the obvious choice after all.

In the particular case of Rust, somebody else apparently already did this work (not tested, and thus not vouched for): https://crates.io/crates/gstreamer


>We use C because significant parts of our platform are GStreamer plugins… so C.

Ehh..., we have written and deployed gstreamer plugins for a proprietary project at $dayjob which was written in Rust. In the case of gstreamer, they have a plethora of examples for their Rust bindings (I think it's also officially supported).


GStreamer plugins do not in any way dictate C. C++ works fine. As does Zig; and even Rust, with some annoyance.


> why are you starting a new C project in 2021?

Why not?


It's unsafe.


And what is safe? C has well defined, tested, and documented specifications which layout exactly how to use C safely. Languages like Rust are more safe but nothing is just safe. If you are an inexperienced developer it is still very easy to write insecure code in Rust.

Sure, moving to safer languages is good. But it is impossible to rule out the use of a language so established that every major operating system is written in it. It is practically impossible to not use C - safe languages are normally bootstrapped by it and eventually your “safe” code will run in an environment programmed in C.


I agree that nothing is fully safe and any developer will alway introduce security vulnerabilities eventually, but history has proven than C is unsafe and generations of very competent developers wrote unsafe C despite documentation to write C safely.


It's pretty fun though. Despite all the problems, I just don't enjoy programming in other languages as much as I enjoy writing C. I've tried C++, Rust, Go, Zig, Javascript, Ruby, Python, Lisp, Lua... I just don't feel as free as I do while writing C.


The ban can be rolled out gradually by not including banned.h everywhere.


But then you're undermining your own hard compile time check, wherein specific files or areas of the project could fall through the cracks and continue to used supposedly "banned" functions.

I'd argue doing it globally but as warning would be a better alternative than doing it file by file.


A warning would be nice, is there a way to implement that without changing the compiler though?


Something along these lines should work, for at least both of GCC and Clang:

    extern void banned() __attribute__((deprecated("This function is problematic")));

    #define strcpy(...) strcpy(__VA_ARGS__); banned;
Which would output something along the lines of:

    test.c:11:3: warning: 'banned' is deprecated: This function is problematic [-Wdeprecated-declarations]
  strcpy(x, y);


MSVC does this, or at least did the last time I had to compile C code for windows. You should be able to do it with the preprocessor if you wanted to.


Honestly not trolling - I disagree with the "ticking time bombs" comment. If you feel that way the devs should be using Rust.

C is a sharp knife with no handle; this is it's purpose as a language and tool.

cc: theo@openbsd.org


> the devs should be using Rust.

Yes, they should. Or several other things depending on what exactly they need.

> C is a sharp knife with no handle; this is it's purpose as a language and tool.

Help me out here HN resident survivalists, carpenters, maybe circus knife throwers. What is the "purpose" of a "sharp knife with no handle" exactly? How often have you thought, "Man, it'd be so much easier to gather firewood, carve decorations or score a bullseye if only the blade would sink into my own flesh while I was using it because it doesn't have a handle" ?

Historically the argument was, "We're using C because alternatives like Java or Python or whatever aren't fast enough or capable enough". OK. But, somewhere in the last few years it moved to, "We're using C because alternatives aren't dangerous enough" and that's crazy.


BYOH - Bring You Own Handle

The purpose is not to assume what the developer wants, but provide access to all the resources(and manipulation) they need.

If you need guardrails there are 10s of languages designed for specific purposes.


However at some point you realize that there is an optimal and general-enough handle shape that everyone ends up individually re-creating.

You should use a knife that comes with such a handle but gives you the option to take it off when you want to impress your friends with your knife-juggling skills and feel you have a few fingers too many.


Before C was born, developers were using system programming languages with handles just fine.

C lacks a handle because its designers didn't care to provide one, it is after all based on a language whose main purpose was to Bootstrap CPL.


FWIW, Zig is as sharp as C but with a handle.


Why should you pick Zig and not Rust, D or Go?


Zig is probably nicer if you want to write code that is to be called from C.


These are just simple API design issues in the standard library, nothing to do with the language itself.


When I used C for a serious project, I always used `snprintf(dst, sizeof(dst), "%s", src);` to copy a string. It might be a little bit slow, but it freed me from all the headaches of identifying different string functions of C and remembering their subtle differences. It also is useful for other purposes, e.g. prefixing a string.


I do this as well. At some point I googled around and didn't get a straight answer. So I've been using snprintf() for all my string manipulation ever since. My productivity is more important than a sliver of performance.


Don't use sizeof. If you pass in a pointer you're going to get the size of the pointer, not the length of the string buffer.


But no suggestion for an alternative?


There's a lot of alternatives, none of them I particularly like best for all cases. So I won't suggest something I don't think is very good. sizeof is particularly bad however for any kind of strings.


sizeof would give you the actual size if you pass in a statically declared array.


C2X will be adding memccpy() (note two c's in the middle, not memcpy!). Overview and justification at https://developers.redhat.com/blog/2019/08/12/efficient-stri...


Great naming choice. Won't be confusing at all.


It's been available in FreeBSD since mid 90s and yet nobody ever uses it, not even FreeBSD developers. I like memccpy because it's easy to check for all possible errors known to me, but everybody else seems to prefer strlcpy because it came from THE security-oriented BSD.


Not sure I agree with the recommendation against strlcpy. While it is technically true that if you can't replace strcpy with memcpy you're using strcpy wrong, it's also true that most uses of strcpy are wrong, which I think is a better point to make. The stated purpose of strcpy is to copy a string, and if you're copying a string your best bet is strlcpy. The article is worded in such a way that you'd walk away thinking "I should always use memcpy."


I'm not a strlcpy fan, but I'll never understand recommending against string functions because they're "nonstandard". They're tiny and portable almost by definition. Vendor them in to your project.


If your project is small and self-contained, sure. If you have a large codebase, you have to at least rename to mylib_strlcpy. Then the other projects do that as well. Then you have 572 almost-clones of strlcpy.

Consider this before copying code if you are part of a large codebase. Note that sufficiently popular OSS typically is.


Even better is to not null terminate strings and use pointer plus length everywhere.


> Even better is to not null terminate strings and use pointer plus length everywhere.

Yes. Except that we aren't programming in a void. Particularly if you are writing C to begin with, you will have to interface with decades of existing code. Some of which has interfaces that crept into standards.

You eventually have to pass a string to some function that does not have a length argument and expects a null-terminated string, be it to a library function or the operating system itself (e.g. the `open` system call). You will still need to keep that null-terminator around.


The only reason why that happens is that C refuses to standardise support for anything else.

Standardise support for pointer+length strings and the most active parts of the ecosystem will start using it. It will take a long time to get widespread but the sooner you start the sooner it will happen.

Sure, you will have to revert to traditional strings. Some times often. That's no big deal, there should be helper functions. In D you just add .toStringZ to any D string and you get a C string which makes interacting with C code easy.

Of course none of this will happen because C is dead from a evolutionary point of view. Hopefully new CS students will likely not have to deal with any of this bullshit in a few decades.


If they were actually open to sort out security issues, even if they never accept fat pointers into the language as additional types, like in Checked C, at very least something like SDS for arrays and strings.

New CS students will always have to deal with this bullshit in all the decades to come, because the industry will keep relying on UNIX clones for its computing infrastructure until we switch to something else like quantum computers.


You can generate null terminated strings at the point of interfacing with those legacy functions. Yes there is overhead, but you're already probably prematurely optimizing all the wrong things anyhow since you're using C. :-P



A struct of string length and then a C-string is a cumbersome solution (but we are talking about C, everything is cumbersome) but it should work for all use-cases.


For example the SDS library used by Redis.


Right, when I was younger, I was convinced that NUL termination was a reasonable strategy. Learning C in the 1990s it made plenty of sense, even though I was also learning about buffer overflows and underflows.

One of the last things that finally changed my mind was the observation that the length shouldn't live with the text, but with the structure describing the text. Some of you might be laughing now, because that was obvious to you, but I genuinely had gone years without considering that. I'd been imagining a hack like the length of the string lives in a few bytes "before" the text.

Once I was envisioning the mutable string as [length, pointer] itself, that seemed obviously better and I was onboard with abolishing NUL termination in software.


A lot of what C provides is not supposed to be used by application code. The string interface is the bare minimum one should use, but any reasonable application should create its own higher level interface. The problem of the C community is that they never managed to create a reasonable string library, or if one was created it is seldom used. Future standards should introduce a higher level string library to fix this problem.


> I'd been imagining a hack like the length of the string lives in a few bytes "before" the text.

That's normal, usually called a "Pascal string".

As I recall, the C standard makes no assumption of whether strings are null-terminated or not.


> As I recall, the C standard makes no assumption of whether strings are null-terminated or not.

I'm not sure what you mean by assumptions made by the C standard, but it definitely says strings are null-terminated:

> A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.

and

> A string literal need not be a string [...], because a null character may be embedded in it by a \0 escape sequence.

(the second one is noting that if a string literal contains "\0", then it's not a string but contains a string with more stuff after it).


You're right. I was remembering something badly. I found some interesting things while looking into this, though:

C's behavior of defining literal strings as null-terminated character arrays is already described in the 1978 K&R. The word "string" is used in the text of the section, but not in its title, "character arrays". Null termination is mentioned here, as the book works through an example of successively reading lines from standard input:

    `getline` puts the character \0 (the 'null character', whose value is zero)
    at the end of the array it is creating, to mark the end of the string of charac-
    ters.  This convention is also used by the C compiler: when a string constant
    like
           "hello\n"
    is written in a C program, the compiler creates an array of characters con-
    taining the characters of the string, and terminates it with a \0 to that func-
    tions such as `printf` can detect the end:
    
           | h | e | l | l | o | \n | \0 |
    
    the `%s` format specification in `printf` expects a string represented in this form.
There is a hint, immediately prior to this, as to why null termination might have been chosen:

    The length of the array `s` is not specified in `getline` since it is determined in `main`.
(Where `getline` is a function defined in the example, and `s` is a parameter to that function.)

https://en.wikipedia.org/wiki/Comparison_of_Pascal_and_C#Str... has an intriguing comment that seems likely to be related:

> In Pascal a string literal of length n is compatible with the type `packed array [1..n] of char`.

> Pascal has no support for variable-length arrays, and so any set of routines to perform string operations is dependent on a particular string size.

I suspect that I was remembering someone writing that how to represent strings was a live issue at the time of the creation of C, rather than, as I wrote above, being a live issue within C for some period after its creation.

----

On an unrelated note, it's interesting to see that the web convention of fixed-width type for code literals and variable-width type for natural text was already in force in K&R 1978.


Like almost everything in software development it's a trade-off, length + pointer means that your data structures become bigger and that often you will use more registers. That used to matter more than it does today.


I don’t believe it mattered that much even at the time. Having to calculate the length of the string by iterating over it at each string operation was and is much more wasteful and slow. It is simply a stupid decision.


It might sound obvious to you now, but most functional languages conceptually store strings as nil-terminated lists ...


The problem is memory safety and functional languages won’t read into another object’s memory even with a logical bug.

And a null-terminated linked list is different from a C-string.


Unfortunately C is locked into null-terminated strings, given that all the printf-style functions work on the assumption there'll be a null terminator. C++ has std::string_view which is pointer + length, but you've still got the same problem if you need to call older printf-style functions.


Why do you have to use printf? You could have a string library would could with it's own formatting routines. There's also the option of using both a length AND a null terminator.


> There's also the option of using both a length AND a null terminator.

I first encountered that idea in this classic Joel on Software post, which rather put me off the idea of using them in production:

> Notice in this case you’ve got a string that is null terminated (the compiler did that) as well as a Pascal string. I used to call these fucked strings because it’s easier than calling them null terminated pascal strings but this is a rated-G channel so you will have use the longer name.

https://www.joelonsoftware.com/2001/12/11/back-to-basics/


Joel's article is a bit long in the tooth. As of C++11, std::string is required to use both a length and a null terminator. With the short string optimization, it could even have that exact layout.

> Lazy programmers would do this, and have slow programs

    char* str = "*Hello!";
    str[0] = strlen(str) - 1;
Modern compilers understand strlen, and will replace the function call with a constant where possible. That code's not slow anymore: https://godbolt.org/z/Kjh8b44Kf


That's good to know, thanks.

But Joel was talking about C rather than C++, where your comments about std::string wouldn't apply, right?


If you are programming C, you will already roll your own data structures. So why not roll your own formatting routines?


Nope, printf can print strings without NULL-terminator: printf("%.*s", <int>length, <char*>string);


And the first argument to printf (the format), what kind of string is that? And what kind of string does sprintf() produce?


In practice, it is almost always a compile-time-known string. gcc will warn you if it isn't, especially since allowing the use of untrusted input for the format can lead to vulnerabilities:

https://en.wikipedia.org/wiki/Uncontrolled_format_string


Anywhere that this format is a variable, you probably already screwed up. C allows that, but if I see it that's getting flagged in my review.

So long as the format string is a literal you needn't care how it works.

Now, one of the places where C makes this nastier than it needed to be is that C built-in types are silly, and so any non-trivial program is using better fundamental types like uint32_t (or the more succinct u32), for which the built-in formatter offers no syntax. So you end up writing format strings like "There are "PRIu32" dogs\n" using macros to bring in the appropriate specifier for your literal. Blergh.


sprintf tells you how many characters it has written, so there's no reason you can't use it for non-null-terminated strings.


williamvds's point is that the first argument to printf is still itself a null-terminated string, so it's basically turtles all the way down if you're using the C standard library.


Their comment talked about two things, and the sibling comment addressed the other one.


Those advocating for length+pointer should really write it up, including showing all the time and memory complexity for all operations and pointing out its cons as well.

There are numerous safe-string libraries for C. I don't think anyone uses them much.

To me, null-terminated strings look like admirable restraint from Thompson, Kernighan and Ritchie.


What are the pros of null-terminated strings? You have to recalculate something you already know at each string operation, potentially failing. It is slow and unsafe.

Languages where a string type is possible to have usually make use of length and non-null-terminated, while I think C++ does length and C-string with longer texts but for short string can “hack” the text itself into the pointer.


I thought we'd all decided the Chad way was the best way to do strings in C.

https://github.com/skullchap/chadstr

Maybe not though. Issue #6 is unresolved.


In another universe, Nicklaus Wirth designed Pascal with slightly more flexible stacks, and we're all using Pascal strings, which wouldn't have any of these issues.


What's the standard practice these days in C to move strings with lengths around? I've been out of C for at least a couple of years now, but I can't imagine it's changed in that time.


If you're asking what to do about copying strings, then it's either memcpy(), or rarely str[n]cpy(). strcpy when I can assume the source length is safe but don't know the size of the underlying buffer. strncpy when I want to check the return value and maybe issue an error.

For passing references around, I use whatever works. Plain `const char *` argument is certainly a frequent choice for simple name or filepath arguments. That can even mean doing the occasional strlen() when making a copy of that string. It doesn't bother me at all; overall zero-terminated strings are very easy to use. Can't understand why people never stop bitching about it.

When the string is not just an opaque ID, but needs to be examined more closely, it's usually more of a "slicey" or a buffer-processing problem - then I'll add an `int len` to the list of arguments, or to the members in a struct.

Very rarely I'll create a String class, but usually I don't bother. It feels to me like going against the grain of the language. I don't want to create my own host of string processing functions that take this String as argument, when it's usually simpler to operate directly on the data.

Something that I close to never need is the "growable" string class with memory management like std::string. I have no idea right now why I would need such a thing. I tend to write my programs to work on fixed buffers. At most I'll create dynamically sized strings, but a generic string that can grow after creation isn't a frequent use case.


I agree with everything you’ve stated. Just curious what others think. Thanks for sharing.


I'd assume memcpy.

But a more serious answer is to use C++ strings instead, lol. Writing "C-like C++" is probably more beneficial.

I do realize that a lot of people prefer to write in pure C (ex: Linux kernel team), but more and more people are realizing the benefits of C-like C++ code.


Yeah, I would assume as much, too. When I last worked in C, there was no de facto way to do this since so many functions just worked on null termination.


Related: Designing a Better strcpy, from last month[0]

[0] https://news.ycombinator.com/item?id=27537900


I found the article kind of misleading: using memcpy for c-string is generally a bad idea, unless string length is bound with string buffer like std::string. Otherwise it will make code review very difficult.

In our team c-string is prefixed with sz_, e.g. char sz_name[13], and we always use a safe subset(or a safe replacement) of strxxx functions with these sz_ prefixed variables. Using memxxx with sz_ variables is explicitly forbidden, since it may break the NULL-terminating contract.

The sz_ prefix convention is by no ways like the hungarian naming nonsense. Suppose that you have "char sz_name[13]" in a structure of configuration parameters, sz_ tells the guy changing the field to keep it NULL-terminated, if they don't, it's their fault. On the other side, users of this field can safely use printf("%s", sz_name) without the risk of crashing the program.

For safe replacements of strcpy, I recommend: https://news.ycombinator.com/item?id=27537900


MSVC is a very curious choice of strcpy_s implementation given that MSVC is openly not compliant with C11, much less Annex K.

OpenWatcom, which implements a late draft of the standard behaves as expected on their testcase:

  Runtime-constraint violation: strcpy_s, s1max > RSIZE_MAX.
  ABNORMAL TERMINATION
The reason not to use the _s versions isn't that they are bad, it's that basically nobody has implemented them (hence me having to use Open Watcom to demonstrate this example)

[edit]

Just noticed that in the page they linked to at the top when they mention strcpy_s, it notes that the MSVC implementation predates even the original draft of the standard and lacks RSIZE_MAX.


MSVC is compliant with C11 and C17, minus optional annexes, they are after all optional and not required for ISO compliance certification.


Thanks for the correction. C support in MSVC was terrible just a few years ago, but googling I see they added better C support starting in 2019


If this is the case, why aren’t the stdlib functions defined this way? In all of the history of the longest-lived production language family besides FORTRAN, this blog post is the first voice to point out that memcpy should be the same operation as strcpy for null-terminated strings? What is going on here?

(Rust crowd snickers as they unwrap<‘jk> &mut *foo_buf)


You could implement strcpy(a, b) as memcpy(a, b, strlen(b) + 1), but it would be slower, since it makes two separate passes over b (one to calculate the size, one to copy).

The post seems to be arguing that in most cases where you would call strcpy, you should know the size already, because otherwise you wouldn't know whether the source string was short enough to fit in the destination buffer.


Curious as well, but when I look at the glibc source for strncpy it is calling memcpy....https://code.woboq.org/userspace/glibc/string/strncpy.c.html

It all seems to depend on the compiler vendor.


Know what's even worse than strcpy?

strlcpy.

Every use of it I have seen was subtly incorrect. To use it correctly takes more code than anybody wants to write, or (AFAICT) ever does.

strcat is worse than both, though.


Not just strcpy either. Pretty much all str* functions are bad and should be replaced with their mem* equivalents when available.

I don't even know why the str* functions exist. They're just worse versions of mem* functions.


They exist to support working with zero-terminated strings, which worked just fine today, and IMHO are useful for identifiers even today (easy to use + some modest memory savings).

Yeah, some stuff like strcat, strtok, or maybe even strcpy is taking it a little bit too far and is too inviting of errors. 30-50 years ago there was often a practice of not minding security aspects of processing external data. But even today, you could use those safely if you know the data.

strtok especially is one arcane function that was probably used a whole lot more back then. Today, hardly anybody does fixed-character delimited fields. strtok was probably useful to "parse" /etc/fstab and formats like that with as little own code as possible :-)

Some functions I use from time to time: strcmp(), strncmp(), strcpy(), strncpy(), strstr(), strchr(), strrchr().

One can rewrite versions of them to work on different types of strings, but they're readily available (if you allow libc dependency). There might be a speed advantage to home-grown solutions, too - although that's not really the point, and if performance matters the bottleneck shouldn't be on such pedestrian string processing anyway.


strtok_r() if you want to be thread safe.


Still a weird function, hard to find a use for it :-)


Should have been deprecated around 1990 and removed by 2005.


[flagged]


Are you doing okay? Seems like an unnecessarily acidic response to this post, which definitely seems like it's intended for C beginners.


(not parent, guessing he was being snarky) While you make a good point, I don’t generally find beginner content on hacker news.

I personally evaluated this trending post as something worth my consideration and analysis. Perhaps my news PID needs a tune-up!


One-liner summary: Author suggests using just memcpy() typically, strncpy() rarely/maybe, and even more rarely, or never, strcpy().


I think though it is recommending against strncpy as well…

linters and code reviewers commonly recommend alternatives such as strncpy (difficult to use correctly; mismatched semantics) […] Besides their individual shortcomings, these answers are incorrect. strcpy and friends are, at best, incredibly niche, and the correct replacement is memcpy.


strncpy() does a specific thing really well. If it's the 1970s where you are and you're writing Unix filesystem code or some 1970s data processing application your program likely has a use for strncpy() and it matches the semantics you need exactly.

But it isn't intended as a solution for buffer overflow bugs in your program, and so if you try to abuse it to solve that problem you likely introduce more problems.

Imagine your car's air conditioning doesn't work properly. On sunny days it's really much too hot in the car. So, you buy a sunroof. Says "Sun" right in the name, surely that will help right? No. That's not what a sunroof is for. The sunroof works fine as a sunroof but that is not what you needed.


I don’t think that’s an accurate summary. “Use memcpy() instead. Use strncpy() in limited circumstances.”


Doesn't memcpy have the same issue as strncpy? the destination will not be null terminated if the source is too long.

Many projects just implement a safe_strncpy wrapper that always terminates the destination. Example: https://github.com/brgl/busybox/blob/master/libbb/safe_strnc...


God the C runtime library is so bad. So is the C++ STL.

I think it’s a travesty that these languages defined an API but didn’t provide an implementation. Hindsight is 20/20, but what a nightmare!

It is far more rational to provide an implementation using standard language features. It’s not like strcpy needs to make a syscall!


Not sure why you brought the STL in to it. Copying a std::string is as easy as assigning the variable, or calling the assign function with a char* and length


I’m asserting that the CRT API design is bad. And following up that C++’s isn’t any better.

std::string does better facilitate copying strings. But string manipulation with std::string is really really bad.

I’m also saying that the concept of defining an API but not providing an implementation is insane. That is, imho, extremely inappropriate at the language level. Differences in behavior between C and C++ STL implementations on different platforms is infamous. And almost entirely unnecessary.


However, that will require a 1000s of SLOC implementation of a weird class type with lots of subtle semantics, and a generally inefficient behaviour (small heap allocations).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: