Hacker News new | past | comments | ask | show | jobs | submit login

Even better is to not null terminate strings and use pointer plus length everywhere.



> Even better is to not null terminate strings and use pointer plus length everywhere.

Yes. Except that we aren't programming in a void. Particularly if you are writing C to begin with, you will have to interface with decades of existing code. Some of which has interfaces that crept into standards.

You eventually have to pass a string to some function that does not have a length argument and expects a null-terminated string, be it to a library function or the operating system itself (e.g. the `open` system call). You will still need to keep that null-terminator around.


The only reason why that happens is that C refuses to standardise support for anything else.

Standardise support for pointer+length strings and the most active parts of the ecosystem will start using it. It will take a long time to get widespread but the sooner you start the sooner it will happen.

Sure, you will have to revert to traditional strings. Some times often. That's no big deal, there should be helper functions. In D you just add .toStringZ to any D string and you get a C string which makes interacting with C code easy.

Of course none of this will happen because C is dead from a evolutionary point of view. Hopefully new CS students will likely not have to deal with any of this bullshit in a few decades.


If they were actually open to sort out security issues, even if they never accept fat pointers into the language as additional types, like in Checked C, at very least something like SDS for arrays and strings.

New CS students will always have to deal with this bullshit in all the decades to come, because the industry will keep relying on UNIX clones for its computing infrastructure until we switch to something else like quantum computers.


You can generate null terminated strings at the point of interfacing with those legacy functions. Yes there is overhead, but you're already probably prematurely optimizing all the wrong things anyhow since you're using C. :-P



A struct of string length and then a C-string is a cumbersome solution (but we are talking about C, everything is cumbersome) but it should work for all use-cases.


For example the SDS library used by Redis.


Right, when I was younger, I was convinced that NUL termination was a reasonable strategy. Learning C in the 1990s it made plenty of sense, even though I was also learning about buffer overflows and underflows.

One of the last things that finally changed my mind was the observation that the length shouldn't live with the text, but with the structure describing the text. Some of you might be laughing now, because that was obvious to you, but I genuinely had gone years without considering that. I'd been imagining a hack like the length of the string lives in a few bytes "before" the text.

Once I was envisioning the mutable string as [length, pointer] itself, that seemed obviously better and I was onboard with abolishing NUL termination in software.


A lot of what C provides is not supposed to be used by application code. The string interface is the bare minimum one should use, but any reasonable application should create its own higher level interface. The problem of the C community is that they never managed to create a reasonable string library, or if one was created it is seldom used. Future standards should introduce a higher level string library to fix this problem.


> I'd been imagining a hack like the length of the string lives in a few bytes "before" the text.

That's normal, usually called a "Pascal string".

As I recall, the C standard makes no assumption of whether strings are null-terminated or not.


> As I recall, the C standard makes no assumption of whether strings are null-terminated or not.

I'm not sure what you mean by assumptions made by the C standard, but it definitely says strings are null-terminated:

> A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.

and

> A string literal need not be a string [...], because a null character may be embedded in it by a \0 escape sequence.

(the second one is noting that if a string literal contains "\0", then it's not a string but contains a string with more stuff after it).


You're right. I was remembering something badly. I found some interesting things while looking into this, though:

C's behavior of defining literal strings as null-terminated character arrays is already described in the 1978 K&R. The word "string" is used in the text of the section, but not in its title, "character arrays". Null termination is mentioned here, as the book works through an example of successively reading lines from standard input:

    `getline` puts the character \0 (the 'null character', whose value is zero)
    at the end of the array it is creating, to mark the end of the string of charac-
    ters.  This convention is also used by the C compiler: when a string constant
    like
           "hello\n"
    is written in a C program, the compiler creates an array of characters con-
    taining the characters of the string, and terminates it with a \0 to that func-
    tions such as `printf` can detect the end:
    
           | h | e | l | l | o | \n | \0 |
    
    the `%s` format specification in `printf` expects a string represented in this form.
There is a hint, immediately prior to this, as to why null termination might have been chosen:

    The length of the array `s` is not specified in `getline` since it is determined in `main`.
(Where `getline` is a function defined in the example, and `s` is a parameter to that function.)

https://en.wikipedia.org/wiki/Comparison_of_Pascal_and_C#Str... has an intriguing comment that seems likely to be related:

> In Pascal a string literal of length n is compatible with the type `packed array [1..n] of char`.

> Pascal has no support for variable-length arrays, and so any set of routines to perform string operations is dependent on a particular string size.

I suspect that I was remembering someone writing that how to represent strings was a live issue at the time of the creation of C, rather than, as I wrote above, being a live issue within C for some period after its creation.

----

On an unrelated note, it's interesting to see that the web convention of fixed-width type for code literals and variable-width type for natural text was already in force in K&R 1978.


Like almost everything in software development it's a trade-off, length + pointer means that your data structures become bigger and that often you will use more registers. That used to matter more than it does today.


I don’t believe it mattered that much even at the time. Having to calculate the length of the string by iterating over it at each string operation was and is much more wasteful and slow. It is simply a stupid decision.


It might sound obvious to you now, but most functional languages conceptually store strings as nil-terminated lists ...


The problem is memory safety and functional languages won’t read into another object’s memory even with a logical bug.

And a null-terminated linked list is different from a C-string.


Unfortunately C is locked into null-terminated strings, given that all the printf-style functions work on the assumption there'll be a null terminator. C++ has std::string_view which is pointer + length, but you've still got the same problem if you need to call older printf-style functions.


Why do you have to use printf? You could have a string library would could with it's own formatting routines. There's also the option of using both a length AND a null terminator.


> There's also the option of using both a length AND a null terminator.

I first encountered that idea in this classic Joel on Software post, which rather put me off the idea of using them in production:

> Notice in this case you’ve got a string that is null terminated (the compiler did that) as well as a Pascal string. I used to call these fucked strings because it’s easier than calling them null terminated pascal strings but this is a rated-G channel so you will have use the longer name.

https://www.joelonsoftware.com/2001/12/11/back-to-basics/


Joel's article is a bit long in the tooth. As of C++11, std::string is required to use both a length and a null terminator. With the short string optimization, it could even have that exact layout.

> Lazy programmers would do this, and have slow programs

    char* str = "*Hello!";
    str[0] = strlen(str) - 1;
Modern compilers understand strlen, and will replace the function call with a constant where possible. That code's not slow anymore: https://godbolt.org/z/Kjh8b44Kf


That's good to know, thanks.

But Joel was talking about C rather than C++, where your comments about std::string wouldn't apply, right?


If you are programming C, you will already roll your own data structures. So why not roll your own formatting routines?


Nope, printf can print strings without NULL-terminator: printf("%.*s", <int>length, <char*>string);


And the first argument to printf (the format), what kind of string is that? And what kind of string does sprintf() produce?


In practice, it is almost always a compile-time-known string. gcc will warn you if it isn't, especially since allowing the use of untrusted input for the format can lead to vulnerabilities:

https://en.wikipedia.org/wiki/Uncontrolled_format_string


Anywhere that this format is a variable, you probably already screwed up. C allows that, but if I see it that's getting flagged in my review.

So long as the format string is a literal you needn't care how it works.

Now, one of the places where C makes this nastier than it needed to be is that C built-in types are silly, and so any non-trivial program is using better fundamental types like uint32_t (or the more succinct u32), for which the built-in formatter offers no syntax. So you end up writing format strings like "There are "PRIu32" dogs\n" using macros to bring in the appropriate specifier for your literal. Blergh.


sprintf tells you how many characters it has written, so there's no reason you can't use it for non-null-terminated strings.


williamvds's point is that the first argument to printf is still itself a null-terminated string, so it's basically turtles all the way down if you're using the C standard library.


Their comment talked about two things, and the sibling comment addressed the other one.


Those advocating for length+pointer should really write it up, including showing all the time and memory complexity for all operations and pointing out its cons as well.

There are numerous safe-string libraries for C. I don't think anyone uses them much.

To me, null-terminated strings look like admirable restraint from Thompson, Kernighan and Ritchie.


What are the pros of null-terminated strings? You have to recalculate something you already know at each string operation, potentially failing. It is slow and unsafe.

Languages where a string type is possible to have usually make use of length and non-null-terminated, while I think C++ does length and C-string with longer texts but for short string can “hack” the text itself into the pointer.


I thought we'd all decided the Chad way was the best way to do strings in C.

https://github.com/skullchap/chadstr

Maybe not though. Issue #6 is unresolved.


In another universe, Nicklaus Wirth designed Pascal with slightly more flexible stacks, and we're all using Pascal strings, which wouldn't have any of these issues.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: