Fixing C Strings

WalterBright · 2024-12-21T21:17:27 1734815847

    struct str {
     char *dat;
     sz len;
    };

It's the same solution D uses, except that it's a builtin type, and works for all arrays. I proposed this solution for C:

https://www.digitalmars.com/articles/C-biggest-mistake.html

It's hard to overstate what a huge win this is. D has had 23 years of experience with it, and the virtual elimination of array overflow bugs is just win, win, win.

I will never understand why C keeps adding extensions consisting of marginal features, and ignores this foundational fix. I guess they still aren't tired of buffer overflow bugs always being the #1 security vulnerability of shipped C code (and C++, too!).

Levitating · 2024-12-21T22:32:16 1734820336

> It's the same solution D uses

As well as most other languages and many C codebases right? Often with a separate length/capacity so the buffer can be larger than the string.

WalterBright · 2024-12-22T00:52:48 1734828768

True, but it turns out that very few arrays need to be resizeable.

volemo · 2024-12-21T23:38:53 1734824333

Why this and not?

    struct str {
     sz len;
     char dat[];
    };

WalterBright · 2024-12-22T00:55:56 1734828956

Those are called "length prefixed strings", or more simply "Pascal strings". The difficulty is one has to reallocate and copy to represent a substring.

b3orn · 2024-12-22T00:23:21 1734827001

Having a pointer in the struct allows you to increase the string's capacity without changing all the references to it.

wat10000 · 2024-12-21T23:45:33 1734824733

Changing pointers to include length would require an ABI break on pretty much all platforms. You’d either have to recompile the world or have some sort of bridging thing that converts calls from C-with-fat-pointers to standard C. And even recompiling the world wouldn’t be enough, since lots of C code relies on being able to do things like cast pointers to integers, manipulate them as numbers, then cast them back. That’s UB by the standard but platforms and compilers can and often do define that behavior to be something useful.

You could say, well, forget binary compatibility and forget nasty code that bit-twiddles pointers. But then why are you even using C? Those are the things that set it apart.

Clang is trying to solve this with annotations that allow the programmer to construct fat pointers, either as structures or just implicitly by having the length in a variable somewhere, and enforcing those bounds in the compiler. Seems promising. https://clang.llvm.org/docs/BoundsSafetyImplPlans.html

WalterBright · 2024-12-22T00:54:14 1734828854

> Changing pointers to include length would require an ABI break on pretty much all platforms.

Fixed with my proposal: https://www.digitalmars.com/articles/C-biggest-mistake.html

Gibbon1 · 2024-12-22T00:23:26 1734827006

Yeah but not having standard buffer and slice types along with safe API's that require their use is unforgivable.

I'm also of the opinion that a backwards compatibility with null terminated string is actually terrible. Because you want people to eventually go, oh this code uses gross null terminated strings, lets fix that.

Koshkin · 2024-12-21T22:45:36 1734821136

> I will never understand why C keeps

Well, I, for one, do like the idea of C (in contrast to D or C++) still being sort of the lowest-level high-level programming language - one that's just a notch above the assembler.

gizmo686 · 2024-12-21T23:02:45 1734822165

There is nothing particuarly low level about null terminated strings. It is just a convention that a bunch of standard library functions follow. Then there is a dozen variants of each of those standard library functions, most of which end up making you pass in some form of a length parameter anyway, because the convention is so terrible.

At this point, I think we might be in a better world if C simply did not offer any string API in the first place. If

yuliyp · 2024-12-21T23:23:47 1734823427

Did your comment get cut off because you embedded a null inside it?

fuzztester · 2024-12-21T23:38:42 1734824322

Koshkin · 2024-12-22T14:41:06 1734878466

> It is just a convention that a bunch of standard library

That’s not true, this representation of strings (in the form of a literal) is built into the language itself.

II2II · 2024-12-21T23:09:57 1734822597

Expressing strings as null terminated or in a data structure that includes the size and data has no relationship to it being a higher or lower level language. They still need to determine where a string starts and ends in memory. The means of doing so is different, the assembly language representation will be slightly different, but the language isn't hiding anything behind an abstraction. Contrast that to, say, Pascal where the length of the string is hidden from the developer and can only be accessed through function calls. (It probably should be, but that is beside the point.)

pjmlp · 2024-12-21T23:23:08 1734823388

Depending on the Pascal implementation, the length might be accessible actually, but better not mess directly with it.

gf000 · 2024-12-21T23:32:09 1734823929

> one that's just a notch above the assembler.

That has never been true, unless you are writing programs against a PDP-11. C compilers can change even the O complexity of your algorithms, it's nowhere close to assembly.

teo_zero · 2024-12-22T07:54:44 1734854084

> C compilers can change even the O complexity of your algorithms

I'd be curious to see an example...

WalterBright · 2024-12-22T00:57:45 1734829065

My proposal https://www.digitalmars.com/articles/C-biggest-mistake.html does not take anything away from C's low level abilities.

pjmlp · 2024-12-21T23:19:44 1734823184

The usual C mythology continuously spread since the mists of UNIX epoch.

gpderetta · 2024-12-21T22:55:32 1734821732

What's lower level on C compared to C++ or D?

WalterBright · 2024-12-22T01:05:54 1734829554

Nothing. You can get just as down and dirty in D as you can in C. You just don't have to suffer under the preprocessor as it blasts your kingdom. And you've got a fully functional inline assembler, and modules.

Koshkin · 2024-12-22T14:44:21 1734878661

I have heard about an object-oriented assembler (never seen one, though).

kevin_thibedeau · 2024-12-21T21:01:11 1734814871

> Current compilers warn you if the format string doesn’t match its arguments. But this only works on functions that have the same signature as printf so it doesn’t work on my implementation.

GCC has the format attribute that lets you have printf type checking on your own variadic functions:

https://gcc.gnu.org/onlinedocs/gcc-14.2.0/gcc/Common-Functio...

WalterBright · 2024-12-21T21:14:52 1734815692

So does D:

https://dlang.org/spec/pragma.html#printf

It was a huge win, at least for me. I implemented it because I was really sick of mismatches. (Although I was careful to use the right formats, when refactoring I'd change a type and then the printf's would go awry. Having the compiler flag them made for quick fixing.)

samatman · 2024-12-22T03:55:37 1734839737

Zig builds string formatting at comptime, so if the format string doesn't match the arguments, the program won't compile. It's nice.

norir · 2024-12-21T22:56:31 1734821791

Format strings are a technique that I think largely should be left behind anyway. String interpolation is in my experience usually shorter, easier to read and will always be checked by the compiler.

jonhohle · 2024-12-21T23:03:12 1734822192

It’s shorter until you need any type of formatting. It’s hard to get more terse than `%8.2f`.

gizmo686 · 2024-12-21T23:18:49 1734823129

You can do that with string interpolation too. In python it would be:

    print(f"{foo:8.2f}")

compared with C-style printf:

    printf("%8.2f", foo)

rowanG077 · 2024-12-21T23:19:00 1734823140

Also hard to be more cryptic. The amount of times I regoogle that syntax, or what exactly those numbers mean is basically uncountable.

simscitizen · 2024-12-17T23:56:46 1734479806

There are quite a few of these "better C string" idioms floating around.

Another one to consider is e.g. https://github.com/antirez/sds (used by Redis), which instead stores the string contents in-line with the metadata.

ropejumper · 2024-12-18T08:36:59 1734511019

Two people have already mentioned things like storing the length inline or including a null-terminator to be backwards-compatible. What's described there is basically the same as std::string_view or &str, and to me one of the biggest reasons to use these structures is that your particular view of the string doesn't interfere with someone else's. You can slice your string in the middle and just look at it piecewise without bothering anyone else.

Choosing between these trade-offs just depends on what you're doing. I'd definitely choose this pattern if I were to write a parser for instance.

Dwedit · 2024-12-21T19:55:16 1734810916

The problem with string views is that they are borrowing the parent string, so you'd need to hold a strong reference to the parent string. This is easy to do in a garbage collected language, because you don't have to do anything. But it's a lot more complicated if you need to do this with reference counting. Do you make every single string view update the reference counter? Do you make a special lighter string view that doesn't keep a counted reference, and is subject to memory safety issues?

ropejumper · 2024-12-22T16:40:12 1734885612

Yep, you're right. One way to make this less of a problem is to make this distinction at the type level, having both an owned_string and a string_view for example. You can even make owned_string store its length inline.

wruza · 2024-12-21T20:30:39 1734813039

These are regular questions in languages with (and without) reference counting, what’s so special about string views?

conradludgate · 2024-12-21T21:13:53 1734815633

Typically you need 4 pointers to represent a strong reference count for a string view.

* One for the start of the source string, with an inline strong count * One for the end of the source string so you know how much to deallocate (only really applicable to Rust) * One for the start of the view * One for the end of the view

32 bytes for each string view is quite a lot. Depending on context you could use 32bit lengths instead of end-pointers if you're OK with <4GB strings, saving 8 bytes.

Dwedit · 2024-12-21T21:04:16 1734815056

There's basically no distinction between a string view and an array slice. It's borrowing an array, and the view is nothing but a reference to the parent, start position, and length.

But views are also implemented as a plain pointer and a length, and that's where the memory safety issues from borrowing begin.

wruza · 2024-12-21T21:19:19 1734815959

I understand the concern, but can’t you just maintain an actual reference field to parent_str in a string view? Unless I missed some no-extra-fields constraint itt, then sorry for the noise.

Dwedit · 2024-12-21T22:21:03 1734819663

Let's say you took a document as a string, and split it up into words using a lot of string views. Every string view created would affect the reference count of the parent string. Then every time you work with the string views, saving temporary instances, passing them to a function, assigning them, whatever, you're affecting the parent string's reference count.

And reference counts are often atomic integer operations, so it might not be a regular memory increment, instead it would be an interlocked increment. And if there's multiple threads, the CPU cores will be competing over who gets to keep the reference counter in their L1 cache line. (There is a way around this where you can give threads an their own reference counter)

jdblair · 2024-12-21T21:08:58 1734815338

I've done something similar, but unlike the author, I always reserved one extra byte and I always null terminated the string. This was so I could use existing string output functions.

cozzyd · 2024-12-18T03:04:25 1734491065

Why not have the null terminator so you can pass to normal printf?

You could even do something crazy with packing a null byte with sz on 64-bit systems (since you will never have a string that long anyway...)

D4ckard · 2024-12-19T21:42:16 1734644536

Yes, there are really cool packing techniques. See this talk for example: https://www.youtube.com/watch?v=kPR8h4-qZdk

I don't include the null-terminator because I use this type in my own environment where I never use null-terminated strings so there is no need for it.

bigpingo · 2024-12-21T19:50:46 1734810646

  printf("%.*s", len, str);

lets you pass the length as an int argument.

up2isomorphism · 2024-12-21T21:34:45 1734816885

For all the complaints ,all you need to do is to include an another .h files from some string lib and that’s it.

But I would say for 95% percent using a fixed length char array with strncpy will work just fine.

superjared · 2024-12-21T23:50:42 1734825042

The bstring library[0] has been around a _long_ time.

[0]: https://bstring.sourceforge.net/

codr7 · 2024-12-21T22:25:46 1734819946

I would consider putting the buffer last in the structure and making it flexible to allow skipping one allocation.

ncruces · 2024-12-21T23:29:26 1734823766

That misses the point. These are passed by value.

Levitating · 2024-12-21T22:30:19 1734820219

> I liked this kind of pattern at the bottom of OpenAI's site :)

Where on OpenAI's site do I find a footer like that?

Quis_sum · 2024-12-22T14:32:12 1734877932

Sorry, but there is a significant misunderstanding: There is no such thing as a string in C. What you call a string is a pointer to char (typically "int8") - nothing more nothing less. The \0 termination is just a convention/convenience to avoid passing the bounds of the memory segment, resp. when to stop processing earlier.

Once you go down the route proposed by many of the comments here - why not enhance it to deal with UTF8... Or rather implement a proper "array" type? What about the lack of multidimensional arrays instead of the pointer to pointer to ... approach? Idiosyncracies such as "int a[2][3];" being of type "int *" and not "int **"?

C was never intended to shield you from mistakes, but rather replace a macro assembler. ANSI C addressed some of the issues in the original K&R C, but that is about it.

If your use case would benefit from all of these protections, there are plenty of higher level language alternatives...

kelnos · 2024-12-22T21:58:32 1734904712

That's incorrect. If I write this in my .c file:

    char *s = "Hi";

The compiler will not treat that as a simple mere pointer to char when allocating space for it in the binary. It will see that the rhs is surrounded by double quote characters, and allocate 3 bytes for it, instead of 2, and put a NUL byte after the bytes for 'H' and 'i'.

Nul-tetminated strings are absolutely a part of the language. Certainly you can make and store strings in a different way if you'd like, but the language itself defines what a string and string literal is.

Quis_sum · 2024-12-23T02:38:39 1734921519

s is still a pointer to character. This is just an optimised shorthand for (assuming ASCII):

char s[3];

Which is then initialised with: 0x48,0x69,0x00

There simply is no such thing as a string type in C.

All the "string" functions work on a char pointer which is incremented until it points to a 0.

__d · 2024-12-22T20:06:56 1734898016

String literals are one place where the compiler implements the null-termination, so it is built-in in that sense.

As per the OP’s example, a wrapper macro like their STR can work around this.

Quis_sum · 2024-12-23T03:08:18 1734923298

I don't have an issue with the OP's example, which is quite nifty indeed, despite the penalties incurred.

teo_zero · 2024-12-22T08:58:15 1734857895

Good attempt at a topic that annoys many programmers.

I see a problem with the separation between str and str_buf, though: you create new strings with the latter, but most functions take the former as arguments. Do you convert them every time? Isn't your code littered with str_from_buf()?

Put it in another way, it's like the mess with const that you mention in your article. If str is the type you use for a const read-only string, and str_buf for a non-const mutable string, you would like to pass a non-const even to those functions that "only" require a const. (I say "only" because being const is a weaker requirement than being mutable; the fact that it's more wordy is another thing that C's syntax makes confusing, but this is an entirely different topic!)

It would be nice if the compiler could be instructed to automatically cast str_buf into str and not vice versa, just like it does for non-const to const.

The only way out I can think of, would be to get rid of the two types and only use the one with the cap field, with the convention that if cap is zero, then the string is read-only. The drawback is that certain mistakes are only detected at run-time and not enforced by the compiler. For example, a function than takes a string s and replaces every substring s1 with s2 could have the following prototype in the two-type system:

  replace(str_buf s, str s1, str s2);

And it would be immediate to recognize that you cannot pass a read-only string as the first argument. With a one-type system you loose this ability.

Oh well, I guess if a perfect solution existed, it would have been adopted by the C committee, wouldn't it? /s

kelnos · 2024-12-22T22:01:53 1734904913

Do you convert them every time?

No, the article addresses this: since the memory layout of the first two struct members is the same in both structs, you can use a pointer to str_buf anywhere a function calls for a pointer to str, after casting it.

teo_zero · 2024-12-23T09:37:48 1734946668

> you can use a pointer to str_buf anywhere a function calls for a pointer to str

Yes, you could, but I see no function mentioned in TFA that wants a pointer to str, only functions that want a str: print_str(), print_fmt(), com_write(). At the same time, the functions that return strings return a struct, never a pointer: str_new(), str_from_range(), str_from_buf(), fmt_buf_new(), and the pseudo-function STR().

To use the memory layout trick you should go through reference + cast + dereference:

  *((struct str *)&...)

My question still holds: is the code littered by such conversion artifacts?

zwnow · 2024-12-21T22:36:57 1734820617

Never had a string related bug in any programming language in 4 years. I sincerely don't know what people talk about when they claim strings are buggy? What kinda tasks do these happen in?

Koshkin · 2024-12-21T22:39:13 1734820753

It's just that the "traditional" implementations of the operations on C strings (strcat etc.) are considered unsafe - which they are, strictly speaking. (But, to be fair, I haven't ever had problems using them, either.)

paulddraper · 2024-12-21T23:00:40 1734822040

I don’t know who you’re talking to or what they’re saying, do I can’t say.

This article is about C strings FYI.

zwnow · 2024-12-22T11:24:19 1734866659

The article claims they are buggy, thats what I am refering to.

zabzonk · 2024-12-21T20:59:16 1734814756

I have been using null terminated strings since the mid 1970s - before using C, and have never had any problems with them.I have never seen an explanation from someone that has that makes any sense.

atiedebee · 2024-12-21T21:45:35 1734817535

In my experience, the standard library is inconsistent with its 0 terminator handling.

fgets will treat the length passed as the capacity of the buffer, and terminate the last byte with a 0.

scanf however treats the length as the number of characters to read, meaning that you need a capacity of n+1 to make sure the 0 terminator is stored properly as well.

Its quite easy to mess up placing the 0 terminator yourself too. It's an overall unnecessary burden that could've been fixed quite easily.

zabzonk · 2024-12-21T23:47:06 1734824826

the standard library certainly has a few problems, if you haven't read the docs, but that does not mean that i do.

kelnos · 2024-12-22T22:06:11 1734905171

I'm sorry that not every programmer in the world has achieved your level of supreme perfection. We shouldn't design our languages or stdlib with the assumption that everyone will have read every line of documentation about them, and (even if they have) remember everything every time they sit down in front of their text editor. That's unrealistic.

I don't actually believe you that you've been programming for 50 years and never misused a string or a string API in C or a language with similar string handling. But even if I did believe you, it wouldn't matter. Many people make mistakes, and those mistakes have cost people a lot of time, money, and stress. If you've not read about any of these instances, then I suggest you've been living under a rock and are incredibly out of touch.

Or you're just trolling.

Quis_sum · 2024-12-23T14:14:31 1734963271

C was designed in the 1970s with the goal of giving you minimal overhead, bar using a macro-assembler. The way C handles "strings" provides exactly that.

The OP clearly stated that he did not mind the overhead (in terms of executable size, memory consumption, execution speed) in his particular use case.