It's hard to overstate what a huge win this is. D has had 23 years of experience with it, and the virtual elimination of array overflow bugs is just win, win, win.
I will never understand why C keeps adding extensions consisting of marginal features, and ignores this foundational fix. I guess they still aren't tired of buffer overflow bugs always being the #1 security vulnerability of shipped C code (and C++, too!).
Those are called "length prefixed strings", or more simply "Pascal strings". The difficulty is one has to reallocate and copy to represent a substring.
Changing pointers to include length would require an ABI break on pretty much all platforms. You’d either have to recompile the world or have some sort of bridging thing that converts calls from C-with-fat-pointers to standard C. And even recompiling the world wouldn’t be enough, since lots of C code relies on being able to do things like cast pointers to integers, manipulate them as numbers, then cast them back. That’s UB by the standard but platforms and compilers can and often do define that behavior to be something useful.
You could say, well, forget binary compatibility and forget nasty code that bit-twiddles pointers. But then why are you even using C? Those are the things that set it apart.
Clang is trying to solve this with annotations that allow the programmer to construct fat pointers, either as structures or just implicitly by having the length in a variable somewhere, and enforcing those bounds in the compiler. Seems promising. https://clang.llvm.org/docs/BoundsSafetyImplPlans.html
Yeah but not having standard buffer and slice types along with safe API's that require their use is unforgivable.
I'm also of the opinion that a backwards compatibility with null terminated string is actually terrible. Because you want people to eventually go, oh this code uses gross null terminated strings, lets fix that.
Well, I, for one, do like the idea of C (in contrast to D or C++) still being sort of the lowest-level high-level programming language - one that's just a notch above the assembler.
There is nothing particuarly low level about null terminated strings. It is just a convention that a bunch of standard library functions follow. Then there is a dozen variants of each of those standard library functions, most of which end up making you pass in some form of a length parameter anyway, because the convention is so terrible.
At this point, I think we might be in a better world if C simply did not offer any string API in the first place. If
Expressing strings as null terminated or in a data structure that includes the size and data has no relationship to it being a higher or lower level language. They still need to determine where a string starts and ends in memory. The means of doing so is different, the assembly language representation will be slightly different, but the language isn't hiding anything behind an abstraction. Contrast that to, say, Pascal where the length of the string is hidden from the developer and can only be accessed through function calls. (It probably should be, but that is beside the point.)
That has never been true, unless you are writing programs against a PDP-11. C compilers can change even the O complexity of your algorithms, it's nowhere close to assembly.
Nothing. You can get just as down and dirty in D as you can in C. You just don't have to suffer under the preprocessor as it blasts your kingdom. And you've got a fully functional inline assembler, and modules.
> Current compilers warn you if the format string doesn’t match its arguments. But this only works on functions that have the same signature as printf so it doesn’t work on my implementation.
GCC has the format attribute that lets you have printf type checking on your own variadic functions:
It was a huge win, at least for me. I implemented it because I was really sick of mismatches. (Although I was careful to use the right formats, when refactoring I'd change a type and then the printf's would go awry. Having the compiler flag them made for quick fixing.)
Format strings are a technique that I think largely should be left behind anyway. String interpolation is in my experience usually shorter, easier to read and will always be checked by the compiler.
Two people have already mentioned things like storing the length inline or including a null-terminator to be backwards-compatible. What's described there is basically the same as std::string_view or &str, and to me one of the biggest reasons to use these structures is that your particular view of the string doesn't interfere with someone else's. You can slice your string in the middle and just look at it piecewise without bothering anyone else.
Choosing between these trade-offs just depends on what you're doing. I'd definitely choose this pattern if I were to write a parser for instance.
The problem with string views is that they are borrowing the parent string, so you'd need to hold a strong reference to the parent string. This is easy to do in a garbage collected language, because you don't have to do anything. But it's a lot more complicated if you need to do this with reference counting. Do you make every single string view update the reference counter? Do you make a special lighter string view that doesn't keep a counted reference, and is subject to memory safety issues?
Yep, you're right. One way to make this less of a problem is to make this distinction at the type level, having both an owned_string and a string_view for example. You can even make owned_string store its length inline.
Typically you need 4 pointers to represent a strong reference count for a string view.
* One for the start of the source string, with an inline strong count
* One for the end of the source string so you know how much to deallocate (only really applicable to Rust)
* One for the start of the view
* One for the end of the view
32 bytes for each string view is quite a lot. Depending on context you could use 32bit lengths instead of end-pointers if you're OK with <4GB strings, saving 8 bytes.
There's basically no distinction between a string view and an array slice. It's borrowing an array, and the view is nothing but a reference to the parent, start position, and length.
But views are also implemented as a plain pointer and a length, and that's where the memory safety issues from borrowing begin.
I understand the concern, but can’t you just maintain an actual reference field to parent_str in a string view? Unless I missed some no-extra-fields constraint itt, then sorry for the noise.
Let's say you took a document as a string, and split it up into words using a lot of string views. Every string view created would affect the reference count of the parent string. Then every time you work with the string views, saving temporary instances, passing them to a function, assigning them, whatever, you're affecting the parent string's reference count.
And reference counts are often atomic integer operations, so it might not be a regular memory increment, instead it would be an interlocked increment. And if there's multiple threads, the CPU cores will be competing over who gets to keep the reference counter in their L1 cache line. (There is a way around this where you can give threads an their own reference counter)
I've done something similar, but unlike the author, I always reserved one extra byte and I always null terminated the string. This was so I could use existing string output functions.
I don't include the null-terminator because I use this type in my own environment where I never use null-terminated strings so there is no need for it.
Sorry, but there is a significant misunderstanding:
There is no such thing as a string in C. What you call a string is a pointer to char (typically "int8") - nothing more nothing less.
The \0 termination is just a convention/convenience to avoid passing the bounds of the memory segment, resp. when to stop processing earlier.
Once you go down the route proposed by many of the comments here - why not enhance it to deal with UTF8...
Or rather implement a proper "array" type?
What about the lack of multidimensional arrays instead of the pointer to pointer to ... approach? Idiosyncracies such as "int a[2][3];" being of type "int *" and not "int **"?
C was never intended to shield you from mistakes, but rather replace a macro assembler.
ANSI C addressed some of the issues in the original K&R C, but that is about it.
If your use case would benefit from all of these protections, there are plenty of higher level language alternatives...
The compiler will not treat that as a simple mere pointer to char when allocating space for it in the binary. It will see that the rhs is surrounded by double quote characters, and allocate 3 bytes for it, instead of 2, and put a NUL byte after the bytes for 'H' and 'i'.
Nul-tetminated strings are absolutely a part of the language. Certainly you can make and store strings in a different way if you'd like, but the language itself defines what a string and string literal is.
Good attempt at a topic that annoys many programmers.
I see a problem with the separation between str and str_buf, though: you create new strings with the latter, but most functions take the former as arguments. Do you convert them every time? Isn't your code littered with str_from_buf()?
Put it in another way, it's like the mess with const that you mention in your article. If str is the type you use for a const read-only string, and str_buf for a non-const mutable string, you would like to pass a non-const even to those functions that "only" require a const. (I say "only" because being const is a weaker requirement than being mutable; the fact that it's more wordy is another thing that C's syntax makes confusing, but this is an entirely different topic!)
It would be nice if the compiler could be instructed to automatically cast str_buf into str and not vice versa, just like it does for non-const to const.
The only way out I can think of, would be to get rid of the two types and only use the one with the cap field, with the convention that if cap is zero, then the string is read-only. The drawback is that certain mistakes are only detected at run-time and not enforced by the compiler. For example, a function than takes a string s and replaces every substring s1 with s2 could have the following prototype in the two-type system:
replace(str_buf s, str s1, str s2);
And it would be immediate to recognize that you cannot pass a read-only string as the first argument. With a one-type system you loose this ability.
Oh well, I guess if a perfect solution existed, it would have been adopted by the C committee, wouldn't it? /s
No, the article addresses this: since the memory layout of the first two struct members is the same in both structs, you can use a pointer to str_buf anywhere a function calls for a pointer to str, after casting it.
> you can use a pointer to str_buf anywhere a function calls for a pointer to str
Yes, you could, but I see no function mentioned in TFA that wants a pointer to str, only functions that want a str: print_str(), print_fmt(), com_write(). At the same time, the functions that return strings return a struct, never a pointer: str_new(), str_from_range(), str_from_buf(), fmt_buf_new(), and the pseudo-function STR().
To use the memory layout trick you should go through reference + cast + dereference:
*((struct str *)&...)
My question still holds: is the code littered by such conversion artifacts?
Never had a string related bug in any programming language in 4 years. I sincerely don't know what people talk about when they claim strings are buggy? What kinda tasks do these happen in?
It's just that the "traditional" implementations of the operations on C strings (strcat etc.) are considered unsafe - which they are, strictly speaking. (But, to be fair, I haven't ever had problems using them, either.)
I have been using null terminated strings since the mid 1970s - before using C, and have never had any problems with them.I have never seen an explanation from someone that has that makes any sense.
In my experience, the standard library is inconsistent with its 0 terminator handling.
fgets will treat the length passed as the capacity of the buffer, and terminate the last byte with a 0.
scanf however treats the length as the number of characters to read, meaning that you need a capacity of n+1 to make sure the 0 terminator is stored properly as well.
Its quite easy to mess up placing the 0 terminator yourself too. It's an overall unnecessary burden that could've been fixed quite easily.
I'm sorry that not every programmer in the world has achieved your level of supreme perfection. We shouldn't design our languages or stdlib with the assumption that everyone will have read every line of documentation about them, and (even if they have) remember everything every time they sit down in front of their text editor. That's unrealistic.
I don't actually believe you that you've been programming for 50 years and never misused a string or a string API in C or a language with similar string handling. But even if I did believe you, it wouldn't matter. Many people make mistakes, and those mistakes have cost people a lot of time, money, and stress. If you've not read about any of these instances, then I suggest you've been living under a rock and are incredibly out of touch.
C was designed in the 1970s with the goal of giving you minimal overhead, bar using a macro-assembler.
The way C handles "strings" provides exactly that.
The OP clearly stated that he did not mind the overhead (in terms of executable size, memory consumption, execution speed) in his particular use case.
https://www.digitalmars.com/articles/C-biggest-mistake.html
It's hard to overstate what a huge win this is. D has had 23 years of experience with it, and the virtual elimination of array overflow bugs is just win, win, win.
I will never understand why C keeps adding extensions consisting of marginal features, and ignores this foundational fix. I guess they still aren't tired of buffer overflow bugs always being the #1 security vulnerability of shipped C code (and C++, too!).