I think it's interesting that the famous tech companies are all developing programming languages that are less managed than java/c#, but more predictable (read: less undefined behavior) than c/c++. Facebook seems interested in D, Apple has Swift, Microsoft has some sort of secret project they are working on, Google has Go, and Mozilla has Rust. Even c++ seems to be attempting to modernize with the new additions to it's spec. And now we see a desire for c itself to change. I wonder if our industry is at a turning point where managed languages aren't quite cutting it, but no one is comfortable going back to the 'good old days'.
On a personal note, I like the idea of friendly C so much that I finally made an HN account. One of my favorite things to do is to take things apart and understand them. I was mortified when I learned the real meaning of undefined behavior in c/c++. It seems like the only way to be sure you understand a C program is to check the generated machine code. Even worse is that when I try to talk to other developers about undefined behavior, I tend to get the impression that they don't actually understand what undefined behavior means. I can't think of a way to verify what they think it means without insulting their intelligence, but hopefully the existence of something like friendly C will make it an easier discussion to have.
I wonder if the managed languages "aren't cutting it" because they're not fashionable and seen as stinky old dinosaurs to the generation of developers growing up now?
I mean, some people openly scoff at languages that aren't Python or ECMAScript, and it appears to be very/extremely fashionable/obligatory on HN to aggressively bash C++, even when it appears that many haven't written or maintained anything in it, and casually dismiss it as "complicated" and daft as it has no compulsory garbage collection. "Huh! It allows to you address memory! HOW STUPID. Why would a language even exist anymore that lets you do that?!"
When reading some of the concepts behind C++ and the reasons for decisions, some of them make excellent sense. The 'newer' languages are simpler, primarily because they haven't had decades of existence for people to demand the features that people have demanded in C++.
Where I work we use C and C++ for most things, but that's just the industry we're in. It appeared to become the norm to teach Java at universities now (from the people I have bumped into), so C++ and C are no longer the starting points for development for most graduates, perhaps?
I think it's also a question of using the tools that fit the problem.
If the fashionable choice is to use C++ for example, a lot more can go wrong if the programmer isn't doing thing right.
Now if you have the same situation with a higher level language, the worst that can happen is infinite loops.
Most projects don't need the extra performance C++ would provide.
So most of the time it's a better choice to use a higher level, more predictable language like Node.js or Go.
Since ECMAScript engines like V8 have been optimised a lot in the last years too, the performance difference is not that high either.
This is not only about extra computational performance, but also about predictability, I'd love to program on a hard real time system with Go instead of C++, but given it uses GC, it is not going to happen. I have mostly worked on real time embedded systems, and I only encountered C and C++, mostly because there is no other sane choice available.
> Microsoft has some sort of secret project they are working on
Microsoft has a multitude of secret programming language projects, and the nice thing about them is that when they migrate into C#, people actually will use them.
I've been a C developer most of my professional life, and I find any discussion about the 'right way to write C' to be worth the effort, usually, because it means either more C coders will be produced with always-interesting results, or the kinds of developers who shouldn't be C coders in the first place will wander off and invent something new and shiny to play with instead. Both conditions are valid.
C is such a thorny language precisely because of the trouble you can get in, and as such it does require a degree of competence that your average IDE-wielding neophytes are not willing to conjure up in all their free time. Digging into the details is the only way to succeed with C.
For me the breakthrough came during my SIL-4 professional period (safety-integrity level 4, life-critical systems..) experience, wherein I learned the value of testing, and always testing, and strict requirements with standardized rules. I also learned the value of code-coverage tools and why you use them to increase the 'safe-ness' of your code. And yes, your issue with undefined behaviour, means of course you should never rely on your compiler/IDE to tell you what is going on, and should know by inspection, always!
In the end though, C still has a lot of uses and applications left to be written - its not going away any time soon - and so as long as a group of people are willing to agree to what they are doing with the C code base, good stuff gets built.
You've made an interesting observation regarding the trend taking place in large tech companies.
>but no one is comfortable going back to the 'good old days'.
Except the open source community. Both recent and on-going projects are being done in C. I'd like to know why but I'm thinking the Linux kernel being in C has a lot to do with it.
I really like these suggestions since they can be summed up in one sentence: they are what C programmers who write code with UB would already expect any reasonably sane platform would do. I think it's definitely a very positive change in attitude from the "undefined behaviour, therefore anything can happen" that resulted in compilers' optimisations becoming very surprising and unpredictable.
Rather, we are trying rescue the predictable little language that we all know is hiding within the C standard.
Well said. I think the practice of UB-exploiting optimisation was completely against the spirit of the language, and that the majority of optimisation benefits happen in the compiler backend (instruction selection, register allocation, etc.) At least as an Asm programmer, I can attest that IS/RA can make a huge difference in speed/size.
The other nice point about this friendly C dialect is that it still allows for much optimisation, but with a significant difference: instead of basing it on assumptions of UB defined by the standard, it can still be done based on proof; e.g. code that can be proved to be unneeded can be eliminated, instead of code that may invoke UB. I think this sort of optimisation is what most C programmers intuitively agree with.
instead of basing it on assumptions of UB defined by the standard, it can still be done based on proof; e.g. code that can be proved to be unneeded can be eliminated, instead of code that may invoke UB. I think this sort of optimisation is what most C programmers intuitively agree with.
The main motivation for relying on some of the UB assumptions is that compilers weren't able to prove things in a lot of cases where programmers expected them to, because C (especially in the face of arbitrary pointers, and esp. with incremental compilation) is quite hard to prove things about. So compilers started inferring things about variables from what programmers do with them. For example, if a pointer is used in a memcpy(), then the programmer is signalling to us that they know this is a non-null pointer, since you can't pass NULL as an address to memcpy(). So when compiled with non-debugging, high-optimization flags, the compiler trusts the programmer, and assumes this is a known-to-be-non-null pointer.
It would be interesting to see some benchmarks digging into whether turning off that kind of inference has significant performance impact on real-world code. Without adding more source-level annotations to C (like being able to flag pointers as non-NULL), it will reduce some optimization opportunities, since the compiler won't be able to infer as many things, even things that are definitely true. But it might not be enough cases to matter.
* Changes that replace undefined behaviours with undefined values. This makes it easier to catch certain types of coding errors at the cost of certain kinds of optimizations.
I'm comfortable with the first kind, although you can already achieve something very similar to that with most compiler (as far as I know) by building with optimizations disabled. Also stuff like missing return values generates a warning in any compiler worth using, if you ignore that kind of warnings you can only blame yourself.
The 2nd kind bothers me more, because it makes otherwise invalid C code valid in this dialect. I'm worried this makes things even more difficult to explain to beginners (and not so beginners, I still have to check the aliasing rules from time to time to make sure the code I'm writing is valid).
Even if you're very optimistic this friendly C is not going to replace daddy anytime soon. There'll be plenty of C code out there, plenty of C toolchains, plenty of C environment where the definition of friendliness is having a dump of the registers and stack on the UART in case of an error. Plenty of environments where memcpy is actually memcpy, not memmove.
For that reason I'd be much more in favour of advocating the use of more modern alternatives to C (and there are a bunch of those) rather than risking blurring the lines some more about what is and isn't undefined behaviour in C.
Precisely. The idea here is that any C program that does not specifically require one's complement or sign-magnitude representation will work just as well compiled as friendly C as it does compiled as C.
Also, just because testing has found no exploitable bugs in a C program compiled with GCC 4.8 does not mean that GCC 4.9 will not introduce a new optimization that wreaks havoc with undefined behavior that was there all along. No such surprises with friendly C. It's the trusty C compiler that you had in 1995, and it always will be.
Security-critical programs would never rely on undefined behavior!, you might say. Let us take the example of ntpd, one of the building blocks of the Internet, considered critical enough to be part of two Google bounty programs. Here is a list of currently harmless undefined behaviors in it that a compiler could use as an excuse to produce vulnerable binary code tomorrow: http://bugs.ntp.org/buglist.cgi?emailreporter1=1&emailtype1=...
Friendly C is not a new language. It is C as most developers understand it, and with bad surprises prohibited as much as possible.
Disclaimer: Julien, who reported these ntpd undefined behaviors, is my colleague, and I am a co-author of the “friendly C” proposal.
I get your point (and it's a fair one) but what happens when it goes the other way? A bit of code written (and working) in friendly C finds its way into a regular good old C codebase. Suddenly it contains a critical bug that may go unnoticed.
Or some coder is used to friendly C, uses aliasing, memcpy and wrapping arithmetic without thinking twice. Then he's asked to work on some project building with a regular C toolchain. Gun is cocked and pointed at the foot.
Those problems only occur because friendly C is a dialect of C, you don't get those drawbacks if you tell devs to use Rust or whatnot, because the difference between Rust and C is obvious, unlike friendly C and C.
This exact problem exists today if someone takes code from the Linux kernel that is compiled with things like -fno-strict-aliasing. Undefined Behaviour makes programs unpredictable in the first place, given that you can't reasonably ensure you're not triggering it. I would expect a lot more people to use friendly-c (or simple-c or yes-this-is-how-your-cpu-works-c) than standard c.
As he notes, people who need the undefined behaviour-induced optimisations are in the minority.
Such code would also fail if passed to tools like lint, and would result in warnings or worse when loaded in IDEs. The incredible C toolchain built over the decades is one of C's biggest advantages; friendly-c cannot afford to lose even 10% of it.
The security critical programs I've worked on, nobody would even think of randomly upgrading the compiler just because a new version happens to be available.
But take OpenSSL as an example of a critical security program. Nothing in the usage practice suggests not upgrading the compiler, or even going to the next version of libc.
Many open-source programs can be security-critical (any network-facing daemon, for instance), and there's nothing which ties them to a specific compiler.
I usually hear the latter type of system described as "safety-critical" or "life-critical". You're right that it's an entirely different world, though.
Can someone give a rationalization of why a "friendly" dialect of C should return unspecified values from reading uninitialized storage? Is the idea that all implementations will choose "0" for that unspecified value and allow programmers to be lazy?
I'd much rather my "friendly" implementation immediately trap. Code built in Visual C++'s debug mode is pretty reliable (and useful) in this regard.
EDIT: It occurs to me that this is probably a performance issue. Without pretty amazing (non-computable in the general case?) static analysis, it would be impossible to tell whether code initializes storage in all possible executions, and using VM tricks to detect all uninitialized access at runtime is likely prohibitively expensive for non-debug code.
Because the current behavior of some compiler optimizations is that instead of returning unspecified values, it can assume that the code is unreachable (undefined behavior) and simply eliminate a bunch of normal, correct code that precedes reading unitialized storage.
Guaranteed immediate trapping might be difficult on some platforms, but a specification that allows to return an unspecified value or trap immediately - that can be done anywhere.
The scope of an unspecified value's effect is limited to that particular value, whereas undefined behavior's effects are unlimited: http://en.wikipedia.org/wiki/Nasal_demons
However, in practice, continuing with an unspecified value can have catastrophic consequences. And, also, undefined behavior can have nice consequences, like terminating with a diagnostic message, which is not a conforming way to implement unspecified behavior.
The change affects compiler optimization. The proposal changes many results from "undefined behavior" (compiler can do anything, including throwing away code before the statement executes) to "unspecified value", which means the statement might result in unknown values, but it can't remove the code.
Read the linked articles in the post for a better treatment of what undefined behavior means and how compilers deal with it.
How do you propose to implement trapping on an uninitialized integer for example? You'd need significant hardware support, on the order of adding another bit to every memory location.
Sometime ago I came up with a simpler proposal: emit a warning if UB exploitation makes a line of code unreachable. That refers to actual lines in the source, not lines after macroexpansion and inlining. Most "gotcha" examples with UB that I've seen so far contain unreachable lines in the source, while most legitimate examples of UB-based optimization contain unreachable lines only after macroexpansion and inlining.
Such a warning would be useful in any case, because in legitimate cases it would tell the programmer that some lines can be safely deleted, which is always good to know.
Is it true that undesirable UB exploitation often happens after macroexpansion and inlining, and doesn't make any actual source lines unreachable? Are there any simple examples of that?
The kernel uses -fno-strict-aliasing because they can't do everything they need to do by adhering strictly to the standard, it has nothing to do with it being to hard (The biggest probably being treating pointers as integers and masking them, which is basically illegal to do with C).
IMO, this idea would make sense if it was targeted at regular software development in C (And making it easier to not shoot yourself in the foot). It's not as useful to the OS/hardware people though because they're already not writing standards-compliant code nor using the standard library. There's only so much it can really do in that case without making writing OS or hardware code more annoying to do then it already is.
> the short answer is that an ‘undef‘ “variable” can arbitrarily change its value over its “live range”. This is true because the variable doesn’t actually have a live range. Instead, the value is logically read from arbitrary registers that happen to be around when needed, so the value is not necessarily consistent over time.
An undefined value is both unspecified but also need not be the same unspecified value over multiple uses. This can result in surprising behavior when you're reading code, where you tend to assume that even if the value of a variable is unknown, different uses of it will be consistent with respect to each other.
No, those are things that _usually_ happen in most C implementations. But reading from uninitialized memory provokes "undefined behavior" according to the standard. Which might mean returning an unspecified value, might mean whole chunks of code being skipped, or might mean demons flying out of your nose.
The first one seems like an obvious choice of behavior, and one that some C programs doubtless depend on. But when you throw advanced optimizations into the mix, compilers may do unexpected things (e.g., assuming that a certain thing can't happen because it's not allowed by the standard, and then manipulating the code under that assumption). If you read other articles from Regehr, he describes some examples.
citing "undefined behavior", compiler writers will gleefully do all kinds of goofy shit and, if confronted by developers who actually program for a living, stick their fingers in their ears and go "nyah nyah nyah undefined behavior we can do whatever we want go learn c nyah nyah nyah".
"Undefined behavior" is a really great trick for promoting pet optimization projects and stonewalling practical feature requests by use of language lawyering.
How was it goofy? Personally what the compiler did made perfect sense to me. If you assume integers can't overflow, then 'b' must be larger then 'a'. Thus why would the compiler bother performing the statement 'c=(b>a)' when it's 'obvious' that it's just going to be 'c=1'?
That said, looking at that page the guy who made this bug was being more then extremely annoying, the person responding to the bug was fairly civil all things considered.
You're complaining about UD but it's a necessary evil in C. Integer overflow was perhaps a bad choice by the standards makers, but the fact still stands that even if GCC did the 'right' thing, there's no guarantee that clang or any other compiler will do the same thing. The code would still be broken, it just might be harder to figure that out. If you want integer overflow and wrapping then use the compiler flag for it and write non-standard code.
IMO, the bigger problem is that people write their code, compile it with gcc, and then assume it's standards compliant because it 'works' with gcc.
"Principle of Least Surprise" is that an integer will wrap, because that's what the hardware does in pretty much all cases. Any clever optimizations or undefined behavior should be happening due to explicit flags--which is exactly what the guy in that bug report wanted.
The thing is that, for like the last half-century, we've expected integers to overflow and wraparound..that's just how they work. Ignoring that kind of expectation is asinine.
There is an explicit flag, he's compiling with '-O2'. As he noted, without -O2 the output is correct. gcc does exactly what you're saying it should do in this instance, so I don't see what you're unhappy about.
Consulting the original bug report, the optimization is hardly clear in performance benefits. Note also that, again, optimization somehow breaking 50 years of numerical reasoning is probably not a good 'default' behavior (even in O2! especially without clear benchmarks proving its utility!).
Two's complement didn't predominate until the late 1970s, early 1980s. Before that time ones' complement predominated.
And there are plenty of processors today which only use sign-magnitude. In particular, floating point-only CPUs. Compilers must emulate two's complement for unsigned arithmetic, and so signed arithmetic is significantly faster.
The C standard is what it is for good reason. It's not anachronistic. Rather, now there are a million little tyrants who can't be bothered to read and understand the fscking standard (despite it being effectively free, and despite it being 1/10th the size of the C++ standard) and who are are convinced that the C standard is _obviously_ wrong.
Which isn't a comment on this friendly-C proposal. But the vast majority of people have no idea what the differences are between well-defined, undefined, implementation defined, or unspecified behavior, and why those distinctions exist.
I think the point is that (integral) numbers stored in hardware naturally wrap, and that this behaviour is not restricted to two's complement. For that matter it's not even restricted to binary - the mechanical adding machines based on odometer-like gear wheels, operating in decimal, would wrap around much the same way, from the largest value back to the smallest... and these were around for several centuries before computers: http://en.wikipedia.org/wiki/Pascal%27s_calculator
> citing "undefined behavior", compiler writers will gleefully do all kinds of goofy shit and, if confronted by developers who actually program for a living,
Do you imagine that compilers are created by lawyers or something?
But undefined behaviour isn't disallowed by the standard, and these things aren't not allowed to happen! If they were, the standard wouldn't even bother to mention any of it, and certainly wouldn't bother to suggest that one option is for things to behave "during translation or program execution in some documented manner characteristic of the platform". (See the C11 draft standard, 3.4.3.2; wording is basically the same in C99 I think.)
It seemed obvious to me from the moment I first heard about this stuff that undefined behaviour is there to avoid binding implementations' hands too tightly. It's a way of allowing as wide a range of implementations as practical to be standard-compliant, by not forcing the compiler to patch over every last difference between systems or provide missing functionality. But is it a way to let gcc do whatever it likes, having proven your program invalid on a technicality? Well... I'm less sure about that one.
The suggestions that performance improvements brought by compiler optimizations are meaningless bother me, though.
First, because hardware isn't getting faster that quickly anymore: Moore's Law hasn't meant for a long time that CPUs actually double their per-thread performance every 18 months, so that "1/10th as effective" from your first link, which refers to compilers hypothetically doubling performance every 18 years, starts to get more and more attractive.
Second, because while in C/C++ code the programmer can often avoid useless machine code, newer languages such as Rust and Swift tend to do more stuff implicitly (safety checks, reference counting) which could often be eliminated by a Sufficiently Smart Compiler - increasing the need for good optimizations. (I think that this somewhat mirrors C++'s early history compared to C, but I was too young then to have any personal experience.) Of course those languages also tend to have no undefined behavior, so it's a bit different...
Third, because I don't think undefined behavior is as evil and hard to avoid as people think it is. I think there are a few weird points (you can cast pointers into malloced buffers to any type as long as you're consistent, but there's no way to do that for a static buffer without fully static layout), and it would be nice to have more control over things like aliasing - both to loosen rules and to tighten them (i.e. more flexible restrict-like functionality). But despite being a fun topic, it doesn't seem to come up that often in practice from what I've seen, so the performance gain is close to free. And when you're, say, fighting for 60fps in a CPU limited scenario, it's hard to turn down even a small free performance gain.
Fourth, because as someone who reads assembly frequently, idiotic looking assembly bothers me aesthetically even if it often doesn't have much performance impact. I speak in particular of reloading struct fields over and over when the data is already in a register and any person looking at the code would know it would be illogical for it to change in memory since it was loaded - but the compiler isn't smart enough to prove it can't alias... Sure, in individual cases it's easy to cache it in a local variable to stop this from happening, but in the large it's hard to avoid. Strict aliasing improves the situation somewhat, which is one reason I like it.
If I'm understanding correctly, I think the difference between "undefined behavior" and "returns an unspecified value" is consitency. If you had something like:
int i = [UNDEFINED BEHAVIOR];
if (i > 0)
COND1;
if (i > 0)
COND2;
then most people would expect that either COND1 and COND2 both execute, or they both don't execute. But I believe that a compiler is theoretically free to produce code that executes one but not the other since the value of i is undefined. In other words, the compiler doesn't have to act as though i has one specific value after that assignment. It can assume any value it likes independently each time i is used. It can assume that i is positive at the first check and then assume it's negative the next time, even though it might be provably true that no code between those two checks can possibly change the value of i. The change to "unspecified value" would mean that the compiler can give i whatever value it wants to, but it has to be a single defined value, and subsequent code must not behave as though i had multiple changing values.
But I'm not really a C programmer, so feel free to correct me if I'm interpreting this wrong.
The classic example of undefined behavior is "nasal demons". Which is to say, that when you hit undefined behavior, the compiler would be completely within its rights to make demons fly out of your nose.
"Undefined behavior" means the compiler can do anything. In your example, the compiler could execute one statement but not the other. Or it could execute both of them fifteen times. Or it could reformat your hard drive, or start a war. All would be legal according to the spec. (Feasibility of implementing such things is, of course, another question.) The more mundane consequences are more likely to be what you actually see, but the point is that you can't really reason about it in the abstract. You have to know exactly how your particular compiler handles it, and it can go well beyond just executing your code in a funny way.
For one real-world that approaches "nasal demons", early versions of gcc would start a game of nethack if they encountered a #pragma statement in your program.
Modern compilers have a tendency to just remove code that can be proven to result in undefined behavior. This can make it very difficult to track down a problem because you'll be staring at the source code, assuming the code was run, unable to fathom how it could possibly have ended up in a particular state.
If the compiler were required to return an "unspecified value", you'll at least know that your code DID run (and generate an incorrect result).
Most responses are hyperbole. Undefined behavior likely doesn't include making up new code; its not certain where you will end up, but someplace in existing code is a certainty.
Even excluding obvious fantasy like nasal demons, and real but unusual cases like starting nethack, there are completely mundane things that are reasonably common in real compilers but that don't go "someplace in existing code", like aborting on integer overflow, or crashing when you access a bad pointer.
For 8 and 3, I believe you're right (That's what I was thinking too - undefined-behavior), but I think they're making some type of distinction between UD and a undefined value/result. In C right now, there isn't any guarantee your code will actually keep going since it's UD, with an 'undefined value' I presume it means you still get some type of number and your code keeps going with that bad number (Which may or may not have bad consequences. I'm not really sure this is better then just UD, but most implementations just give you whatever random value was there anyway so it wouldn't really change much besides guaranteeing your code will run with a bad value instead of the compiler blowing up on the UD).
As for 4, I was also curious on that. I think they're just trying to specify something more specific then UD, like a SIGSEGV (or whatever signal thing Windows does). Compilers basically do that already though, and any architecture that doesn't do some type of trap on an invalid pointer probably doesn't have the infrastructure to do so. That said, the biggest issue is not what the trap is, but how the trap is caught or handled. It's not really that helpful of a standard if you know it's going to "trap", but you have no standard way of handling such a trap.
"Undefined behavior" affects the entire output of the compilation - it allows the compiler to change anything, anywhere (dropping overflow checks on the assumption that overflow can't happen is perfectly legal, for example).
An "unspecified value"/"undefined value" seems to mean that the associated operation can return anything it wants, but can't affect adjacent code.
8. Currently the behaviour is undefined not unspecified.
Undefined behaviour means anything can happen from that point on( bad ), while unspecified just gives you an unspecified value.
3. Again unspecified vs. undefined.
1. From C11 Standard: The value of a pointer becomes indeterminate when
the object it points to (or just past) reaches the end of its lifetime. When you use that pointer you get undefined b.
You misinterpreted my statement. From that point on means when you code it.
I have read that article before; the usage of the word time travel is nonsense. Author could just as well say the code preceding the ub can get mangled and avoid confusion.
Plenty of architectures do not trap at null-pointer dereferencing (they don't have traps). Some (like AVR) are not arcane, they are one of the best excuses for still writing C nowadays.
I have strong feelings that the C standard (and, by extension, C compilers) should directly support non-portable code. It means many behaviors are not "undefined"- instead they are "machine dependent". Thus overflow is not undefined- it is _defined_ to depend on the underlying architecture in a specific way.
C is a more useful language if you can make machine specific code this way.
I'm surprised that some of the pointer math issues come up. Why would the compiler assume that a pointer's value is invalid just because the referenced object is out of scope? That's crazy..
Weird results from uninitialized variables can sometimes be OK. I would kind of accept strange things to happen when an uninitialized bool (which is really an int) is not exactly 1 or 0.
Perhaps a better way to deal with the memcpy issue is this: make memcpy() safe (handles 0 size, allows overlapping regions), but create a fast_memcpy() for speed.
...with one option for undefined behaviour being for it to behave "in some documented manner characteristic of the platform" (C11 para 3.4.3.2). Which is not the same thing as being implementation defined, of course, but is at least a step above simply deciding your program is invalid and generating bogus code because of it.
So compilers could do this stuff already. I have no idea why they don't. C isn't really the language for people who want platform independence.
Back in 1990, it didn't take long to figure out that (Borland) Turbo Pascal was much less insane than C. Unfortunately, it only ran on MS-DOS & Windows, whereas C was everywhere.
Employers demanded C programmers, so I became a C programmer. (now I'm a Java programmer, for the same reason, and think it's also a compromised language in many ways)
For anybody who is willing to run a few percent slower so that array bounds get checked, there is now an open source FreePascal environment available, so as not to be dependent on the scraps of Borland that Embarcadero is providing at some cost. Of course, nobody is going to hire you to use Pascal. (or any other freaky language that gets the job at hand done better than the current mainstream Java and C# languages)
Pascal, in general, was an amazing language. Simple to understand, compiled to native code - honestly the only complaint you could now (with FreePascal) make about the language is that it is a little "wordy."
Much the same as D - which was my initial thought when I read "friendly C." We already have a friendly C.
When I first learned C, after having used Turbo Pascal for years (back in the 80s) it was frustrating. When I finally understood it well a little later, I felt like I had taken off the shackles of Pascal...
Granted, original, 1970, flavor Pascal on the CDC Cyber was very limited. I'm curious, though, what is it that Turbo Pascal 5.0 and above would prevent you from doing?
It's been a while, but if I really have to, I could cast a pointer to a long, increment it by sizeof in a loop, and cast it back to a pointer in Pascal to do pointer arithmetic in a tight loop.
I can use procedural types anywhere a function pointer in C would be used.
I can make complex constants (tables) just like a pre-initialized array of structs in C.
There is an extension to return in the middle of a function/procedure rather than toggling flags to pinball-style drop to the end of the routine.
I can set compiler pragmas to allow/force me to check for error codes after every memory and I/O call (at the risk of accidentally stumbling on to cause secondary damage), rather than just failing with a line number.
I can nest functions inside of other functions, and reference the enclosing variables, just like in GNU C (oh, wait, that's a non-standard GNU extension), even if I can't return such a function as a "use it later", honest to God closure.
What are the shackles? (I'm probably in for a face palm, oh, that, moment when I get the answer, but I'm drawing a blank at the moment)
I think it would be supporting libraries. Looking at RedMonks's statistics[1] Pascal doesn't even feature (I was able to find ~600 on GitHub). By today's standards that's a pretty big shackle, you'd have to write a lot of functionality yourself (although that's ignoring that you can use C libraries with a little effort).
I love the language, it just had no staying power. Maybe people just prefer curly braces.
I mean, how many XML interpreters do we really need? At least for Java, it seems like there are too many "Enterprise!!!" libraries that seem to exist to extend the lameness of the language by requiring you to write a DSL in XML.
I suppose for C there really are a lot of low level libraries that actually do things, though, such as read/write sound and graphics files, or implement useful abstract data types. Things that might seem "built in" in a 21st century language/environment, but not in C or Pascal.
I don't really see what's to be accomplished by most of the points of this. A program that invokes undefined behaviour isn't just invalid; it's almost certainly _wrong_. Shifting common mistakes from undefined behaviour to unspecified behaviour just makes such programs less likely to blow up spectacularly. That doesn't make them correct; it makes it harder to notice that they're incorrect.
Granted, not everything listed stops at unspecified behaviour. I'm not convinced that that's a good thing, though. Even something like giving signed integer overflow unsigned semantics is pretty effectively useless. Sure, you can reasonably rely on twos-complement representation, but that doesn't change the fact that you can't represent the number six billion in a thirty-two bit integer, and it doesn't make 32-bit code that happens to depend on the arithmetic properties of the number six billion correct just because the result of multiplying two billion by three is well-defined.
Then there's portability. Strict aliasing is a good example of this. Sure, you can access an "int" aligned however you like on x86. It'll be slow, but it'll work. On MIPS, though? Well, the compiler could generate code to automate the scatter-gather process of accessing unaligned memory locations. This is C, though. It's supposed to be close to the metal; it's supposed to force-- I mean, let you do everything yourself, without hiding complexity from the programmer. How far should the language semantics be stretched to compensate for programmers' implicit, flawed mental model of the machine, and at what point do we realize that we already have much better tools for that level of abstraction?
I have a feeling you haven't read through all of the linked posts and papers.
The problem they're trying to address is that C compliers take advantage of undefined behavior for optimizations. Such optimizations can cause very strange, unintuitive behavior that is very difficult to discover. The linked posts, papers, and even this thread provide many great examples.
You're right that the programs are wrong. The goal of this proposal is to make them wrong in reasonable ways.
>The problem they're trying to address is that C compliers take advantage of undefined behavior for optimizations
That's not a problem. That's a good thing.
>Such optimizations can cause very strange, unintuitive behavior that is very difficult to discover
That's a problem -- or at least the "difficult to discover" part is. "Strange" and "unintuitive" is helpful; it's a nice, big red flag. How does migrating from undefined behaviour to producing unspecified values make bugs easier to discover? I can see how they would make results more consistent, and the bugs easier to hunt down, but that's only useful once you know the bugs are there (inconsistency is another useful red flag here), and there are already good tools like valgrind and ubsan for tracking down the source of the bug.
>The goal of this proposal is to make them wrong in reasonable ways
That isn't the purview of C, though. It's a noble goal, don't get me wrong, but stepping on the optimizer's toes and reinforcing plainly bad programming practices -- I know that's not an intended effect, but it will happen -- isn't the way to do it. A better way, just as an example, would be to give the programmer a proper mechanism for encoding preconditions and other interprocedure-analysable constraints. This doesn't reinforce bad practices, it could actually help the optimizer if done right, and would perhaps encourage programmers to reason a little more rigorously about their code -- an ounce of prevention and all that.
The behavior is undefined on a system with 32 bit integers because of signed arithmetic overflow (despite the fact that all the explicit types involved are unsigned, a uint8_t gets promoted to a signed integer before the left-shift operation).
Right now it will work on every compiler I've tried, but it would be perfectly valid (by the ANSI specification) for a compiler to assume that the result of that function can never have the highest bit set. In friendly C, the result is well defined.
No, that one's a consequence of C's insane type system. The solution here isn't to change the semantics of signed integer arithmetic. The solution is to change integer promotion to use unsigned arithmetic like it should have done in the first place.
Not taking a position on this, and it's been a long while, but I seem to remember that the discussion in X3J11 on the issue of integer promotions, which mostly occurred before I joined, were long and heated.
Integer promotion will not help, because it may not go to a sufficiently wide unsigned type to cover the shift. (In practice it will, but unsigned int could be just 16 bits).
Promotion of unsigned chars to unsigned int would have problems of its own, mostly because unsigned arithmetic (modulo power of two arithmetic) is inappropriate for most uses, and error-prone: it has a large, silent discontinuity right next to zero.
Alas, in fact, unsigned chars can promote to unsigned int: on rare platforms like DSP's where you have sizeof (int) == 1. Sigh.
From the examples I'm familiar with, shifting from undefined to unspecified actually makes invalid programs _more_ likely to blow up spectacularly, because they're likely to go on and try to use the unspecified value rather than having the code path that uses it quietly excised or transformed.
You aren't aware just how many security-critical C programs invoke potentially undefined behaviour today.
Every time you write '+' in a C program between two signed values, do you do an overflow check beforehand? Do you know what it takes to write that overflow check? Or do you do a global dataflow analysis of the program to prove that it can't overflow?
Saying that 95% (at a conservative guess, IMO) of C programs out there are wrong makes for a strong argument for this proposal, not against it. Arguing for a technical correctness that is observably almost impossible to achieve in the wild is what's really pointless.
> You aren't aware just how many security-critical C programs invoke potentially undefined behaviour today.
It doesn't take much experience with C to appreciate something of the magnitude of difficulty of keeping a million lines of C UB-free. I'm not in the security field, but if the code there is anything like the codes I've worked with, I'd be surprised to find even a "security-critical" C program of any appreciable complexity that doesn't invoke undefined behaviour.
>Every time you write '+' in a C program between two signed values, do you do an overflow check beforehand?
Personally? I use signed scalar arithmetic very rarely. The bulk of the arithmetic I do is with unsigned integers (bit vectors, really) and arbitrary-precision integers. In fact, essentially the only time I do use signed scalar arithmetic is when I can determine statically that it won't overflow. Signed integer overflow is a bitch, UB optimizations or no.
>Or do you do a global dataflow analysis of the program to prove that it can't overflow?
That's the basic idea behind seL4, for example.
>Arguing for a technical correctness that is observably almost impossible to achieve in the wild is what's really pointless.
Sure, but only because everyone writes their security-critical programs in a language that couldn't more effectively impede security if it were designed to do so. Could a "friendly" dialect help? Maybe, but the proposed changes are far from sufficient to make formal verification feasible, and frankly, I don't know how seriously I can take "security-critical" without formal verification.
> Every time you write '+' in a C program between two signed values, do you do an overflow check beforehand? Do you know what it takes to write that overflow check?
But UB isn't the problem here. You can make two's complement wraparound the defined behavior, and programs which fail to check their arithmetic are very likely to be wrong.
This is already the case with unsigned overflow: it's defined to wrap around. Yet many many programs are vulnerable because they do something like malloc(nitems * size) and carry on thinking they must've gotten the wanted amount of memory.
Most of the time, you really have to do the right checks whether your arithmetic is signed or not. Whether wraparound is defined or not.
"Reading from an invalid pointer either traps or produces an unspecified value."
That still leaves room for obscure behavior:
if( p[i] == 0) { foo();}
if( p[i] != 0) { bar();}
Calling foo might change the memory p points at (p might point into the stack or it might point to memory in which foo() temporarily allocates stuff, or the runtime might choose to run parts of free() asynchronously in a separate thread), so one might see cases where both foo and bar get called. And yes, optimization passes in the compiler might or might not remove this problem.
Apart from truly performance-killing runtime checks i do not see a way to fix this issue. That probably is the reason it isn't in the list.
(Feel free to replace p[i] by a pointer dereference. I did not do that above because I fear HN might set stuff in italics instead of showing asterisks)
Short of languages like clojure and haskell that are heavily opinionated on immutability and side effects, I can't think of any languages that meaningfully protects against that kind of obscure behaviour. Indeed, it seems "else" was meant for exactly this kind of code.
I would think also something like "if I write a piece of code, the compiler should compile it", perhaps "or else tell me with a warning that it isn't going to compile it".
> 1. The value of a pointer to an object whose lifetime has ended remains the same as it was when the object was alive.
This does not help anyone; making this behavior defined is stupid, because it prevents debugging tools from identifying uses of these pointers as early as possible. In practice, existing C compilers do behave like this anyway: though any use of the pointer (not merely dereferencing use) is undefined behavior, in practice, copying the value around does work.
> 2. Signed integer overflow results in two’s complement wrapping behavior at the bitwidth of the promoted type.
This seems like a reasonable request since only museum machines do not use two's complement. However, by making this programming error defined, you interfere with the abilty to diagnose it. C becomes friendly in the sense that assembly language is friendly: things that are not necessarily correct have a defined behavior. The problem is that then people write code which depends on this. Then when they do want overflow trapping, they will have to deal with reams of false positives.
The solution is to have a declarative mechanism in the language whereby you can say "in this block of code, please trap overflows at run time (or even compile time if possible); in this other block, give me two's comp wraparound semantics".
> 3. Shift by negative or shift-past-bitwidth produces an unspecified result.
This is just word semantics. Undefined behavior, unspecified: it spells nonportable. Unspecified behavior may seem better because it must not fail. But, by the same token, it won't be diagnosed either.
A friendly C should remove all gratuitous undefined behaviors, like ambiguous evaluation orders. And diagnose as many of the remaining ones which are possible: especially those which are errors.
Not all undefined behaviors are errors. Undefined behavior is required so that implementations can extend the language locally (in a conforming way).
One interpretation of ISO C is that calling a nonstandard function is undefined behavior. The standard doesn't describe what happens, no diagnostic is required, and the range of possibilities is very broad. If you put "extern int foo()" into a program and call it, you may get a diagnostic like "unresolved symbol foo". Or a run-time crash (because there is an external foo in the platform, but it's actually a character string!) Or you may get the expected behavior.
> 4. Reading from an invalid pointer either traps or produces an unspecified value. In particular, all but the most arcane hardware platforms can produce a trap when dereferencing a null pointer, and the compiler should preserve this behavior.
The claim here is false. Firstly, even common platforms like Linux do not actually trap null pointers. They trap accesses to an unmapped page at address zero. That page is often as small as 4096 bytes. So a null dereference like ptr[i] or ptr->memb where the displacement goes beyond the page may not actually be trapped.
Reading from invalid pointers already has the de facto behavior of reading an unspecified value or else trapping. The standard makes it formally undefined, though, and this only helps: it allows advanced debugging tools to diagnose invalid pointers. We can run our program under Valgrind, for instance, while the execution model of that program remains conforming to C. We cannot valgrind the program if invalid pointers dereference to an unspecified value, and programs depend on that; we then have reams of false positives and have to deal with generating tedious suppressions.
> 5. Division-related overflows either produce an unspecified result or else a machine-specific trap occurs.
Same problem again, and this is already the actual behavior: possibilities like "demons fly out of your nose" does not happen in practice.
The friendly thing is to diagnose this, always.
Carrying on with a garbage result is anything but friendly.
> It is permissible to compute out-of-bounds pointer values including performing pointer arithmetic on the null pointer.
Arithmetic on null works on numerous compilers already, which use it to implement the offsetof macro.
> memcpy() is implemented by memmove().
This is reasonable. The danger in memcpy not supporting overlapped copies is not worth the microoptimization. Any program whose performance is tied to that of memcpy is badly designed anyway. For instance if a TCP stack were to double in performance due to using a faster memcpy, we would strongly suspect that it does too much copying.
> The compiler is granted no additional optimization power when it is able to infer that a pointer is invalid.
That's not really how it works. The compiler assumes that your pointers are valid and proceeds accordingly. For instance, aliasing rules tell it that an "int *" pointer cannot be aimed at an object of type "double", so when that pointer is used to write a value, objects of type double can be assumed to be unaffected.
C compilers do not look for rule violations as an excuse to optimize more deeply, they generally look for opportunities based on the rules having been followed.
> When a non-void function returns without returning a value, an unspecified result is returned to the caller.
This just brings us back to K&R C before there was an ANSI
standard. If functions can fall off the end without returning a value, and this is not undefined, then again, the language implementation is robbed of the power to diagnose it (while remaining conforming). Come on, C++ has fixed this problem, just look at how it's done! For this kind of undefined behavior which is erroneous, it is better to require diagnosis, rather than to sweep it under the carpet by turning it into unspecified behavior. Again, silently carrying on with an unspecified value is not friendly. Even if the behavior is not classified as "undefined", the value is nonportable garbage.
It would be better to specify it a zero than leave it unspecified: falling out of a function that returns a pointer causes it to return null, out of a function that returns a number causes it to return 0 or 0.0 in that type, out of a function that returns a struct, a zero-initialized struct, and so on.
Predictable and portable results are more friendly than nonportable, unspecified, garbage results.
I believe you have confused undefined behavior and implementation defined behavior. Undefined behavior means that the code is not legal, and if the compiler encounters it, it is allowed to eliminate it, and all consequential code. (The linked posts and papers have lots of examples of this.)
Implementation defined behavior means that the code is legal, but the compiler has freedom to decide what to do. It, however, is not allowed to eliminate it.
After 20 years of comp.lang.c participation, it's unlikely that I'm confusing UB and IB.
Undefined behavior doesn't state anything bout legality; only that the ISO C standard (whichever version applies) doesn't provide a requirement on what the behavior should be.
Firstly, that doesn't mean there doesn't exist any requirement; implementations are not only written to ISO C requirements and none other. ISO C requirements are only one ingredient.
Secondly, compilers which play games like what you describe are not being earnestly implemented. If a compiler detects undefined behavior it should either diagnose it or provide a documented extension. Any other response is irresponsible.
The unpredictable actual behaviors should arise only when the situation was ignored. In fact it may be outright nonconforming.
The standard says:
"Possible undefined behavior ranges from ignoring the situation completely with unpredictable
results, to behaving during translation or program execution in a documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to terminating a translation or
execution (with the issuance of a diagnostic message)."
A possible interpretation of the above statement is that the unpredictable results occur only if the undefined behavior is ignored (i.e. not detected).
If some weird optimizations are based on the presence of undefined behavior, they are in essence extensions, and must be documented. This is because the situation is being exploited rather than ignored, and the program isn't being terminated with or without a diagnostic. That leaves "behaving in a documented manner".
But optimizing based on explicitly detecting undefined behavior is not a legitimate extension. It is simply insane, because the undefined behavior is not actually being defined. There is no extension there, per se. Optimization requires abstract semantics, but in this situation, there aren't any; the implementation is taking C that has no (ISO standard) meaning, it is not giving it any meaning, and yet, it is trying to make the meaningless program go faster. Doing all this without issuing a diagnostic is criminal.
I don't think the GCC people are really doing this; only people think that. Rather, they are writing optimizations which assume that behavior is not undefined, which is completely different. The potential there is to be over-zealous: to forget that GCC is expected to be consistent from release to release: that it preserves its set of documented extensions, and even some of its undocumented behaviors. Not every behavior in GCC that is not documented is necessarily a fluke. Maybe it was intentional, but failed to be documented properly.
Compiler developers must cooperate with their community of users. If 90% of the users are relying on some undocumented feature, the compiler people must be prepared to make a compromises. Firstly, revert any change which breaks it, and then, having learned about it, try avoid breaking it. Secondly, explore and discuss this behavior to see how reliable it really is (or under what conditions). See whether it can be bullet-proofed and documented. Failing that, see if it can be detected and diagnosed. If such a behavior can be detected and diagnosed, then it can be staged through planned obsolescence: for a few compiler releases, there is an diagnostic, but it keeps working. Then it stops working, and the diagnostic can change to a different one, which merely flags the error.
But optimizing based on explicitly detecting undefined behavior is not a legitimate extension. It is simply insane, because the undefined behavior is not actually being defined.
The authors of this proposal agree, and are trying to avoid such situations by just eliminating undefined behavior. Their blog posts and academic papers (linked from the blog post in this submission) have many examples where such insanity has happened.
It is friendly, if you're not an idiot. Not becoming an idiot is the best solution (practice, lots).
Edit: perhaps I made my point badly but don't assume that you'll ever be good enough to not be an idiot. Try to converge on it if possible through. I'm still an idiot with C and I've been doing it since 1997.
> Not becoming an idiot is the best solution [...] [...] but don't assume that you'll ever be good enough to not be an idiot. Try to converge on it if possible through. I'm still an idiot with C and I've been doing it since 1997
We have wildly differing definitions of "best solution". I like solutions to be actually achievable, for example.
The problem is, by that metric, everyone is an "idiot." There isn't a C programmer alive who hasn't been bitten by most, if not all, of these types of memory bugs.
For nearly all of the last 10 years, I have been working on the specific topic of how to better predict what a C program can do (and in particular, correctly predict that a C program will behave well). This is a shorter time than 17 years, but it is more focused than using C as a tool for actually achieving something else, where you aren't thinking about the specifics of C most of the time, hopefully.
Despite this, a fortnight ago I was caught off-guard by GCC 4.9's choice of compiling the program in http://pastebin.com/raw.php?i=fRbGfQ6p to an executable that displays “p is non-null” followed by “p:(nil)”.
There has to be a way out of this. Seriously. Most C programmers neither asked for nor deserve this.
If you don't mind me taking a guess, is it because calling memcpy with NULL is undefined-behavior, thus GCC assumes p must be non-null and thus the if (p) is redundant and x must equal 1?
"Best" by what quality metric? "Not becoming an idiot" is difficult and time-consuming. Furthermore, there's no reliable way to determine if you have in fact succeeded in ceasing to be an idiot because part of the license that compilers have when encountering undefined behavior is to mask it. So even if your code compiles without warnings and passes all tests that is no guarantee that you have ceased to be an idiot. So requiring a compiler to do something more reasonable in the face of undefined behavior than reformat your hard drive (or at least to make it an option) is not an unreasonable position.
On a personal note, I like the idea of friendly C so much that I finally made an HN account. One of my favorite things to do is to take things apart and understand them. I was mortified when I learned the real meaning of undefined behavior in c/c++. It seems like the only way to be sure you understand a C program is to check the generated machine code. Even worse is that when I try to talk to other developers about undefined behavior, I tend to get the impression that they don't actually understand what undefined behavior means. I can't think of a way to verify what they think it means without insulting their intelligence, but hopefully the existence of something like friendly C will make it an easier discussion to have.