Hacker News new | past | comments | ask | show | jobs | submit login
Curl is C (2017) (haxx.se)
85 points by taf2 on Feb 18, 2021 | hide | past | favorite | 79 comments



I think something that is often missing in such posts is the appalling attitude towards undefined behavior in C compilers.

Yes, okay. You've been careful. You've checked you're not accidentally overflowing a buffer, traced things with valgrind, all that good stuff.

None of that stops the compiler from doing random bullshit such as this in the search for better optimization:

https://gcc.godbolt.org/#%7B%22version%22%3A3%2C%22filterAsm...

Personally, as a programmer I dread that sort of thing. I want code to work, not to try to figure out how a compiler could interpret my code in the most twisted way possible, like some sort of evil genie.

I think that's one area where modern compilers are going wrong -- fine combing the spec for loopholes in the attempt to optimize better results in code being compiled into something nobody in their right mind would have intended.

Edit: For extra fun, try adding a std::cout to NeverCalled() -- it gets even more interesting.


This same discussion happens in every thread about C safety. I wrote about it in my essay, "With undefined behavior, anything is possible"[1].

Basically, by saying the compiler is doing things wrong even though it is following the standard, you're putting yourself in the "semi-portable C" camp. That's fine, there are lots of people in that camp. It still represents important C projects such as the Linux kernel, which is written in the GCC dialect of C, with flags to control various things that would otherwise be undefined behavior. (It now compiles in Clang as well, but I'd ascribe that to a significant effort in making Clang successfully compile the GCC dialect; in either case it's most decidedly not standard C)

Here's the thing, though. Even though a good case can be made for the semi-portable C camp, it is losing. Regehr's proposal for a "Friendly C" dialect failed, and, a few years on it certainly doesn't look like it will be taken up again. A lot of people who care about undefined behavior are simply moving to Rust. Or if you do want the advantages of C (extreme portability, excellent support from tools, etc), then you use tools such as static analyzers, sanitizers, and fuzzers, with the goal of getting all UB (as defined by the standard) out of your program.

So, bottom line, if you're hoping for consensus on "reasonable" behavior from a C compiler, so that you can count on your code working (even if it contains UB as defined by the standard), it's not going to happen.

[1]: https://raphlinus.github.io/programming/rust/2018/08/17/unde...


Or... the standard just has bugs which could be fixed. Bugs meaning: being out of line with the history of C and large amounts of C code in the wild.

The more people beat the standard drum, the worse things will get until the standard itself is fixed.

Other languages that don't have a standard don't have this problem (but they do have other problems).


Another characteristic of this topic is that people from different camps keep talking past each other, and the discussion goes in circles.

What you propose is basically identical to Regehr's "Proposal for a Friendly Dialect of C". What's different in 2021 that will make it succeed now when it failed in 2014? If anything, there's less interest now, as there are viable alternatives and increased investment in tools that work with standard C.

Small bugs are being fixed. One of the most surprising is that shift-left of a negative number was UB (a fact that would be shocking to anyone in the semi-portable camp, who would reasonably expect to to compile to SAL on x86). Fortunately, this (ETA: hopefully) will be fixed in C2x.

ETA: As of N2596, it's still not fixed in the C2x draft. There certainly have been proposals to fix it, and I thought it had been agreed upon, but if so, it hasn't found its way into the draft yet. In the mean time, good luck shifting those negative integers!


"What's different in 2021 that will make it succeed now when it failed in 2014? If anything, there's less interest now, as there are viable alternatives and increased investment in tools that work with standard C."

You may be right, but many things could conceivably change.

Viable alternatives like Rust may result in C losing ground and feeling pressure to keep up.

Some huge corporation could announce a focus on security and that they will try to minimize C usage.

More people could adopt the goal of a "friendly dialect of C", and be more determined or successful.

Compiler developers could somehow step over the line -- make a compiler that makes such wild optimizations that it results in a backlash.

Or it could just be improved very gradually, with little bits of the "friendly dialect" being adopted one by one rather than all at once.

I'm not saying by any means I'm confident in anything of the sort, but things change a lot over time. Back when I was getting started, Perl was everywhere. Today pretty much nobody does big Perl projects anymore. It had a big miss with Perl 6, and that was bad enough that it got overtaken by competition. While C is much bigger and more resilient I think it's not impossible by any means for it to feel pressure to adapt.

I definitely expect a lot of resistance to change, but the world changes nonetheless.


I didn't like Regehr's proposal because I don't want a friendly dialect of C. I mostly just want C the way it worked up until, say, GCC 4.x.

I don't know specifically how to fix the standard, although I've been thinking about it. A simple idea would be like the Linux mandate "don't break userspace." The language-lawyering has to stop, and more rules are unlikely to help.


"C the way it worked until GCC 4.x" is basically a worse version of friendly C.

You can't on the one hand say "no language lawyering" and on the other hand call certain kinds of optimizations bugs. Compiler developers need to be able to tell something will be considered a bug before users complain--you can always compile on -O0 if you don't want compiler optimizations, and many people consider performance regressions bugs in their own right, so they're going to try to eke out every bit of performance they can on higher optimization levels.


> "semi-portable C" camp

Do you mean the "every CPU is an x86" camp?


No, that's the unportable camp, which doesn't have much following these days. Semi-portable basically means that you use #ifdef and similar techniques to control the code that gets generated for the particular CPU and platform. To give one example, a lock-free ring buffer may have memory barriers that are #ifdef'ed out on x86 because you can rely on total store order. Similarly, `volatile` on MSVC can be expected to provide memory ordering guarantees (see https://docs.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=m...).


The concept is closer to the idea that C is portable assembler, and that by writing very specific constructions in C you can reliably control the exact assembly that is emitted.


> undefined behavior in C compilers

My unpopular opinion is that UB fear is greatly exaggerated these days, especially by folks who want to capture developers for their languages. Somehow, they bring it up every time they can. Is like every time you want to jump into your car, I jump from behind a bush just to tell you "remember car accidents!".

I write C for a living, for the past ~20 years (mostly for embedded/kernel/drivers). I can't recall my last UB problem (if any). It's true. Though, I did find myself reading the ASM output just to discover that some instruction was omitted or does weird things (compiler bug).

Don't forget that the world runs on C.


Unless you are better than every other C programmer on the planet, I am quite confident I could easily find UB in your code as long as you worked with a team larger than a couple of people. If not in the stuff you wrote (though that's unlikely), then in other people's code that called yours without respecting invariants.


The world also runs on fossil fuels, politics and processed foods. ;)


We can lower the fossil fuel consumption by writing more efficient software, in a more efficient programming language like C. It gets less processed as well. Don't have an idea for politics.


Just for the record, you share you unpopular opinion with me.


100% agree.


The code you're compiling is flat-out wrong though. This is exactly a nasal demons situation. Since `Do` is static, the compiler can guarantee that the only possible place it could be set in this file is in `NeverCalled`. Since calling `Do` is only ever valid after that line in `NeverCalled`, it can safely assume that you'd only ever call `Do` after it's been set to `EraseAll`. Anything else is 100% undefined behaviour. You should have a check on the value of `Do` or have a default value.


This boils down to "just don't do undefined behavior". It's 100% correct and misses the point entirely.[1]

1: https://nibblestew.blogspot.com/2020/04/your-statement-is-10...


I don't think OP is arguing that this is a violation of the language spec; they're arguing that the spec (as implemented) is stupid.


Right, and that's one of the things that makes C a bad language for security sensitive issues -- not just that it'll let you use memory after free(), but also that compilers have a liking for doing creative interpretations.


What's wrong with a compiler error? "You are calling a function (Do) that is only defined by calling a function (NeverCalled) that we can't guarantee you ever call"


Such optimizations seem unreasonable in pathologically constructed examples such as this one, overlooking that in many more practical examples such optimizations can truly make a difference in well behaved programs.

A compiler being able to prove that certain code paths can not be taken under certain conditions is quite a powerful tool that allows said compiler a great deal of practical optimizations. It is normally quite hard to prove that it will not happen, but relying on that it is u.b. if it happen can simplify this task, for in well-behaved programs it will not happen indeed.

Consider a more practical example, that we have an integer variable in a language where integer overflow is u.b.. Consider that the compiler can prove that if the integer be positive at the start of a function, for a certain test to succeed after arithmetic, overflow must have occured, as such the compiler can statically assume that the test will always fail so long as the integer be positive, and can greatly simplify the computational efforts to compute the test.

If the program be well-written, the programmer has indeed ensured that no integer overflow can occur.

We can even up the ante and say that this code path is the only place where a certain memory location is modified and that the compile can prove that, and can thus prove that so long as the initial integer be positive, that value will not be modified.

The code reads the value at some point after the potential modification, but since the compiler knows that so long as the integer be positive that it cannot be modified, it can forgo reading it a second time, saving memory lookup.

Of course, this is all under the assumption that the programmer indeed did design the code so that no integer overflow could ever occur by whatever means. If it do occur, then there are our nasal dæmon: the compiler has removed a codepath entirely, and a value that was updated seems to not be updated at all.

Nevertheless, the assumption that compiles are free to make that no u.b. will ever occur in practice allow them to indeed prove properties about control flow that can lead to substantial optimizations that one is typically not even consciously aware of in well-behaved code.


That optimization is not random. The only possible targets for that function pointer are null and the function EraseAll. Since you're a careful programmer, you are therefore smart and you never call a null function pointer. Therefore the compiler can disregard the possibility that the function pointer is null, so the only possible valid target is EraseAll.


"Since you're a careful programmer, you are therefore smart and you never call a null function pointer."

I don't think that's remotely justifiable to assume in modern times. There's 150K lines of C code in the curl repository, with commits from 862 authors, 555 of which ever only made a single commit.

There's no doubt that some of those will be less than perfect, or have an off day, or just not be familiar enough with the code, or that somebody will mess up a merge and remove a line too many.


This is the crux of the problem. When I put on my security hat, I absolutely agree with you: we should give safety rails to programmers, because decades of experience have shown that programmers are incapable of writing correct code without them.

But when you go around and start talking to most C programmers, they don't want those safety rails, they instead want maximum performance. I won't name and shame anyone specific here, but comments along the lines of "we write in C because it keeps all the idiot programmers out of our project" are not unknown. (And then some of those people turn around and complain when the compiler didn't compile their broken code in the way they expected it to, ugh.)

To me, the biggest problem with C isn't that it has undefined behavior, nor that it lacks the safety rails. It's that it gives you undefined behavior while simultaneously avoiding giving you any tools to enforce avoidance of it (and in a couple of cases, it downright encourages you to commit it. Try checking for signed integer overflow without triggering it!). Contrast this to Rust. Rust doesn't have any less undefined behavior--it actually has even more cases that are undefined--but at the same time, it makes it harder to actually go trip over those cases. In the function pointer example, your basic function pointer is required to be non-null, and it's annoying to construct a null function pointer, so you instead it wrap it in Option. But once it's in Option, there's no (easy) way to call it without first making sure that it's non-null, so you end up with code that can never actually call a null function pointer.


hahaha, I love this example

By the way, I think that the behavior of the compiler is quite reasonable here. It could emit a warning, though.


I don't think this should have a warning by default (or even in -Wall -Wextra). The code is not unreasonable - spamming warnings for such code will only get people to ignore warnings.

It is something that a static analyzer should warn though since only having one implementation of Do indicates that there might be an opportunity to remove the abstraction in the code.


Of course it should not print a warning by default, but it would be nice if at least -Wextra said something.


There's also UBSAN though ;)

https://gcc.godbolt.org/z/3jxfeq

(of course a compile-time error would be better than a runtime error)


Do you have a specific instance where curl/libcurl relies on UB ?


Probably no, but that is not the point. Many UBs are impossible to find automatically, so we'll never know if there are UBs. And if there are, they could be creatively implemented by compilers to get much more damaging bug.


> we'll never know if there are UBs

Well... there is also pragmatism. I think it's save to assume that in a library that's 24 years old and is probably linked against by 90%+ of all the applications that touch HTTP we would have encountered the problems one way or another.

It's fine to look at UB as being the big bad wolf if you're starting a new project, but for well established ones I find it unreasonable to project your fears on their developers.


I don't disagree that the curl is relatively safe, but even though the library is 24 years old, but there is newer code and more is code is (presumably) written. Also, new compilers may trigger old UB in unusual ways (although this is not very likely)


This is a mental model that produces security vulnerabilities.

Even after weathering real world usage for 20 years, you can absolutely be pwned by an adversary intelligently crafting hostile inputs or using fuzzers that are good at very quickly covering exotic corners of the input space. With surprisingly high probability. Remember that UB is largely a dynamic data dependent condition.


I'm not sure which part of my reply you disagree with: the fact that new projects should be written in memory safe languages if they are security sensitive, or that it is not always feasible to do so for existing ones.

For the later case, the amount of friction you introduce for the devs, the amount of new bugs you write in because you're moving to a new paradigm are not worth (in my humble opinion) the memory safety this rewrite would bring, especially for something like libcurl which was fuzzed by 20+ years of monkeys with keyboards. Why you're feeling entitled to know better than the people that spent years of their lives doing this one thing is puzzling and leads me to perceive your answer as being knee-jerk dogma.


I was disagreeing with "safe to assume". And I brought arguments :)

Also, let's remember that Daniel changed his mind after writing this in 2017, and there was a blog post showing that most curl vulns were indeed from C linked elsewhere in the comments here.


OK, I see, then I'll retract my "safe to assume" assertion in favour of the way I expanded my position in the second answer: friction is too high for the benefits in projects that already have a chunky code base.


Out of interest, what did you intend for the code you linked to do?

You're calling an uninitialised function pointer. It could point to anything.

EDIT: Actually, I was wrong. The function pointer should be null.


It makes no difference to initialize the pointer to nullptr, actually. Also, neither GCC, nor Clang produce a single warning with -Wall here.

Ideally, I think, the pointer should be null and one would get some sort of error to the effect 'you didn't initialize that pointer, dummy'.

It would be somewhat expected if it was initialized randomly and just jumped to some completely random part of the code. That wouldn't be great, but at least it would crash most of the time.

But what clang is doing here is engaging in bizarre evil genie deduction:

1. Undefined behavior isn't a thing 2. So Do() can't be UB 3. The only way for it not be UB, is for NeverCalled() to get called somehow, though it never is. 4. Perhaps something unseen does that from outside this compilation unit. 5. So we're going to make that assumption out of thin air, and not say a single peep about it.


> 1. Undefined behavior isn't a thing 2. So Do() can't be UB 3. The only way for it not be UB, is for NeverCalled() to get called somehow, though it never is. 4. Perhaps something unseen does that from outside this compilation unit. 5. So we're going to make that assumption out of thin air, and not say a single peep about it.

That's not what how the chain works. This is the actual chain of reasoning:

1. What can Do point to?

1a. It's a global variable whose address is never leaked to anybody outside of the current translation unit. Therefore, all of the assignments I see to this variable constitute the complete set of possible values.

1b. The assignments are the global initialization to nullptr and the assignment to EraseAll.

1c. Therefore, the only possible values of Do are {nullptr, EraseAll}.

2. When I call Do, what can happen?

2a. If Do is EraseAll, I emit a call to EraseAll directly.

2b. If Do is nullptr, it is undefined behavior, and therefore I can replace this with any other behavior.

3. Okay, let's replace a call to a function pointer with a direct call to EraseAll, is that legal?

3a. If Do is EraseAll, yes, I'm calling the same function.

3b. If Do is nullptr, yes, it's undefined behavior, and calling a function is a valid refinement to undefined behavior.

4. It's safe to do so, therefore I'm going to call EraseAll.

It's not that undefined behavior doesn't exist, it's that any behavior is a valid refinement to undefined behavior. It would equally be a valid refinement to replace a null function call with a crash. Indeed, comment out NeverCalled, and the code is replaced with a SIGILL crash (it executes the ud2 instruction instead of calling an undefined pointer).


I think for clang that's the purpose of -fsanitize=undefined and -fsanitize=nullability. If some undefined behavior isn't caught by the sanitizer, clang doesn't use it for optimization.


Calling a NULL pointer is still undefined behavior.


Sure, but "everyone knows" what happens when you call a null pointer, and it isn't system("rm -rf /"). This is the point GP was making.


> I think that's one area where modern compilers are going wrong

And I think people should stop to approach software development as if building the next shiny web CRUD application was serious software development.


Could (2017) be added to the title? Since then, Daniel has since rethought this post and is now writing a Rust backend for curl: https://daniel.haxx.se/blog/2020/10/09/rust-in-curl-with-hyp...


this is a frontend, not a rewrite


You mean a backend?


Thank you for the link. In which Daniel Stenberg says: "A rewrite of curl to another language is not considered".


I'm reading it as saying it's not considered because it's a massive task and difficult to get it right, not because it wouldn't be worth doing.


In this case, that’s pretty much the definition of “not worth doing”.


C is great.

I never understood some peoples' desires to rewrite old stuff into new languages, just because it's a new language. X exists... and HN is then full of "X in python", "X in python3", "X in rust", "X in ruby", "X in go", "X in perl6" ... why? Especially with "new" languages, that lose popularity after a few years (remember all the ruby (-on rails,...) hype?) or languages that change with versions, and running old code in a new interpreter doesn't work (fscking python, and a bunch of python2.x code online that won't run on modern systems anymore, without a lot of work.... i have 20 year old perl books, and all the examples still work, even older C books, and everything works... but not python, "print foo" breaks in pyton3, not to mention integer divison that you don't even notice it failed, until something goes horribly wrong).

curl is written in C and JustWorks(TM), it doesn't need to be written in anything else.

</rant>


The stability and ubiquity of C are probably it's biggest upside characteristics and, as you say, there is a lot to like there. But you have to admit the language is showing it's age and has some huge flaws.

To call manual memory management a foot-gun would be a ridiculous understatement. Once you've used a language with modules and namespaces, going back to C is so painful. The C standard library is super minimal and has lots of... lets call them... quirks. Basically lots of things that are easy and safe in any of the languages you mentioned are difficult and dangerous in C.


C is difficult to learn and the largest community to get involved in / learn from is the kernel dev community, which is notoriously hostile to newbs. I suspect if the kernel community was more friendly (less hostile), and mentorship was more common in the C community, you would see less people searching for an alternative to C.


Not my experience at all. C was super easy for me to learn. However, having said that, I did learn Z80 assembler programming before learning C. That might have made it easier.


C is pretty easy to learn, and the larges community is probably the embedded/maker one (with some c++.... micropython is slowly crawling in tho).


Do you have links to good resources for learning C and the communities that would be more welcoming?


micropython targets huge microcontrollers.

I guess for makers that's not a huge deal though.


C is very easy to learn! It's difficult to use in some cases (e.g. string handling), but it's the right tool for a lot domains.


Do you have links for learning C that you think are good?


Well, the way I learned C was by taking this class many years ago: https://see.stanford.edu/Course/CS107 (not this specific instance of the class, but same instructor so I assume the curriculum is the same...).

There are surely more efficient ways...


What would you consider the more efficient ways to be?


I'm sure there are good books... what are your goals and where are you starting from?

C syntax is very simple and there aren't a huge number of concepts you need to learn (memory layout, pointers, the preprocessor, casting).

There are tons of projects written in easily-understandable C. cpython is one.


I already know higher level languages like Python, Java, and a handful of others, but never really dove into C much beside the Kernighan Ritchie book years ago. I was able to do some stuff with an arduino with it, but my goal was actually always kernel development, or maybe rockets at SpaceX.


If you want to get into kernel development (something I have no experience with except reading headers for various userspace interfaces), perhaps starting with an education OS is good. I did take OS in undergrad and that was probably where I got most familiar with C.

This might be a useful place to start: https://web.stanford.edu/class/cs140/projects/pintos/pintos....


I find it quite easy to understand: C is not great at all.

Rarely is something so dated, that was not created with the benefit of more modern insights very good.

Writing in C not only introduces the possibility of many bugs, but is also a menial chore of boilerplate.


Best part of the article for me:

> C is not the primary reason for our past vulnerabilities

> There. The simple fact is that most of our past vulnerabilities happened because of logical mistakes in the code. Logical mistakes that aren’t really language bound and they would not be fixed simply by changing language.

Moving to Rust for new projects seems like a good idea, but I just had an experience moving from Clojure back to Java for a new project recently that helped me realize something: it's not about the language.

When I moved back to Java, it wasn't about the language anymore. It was about building something cool. There's always this subtext to any programming I did in Clojure that was like "you're doing this cool thing in a cool language and that makes it better", but in java it was "hey let's make something really cool".

That's what this guy is saying the most, I think. It's not about the language. It's about building something rock-solid, and rewrites aren't usually the answer when that's what it's about.



This fails to consider the likelyhood of new logic bugs being introduced in any rewrite that didn't exist in the previous code. Unfortunately, I don't think there's good numbers available to estimate any numbers on these. But judging by the huge array of topics listed with bugs there (TFTP, NTLM, IMAP, Krb5, ...) I'm gonna say it's not possible to make a statement either way on whether the balance would be overall positive or negative.

FWIW we have the same problem with a large 1996-started C codebase with a lot of logic complexity (routing protocols.) Our best guess - amplified by the fact that we don't do a lot of complex string/memory ops, but rather complicated algorithms - is that a rewrite would introduce worse bugs. But I'm not sure on that either.

(Would be a nice MSc or PhD thesis topic...)



I would also add, some people suggesting to rewrite curl are younger than curl


Obligatory response with an analysis of memory vulnerabilities found in Curl: https://timmmm.github.io/curl-vulnerabilities-rust/

Discussion: https://news.ycombinator.com/item?id=25805576


Money quote from that article:

"Results

There are 95 bugs. By my count Rust would have prevented 53 of these."

(Unfortunate that the parent comment is so low in the comments, as it's a convincing proven refutation of the main security argument of the "Curl is C" posting)


It's pretty clear at this point that HN collectively tends toward the "C hating" camp. It's a sad state of affairs.


[flagged]


> As much as I respect Daniel for his work on curl, this claim is simply bullshit.

He was talking about "our" in the context of curl, so clearly talking about curl itself. Claiming that the claim is bullshit based on a blog post by Microsoft about a completely unrelated sample set is... not great methodology and certainly not worthy of using the "BS" stamp.


He is talking about curl, not all software. But the claim is still BS as debunked 4 years ago by simias: https://news.ycombinator.com/item?id=13966967 7 out of 11 are pretty clearly caused by C being an unsafe language.

When I look at https://curl.se/docs/security.html today, it does not seem to be 70%, but there is still quite a big new things called "double-free", "buffer overflow", ...


I can't speak for the buffer overflows since those can happen in various ways but I'd sort double frees under logic errors which he admits curl to have


I think he was talking in the context of curl.


I do have a problem with people using "number of CVEs" as a metric for anything...


The plain view from Steinberg is alleviating. C - and also C++ - are reliable tools for system and application programming. I think Rust is interesting and I see how C++ develops even further covering a bewildering broad range of low- and high-level programming features.

Jumping on the next buzzword language or promising tool? No thanks. Languages like Java, C#, JavaScript or even Go are not itself secure but tools with different features. And another sort of own issues. You will face stack overflows, one-by-offs, missed checks and a lot of bad exception handling and a mad garbage-collector. This doesn't even cover most of our own logic programming mistakes - the most important category. Often people don't know how their tool actually works. What I've learned, good results are only achieved with much and steady work. And well designed APIs prevent problems.

Regarding tools, one big improvement is the address-sanitizer in GCC and LLVM. I'm so thankful for this :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: