Wouldn't it be a lot easier to just use a C compiler that produces memory-safe code?
I'm sure someone else has already thought of this, but in case not... All you need to do is represent a pointer by three addresses - the actual pointer, a low bound, and a high bound. Then *p = 0 compiles to code that checks that the pointer is in bounds before storing zero there.
I believe such a compiler would conform to the C standard. Of course, programs that assume that a pointer is 64-bits in size and such won't work. But well-written "application level" programs (eg, a text editor) that have no need for such assumptions should work fine. There would be a performance degradation, of course, but it should be tolerable.
That's essentially what ASAN is, with some black magic for performance and scope reasons. The problem is that ensuring that your code will detect or catch memory unsafety isn't enough, because the language itself isn't designed to incorporate the implications of that. If you're writing a system messenger for example, you can't just crash unless you want to turn all memory unsafety into a zero-click denial of service.
Programs that would crash when using the memory-safe compiler aren't standards conforming. If you're worried that programs crashing due to bugs can be used for a denial-of-service attack... Well, yes, that is a thing.
Low-level OS and device-handling code may need to do something that won't be seen as memory safe, but I expect that for such cases you'd need to do something similarly unsafe (eg, call an assembly-language routine) in any "memory safe" language.
I'm not familiar with how ASAN is implemented, but since it doesn't change the number of bytes in a pointer variable, I expect that it either doesn't catch all out-of-bounds accesses or has a much higher (worst case) performance impact than what I outlined.
I brought up ASAN because it's a real thing that already exists and gets run regularly. The broad details of how ASAN is implemented are best summarized in the original paper [1]. The practical short of it is that there are essentially no false negatives in anything remotely approaching real-world use. A malicious attacker could get around it, but any "better algorithm" would still run into the underlying issue that C doesn't have a way to actually handle detected unsafety and no amount of compiler magic will resolve that.
You have to change the code. Whether that's by using another language annotations or through annotations like checked C is an interesting (but separate) discussion in its own right.
As for the point that programs with memory unsafety aren't standards conforming; correct but irrelevant. Every nontrivial C program ever written is nonconformant. It's not a matter of "just write better code" at this point.
From the linked ASAN paper: "...at the relatively low cost of 73% slowdown and 3.4x increased memory usage..."
That's too big a performance hit for production use - much bigger than you would get with the approach I outlined.
I don't agree that any nontrivial C program is nonconformant, at least if you're talking about nonconformance due to invalid memory references. Referencing invalid memory locations is not the sort of thing that good programmers tolerate. (Of course, such references may occur when there are bugs - that's the reason for the run-time check - but not when a well-written program is operating as intended.)
I usually find it safe to assume that compiler folks are conscious of optimization opportunities and make pretty intelligent tradeoffs on that spectrum. This is one such case. There's a long history of bounds checking compilers. The first that I know of is bcc in the 80s, which had a 10x slowdown! Austin et al. [1] came along a few years later way back in '94 and improved things to a mere 2-5x slowdown. That's pretty much where things stood for the next two decades because pointer accesses are everywhere in C and register pressure is nothing to sneeze at. Moreover, changing pointer sizes breaks your ability to link external things that weren't compiled with the same flags, like the system libc. ABI compatibility is make-or-break for a C compiler. You can get around that by breaking up the metadata from the actual pointer (e.g. softbound), but the performance cost is still ~3-4x [2].
ASAN was notable because
1) it was very efficient. That initial 73% was utterly fantastic at the time.
2) It was production-usable (i.e. worked on big codebases)
3) With hardware support, the performance hit is often under 10%. HWAsan on modern platforms is low-cost enough to run it all the time.
And no, I'm saying that pretty much every nontrivial C program has UB, not that they're specifically memory unsafe.
With all due respect, why do you assume that your “thought about it 3 mins straight” idea would perform better than one that has been in the works for a long time now by people working on similar topics all of their lives?
Don’t get me wrong, I often fell into this as well, but I think programmers really should get a bit of an ego-check sometimes, because (not you) it often affect discussions in other fields as well where we don’t know jackshit about.
I do this pretty often, and it's often a very valuable exercise, even though I'm almost always wrong. Interrogating the apparent contradiction between my beliefs and existing reality is a highly fruitful learning experience. There are several serious failure modes, though:
1. I can get my ego so wrapped up in my own idea that, even once I have the necessary information to see that it's wrong, I still don't abandon it. In fact, this always happens to some extent; when I change my mind it's always embarrassing in retrospect how blind I was. But the phenomenon can be more or less extreme.
2. In a context where posturing to appear smart and competent is demanded, such as marketing, advocating totally stupid ideas puts me at a disadvantage, even if I recant later. Maybe especially then, because it reminds people who might have forgotten.
3. People who know even less than I do about a subject may be misled by my wrong ideas.
4. This approach is most productive when people who know more than I do about a subject are kind enough to take the time to explain why my ideas are wrong. This happens surprisingly often, both because people are often kind and because the people who know the most about a subject are generally very interested in it, which means they like to talk about it. Still, attention of experts is a valuable, limited resource.
5. People who know more than I do about a subject can get angry and defensive when I question something they said about it, particularly if they're mediocre and insecure. The really top people never act this way, in my experience; if they pay attention at all, either they can explain immediately why I'm wrong, as AlotOfReading did here (though I may not understand!) or they go "Hmm, now that's interesting," before figuring out why I'm wrong. (Or, occasionally, not.) But people with a good working understanding of a field may know I'm wrong without knowing why. And there are always enormously more of those in any field than really top people.
So, I try to do as much of the process as possible in my own notebook rather than on permanently archived public message boards. The worst is when group #3 and #5 start arguing with each other, producing lots of heat but no light.
My theory about why the angry and defensive people in group #5 are never the top people is that they stopped learning when they reached a minimal level of competence, because their ego became so attached to their image of competence that they stopped being able to recognize when they were wrong about things, so they are limited by whatever mistaken beliefs they still had when they reached that level. But maybe I'm just projecting from my own past experience :)
Yes, I know. But this thread is about detecting invalid memory references in production, to prevent security exploits. ASAN seems too slow to solve that problem.
Based on recent experience, you'd really want your media decoders compiled with a safe compiler, and if it crashes, don't show the media and move on. Performance is an issue, but given the choice between RCE and DoS, DoS is preferable.
It would be nice if everything was memory safe, but making media decoding memory safe would help a lot.
I absolutely agree that it's a step in the right direction. My point is that we can't get all the way to where we want to be simply by incremental improvements in compilers. At some point we have to change the code itself because it's impossible to fully retrofit safety onto C.
There are similar approaches, ie: Checked-C which work surprisingly well. However, I'm not sure that this approach would be expressive enough to handle the edge cases of C craziness and pointer arithmetic. There's more to memory unsafety than writing to unallocated memory, even forcing a write to slightly wrong memory (ie setting `is_admin = true`) can be catastrophic.
I think it handles all standards-conforming uses of pointer arithmetic. Even systems-level stuff like coercing an address used for memory-mapped IO may work. For example,
struct dev { int a, b; } *p; p = (struct dev *) 0x12345678;
should be able to set up p with bounds that allow access only to the a and b fields - eg, producing an error with
int *q = (int *) p; q[2] = 0;
Of course, it doesn't fix logic errors, such as setting a flag to true that shouldn't be set to true.
Yes, such approaches can be compliant. There's even a few C interpreters. Very popular back in the day for debugging C programs when you didn't have full OS debugging support for breakpoints and etc. Such an approach would be quite suitable for encapsulating untrusted code. There is definitely some major overhead, but I don't see why you couldn't use JIT.
Good point. There's also the problem of pointers to no-longer-existing local variables. (Though I think it's rare for people to take addresses of local variables in a context where the compiler can't determine that they won't be referenced after they no longer exist.)
I'm sure someone else has already thought of this, but in case not... All you need to do is represent a pointer by three addresses - the actual pointer, a low bound, and a high bound. Then *p = 0 compiles to code that checks that the pointer is in bounds before storing zero there.
I believe such a compiler would conform to the C standard. Of course, programs that assume that a pointer is 64-bits in size and such won't work. But well-written "application level" programs (eg, a text editor) that have no need for such assumptions should work fine. There would be a performance degradation, of course, but it should be tolerable.