C Is Not a Low-Level Language

mattnewport · on May 1, 2018

This article makes some valid points but is overall rather misleading I think. Almost all of the reasons given why C is "not a low-level language" also apply to x86/x64 assembly. Register renaming, cache hierarchies, out of order and speculative execution etc are not visible at the assembly / machine code level either on Intel or other mainstream CPU architectures like ARM or Power PC. If C is not a low level language then a low level language does not exist for modern CPUs and since all other languages ultimately compile down to the same instruction sets they all suffer from some of the same limitations.

It's really backwards compatibility of instruction sets / architectures that imposes most of these limitations. Processors that get around them to some degree like GPUs do so by abandoning some amount of backwards compatibility and/or general purpose functionality and that is in part why they haven't displaced general purpose CPUs for general purpose use.

ge0rg · on May 1, 2018

I also had the initial impression that the article is misleading, but later on the author made the point that the C compiler is doing significant work to reorder / parallelize / optimize the code. I agree that x86/x64 is not a low-level language either, but even if it was, with the description the author provided, I'd agree with his point of C not being low-level.

Regarding cutting off backwards compatibility to improve the design, Intel's Itanium (affectionately called "Itanic") was a very progressive approach to shift the optimization work from the CPU (and the compiler) to just the compiler. I'm not sure what the reasons for its failing were, though.

pera · on May 1, 2018

I'm not an expert, but I don't think the optimization phase should be really considered here: the same kind of pattern matching used by (e.g.) LLVM to find optimizable sequence of statements also could be used by any assembler. NASM for instance offer some level of optimizations, so I think optimization should only be considered when it's part of the language specification itself, like in Scheme.

IMO C is close to low-level because it's relatively easy to imagine the resulting unoptimized assembly given some piece of code (which is why some people jokes about C being a macro assembly).

Maybe this old debate should get an slight update... and this could be the starting point: Is modern x86 assembly still "low-level"? :)

munificent · on May 1, 2018

> the same kind of pattern matching used by (e.g.) LLVM to find optimizable sequence of statements also could be used by any assembler.

Some of the simpler optimizations, sure. But modern backends do many incredibly sophisticated optimizations that are way beyond any kind of simple pattern-matching-and-substitution model.

Even fundamental "optimizations" like register allocation use quite sophisticated algorithms. Optimal register allocation is NP-complete, so compilers use heuristics on top of graph coloring algorithms to do their best.

Most other optimizations rely on type analysis, data flow analysis, liveness analysis, etc.

verall · on May 1, 2018

The K-graph coloring problem for register allocation is NP-complete, but in SSA form is actually linear. The tougher problem isn't the coloring but rather where to optimally place spills and fills around loops and calls.

https://llvm.org/devmtg/2011-11/Olesen_RegisterAllocation.pd...

andrewflnr · on May 1, 2018

Yeah, the number of "registers" needed isn't really NP-complete the way that the number of colors in a graph is. It's just the maximum number of live values, which, while not exactly trivial, is not that hard to figure out.

vvanders · on May 1, 2018

It would be interesting to revisit a world where languages/compilers were built explicitly with common memory access semantics/out of order op/etc in mind.

One of the things that really excites me about Rust is that it's single mutable reference enforcement means you can run run `restrict` 100% of the time if you wanted which is a non-trivial performance boost. I think it's not enabled today but from previous discussions it sounds like that's just a matter of plumbing through the right things to LLVM.

Every time I've seen that rolled out in a C/C++ codebase someone invariably forgets about pointer aliasing and you spend a week tracking down some non-deterministic behavior.

pera · on May 1, 2018

My point was that since optimization is also possible in assembly, and since assembly is considered low-level, then optimization per se shouldn't be used as something that characterize high-level languages exclusively. But it is true that some abstractions used by high-level languages enable quite complex optimization techniques, so there is a clear correlation between the level of a language and the ability of its compiler to analyze and optimize programs.

nickpsecurity · on May 1, 2018

"the same kind of pattern matching used by (e.g.) LLVM to find optimizable sequence of statements also could be used by any assembler"

Transmeta's x86 CPU's were even doing translations and optimizations dynamically between x86 and their internal representations.

still_grokking · on May 2, 2018

> Transmeta's x86 CPU's were even doing translations and optimizations dynamically between x86 and their internal representations.

I think every modern x86 CPU is doing exactly that.

posterboy · on May 1, 2018

C cannot read the overflow bit after an ADD, because of it's abstractions ... so I would say modern ASM is still lower in some aspects because it has less constrains and more importantly, it is less expressive, which is the whole idea of this hierarchy.

The compiler should offer a macro for that. Then the question is whether to take the specification or the implementation at which point it's an absurd question to begin with. You could compare -O0 binaries, bypassing the optimization question, too.

High/low is not fine grained enough. IIRC, prolog for example would be dubbed a fifth generation language, after assembler, goto hell macro compilers, structured functional programming, and DSLs. Now coq and the like seem to be of yet a higher order (pun intended, sorry).

jburgess777 · on May 1, 2018

> C cannot read the overflow bit after an ADD ... The compiler should offer a macro for that.

If you use GCC then __builtin_add_overflow() is what you are looking for:

"The compiler will attempt to use hardware instructions to implement these built-in functions where possible, like conditional jump on overflow after addition, conditional jump on carry etc."

https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins...

posterboy · on May 2, 2018

Yes, I know that. Y ou could add macros implementing a complete asm language to become a subset of C and C would still be more expressive

skywhopper · on May 1, 2018

There was an article on Hacker News recently that covered some of the reasons for Itanium's failure to realize its theoretical benefits. I'm not finding it now, but IIRC, the argument made was that predicting likely-parallelizable code is actually a lot harder to do at compile time, and that, like so many ultra-optimized systems, the real world works much differently and a messier, more random approach ultimately yields far better performance.

gh02t · on May 1, 2018

Itanium suffered performance wise initially because they had trouble with compilers, but that's not the whole story. You also have to consider that AMD launched AMD64, which was backwards compatible, at about the same time. Later on the Itanium compilers got better, but on release it became a choice of "sluggish, incompatible and expensive Itanium with potential to perform well in the future" versus "backwards compatible, currently faster and cheaper x86_64." It didn't gain any real momentum to start because of this, which ultimately doomed it even when a lot of the issues were resolved later on.

mattnewport · on May 1, 2018

> which ultimately doomed it even when a lot of the issues were resolved later on.

Was there ever a point in the Itanium's history where there were Itaniums that ran mainstream software with better performance than equivalently priced x64 processors?

zlynx · on May 1, 2018

There were hand-coded assembly loops that were 3-4 times faster than x86, using Itanium's predicates and rolling register windows.

But I guess you said mainstream. So unless you count database engines, I suppose the answer is "No."

Today you can get the same vector performance using SSE4 and AVX. Almost all of Itanium's good stuff has been rolled into Xeon.

gh02t · on May 2, 2018

As far as I know (which isn't very far, admittedly) they only really managed to reach parity with some performance gains over x86 in a few niches, but it's also a bit chicken-and-egg. It never had enough attention to really get the optimization and porting efforts it would have seen if it had been successful.

dnautics · on May 1, 2018

> the argument made was that predicting likely-parallelizable code is actually a lot harder to do at compile time, and that, like so many ultra-optimized systems, the real world works much differently and a messier, more random approach ultimately yields far better performance.

I am not an expert on computer history, but my feelings on the matter are as follows:

It's hard for certain domains, like handling millions of web requests. For most computational stuff where you're just blowing through regularly-shaped numerical computation (like for example ML, or signal processing), it's not that hard, but arguably the compilers of the time were still not quite up to it (there's a lot of neat stuff that's getting worked into the LLVM pluggable architecture these days). Of course ML wasn't really a thing back then, and intel didn't seem interested in putting itaniums into cell towers.

One way to think of the OOO and branch predict processing that current x86 (and arm) do is that they are doing on-the-fly re-JITing of the code. There is a lot of silicon dedicated to doing the right thing and avoiding highly costly branch mispredicts, etc. During itanium's heyday, there was a premium of performance over efficiency. Now everyone wants power efficiency (since that is now often a cost bottleneck). Besides which, for other reasons Itanium wasn't as power efficient as (ideally) the chosen architecture could have achieved.

drbawb · on May 1, 2018

>the argument made was that predicting likely-parallelizable code is actually a lot harder to do at compile time

So don't do it at compile time? That's really a very weak argument against the Itanium ISA, and honestly more of an argument against the AOT complication model. Take a runtime with a great JIT, like the JVM or V8, and teach it to emit instructions for the Itanium ISA. (As an added advantage these runtimes are extremely portable and can be run, with less optimizations, on other ISAs without issue.)

The problem, as always, is that nobody with money to spend ever wants to part with their existing software. (Likely written in C.) In 2001 Clang/LLVM didn't even exist, and I'm not familiar with any C compilers of the era that had so much as a rudimentary JIT.

mattnewport · on May 1, 2018

There's not that much overlap between the kind of optimizations that JITs do and the optimizations that modern CPUs do. The promise of JITs outperforming AoT compiled code has never really materialized. The performance advantages of OoO execution, speculative execution, etc. are very real and all modern high performance CPUs do them. Attempts to shift some of that work onto the compiler like Itanium and Cell have largely been failures.

dnautics · on May 1, 2018

arguably the "sufficiently advanced compiler" (cue joke) has arrived (sadly, post Itanium, Cell) in the form of a popularized LLVM[0], so it's improper to claim failure based on two, aged datapoints.

The flaws of OOO and SpecEx are evident with the overhead required to secure a system (spectre, meltdown) in a nondeterministic computational environment, and there is certainly a power cost to effectively JITting your code on every clock cycle.

As the definition of performance is changing due to the topping out of moore's law and shifting paralellism from amdahl to gustafson, I think there is a real opportunity for non ooo, non specex in th future.

mattnewport · on May 2, 2018

OoO and speculative execution are largely improving performance based on dynamic context that in most real world cases is not available at compile time. They are able to do so much more efficiently than software JITting can due to being implemented in hardware. There is still no sufficiently advanced compiler to make getting rid of them a good strategy for many workloads.

Most of what OoO and speculative execution are doing for performance on modern CPUs is hiding L2 and L3 cache latency. On a modern system running common workloads it's pretty unpredictable when you're going to miss L1 as it's dependent on complex dynamic factors. Cell tried replacing automatically managed caches with explicitly managed on chip memory and that proved very difficult to work with for many problems. There's been little investment in technologies to better use software managed caches since then because no other significant CPU design has tried it. It's not a problem LLVM attempts to address to my knowledge.

Other perf problems are fundamental to the way we structure code. C++ performance advantages come in part from very aggressive inlining but OoO is important when inlining is not practical which is still a lot of the time.

dnautics · on May 2, 2018

My point is that the dominant software programming paradigm is migrating away from highly dynamic to highly regular. A good example is Machine learning, where for any given pipeline, your matrix sizes are generally going to stay the same. A good compiler can distribute the computation quite well without much trouble, and this code will almost certainly not need SpecEx/OOO (which is why we put them on GPUs and TPUs). Or imagine a billion tiny cores each running a fairly regularly-shaped lambda.

Sure some things like nginx gateways and basic REST routers will have to handle highly dynamic demands with shared tenancy, but the trends seem to me to be away from that. As you say, this is all dependent on the structure of code; and I think our code is moving towards one where the perf advantages won't depend on OoO and specex for many more cases than now.

mattnewport · on May 2, 2018

This might be true for some domains but it's far from true for the performance sensitive domains I'm familiar with - games / VR / realtime rendering. The trend is if anything the opposite there as expectations around scene and simulation complexity are ever increasing.

pjmlp · on May 2, 2018

Actually if you read IBM's research papers on RISC, their PL/8 compiler toolchain was pretty much like how LLVM kind of looks like, just on the 70's.

dnautics · on May 2, 2018

no doubt, but popularity and timing matters.

admax88q · on May 2, 2018

> The promise of JITs outperforming AoT compiled code has never really materialized.

Well JITs do actually outperform AoT compiled code today. Java is faster than C in many workloads. Especially large scale server workloads with huge heaps.

Java can allocate/deallocate memory faster than C, and it can compact the heap in the process which improves locality.

mattnewport · on May 2, 2018

I haven't seen this convincingly demonstrated. Can you point to good examples? The few times I've seen concrete claims they are usually comparing Java code with C code that no performance oriented C programmer would actually write. In certain cases Java can allocate memory faster than generic malloc but in practice in many of those cases a competent C or C++ programmer would be using the stack or a custom bump allocator.

In practice it's quite hard to do really meaningful real world performance comparisons because real world code tends to be quite complex and expensive to port to another language in a way that is idiomatic. My general observation is that where performance really matters to the bottom line or where there is a real culture of high performance code C and C++ still dominate however. This is certainly true in the fields I have most experience in and where there are many very performance oriented programmers: games, graphics and VR.

marshray · on May 1, 2018

This argument has been made since the introduction of the JVM in the early mid-90's.

Seems to me like if, in practice, JIT provided better performance then by now people would be rewriting their C/C++ code in Java and C# for speed.

gpderetta · on May 1, 2018

Most importantly people would write JITers for C and C++.

Tobba_ · on May 1, 2018

It might still be possible. The JVM and .NET both have their speed annihilated by their awful choice of memory model.

_qjt0 · on May 13, 2018

What are some languages that have a better memory model and work faster with a JIT rather than an AOT compiler?

For that matter, does Java code execute faster or slower with an AOT compiler than with HotSpot? I did a quick Google search but couldn't find an answer, except for JEP 295 saying that AOT is sometimes slower and sometimes faster :(

Baeocystin · on May 1, 2018

What's wrong with their memory model? Honest question.

jeffffff · on May 1, 2018

the jvm lacks structs and more specifically arrays of structs as a way to allocate memory. this causes extreme bloat due to object overhead as well as a ton of indirections when using large collections. the indirections destroy any semblance of locality you may have thought you had which is the absolute worst thing you can do from a performance perspective on modern processors. what people end up doing instead is making parallel arrays of primitives where there is an array for each field. this is also not ideal for locality but it's better than the alternative since there isn't a data dependency between the loads (they can all be done in parallel).

i am not that familiar with the C# runtime and i know C# has user definable value types, but i'm not sure what their limitations are.

DerekL · on May 2, 2018

There's a proposal to fix this by adding value types to the JVM. It's part of something called “Project Valhalla”.

http://jesperdj.com/2015/10/04/project-valhalla-value-types/

Hemospectrum · on May 1, 2018

In a nutshell: Too much pointer chasing. C# actually does much better than Java here, with its features for working with user defined value types, but it could still improve by a lot.

TimJYoung · on May 2, 2018

In addition to what others have mentioned, there's also the inability to map structures on to an area of memory. The result is that you end up using streams and other methods to accomplish the same thing, and they result in a lot of function/method overhead for reading/writing simple values to/from memory.

marshray · on May 2, 2018

Garbage Collection is a very consequential design decision. To free that last unused object is going to take O(total writable address space) memory bandwidth.

lmm · on May 1, 2018

> Seems to me like if, in practice, JIT provided better performance then by now people would be rewriting their C/C++ code in Java and C# for speed.

It's a little bit faster, not faster by enough to matter. If you're going to rewrite C/C++ code for speed you'd go to Fortran or assembler, and even then you're unlikely to get enough of a speedup to be worth a rewrite.

New projects do use Java or C# rather than C/C++ though.

hutzlibu · on May 1, 2018

"New projects do use Java or C# rather than C/C++ though."

But not for speed reasons. Java is in no way faster than well written C/C++

lmm · on May 3, 2018

X is not faster than well written Y, for all X and Y; that's not a particularly useful comparison though. I've seen a project pick Java over C/C++ because, based on their previous experience, the memory leaks typical of C/C++ codebases were a worse performance problem than any Java overhead.

hutzlibu · on May 3, 2018

Well written Java is sure to be slower than well written C/C++.

Happy? ;)

But yes, the point you make, is valid, it is much harder to write C/C++ well, because of the burden of memory management. So if you lack the time or skilled people, it might make sense to choose Java out of perfomance reasons.

pjmlp · on May 2, 2018

Java might not be, but C# is another matter.

Specially after the Midori and Singularity projects, and how it affected the design of C# 7.x low level features and UWP AOT compiler (shared with Visual C++).

Also Unity is porting engine code from C++ to C# thanks to their new native code compiler for their C# subset, HPC#.

mattnewport · on May 2, 2018

The discussion was about JITs vs AoT compiled native code. Unity is not using a JIT runtime for their new Burst compiler but using LLVM to do AoT native compilation and getting rid of garbage collection. If you get rid of JIT and garbage collection then yes, a subset of C# can be competitive in performance with C++ for some uses.

pjmlp · on May 2, 2018

JIT vs AOT is an implementation detail, nothing to do with a programming language as such, unless we are speaking about dynamic languages, traditionally very hard to AOT.

In fact C# always supported AOT compilation, just that Microsoft never bothered to actually optimize the generated code, as NGEN usage scenario is fast startup with dynamic linking for desktop applications.

While on Midori, Singularity, Windows 8.x Store, and now .NET Native, C# is always AOT compiled to native code, using static linking in some cases.

As for GC, C# always offered a few ways to avoid allocations, it is a matter for developers to actually learn to use the tools at their disposal.

With C# 7.x language features and the new Span related classes, it is even easier to avoid triggering the GC in high performance paths.

_qjt0 · on May 13, 2018

For someone who doesn't develop for the MS stack but is still curious, what are these ways to avoid allocations and GC in performance-critical paths?

tzahola · on May 1, 2018

Nah, nobody in their right mind would use Java/C# over C/C++ for performance...

http://blog.metaobject.com/2015/10/jitterdammerung.html?m=1

elcritch · on May 2, 2018

That’s a great blog post!

> I agree with Ousterhout's critics who say that the split into scripting languages and systems languages is arbitrary, Objective-C for example combines that approach into a single language, though one that is very much a hybrid itself. The "Objective" part is very similar to a scripting language, despite the fact that it is compiled ahead of time, in both performance and ease/speed of development, the C part does the heavy lifting of a systems language. Alas, Apple has worked continuously and fairly successfully at destroying both of these aspects and turning the language into a bad caricature of Java. However, although the split is arbitrary, the competing and diverging requirements are real, see Erlang's split into a functional language in the small and an object-oriented language in the large.

I still strongly think Apple is taking the wrong approach with Swift by not building on the ObjC hybrid model more.

lmm · on May 3, 2018

Your article is correct that Java/C# performance is unpredictable. But, per the OP, C/C++ performance is also unpredictable, because C/C++ doesn't reflect what a modern processor actually does; there are cases where e.g. removing a field from a datastructure makes your performance multiple orders of magnitude worse because some cache lines now alias.

mattnewport · on May 1, 2018

> New projects do use Java or C# rather than C/C++ though.

Nobody is picking Java/C# over C/C++ for performance reasons.

pjmlp · on May 2, 2018

It is not that Java or C# are able to beat C and C++ on micro-benchmarks, rather they are fast enough for most tasks that need to be implemented, while providing more productivity.

The few cases where raw performance down to the the byte level and ms matter are pretty niche.

lmm · on May 3, 2018

I've seen a project pick Java over C/C++ because of the memory leaks they saw in the latter in practice. You can call that a correctness issue rather than a performance issue if you like, but the practical impact was the same as a performance problem.

Avshalom · on May 1, 2018

One of the big problems with predicting what can be MIMDed is that almost all the languages we use except for Haskell allow for dependency on who knows what. With out very strict refusal of state it's hard as fuck to figure out what is independent of what at compile time.

Not that it can't be done so much as getting programmers to accept it is can't be done.

Avshalom · on May 1, 2018

Part of it was the sort of giant blind spot that "C is a low level language" creates that made even Intel forget how many man-years of optimization were actually in C compilers and the attendant hubris that "oh we can whip up something that'll beat it between now and when we start shipping the chips"

colin_mccabe · on May 1, 2018

I think people often overestimate how smart the compiler is. DJB had a slide deck about this: http://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf Compilers are really useful... but, if it really has to be fast, you still need to have a human in the loop.

I was also under the impression that there hasn't been much improvement in compiling C/C++ in a long time. It would be interesting to compare the performance of gcc from 15 years ago versus gcc today, on a real world piece of code. I suspect you wouldn't see much difference (aside from the changes in C dialect over time), and some added features in the new version. Has anyone run this experiment?

mattnewport · on May 1, 2018

I think you can make a good case that the failure of the Itanium (and other similar attempts to un-hide some of this stuff like the IBM/Sony Cell used in the PlayStation 3) was precisely because they tried to shift optimization work from the CPU to the compiler / programmer.

wumpus · on May 1, 2018

Funny, the compiler people I've worked with complain that Itanium tried to do too much in hardware, like the hw support for loop unrolling, which made superpipelining optimizations in existing compilers much more complicated.

mattnewport · on May 1, 2018

Loop unrolling is one of those optimizations that actually highlights the need for dynamic CPU optimizations like out of order execution and speculative execution. It's very difficult to statically make a good decision about the optimal amount of loop unrolling to do, especially if you want to generate code that will continue to perform well on future CPUs using the same ISA. Even when targeting a specific CPU model it's difficult however since you don't know statically how many iterations of the loop you're expecting, what's currently in cache, what other code might be running immediately before or after the loop, what's running at the same time on other threads, etc.

wumpus · on May 1, 2018

Itanium's hardware did not make any of these things easier.

wbl · on May 1, 2018

The compiler wasn't smart enough. There is information available only at runtime that can be used to get out of order execution and superscalar execution, and the compiler doesn't see any of it. More effort on standard architectures paid off.

pcvarmint · on May 3, 2018

Itanic failed because Intel/HP initially still used a front side bus memory architecture, which could not support the bandwidth necessary for peak performance on anything but matrix multiplication and other computations where most of the work is done in-cache. Then Opteron came around, with its faster memory, and Intel was suddenly thrown back into reality.

Itanic was also in-order (at least as far as dispatch), meaning anytime an instruction was stalled, so were all instructions in the same bundle or after it.

One "non-low-level" idea on Itanic, which never really panned out in practice, was for the assembler to automatically insert stop bits ;; marking assembly code "sequence points", instead of the programmer having to do it manually. But in practice, everyone did it manually, because they'd rather know how well their bundles were being used, and whether they could move instructions around in order get the full 3 instructions / bundle (6 instructions / clock).

And explicit stop bits did not provide any advantage to future hardware by marking explicit parallelism, because at every generation everyone was concerned about obtaining maximum performance on the current machine, which involved shuffling instructions into 6-instruction double-bundles, often at the expense of parallelism on future implementations (which never went beyond two bundles / clock).

SolarNet · on May 1, 2018

Not sure if facetious, but it failed because shifting that work caused it to be slow and not backwards compatible.

PersistentCough · on May 1, 2018

> the C compiler

You mean C compiler X has the feature of Y. There are lots of compilers and that's not part of the language.

umanwizard · on May 1, 2018

Their comment makes sense if you interpret "the C compiler" to mean "mainstream C compilers that 99% people use in production" or even just "gcc and clang".

We're talking about concrete things in the real world here, not philosophizing about the language spec.

PersistentCough · on May 1, 2018

> not philosophizing about the language spec.

That's exactly what's going on.

Avshalom · on May 1, 2018

And yet small device C and Gcc/clang are onky kissing cousins and we (well people that aren't me: ustedes) still rhetorically lump them together as a single language.

kazinator · on May 1, 2018

> Register renaming, cache hierarchies, out of order and speculative execution etc are not visible

That's different from C.

In the history of x86, most new optimizations have preserved the semantics of code. For instance, register renaming isn't blind; it identifies and resolves hazards.

In C, increasing optimization has broken existing programs.

C is like a really shitty machine architecture that doesn't detect errors. For instance, overflow doesn't wrap around and set a nice flag, or throw an exception; it's just "undefined". It's easy to make a program which appears to work, but is relying on behavior outside of the documentation, which will change.

Computer architectures were crappy like that in the beginning. The mainframe vendors smartened up because they couldn't sell a more expensive, faster machine to a customer if the customer's code relied on undocumented behaviors that no longer work on the new machine.

Then, early microprocessors in the 1970's and 80's repeated the pattern: poor exception handling and undocumented opcodes that did curious things (no "illegal instruction" trap).

gpderetta · on May 1, 2018

Then again store-load reordering is visible across cores and undefined.

admax88q · on May 1, 2018

> If C is not a low level language then a low level language does not exist for modern CPUs

I think that's a fair conclusion though, I don't think the article is misleading.

x86 assembly is a high level language. It's analogous to JVM bytecode. Modern x86 processors are more like a virtual machine for x86 bytecode.

magduf · on May 1, 2018

>x86 assembly is a high level language. It's analogous to JVM bytecode.

If you take this position, then having the distinction between "low level" and "high level" languages becomes pointless, and we have no way to distinguish between languages like x86 assembly and C and languages like Python and Haskell. This is why we use the terms "low level" and "high level": some of these languages have a lower level of abstraction than others. The fact that it's not giving you a great idea of exactly what's happening in the transistors is irrelevant: "low" and "high" are relative terms, not absolute.

munificent · on May 1, 2018

> The fact that it's not giving you a great idea of exactly what's happening in the transistors is irrelevant

The author's point is that what's happening in the transistors is relevant — not controlling it is what led to Spectre and heavy performance losses if you aren't smart about cache usage. Thinking of C as a "low-level language" makes it easier for people to overlook that fact.

ori_b · on May 1, 2018

And the parent's point is that there should be a distinction in level between Haskell and C, for example.

Avshalom · on May 1, 2018

Then we need a finer grained taxonomy instead of the current model that is essentially: machine code;c/forth; literally everything else.

I mean that should be blindingly obvious anyway looking at the actual history of programming languages and CPUs but here we are in 2018 insisting that we must have exactly three categories with exactly the definitions of: assembly;C/forth; everything else.

ninkendo · on May 1, 2018

It's a fair conclusion if you're willing to accept that the phrase "low-level language" has approximately zero modern examples.

Or maybe we should relax the definition of "low-level language" a bit?

scott_s · on May 1, 2018

I think the author of the article is willing to accept that, and it's part of his point. The point is not to label languages, but to illuminate how we ended up with our current combination of software and hardware design. He points out that we are in a local-maximum with respect to processor performance because most of the optimizations we've made in processor design the past few decades break the C abstract machine. That requires the compiler and processor to go through awkward contortions to present the fiction of that abstract machine while still getting good performance. The final section, "Imagining a Non-C Processor", is where he explores this idea.

namibj · on May 1, 2018

Take a look at maxas [0], and the maxwell/pascal microarchitecture if you want a modern example of low-level language (iirc. they are still manufactured on small mainstream process nodes).

[0] https://github.com/NervanaSystems/maxas

jrochkind1 · on May 1, 2018

Hmm, what does "virtual machine" mean if the "virtual machine" is implemented in hardware?

And, is nothing but actual binary machine code a "low level language"? I guess it's the lowest, I don't _think_ you can go lower than that... but someone's probably gonna tell me I'm wrong.

ummonk · on May 1, 2018

VHDL / Verilog is lower level!

In all seriousness, the "assembly is high level" argument is ridiculous and robs the "low level" vs "high level" categorization of all meaning.

Avshalom · on May 1, 2018

Taxonomies have been rendered meaningless constantly throughout history, fighting it is usually fighting reality.

jrochkind1 · on May 1, 2018

Perhaps C is a "medium-level language"? :)

mikepurvis · on May 1, 2018

The JVM provides native primitives around memory and thread management which are not present in x86, but that's a matter of degree.

fixermark · on May 1, 2018

x86 provides all sorts of memory abstractions. The whole memory segmentation model is an abstraction. https://en.wikipedia.org/wiki/X86_memory_segmentation

pvg · on May 1, 2018

It's analogous to JVM bytecode.

That's only true in the way everything is analogous to everything.

chubot · on May 1, 2018

I think it's a good analogy because it's saying that nontrivial optimizations happen below the code layer.

In early CPUs, nothing was optimized. They just executed your instructions. Now there are nontrivial optimizations and rewriting, just like the JVM.

coldtea · on May 1, 2018

Nope, also in the way the parent explains right after that sentence.

pvg · on May 1, 2018

It's a pretty terrible analogy in nearly every obvious technical way. What an x86 cpu does with its instructions is almost but not entirely unlike what a JVM does.

admax88q · on May 1, 2018

None of the "registers" in x86 assembly are real. Few of the instructions are implemented in hardware, most are impelmented in software microcode.

Advanced hardware will re-order and pipeline instructions based upon data dependencies.

Sure it's not doing the exact same things the JVM does with bytecode but the point is that x86 assembly is not the language of the hardware. It's a language that the hardware+firmware knows how to interpret and optimize at runtime similar to what your JVM does with java bytecode.

pvg · on May 1, 2018

It isn't at all similar. JVM bytecode is a pretty high-level IR for tokenized Java. The JVM's main unit of optimization is a method, not instructions. Its key component is a compiler, it's even called that.

An x86 cpu, as the article points out, spends inordinate resources looking for ILP. It's not a compiler in any reasonable sense of the word, while a JVM is.

admax88q · on May 1, 2018

The point I'm trying to make with the analogy is that the x86 instruction set is not representative of what the hardware is actually doing.

It is not "low level" because it is an abstraction or virtual platform that the processor exposes and then interprets using its own internal resources and programming interface (microcode). The x86 interface does not map closely to the actual hardware, just as the article states. It exposes a flat memory model with sequential execution and only a handful of registers.

Much the same way that the JVM exposes a virtual machine that doesn't directly map to any of the platforms that it runs on. It's an abstraction that is interpreted or compiled at runtime.

I don't understand why you think the two are so different just because the JVM is higher level.

pvg · on May 1, 2018

Right, and the point I'm trying to make is this is a pretty lousy analogy. For one thing, an x86 CPU is not nearly as VM-y as you make it out - renamed registers are very much real registers, big piles of the most common instructions execute in 1 or 2 uOps. For another, the VM you've picked as an example is singularly uncpu-like. C also exposes an abstract machine, would you use that as an analogy? Probably not.

'An abstraction that is interpreted or compiled at runtime' is so broad it's exactly the what I said up top - it's analogous in the way everything is analogous to everything else. It's the sort of thing that might be true if you squint but offers somewhere between zero and negative insight.

admax88q · on May 1, 2018

I don't know I think you're getting caught up too much on the specifics of what they're doing.

CPU's are adopting JIT like tendencies in order to increase performance. Instruction reorder, register renaming, branch prediction, etc.

> if you squint but offers somewhere between zero and negative insight.

The insight I bring from this is that we should look moving those features out of the hardware and into the software level. Let us take advantage of them in our compilers and virtual machines.

The JVM can beat C in many scenarios because it can make optimizations based upon runtime information that a static compiler will never have available.

Imagine what we could do if we weren't chained to the x86 abstractions.

gpderetta · on May 1, 2018

OoO, renaming, branch prediction, microcode have existed for a long time. If anything, more modern CPUs (x86 included) are RISCer than the older generations which had extensive microcode expansion for each instruction.

Even ignoring the fact that the JVM is typed, memory safe and with builtin GC (all things that were tried architecturally in the past and abbandoned), there is still a large difference between the scope and variety of non-local optimizations perfomed by any non-toy VM and the local, strictly realtime, constrained to a small window, set of reordering done by an OoO engine. Even tracing, which is used by some JITs, has been largely abbandoned in the CPU world.

Transmeta and Denver-like dynamic translation is closer to the behaviour of a software JIT and it is certainly considered drastically different from mainstream OoO.

pvg · on May 2, 2018

"x86 is an architecture hobbled by its legacy ISA, the CPUs are immensely complex VM-like dynamic beasts that hide the real CPU to get performance out of it" is one of those tropes that's inaccurate enough to have a small cottage industry of online pieces explaining the wrong bits. You can probably find highly rated SO answers or HN comments about it.

The long and the short of it is, an x86 cpu is not really VM-like and a JVM is decidedly unCPU like. The analogy only works if you generalize it so much it becomes a uselessly mushy tautology or you ignore basic aspects of how each of these things work.

coldacid · on May 1, 2018

`javac` is a compiler. The JVM is an execution platform, just like modern x86.

pvg · on May 1, 2018

What's the word that comes after 'JIT'?

coldtea · on May 1, 2018

What the word is not really important.

JITting is also called "dynamic translation", which is what a CPU does with microcode.

Whether that's a full compiler or not is beyond pedantic -- and irrelevant to the parent's point.

pvg · on May 2, 2018

The parent is telling me how the JVM doesn't have a compiler, which was my claim. It has a couple full-blown compilers.

which is what a CPU does with microcode.

Either you know something about current x86 CPUs that I don't or words and technical terms are, indeed, not important and have no meaning.

rbanffy · on May 1, 2018

Microcode is the new low level language. ;-)

But, to be fair, C is not that low level. In fact, when I first learned it, it was considered a high-level language because CPUs we used it with didn't have functions with parameters, only subroutine jumps.

C reaches into the realm of low-level languages because it allows you to arbitrarily read from and write to the "state" of the context you live in, but it also allows you to express constructs that have no counterpart even on the most complex CPU architectures (even if they have things that disagree fundamentally with C's point of view).

pjmlp · on May 1, 2018

> Microcode is the new low level language. ;-)

Except for RISC, that has mostly always been the case, when we look back at all those mainframes and their research papers.

nerdponx · on May 1, 2018

On the flipside, doesn't it allow more or less direct memory access?

abofh · on May 1, 2018

I mean, yes, no, it depends on what you mean?

You can write to null in C, your operating system rejects it. You can write past your allocated memory, your OS rejects it. It's not like just because it's written in C it gets to read the kernel memory - it's just that you can try.

All memory access in userspace is mediated by the MMU, so nothing in C gets "direct" memory access - but it does allow you to screw up your own memory space pretty well... I'm not sure that's C's fault though

pjmlp · on May 1, 2018

Many C targets also run bare metal, there is nothing to reject there, unless the hardware has an MMU and the code bothered to configure it on boot.

nerdponx · on May 1, 2018

That's my whole point. C doesn't try to stop you from doing that. C tries to do exactly what you ask it to, and if the OS doesn't allow it it just crashes.

To me that is about as low level as you can get without bypassing the OS.

ryao · on May 1, 2018

Try using atomic instructions or initializing various tables during system boot. You will find that there are things C cannot do, unless you write assembly and call to it from C.

saagarjha · on May 1, 2018

But is there really any programming language that lets you write to arbitrary memory? This is a problem of privileges rather than how low level your language is: it's not like you can write assembly that will give you arbitrary access to kernel memory.

billfruit · on May 1, 2018

In writing h kernal modules in C for vxworks, I find people quite often write/read data from hard-coded memmory addresses.

Avshalom · on May 1, 2018

>If C is not a low level language then a low level language does not exist for modern CPUs

correct, but that lack is not an argument for C being low level.

swsieber · on May 1, 2018

Correct - it's an argument for broadening the title.

Calling out a specific language as not being something leads one to ask, "Well what is?". In this case, there is no qualifying alternative, so the title might as well be, "There is no low-level language for CPUs".

munificent · on May 1, 2018

> Calling out a specific language as not being something leads one to ask, "Well what is?".

The article does basically answer this. The last section is about what it would mean to design a chip such that a low level language were possible to design for it.

Avshalom · on May 1, 2018

... any more.

It's not a category that can't exist, it just doesn't have any members right now.

mattnewport · on May 1, 2018

It actually can't exist at all for current modern CPUs (x64/ARM/PowerPC) since they don't expose a programming interface for many of the things discussed in the article (speculative / out of order execution, register renaming, full cache control).

Avshalom · on May 1, 2018

Yeah but it can for all sorts of microcontroller-y chips; embedded CPUs; DSPs; theoretical future CPUs that aren't trying to both act like a PDP11 and go faster every year...

0xfaded · on May 1, 2018

I played a little bit with gpgpu on the raspberry pi.

I’d imagine it’s relativly primitive compared to whatever shaders are compiled to on modern GPUs, but it was humbling to have to manage things like separate, per core, disjoint register files which can only be read 4 cycles after write. The cores are heterogeneous, so there is special hardware for exchanging register reads between cores if necessary.

hencoappel · on May 2, 2018

What language do you use for GPGPU?

vardump · on May 2, 2018

Sounded like Videocore 4 assembler.

wmu · on May 2, 2018

> Register renaming, cache hierarchies, out of order and speculative execution etc are not visible at the assembly / machine code level

Cache hierarchies are directly accessible with CLFLUSH, INV, WBINVD x86 instructions; we may count also PREFETCHx, but they call it "a hint". FENCE instructions touch even the multicore part of system.

Many low-level CPU concepts leak to higher layer. A bright example is false sharing, which may manifest even in Java or C# programs.

fixermark · on May 1, 2018

x86 is also there on modern architectures to supply an abstraction that is increasingly divergent from the actual die hardware (but compatible with expectations of e.g. a C compiler that already has an output target for previous x86 hardware).

Modern Intel CPUs basically emulate x86; there are many layers of abstraction between individual opcodes and transistor switching.

kev009 · on May 2, 2018

By David's postulation, even the native assembly language for the CPU is not low level. See my other comment on the parent topic for justification.

ryao · on May 1, 2018

Spend a week writing in assembly and you will never call C a low level language again.

munificent · on May 1, 2018

I really really liked this article, and reading the comments here is blowing my mind. Did we read the same thing?

I think it's a strong insight that insight that chip designers and compiler vendors have spent person-millenia maintaining the illusion that we are targeting a PDP-11-like platform even while the platform has grown less and less like that. And, it turns out, with things like Spectre and the performance cost of cache misses, that abstraction layer is quite leaky in potentially disastrous ways.

But, at the same time, they have done such a good job of maintaining that illusion that we forget it isn't actually reality.

I like the title of the article because many programmers today do still think C is a close mapping to how chips work. If you happen to be one of the enlightening minority who know that hasn't been true for a while, that's great, but I don't think it's good to criticize the title based on that.

noobermin · on May 1, 2018

As someone who does large scale computational work for a living, stuff like this is close to my heart. I often run into serious memory and run time constraints due to poorly written codes that have rather dumb understanding of the real underlying machine implementation that modern processors actually have rather than this imaginary PDP-11 that we've been brought up to believe.

I wonder how much I could save (and how many more sims I could run) if my codes were rewritten in a language that has an abstract system that is much more cleanly and simply translated to what the computer actually does in 2018.

Shikadi · on May 1, 2018

Actually, the author's argument about PDP-11 is interesting because C would have never been considered a low level language back then, for any platform.

Wiki definition, also what I was taught in my first CS class:

"A low-level programming language is a programming language that provides little or no abstraction from a computer's instruction set architecture—commands or functions in the language map closely to processor instructions. Generally this refers to either machine code or assembly language."

The term is evolving to match the time, as shown by the author's interpretation already being higher level than the original intention despite the goal of preventing exactly that.

munificent · on May 1, 2018

> Actually, the author's argument about PDP-11 is interesting because C would have never been considered a low level language back then, for any platform.

Sure, agreed, but I don't think it's super interesting that words evolve in meaning over time.

What I find strange about the comments here is that some people think the article's title is bad even though my experience is that many people today do think "C is a low level language" is a reasonable thing to say.

Shikadi · on May 1, 2018

I guess I'm arguing that it is a reasonable thing to say today in the right context. When I talk to web developers who don't understand anything about computer architecture for example it's much easier for me to tell them I'm a low level developer rather than explain that I design digital hardware (FPGA) and write drivers and firmware to interact with it. But I do agree that it's important developers know that C doesn't correspond to the CPU in the way they might mistakenly think it does

lowbloodsugar · on May 1, 2018

Heh. Yeah. Was programming ARM assembly in 1988. C compiler was far too high level. Look how its always saving these registers to memory! Meanwhile I'm dropping into FIQ mode just for the banked R8-R14.

Now, sure, people say C is low level, and compared to Java it sure is. But it isn't low-level.

robochat · on May 1, 2018

I also really liked the article and found it thought provoking. This is all way above my pay grade but I like to think that there is a more optimal cpu design and language pairing that we will eventually reach and it's fun to imagine what that might look like.

Obviously, it would be very hard to shift the incumbent model in reality. We just have to look at the lack of prosperity for the Itanium and Cell processors to see how hard it is to achieve success. But imagine if new computer languages had been created just for these processors. Commercially this would make little sense but it might be possible to create languages that fully used these processors yet retained simplicity for developers. Or maybe it isn't possible to beat the clarity of sequential instructions for human developers or maybe Out Of Order processing is the optimal algorithm. There are other changes coming too such as various replacements for DRAM that either integrate more closely with the CPU (such as 3d chips) [1,2] that by reducing the latency of main memory, could actually bring us back closer to the C model of the computer? or just change computing entirely...

[1] https://www.extremetech.com/computing/252007-mit-announces-b... [2] https://news.ycombinator.com/item?id=16894818

_qjt0 · on May 13, 2018

It'll be an interesting exercise to ask the Clang folks to relax backward compatibility, this designing a new language, if it makes their compiler go faster.

That could be deployed as a new language, or adding features to existing ones, like value types in Java, or even compiler switches that relax some C rules for faster speed. Imagine -fpointers_cant_be_cast_to_ints or -freorder_struct_fields.

skybrian · on May 1, 2018

It seems like the article is mostly useful for inspiring research; that is, most of us aren't the target audience.

I'm wondering what will happen as GPU's become more general-purpose. What's next after machine learning?

Would it be possible to make a machine where all code runs on a GPU? How would GPU's have to change to make that possible, and would it result in losing what makes them so useful? What would the OS and programming language look like?

antris · on May 1, 2018

> It seems like the article is mostly useful for inspiring research; that is, most of us aren't the target audience.

As a group of professionals, it is highly beneficial for us to be interested in these things. People who design languages and compilers do it largely on what is perceived as being demanded, and us as programmers are the ones that create the demand for new languages.

To put in other words, if programmers aren't aware of what's going wrong with our current languages, they cannot express their need for new languages. So, there's less incentive for researchers to produce new ways of programming computers. It is much more tempting to "please the masses" in a way that causes this local-maximum problem. It's much more interesting to research problems that translate into mainstream use than academic things that nobody actually uses.

skybrian · on May 1, 2018

I still think a research project (either academic or in industry) with a hardware component would be the best way to explore radical new processor architectures that are further away from C.

- Hardware designers are conservative. They aren't likely to implement a different hardware architecture because "programmers demand it" (really?), unless there's existing research showing how it can be done and a compelling reason why customers will buy the chips.

- As a hobbyist language designer, I'm still going to target something that exists and is popular: x86, JavaScript, wasm, C, or something like that. A low-level language targeting a platform that doesn't exist isn't all that appealing. But, someone might get some good papers out of doing the research.

proverbialbunny · on May 1, 2018

I think it is going the opposite direction, where cpus get more powerful graphics cards integrated more and more into them. This allows for matrix math to become a bit more standard in day to day programming.

However, in the opposite direction where a gpu becomes more like a cpu, if streams could do some level of limited branching without slowing the whole thing down, it opens the door to threading frameworks and design patterns where you write a loop in code, every thread gets it's own copy of memory, and on the threading front, it just kind of works for a lot of generic code.

Then if gpus added some sort of piped-like summation-like instruction, in the cases in a loop where variables need to be shared, they can still be added, subbed, mul, div, or mod, easily and quickly, allowing for what looks and acts like normal code today, but is actually threaded. That would kind of bring code back to where it is today.

Who knows? It's kind of fun to speculate about though.

gpderetta · on May 1, 2018

> cpus get more powerful graphics cards integrated more and more into them.

Actually the wheel of reincarnation seems to have stopped at least for the time being. It seems that there is a fundamental, hard to reconcile, disconnect between a latency optimised engine like a CPU and a throughput engine like a GPU. Hybrids CPUs like larrabee or extensions like AVX512 do not seem to be enough.

Short term, probably the best we are going to see is separate CPU and GPU cores in the same die (or more likely jus the same package), but even that is likely suboptimal.

_qjt0 · on May 13, 2018

Or, conversely, could the iPhone 11 have 20 cores that are optimised not for latency but for power-efficient execution, such as how much you can do before the device runs out of battery? These would still be ARM, and backward-compatible with mainstream languages like Swift, so will have a low barrier to entry.

Maybe apps will be allowed to run longer in the background, if there are always extra CPU cores available that don't consume much power.

snarf21 · on May 2, 2018

I think the bigger take away for me is that what constitutes low level has evolved over time. I still think C is low level because you have to manage your own memory and can play tricks with pointers and memory that other languages protect you from. Meaning that low level takes off a lot of the training wheels. The compiler still does what it can and optimizes things but you have a far greater ability to shoot yourself in the foot than in some other language. The arguments about what a "real" low level in these comments seem mostly pedantic.

munificent · on May 2, 2018

> I think the bigger take away for me is that what constitutes low level has evolved over time.

Yeah, this is a good insight. The height of the overall stack has grown. The lowest low level is lower than it was in the 60s and the highest high level is higher. So we need more terms to cover that wider continuum.

zengid · on May 4, 2018

I agree that this article hits pretty hard against a lot of assumptions about how our machines are working.

I also feel like the author is trying to say something about how imperative scalar (meaning 'operates on one datum at a time') languages are causing more trouble than they're worth. Sophie Wilson said something similar in her talk about the future of microprocessors [1]. This implies that declarative and functional semantics would be more amenable to parallelization, as the author mentions in the article, as well as allowing the compiler more freedom to deduce a suitable 'reordering' of operations that would better fit the memory access heuristics the machine is using.

[1] https://youtu.be/_9mzmvhwMqw?t=26m30s

jstimpfle · on May 1, 2018

> cost of cache misses

How is the C memory model a leaky abstraction here? What better way do you suggest? Are we not fine coding sequential (in memory) datastructures in C?

munificent · on May 1, 2018

C leads you to believe that memory access has uniform cost regardless of address. What is the perf cost of:

    *foo

Depending on what foo points to, and which memory you have previously read, the cost can vary by close to two orders of magnitude on many chips.

C does give you the ability to control those costs, but controlling how you lay out your data in memory and controlling imperatively in which order you access it. But the language doesn't show you those costs in any way.

jstimpfle · on May 1, 2018

> But the language doesn't show you those costs in any way.

I think it's pretty clear. Access memory sequentially, and you can expect to hit the cache. Access more memory than the cache size in a random order, and you can expect to pay memory access latencies (100s of CPU cycles).

I doubt you would be willing to manage the cache yourself in every line of code. That would be a lot of code. Some programmers might want to tune cache eviction behaviour by changing it in a few controlled points in time. But not in a way that couldn't be exposed to assembler/C. (I don't know if that's even realistic from a hardware architect's point of view).

dragontamer · on May 1, 2018

> Access memory sequentially

Except memory is virtual. Memory location 0x1000 might be forward, or backwards compared to 0xFFF, depending on the state of the Translation-lookaside buffer (TLB).

Ever notice how (when ASLR is disabled), programs all start at the same location?? (https://stackoverflow.com/questions/14795164/why-do-linux-pr...)

Hint: Virtual address 0x0804800 doesn't belong at physical address 0x0804800. The OS can put that memory anywhere, and the CPU will "translate" the memory during runtime.

This means that in an obscure case, going forward a linked list (ie: node = node->next) may involve FIVE memory lookups on x86-64

* Page Map Lookup

* Page Directory Pointer lookup

* Page Directory lookup

* Page Table lookup

* Finally, the physical location of "node->next".

An even more obscure case (looking at maybe address 0xFFFC, unaligned) may require two lookups, for a total of 10-memory lookups (the page-directory walk for page 0xFFFC, and then the page-directory walk for 0x1000).

There is a LOT of hardware involved in just a simple "node = node->next" in a linked list. Its not even CPU-dependent. Its OS configurable too. x86 supports 4kb pages (typical in Linux / Windows), 2MB Large Pages and 1GB Huge Pages.

jstimpfle · on May 1, 2018

> Except memory is virtual. Memory location 0x1000 might be forward, or backwards compared to 0xFFF, depending on the state of the Translation-lookaside buffer (TLB).

Does that matter though? I would assume that the "prefetcher" (or whatever it's called) can make its predictions in terms of virtual memory.

Regarding the linked list, it has been common wisdom for a long time that one should prefer sequential memory instead of linked lists. A nice benefit is that this simplifies the code as well :-). Growing and reallocating sequential buffers might not be possible in very dynamic/real-time and decentralized architectures - like a kernel - though.

dragontamer · on May 1, 2018

> Does that matter though?

Yeah. Meltdown means that the OS (as a security measure) wipes away the TLB whenever you make a system call. So all of those cached TLB entries disappear every syscall due to Kernel Page-Table Isolation.

And then the CPU core has to start from scratch, rebuilding the TLB-cache again.

So last year (when we didn't know about Meltdown), a system-call was fast and efficient. This year, on Intel and ARM systems (vulnerable to Meltdown), system-calls are now forced to wipe TLB. (But not on AMD-systems, which happened to be immune to the problem)

Both AMD and Intel implement x86 instruction set, and now the performance characteristics is different between the two boxes for something as simple as "blah = blah->next".

-------------

The important bit is that C is still quite "high level", and indeed, even Assembly language "lies" to the programmer through virtual memory. The OS (specifically page-tables) can interject and have some magic going on even as assembly-code looks up specific addresses.

The simple pointer-dereference *blah is actually incredibly complicated. There's no real way to know its performance characteristics from a C-level alone. It depends on the machine, the OS, the configuration of the OS (ie: MMap, Swap, Huge Page support, Meltdown...) and more.

EvilTerran · on May 1, 2018

And all that's without even getting into things like swap, or memory-mapped files. Or, to really go all-out, mmap() an NFS file - and you could find your mere pointer dereference waiting for the network.

pedasmith · on May 1, 2018

How about a different example entirely? Memory nowadays is either CPU (and the GPU can access it) or GPU (and the CPU has a window into it). It's terribly important to use the right one: if a chunk of memory is mostly used by the CPU, it needs to be CPU memory, and if it's mostly used by the GPU, it needs to be GPU memory. But there's no good way to specify that.

You might argue that a modern computer is more like programming a tightly-bound, nonuniform multi-processor system. And I'd agree. But C doesn't much to help program such a thing.

jstimpfle · on May 2, 2018

Still, what you normally do and can assume is that you're using just "CPU" memory with pretty predictable (I think) latencies and a transparent caching layer.

When different things are mapped into the address space, that's an abstraction the programmer (or the user) consciously made. It should be possible to figure out the performance characteristics there.

Of course, many programs work on various machines with their own performance characteristics. You should still be able optimize for any one of them by querying the hardware and selecting an appropriate implementation. If you want to put in the work.

I don't think assembler/C is such a big problem here. But then again, I'm not a low level guy (in this sense) for now.

United857 · on May 1, 2018

It's worth noting that chips that were designed for high-performance computing (e.g. the Cell) from the outset generally don't have silicon devoted to things like out of order execution, register renaming, etc. In this case, the bulk of the optimization logic does shift to the programmer (aided by the compiler).

The reason is that in these domains (e.g. game consoles, supercomputing), you know ahead of time the precise hardware characteristics of your target, you can assume it won't change, and can thus optimize specifically for that ahead of time.

This isn't true for "mass-market" software that needs to run across multiple devices, with many variants of a given architecture.

mattnewport · on May 1, 2018

> The reason is that in these domains (e.g. game consoles, supercomputing), you know ahead of time the precise hardware characteristics of your target, you can assume it won't change, and can thus optimize specifically for that ahead of time.

Cell was a failure in large part because this proved to be less true / less relevant than its designers thought.

Source: many late nights / weekends trying to get PS3 launch titles performing well enough to ship.

scott_s · on May 1, 2018

I did work in graduate school trying to make the Cell easier to program - basically, providing OpenMP-like abstractions that would take advantage of the SPEs. I've always been really curious: how much did your games take advantage of the SPEs? When did you send code to the SPEs versus using the GPU? Were you using libraries that helped managing the SPEs, or did you do all of it manually?

mattnewport · on May 1, 2018

OpenMP is a bad approach for the types of problems commonly encountered in games and graphics programming in my experience. Matt Pharr's excellent series of articles on the history of ISPC gives some good explanations of what programming models actually work well for graphics particularly: http://pharr.org/matt/blog/2018/04/30/ispc-all.html

At the time I was doing most of my SPE work (helping to optimize launch titles at EA prior to the launch of the PS3) most titles weren't taking much advantage of them at all. We were a central team helping move some code that seemed like it would most benefit over, I was particularly involved in moving animation code to the SPEs. There weren't really any options for libraries to help at that point, other than things we were building internally, so it was almost all manual work.

Later on in the PS3 lifecycle people moved more and more code to the SPEs. To my knowledge most of that work was largely manual still. For a while I was project lead on EA's internal job/task management library which had had a big focus on supporting use of the SPEs but my involvement in it was mostly during the early part of the Xbox One / PS4 generation. The Frostbite graphics team in particular did a lot of interesting work shifting GPU work over to the SPEs (I think some of it they've talked about publicly) but I wasn't directly involved in that.

scott_s · on May 1, 2018

I completely believe you on OpenMP being bad for games and graphics programming; we were targeting the HPC community which had a heavy interest in Cell as well. But all the while, I knew a bunch of programmers out in the world were shipping Cell code, and I was always curious what their patterns were. Thanks for the answers!

obl · on May 1, 2018

The point of dynamic optimizations (such as ooo) is not only to hide implementation details (such as register file size) but very much to take advantage of dynamic opportunities that simply cannot be known statically. The optimal schedule can be very different depending on whether some load hit L1 vs L2 or even was forwarded from the store buffer.

There are some classes of very regular algorithm where you could probably predict everything (and handle the memory hierarchy) statically, such as GEMM, but it's not very common.

mattnewport · on May 1, 2018

Yeah, this is a very important point that many people seem to be missing, including the authors of the original article it seems to me. It was certainly a big problem for performance of games on Cell in my experience.

umanwizard · on May 1, 2018

The points made in the article are certainly valid, but C is low-level in an abstract sense: it is approximately the intersection of all mainstream languages.

I.e. if a feature exists in C, it probably exists in every language most programmers are familiar with. (I worded this statement carefully to exclude exotic languages like Haskell or Erlang).

Thus C, while not low-level relative to actual hardware, is low-level relative to programmers' mental model of programming. If this is what we mean, it's still true and useful to think of C as a low-level language.

That said, it's important to keep the distinction in mind -- statements like "C maps to machine operations in a straightforward way" have been categorically wrong for decades.

munificent · on May 1, 2018

> if a feature exists in C, it probably exists in every language most programmers are familiar with.

I don't think that's true.

Off the top of my head, C has: array point decay, padding, bit fields, static types, stack allocated arrays, integers of various sizes, untagged enums, goto, labels, pointer arithmetic, setjmp/longjmp, static variables, void pointers, the C preprocessor.

Those features are all absent in many other languages and are totally foreign to users that only know those languages. A large part of C is exposing a model that memory is a freely-interpretable giant array of bytes. Most other languages today are memory safe and go out of their way to not expose that model.

typomatic · on May 1, 2018

> I worded this statement carefully to exclude exotic languages like Haskell or Erlang

I suspect that your definition of "exotic" is exactly "not like C".

mda · on May 1, 2018

Which is kinda true. Most popular languages are C like.

pjmlp · on May 2, 2018

30 years ago the landscape looked quite different.

kazinator · on May 1, 2018

Which languages have pointer arithmetic, longjmp, goto anywhere within a function, address-of operator, memcpy that is equivalent to assignment even for lexical variables, untagged unions, switch with fall-through in absence of explicit break and a textual/token-wise preprocessor?

pjmlp · on May 2, 2018

BLISS, Modula-2, NEWP, Mesa,...?

Of course, some of those tricks are only allowed in SYSTEM/UNSAFE blocks on these languages.

kazinator · on May 2, 2018

That was in response to the claim that if C has a feature, it "probably exists in every language most programmers are familiar with".

"Most programmers" are not familiar with these. (A bit sad in the case of Modula-2).

pjmlp · on May 2, 2018

Which is a sad state of affairs, programming language history should be a required part of curriculum.

kazinator · on May 2, 2018

Well, there is a big jump between knowing about historic programming languages and familiarity. I know a few things about SNOBOL, but I couldn't sit down and start coding in it without some ramp-up period.

saagarjha · on May 1, 2018

One feature that many languages don't provide is the ability to have direct control over how aggregates are organized in memory.

fixermark · on May 1, 2018

> Thus C, while not low-level relative to actual hardware, is low-level relative to programmers' mental model of programming

Programmers' mental model of programming is not a homogeneous set. I'm pretty comfortable in LabView, for example; a language that is extremely parallel (the entire program is composed of a graph of producer / consumer nodes and sequential operation, if desired, must be explicitly requested).

mkirklions · on May 1, 2018

Its crazy that C has changed because of the way people used it.

Well crazy isnt the correct word... mainstream use has changed the future of an old language...

Rebelgecko · on May 1, 2018

Going by their definition, I don't think there are any low level languages, at least on modern architectures. Even x86 assembly abstracts out a lot of what is going on within the CPU.

umanwizard · on May 1, 2018

That doesn't mean the definition is useless -- rather than "C isn't a low-level language, as opposed to something else which is", the point might be "there exist no low-level languages according to most people's understanding of that term". Which is still an interesting and useful fact.

rbanffy · on May 1, 2018

It also hides the fact C is just a couple notches above the absolute minimum most people would even consider - writing assembly code by hand - and is, effectively, the lowest most programmers will ever venture.

fixermark · on May 1, 2018

True, but one of the points the article makes is that in practice, there's a vast gulf of distance (person-years of C compiler development) between the C code one writes and the resulting assembly code output (and this is ignoring the fact that x86 assembly is, itself a co-evolved abstraction with C-like languages that is basically emulated on modern massively-parallel CPU architectures).

In that regard, a case can be made that when you're writing in C, you're writing exactly as close to the bare metal as if you're writing in, say, Go or Haskell.

simen · on May 2, 2018

> In that regard, a case can be made that when you're writing in C, you're writing exactly as close to the bare metal as if you're writing in, say, Go or Haskell.

No, you really can't. This is childish black and white thinking. The computational model of C is built on an interface exposed by the hardware. Go and Haskell build many additional abstractions on top of that same model.

This article could have had a fruitful discussion about what the author is trying to say, but by choosing such a clickbait title, he managed to turn it into a discussion on semantics that wants to deny useful distinctions, because in some context (not the context in which it's actually used), it doesn't fit.

This kind of linguistic wankery really pisses me off, because it's useless and rests on a misunderstanding of how people actually use language (which is to say, in context and often in relative terms).

fixermark · on May 2, 2018

But I believe that's the article's very point: the context and relative terms people often talk about C are incorrect. The amount of mutation delta between a C program and the corresponding assembly instructions is significant, but people continue to believe it is not, which results in all sorts of incorrect assumptions when reasoning about a C program (such as which line of code or statement executes "first").

Haskell, Go, et. al. are understood to have complex runtime machinery atop the x86 instruction set. It's an erroneous belief that C does not (and one that I've seen developers get bitten by repeatedly as they try to manage threaded C code).

mark-r · on May 1, 2018

How many people have resorted to SIMD intrinsics because C wasn't low level enough?

anfilt · on May 2, 2018

I have used the SIMD instructions quite a bit. Even used FPGA's for some tasks.

ModernMech · on May 1, 2018

> Which is still an interesting and useful fact.

I think it just leads to quibbling over the boundary of low level, as is happening here.

I think it's just important to know that the definition changes over time relative to the state of the art. C was once considered high level. In the future, if programming languages evolve to a more natural language state, then sending serial instructions to the computer in a strange code will seem very low level to such programmers.

umanwizard · on May 1, 2018

Quibbling and language-lawyering aside, this is clearing up a real, fundamental misunderstanding that a lot of people have.

Anecdotally I have encountered loads of programmers who actually believe that there is a straightforward correspondence between C code and what the machine is actually doing, which is wrong.

So regardless of how you want to define "low-level", understanding this point is useful.

Shikadi · on May 1, 2018

If you think of the hardware as a black box, it does correspond to the instruction set architecture presented. Generally you don't think about implementation details of an op-amp when you wire it into a circuit. The issue is that CPUs have become so complex that the lines are blurred, since the external interface is so removed from what's going on inside with micro code and ooe and cache. So while pedantically there are no longer low level language outside of microcode, that renders the term useless, and I'd prefer the natural language evolution that's occurring.

craigsmansion · on May 1, 2018

> actually believe that there is a straightforward correspondence between C code and what the machine is actually doing, which is wrong.

That really depends on your definition of "the machine". If "the machine" is "hardware", then sure. But if software is a considered piece of logic onto itself, that, when pitted against a sound model of an architecture will result in a series of logical steps, it's different: there is a very straightforward correspondence between the model-machine and assembly/C. Whether there is a 1-1 correspondence between the model-machine and any accidental hardware it is implemented is not that relevant.

So if low-level is defined as the lowest level that any hardware abstraction functions exactly as its logical function and not its silicon, it tells you exactly what the machine should be doing, even if it's doing it through different means.

fixermark · on May 1, 2018

> there is a very straightforward correspondence between the model-machine and assembly/C

I believe one of the points the article is making is that, no, there is not. Pin down a C developer's actual understanding of how C is converted into x86 or x86-64 assembly, and you find that nobody actually has that abstraction riding around in their heads, because the abstraction they do have riding around in their heads would make for some unacceptably non-performant code. Even if we disregard the fact that x86 is emulated on modern hardware, the C -> (Clang + LLVM / gcc) -> assembly path is deeply complicated.

rbanffy · on May 1, 2018

Assembly has been a fiction on most computers for a long time now. From the other side of the instruction decoder, they are more like VLIW machines than evolved 8080's.

fixermark · on May 1, 2018

Correct. For low-level language, we may actually want to look more in the direction of HLSL or GLSL.

Rebelgecko · on May 1, 2018

I haven't done any shader programming, but can we even say that about those languages? It might just be my own inexperience talking, but for anything more complicated than matrix multiplication the innards of a GPU seem just as opaque as a CPU.

pavon · on May 2, 2018

OpenGL started out as being a pretty high level language and 1.0 certainly doesn't map closely to modern hardware at all. But APIs and hardware have moved closer together over time, and stuff like CUDA and Vulkan use models that are a pretty close match to the hardware they run on. When writing CUDA you can reasonably figure the number of cycles an operation will take, and benchmarking will agree, unlike CPUs that have become so non-deterministic that they are much harder to reason about.

That said, I wouldn't look to those as examples for how to design a good "low-level" CPU language, as CPUs and GPUs solve very different problems.

ChuckMcM · on May 1, 2018

I enjoyed reading this, mostly because it made me angry, then curious, then thoughtful all in one go.

Partly because I really like the PDP-11 architecture, and it's 'separated at birth' twin the 68K, it greatly influenced me in how I think about computation. I also believe that one of the reasons that the ATMega series of 8 bit micros were so popular was that they were more amenable to a C code generator than either the 8051 or PIC architectures were.

That said, computer languages are similar to spoken languages in that a concept you want to convey can be made more easily or less easily understood by the target by the nature of the vocabulary and structure available to you.

Many useful systems abstractions, queues, processes, memory maps, and schedulers are pretty easy to express in C, complex string manipulation, not so much.

What has endeared C to its early users was that it was a 'low constraint' language, much like perl, it historically has had a fairly loose policy about rules in order to allow for a wider variety of expression. I don't know if that makes it 'low' but it certainly helped it be versatile.

dahart · on May 1, 2018

> A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model.

Sounds like a GPU?

> Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.

It seems like ATI & NVIDIA are doing okay, even with C & C++ kernels. GLSL and HLSL are both C-like. What is problematic?

tsomctl · on May 1, 2018

C-like code that runs on GPUs is not even close to normal C, even though the syntax is similar. The way you layout your memory, schedule your threads, and add memory barriers is completely different. You are never going to take a piece of large C code written for a CPU and just run it directly on a GPU.

dahart · on May 1, 2018

Huh, that’s weird, I run a C++ compiler directly on my GPU code. The only difference between CPU and GPU code at the function level is whether I tag it with a __global__ macro or not, and lots of functions compile and run for both CPU and GPU.

Memory layout, thread scheduling, and barriers are not features of the C language and have nothing to do with whether your C is “normal”. Those are part of the programming model of the device you’re using, and apply to all languages on that device. Normal C on an Arduino looks different than normal C on an Intel CPU which looks different than normal C on an NVIDIA GeForce.

tsomctl · on May 1, 2018

OK, I guess it comes down to what you call "normal" C. I was defining it as what would run on x86 Windows or Linux.

my123 · on May 1, 2018

You can look at C++ AMP too, it runs with all GPUs that support DX11 on Windows, and is a part of the Windows SDK. It's implemented by AMD ROCm on Linux, which also implements HIP/CUDA. Normal C/C++ can run fine on modern GPU architectures.

pjmlp · on May 2, 2018

NVidia designed their latest GPU architecture to run C++.

rbanffy · on May 1, 2018

> Sounds like a GPU?

Which reminds me I'd love to see a computer running exclusively from a GPU-like CPU.

And no, Xeon Phi's don't count. They are cool, but look too much like normal PCs.

dahart · on May 1, 2018

Here’s one: https://en.m.wikipedia.org/wiki/Cray-1

They didn’t call it a GPU then, but the SIMD architecture is quite similar at a high level.

Larrabee was going to be a GPU-like CPU. https://en.m.wikipedia.org/wiki/Larrabee_(microarchitecture)

Here’s a more modern GPU based computer: https://www.nvidia.com/en-us/self-driving-cars/drive-platfor...

If you meant something that sits on your desktop and runs Linux, then yeah it’s uncommon but not unheard of to run it on a SIMD system. The trend is absolutely definitely going toward SIMD being used in general purpose computing. Even if you don’t want to count any of my examples, you will see the “normal” PC become more GPU-like in the future than it is today.

rbanffy · on May 2, 2018

> you will see the “normal” PC become more GPU-like in the future than it is today.

I keep telling people to get used to develop on Xeon Phi's and nobody seems to listen ;-)

Today's Xeon Phi is tomorrow's Core i9.

geokon · on May 2, 2018

I've only done a bit of GPU kernel writing, but I always found it very .. unergonomic. Its like they mashed C to work in a context it wasn't meant for. Which is understandable since you want to encourage adoption, but I'd guess it's part of the motivation behind creating SPIR-V and allowing people to target other languages to the GPU

pjmlp · on May 2, 2018

SPIR is a reaction to CUDA's adoption.

NVIdia always allowed multiple language on CUDA via PTX, with the offerings for C, C++ and Fortran coming from them, while some third parties had Haskell, .NET and Java support as well.

Yet another reasons why many weren't so keen in being stuck with OpenCL and C99.

ovao · on May 1, 2018

To me the argument's akin to suggesting that Robert Wadlow wasn't tall, because giraffes are taller than Robert Wadlow.

When the spectrum of the context is unambiguous, that's not an argument for finding a way to make it ambiguous.

Sean1708 · on May 2, 2018

I think that would be a fair point if the article was about whether or not we should call C a low-level language, but the article is actually about whether C maps cleanly onto what the machine actually does and what a machine might look like if we didn't have that expectation.

cryptonector · on May 1, 2018

> The root cause of the Spectre and Meltdown vulnerabilities was that processor architects were trying to build not just fast processors, but fast processors that expose the same abstract machine as a PDP-11. [...]

This strikes me as a flavor of the VLIW+compilers-could-statically-do-more-of-the-work argument, though TFA does not mention VLIW architectures.

C or not, making compilers do more of the work is not trivial, it is not even simple, not even hard -- it's insanely difficult, at least for VLIW architectures, and it's insanely difficult whether we're using C or, say, Haskell. The only concession to make is that a Haskell compiler would have a lot more freedom than a C compiler, and a much more integrated view of the code to generate, but still, it'd be insanely hard to do all of the scheduling in the compiler. Moreover, the moment you share a CPU and its caches is the moment that static scheduling no longer works, and there is a lot of economic pressure to share resources.

There are reasons that this make-the-compilers-insanely-smart approach has failed.

It might be more likely to be successful now than 15 years ago, and it might be more successful if applied to Rust or Haskell or some such than C, but, honestly?, I just don't believe this will work anytime soon, and it's all academic anyways as long as the CPU architects keep churning out CPUs with hidden caches and speculative execution.

If you want this to be feasible, the first step is to make a CPU where you can turn off speculative execution and where there is no sharing between hardware threads. This could be an extension of existing CPUs.

A much more interesting approach might be to build asynchrony right into the CPUs and their ISAs. Suppose LOADs and STOREs were asynchronous, with an AWAIT-type instruction by which to implement micro event loops... then compilers could effectively do CPS conversion and automatically make your code locally async. This is feasible because CPS conversion is well-understood, but this is a far cry from the VLIW approach. Indeed, this is a lot simpler than the VLIW approach.

TFA mentions CMT and ULtraSPARC, and that's certainly a design direction, but note that it's one that makes C less of a problem anyways -- so maybe C isn't the problem...

Still, IMO TFA is right that C is a large part of the problem. Evented programs and libraries written in languages that insist on immutable data structures would help a great deal. Sharing even less across HW/SW threads (not even immutable data) would still be needed in order to eliminate the need for cache coherency, but just having immutable data would help reduce cache snooping overhead in actual programs. But the CPUs will continue to be von Neuman designs at heart.

kev009 · on May 2, 2018

The meta point from the article is that this is as much a hardware problem as it is a language or developer one. An arms race was waged to create CPUs that are very effective in running sequential programs; to the point that what they present to the program is a very much a facade and they hide an increasing great deal of internal implementation detail. By David's postulation, even the native assembly language for the CPU is not low level.

To drive this juxtaposition home, I'd point to PALcode on Alpha processors in which C (and others) can very much be a low level language. Very few commercial processors let you code at the microcode level.

The overarching premise is then brought home by GPU programming, which shows that you don't necessarily need to be writing at the ucode level if the ecosystem was built around how the modern hardware functioned.

scott_s · on May 1, 2018

The author, David Chisnall, is a co-author on a related paper from PLDI 2016: "Into the Depths of C: Elaborating the De Facto Standards", https://news.ycombinator.com/item?id=11805377