I’ve been thinking about this lately. Can you actually be faster than C? Like, i...

WalterBright · on March 4, 2021

> Can you actually be faster than C?

Sure, in any language that provides more semantic information than C does. For example, D enables a "pointer to immutable data" type, while C does not. This can improve optimization.

On a pragmatic note, C makes it easy to use 0-terminated strings, and clumsy to use length-terminated strings. The natural result is people use 0-terminated strings.

0-terminated strings are inefficient because of the constant need to strlen them. When I look to optimize C, I find plenty of paydirt in all the strlen instances. D makes it easy to use length-terminated strings, and so people naturally prefer them.

teddyh · on March 4, 2021

> For example, D enables a "pointer to immutable data" type, while C does not.

Wait, isn’t that what

  const int *x;

does? I.e. a pointer to a constant.

WalterBright · on March 4, 2021

In C, the value pointed to by `x` cannot be changed via a write through `x`, but if there is another pointer to that value, it can be changed through that.

In D, `immutable(int)* x` cannot be changed by any reference to the value.

flohofwoe · on March 4, 2021

Isn't that what 'restrict' is for? It's a new type of footgun of course because the compiler doesn't detect if the value is actually accessed through another pointer, but by simply ignoring that possibility via 'restrict', the compiler should have the same optimization opportunities, no?

WalterBright · on March 4, 2021

> but by simply ignoring that possibility via 'restrict',

If you deceive the compiler like that, it is entitled to hand you a broken executable :-/

Besides, I forgot to mention that D immutable data is inherently thread safe, no synchronization required. And you can have as many live references to it as you like.

Immutable data is part of D's support for functional programming.

flohofwoe · on March 4, 2021

Well yeah, sure. I wasn't picking on D, just pointing out that C offers an escape hatch, no matter how dangerous that might be ;)

Omin · on March 4, 2021

restrict is so rarely used that clang has had open code generation bugs for years. Rust would like to use its non-aliasing guarantees to generate better code, but clang is unable to reliably generate correct code.

So even if you manage to reason correctly around `restrict` in C, you can't count on the compiler to translate your code correctly.

GCC also had bugs around restrict, but I don't know about their current status.

yvdriess · on March 4, 2021

I haven't found the compiler ever optimizing something different because of constness. The various escape hatches means that it is really hard for the optimizing compiler to make certain assumptions, especially if it's connected to something with external linkage. A function with a const struct arg that you never take the address of? compiler emits memcpy!

imtringued · on March 4, 2021

You might be right but if it's just a technicality and in practice nobody uses restrict the way you said, would anyone care about the theoretical superiority of C? I'm sure some would, but they wouldn't be the majority.

adrianN · on March 4, 2021

restrict is used so rarely in C code that the Rust team hit a lot of bugs when they tried to give LLVM all the aliasing info the Rust compiler has.

teddyh · on March 4, 2021

You’d think that C could do any appropriate optimizations as well as D, since the C compiler (in theory) should know whether there can exist (in any given program) any writeable pointers to the same value or if there only can exist pointers to const.

marton78 · on March 4, 2021

Only inside the same compilation unit.

ygra · on March 4, 2021

Doing it across the whole compilation requires LTCG, but that does exist as well. It just makes incremental builds a pain with large binaries.

innocenat · on March 4, 2021

Pointer arithmetic would surely make all value mutable. Or even simple array access.

teddyh · on March 5, 2021

Yes, but the compiler should know if there is any pointer arithmetic present in the code, and whether any such pointer arithmetic could, possibly, result in a pointer pointing to the const data.

makapuf · on March 4, 2021

If you point to writable memory (Not all programs reside in ram.)

knome · on March 4, 2021

    const int world = 42;
    const int const * const hello = &world ;

Apparently, you can be very const and `gcc -Wall -Wextra -std=c99` won't raise any complaints.

teddyh · on March 4, 2021

Firstly, isn’t that a syntax error? There’s a stray “const” in there. You probably meant

  const int * const hello = &world;

Secondly, what should the compiler complain about? You have a const int, and then a const pointer to const int, pointing to that first const int. What’s the problem?

Thirdly, the latest C version supported by GCC is “-std=c17”.

tick_tock_tick · on March 4, 2021

Nah const is fun

  int main()
  {
    const int world = 42;
    const const const int const const * const const hello = &world;

    return 0;
  }

Is a valid program

https://onlinegdb.com/SJcwJrRzd

teddyh · on March 4, 2021

Sure, but those extra “const” do not mean anything, and are apparently ignored by the compiler.

jcelerier · on March 4, 2021

> In D, `immutable(int)* x` cannot be changed by any reference to the value.

even if I use a magnetic needle to change the bits in my RAM ? :)

petertodd · on March 4, 2021

There's an easy answer to that: using a magnetic needle to change the bits in your RAM violated the assumptions of the compiler, leading to undefined behaviour. What exactly will happen in that circumstance is just that: undefined.

MaxBarraclough · on March 4, 2021

To put that another way: hardware failures are beyond the scope of the compiler.

wizzwizz4 · on March 4, 2021

Maybe your compiler.

gpderetta · on March 4, 2021

I think the point of the parent is that in principle you can always implement those optimizations by hand in C although it might of course be impractical.

bhaak · on March 4, 2021

If you go down this road then you can always drop down to assembler to be even faster than C.

I don't think this is a reasonable argument. Every Turing compatible language that gives you direct access to the metal, so to speak, provides you with the opportunity to implement these optimization by hand.

I think it's much better to look at average C code there. And then C has a tremendous advantage with their compiler support. C compilers have decades of optimization put into them. This will take a while for other languages to catch up to that.

WalterBright · on March 4, 2021

The three D compilers use, the GCC backend, the LLVM backend, and the Digital Mars backend. All the decades of optimization in them are taken advantage of by D. If you write C-style code in D, you'll get the same code generated as if your wrote it in C.

bhaak · on March 4, 2021

But isn't that a consequence of D "mimicking" C to some extent?

Disclaimer, my compiler background was the Java compiler and its bytecode generation but I would expect both gcc and clang/llvm to have lots of optimization cruft hardcoded for how C programs are written. If you deviate from that, you get potentially less well optimized code.

WalterBright · on March 4, 2021

> But isn't that a consequence of D "mimicking" C to some extent?

It's true that multi-language backends will inevitably cater to the semantics of the dominant language.

For a related example, I discovered that gdb doesn't pay much attention to the DWARF symbolic debug spec. It pays attention to the particular way gcc generates symbolic debug info. Other patterns, while compatible with the spec, don't work with gdb. Similar problems have happened with linkers.

It's a matter of "do as I do, not as I say".

Since I am the author of DMD's backend, I've been putting in optimizations that do cater to D, rather than just C and C++. For example, in the last few days I started putting in optimizations to take advantage of D's new "bottom" type:

https://github.com/dlang/dmd/pull/12245

WalterBright · on March 4, 2021

> in principle you can always implement those optimizations by hand in C

Some things, like integral promotion rules, you can't get around. You may know that the promotion can be skipped in certain cases, but the compiler cannot know it.

lenkite · on March 4, 2021

Leading to eye-rolling problems like these: https://github.com/biojppm/rapidyaml/issues/40

rz2k · on March 4, 2021

Because C is not machine code or even assembly language it must be compiled. Not only might the compiler not be as opportunistic about improving the human written code, but it might not fully utilize the machine's full capabilities if the compiler does not understand the complete instruction set that is available or other things like memory and different levels of cache.

The compiler might not even provide access to these capabilities to a programmer who knows about them and wants to explicitly use them. In such cases, the programmer might have to use assembly, or Fortran, C++ or even something much higher level than C with a compiler that provides access or knows how to "intuit" when those capabilities are useful.

tachyonbeam · on March 4, 2021

Compiler engineer here. In practice, compilers for higher-level languages often have a lot of difficulty getting anywhere close to the efficiency of comparable C code. If you take Python, for example, you have to do a lot of inlining to eliminate various abstractions. Inlining is actually non-trivial. Yes, inlining, by itself, is an easy program transformation, but knowing where to inline to get the best performance is very hard. If you inline too much, you increase code size, and you lose performance. You have to know precisely which inlining decisions will pay for themselves, and your large codebase might have tens of thousands of call sites and a call hierarchy 40 functions deep. Python also adds the fun little problem that you can redefine any function at run time, which makes it hard for the compiler to know which function you're actually going to be calling. To complicate things further, inlining decisions affect other optimizations. For example, if you inline foo into bar, do you then also inline into bar the functions that foo is calling? Do you unroll the loops from foo into bar? Etc.

Also, there's an aspect that I feel is constantly overlooked, and this is the way that objects are laid out in memory. In Python, JavaScript, Ruby, you have a lot of pointers to objects. You get a lot of pointer-chasing as a result. This is BAD for caches. Each object you touch, each pointer you dereference means pulling in a new cache line. In C, you can design very compact and flat data structures. You can use 32-bit floats, 8-bit, or even one-bit integers if you want. You can have a struct within a struct, with an array inside of it, all without any pointer-chasing.

Modern CPUs are constrained by memory bandwidth, and it's very hard for any programming language to beat C on achievable memory efficiency. What's worse is that we have little to no academic compiler literature on automatically optimizing for memory efficient data layouts. It's an overlooked and understudied problem. You would have to prove that integer values lie within certain ranges (hard) and also do object inlining (collapsing objects into parents) which AFAIK is also hard and not done by any mainstream compiler.

So, yeah, keep thinking that a sufficiently-smart compiler will do everything for you. You will assuredly be very disappointed. Until we have strong AI, the sufficiently smart compiler is basically unobtainium. If you want efficiency, the best route is generally to have less layers of abstraction, or to only rely on compiler optimizations you know for certain will happen.

cbsmith · on March 4, 2021

The sufficiently smart compiler is effectively unobtanium, but that presents a challenge for C as well. C has its own challenges with its memory model & language semantics.

C never was the lowest level of abstraction; there are other abstraction models out there and more still to be invented no doubt. C's model did well (though struggled mightily against Fortran for the longest time) at aligning with processor models of the 80's and 90's. It helps that C's ubiquity meant that to a degree the trail was washing the dog: processor designs were often measured against how they executed code compiled with a C compiler. But past preformance is a poor predictor of future success; who is to say that as processor designs continue to evolve, C's abstraction model won't become increasingly leaky against another? Absent a sufficiently smart compiler, it's entirely possible that C compiler writers will find themselves at a disadvantage.

And that assumes they're competing with a traditional compiler. It's possible, though unlikely, there will be competition from other execution models. As you said memory is often the core area of performance bottlenecks these days, and as terribly inefficient as bytecode interpreters are, they tend to have smaller code sizes. Efficient interprets are hand tuned specifically to make the overhead of bytecode interpretation as efficient as possible. Now, intrinsically they are performing a translation step at runtime that a compiler did before runtime, but one can at least theorize of a model where the interpreter is effectively a specialized decompression algorithm that feeds machine code to the CPU (that's really not that far afield from what happens in hardware in modern cpus and in mobile runtimes). Higher levels of abstraction might allow for more efficient decompression... It's crazy, but not inconceivable.

cbsmith · on March 4, 2021

s/washing/wagging/

s/interprets/interpreters/

Damn, I should have proof read this on my desktop rather than just blind posting from my phone.

josephg · on March 4, 2021

Well said.

I've had a lot of conversations with javascript engineers over the years who've argued to me that well tuned JS will be nearly as fast as the equivalent C code. I've written plenty of little toy benchmarks over the years, and in my experience they're partly right. Well written JS code in V8 can certainly run fast - sometimes around half the speed of C code. But a massive performance gap opens up when you use nontrivial data structures. Nested fields, non-uniform arrays, trees, and so on will all cripple javascript's performance when compared to C's equivalent of simple embedded nested structs. If you couple clean C data structures with allocation-free hot paths from arenas, the performance of C will easily hit 10-20x the performance of the equivalent JS code.

From memory my plain text based operational transform code does ~800k transforms / second in javascript. The equivalent C code does 20M/second. The C implementation is about twice as many lines of code as the JS version though.

(The code in question: https://github.com/ottypes/text-unicode/blob/master/lib/type... / https://github.com/ottypes/libot )

jfrunyon · on March 4, 2021

More like "well tuned JS will be nearly as fast as poorly tuned C code"...

josephg · on March 5, 2021

True. But if I’m given the choice, I’d usually rather well written javascript than badly written C. And there’s a lot more decent JS programmers out there than decent C programmers.

jfrunyon · on March 16, 2021

Fair. I'd rather neither, and take Python, or even (God forbid) PHP. LOL

lmm · on March 4, 2021

C doesn't tell you anything about your cache efficiency, which is to first order the only thing that affects your program's performance.

You're right that flat datastructures are important, but C is far from the only language that can offer those.

I don't think C can ever be the answer; "comparable C code", i.e. code that was produced with the same amount of developer effort, is almost always undefined behaviour that happens to work most of the time on the developer's compiler. C doesn't have the kind of polymorphism that you need to write code that's compositional, testable, but also has good cache locality, at least not for code that also has to interact back and forth with the outside world. There is at least some work in the literature on making efficient use of cache in a Haskell/ML-like evaluation-by-reduction model, and I think that's where the long-term way forward lies (and FWIW in my experience compilers for ML-family languages already more than smart enough).

tachyonbeam · on March 4, 2021

I'm not arguing that C is the ultimate programming language for maximum performance or anything like that. Simply that it gives you a lot more freedom to tune your implementation where it matters for performance.

For HPC, you'd likely be better with something designed to run on GPUs, with built-in primitives for matrix operations, and a compiler that can tune the way those operations get implemented for you.

However, when it comes to cache locality, if the C programmer knows what they are doing, I think you'll have a very hard time beating them with Haskell or ML. AFAIK those languages have the same pointer-chasing issues I've described earlier. If you need to design a complex data structure, for example a scene graph in a 3D game engine, it will be much easier to make it compact in C.

lmm · on March 5, 2021

> However, when it comes to cache locality, if the C programmer knows what they are doing, I think you'll have a very hard time beating them with Haskell or ML.

A programmer who knows what they're doing is a rare thing indeed. On actual business problems with real-world programmers Haskell consistently beats C, IME.

> AFAIK those languages have the same pointer-chasing issues I've described earlier.

There's a pragma to flatten strict record fields, or a compiler flag to do that by default (at least in GHC).

jeffreygoesto · on March 4, 2021

Thank you for the good explanation! I'm in embedded and my experience is that this stuff is almost totally ignored in universities. People have to go through a steep and (for us) expensive learning curve until they get the feeling for the whole system and what makes it fast. And then I think, is it asking for too much? Can we expect knowledge about a gazillion layers smeared onto each other from everybody who just wants to deliver value from the topmost layer?

Const-me · on March 4, 2021

> Inlining is actually non-trivial.

OTOH, JIT runtimes have more input data than a C compiler. They can implement some runtime equivalent of C++ profile-guided optimization: measure what actually happens in runtime, assume the input data is going to stay roughly the same, and re-generate machine code with this new information into something more efficient.

Pretty sure modern Java does that sometimes.

> In Python, JavaScript, Ruby, you have a lot of pointers to objects.

In C# you can make data structures which don’t use pointers much, or at all. The language is strongly typed, has value types, it’s easy to place structures inside other structures. There’s a native stack just like C, and even stackalloc keyword which is an equivalent of alloca() in C.

tachyonbeam · on March 4, 2021

> In C# you can make data structures which [...]

Yeah but that's completely validating my point. C# is not Python or JS. It's a (remote) cousin of C which tries to take some of the valuable performance tools from C and bring those to a managed runtime. Because it's strongly typed, it's a lot easier for the compiler to optimize, and because you have all these tools to design compact objects without pointer, you can do that job so the compiler doesn't have to.

And again, an experienced C# programmer can probably write code that runs circles performance-wise around code written by an experienced JS/Python developer in most cases.

Const-me · on March 4, 2021

> It's a (remote) cousin of C which tries to take some of the valuable performance tools from C and bring those to a managed runtime.

That’s correct. But at the same time, the language is way higher level than C or C++.

> experienced C# programmer can probably write code that runs circles performance-wise around code written by an experienced JS/Python developer in most cases.

In some cases, an experienced C# programmer can even write code which approaches or outperforms C. My Linux video player library https://github.com/Const-me/Vrmac/tree/master/VrmacVideo#per... uses CPU on par with VLC, and 20% less RAM.

pkolaczk · on March 4, 2021

> JIT runtimes have more input data than a C compiler

They have more data, but they are at a disadvantage by being time-pressured. They can't apply costly analysis to these data because they need to compile fast. Therefore typically they limit themselves to local analysis which may miss a lot of opportunity for inlining.

Const-me · on March 4, 2021

> They can't apply costly analysis to these data because they need to compile fast.

True in general, but they use quite a few tricks to minimize the consequences.

They use interpreter, or very unoptimal but fast version of JIT compiler, first time a function is called. They replace it with faster version once it’s clear the function is called a lot.

Unlike C compilers, they don’t need to do that analysis globally for the whole program. They only need to do that for the hot paths, that’s often a small portion of the code.

They can offload that analysis and code generation to another CPU core, and replace the implementation once that background thread has produced a faster version.

pkolaczk · on March 4, 2021

> They only need to do that for the hot paths, that’s often a small portion of the code.

That's often correct, however unfortunately codebases today can be very, very huge. It can take a really lot of effort to optimize even just 10% of the hottest code if the product is several hundreds of MB of compressed byte-code. There are also applications with no obvious hot-spots, but flat profiles - e.g. database systems, where most of the time is being spent transferring data between various layers of the system. If a request gets passed through most of the layers, and can be routed into different areas depending on the query type set at runtime, and the clients are allowed to send queries of various types, so they target various different submodules of the server, there will be no hotspots. In these cases warmup can take enormous amount of time.

Even for a server this can be a problem, because after restarting you get an immediate performance hit.

Also keep in mind many software products are not long-living backend processes that can warmup for hours or even minutes. Client apps need good responsiveness, and before JIT even realizes which code to compile, it is already too late.

Const-me · on March 4, 2021

I think what you wrote largely applies to Java and especially JavaScript, much less to C#. Value types, real generics, and native stack allow even the faster version of the .NET JIT to produce native code that’s not too horrible performance wise.

Good enough for desktop or embedded use cases, even on slow CPUs. I have 3 such devices on my desk, Raspberry Pi 4, a dev.board with Rockchip RK3288, and a tablet with Atom Z3735G, .NET is reasonably fast on all of them, without noticeable warmup issues at startup.

pkolaczk · on March 4, 2021

I was talking about JITs in general. Sure, you can AOT-compile .NET code quite efficiently, but then this is a different story.

Here is a nice analysis of how various JITs warmup in practice:

https://tratt.net/laurie/blog/entries/why_arent_more_users_m...

TL;DR; often they don't!

andrekandre · on March 4, 2021

> unfortunately codebases today can be very, very huge

why is that?

imtringued · on March 4, 2021

Actually, action for action managed languages tend to be "faster" than C in the sense that writing the equivalent program would be slower in C. However, the additional freedom leads to increased program complexity and that complexity eats into the performance budget enough to end up slower than "idiomatically" written C where the developer distilled the solution to its bare essence.

Javascript has the fastest general purpose hashmap/dictionary implementation of all programming languages but at the same time you are forced to use them all the time which overall slows the language down. Writing exactly the same code in C would be even slower since C hashmaps aren't as optimized. However, C code is rarely written like that so it's usually faster.

jonathanstrange · on March 4, 2021

I don't know about LLVM but GCC has had something like this for years as a standard optimization feature. You create a binary with special profiling code which writes to a file. After running the program a few times, you recompile with optimization that uses this file. I forgot the flags though.

Personally, I'm more interested in executable optimization. It decompiles an executable, performs whole-program optimization, and re-compiles it. I'd love to tinker around with optimizing other people's binaries (e.g. games) for my machine. There is something like that for LLVM but it's very experimental.

Const-me · on March 4, 2021

It's called profile-guided optimization.

And it fails for C when the source data changes significantly, compared to the inputs used by developers when they were building the binary.

jonathanstrange · on March 5, 2021

Yes indeed.

MaxBarraclough · on March 4, 2021

> You would have to prove that integer values lie within certain ranges (hard)

Don't JavaScript JITs rely heavily on this to reduce JavaScript's floating-point arithmetic to integer arithmetic?

Not to say it's easy, but hasn't a lot of work been done on this?

tachyonbeam · on March 4, 2021

Sometimes you can prove that values will remain integer when iterating over array indices for example. However, in the general case, when doing integer additions or multiplications, they need to insert dynamic integer overflow check instructions everywhere. When the integers would overflow they have to be converted to floating-point values.

If you think about this for a second. Suppose I have a loop where I'm multiplying values:

function foo(x, y, n) { for (var i = 0; i < n; ++i) { x = x * y; }

    return x;

}

In order to know that the multiplication won't overflow, you have to be able to prove that both x and y, coming into that function, will be integers. You also have to have information about the range of x, y and i. If you know that x>=0 and y>= but you don't know how many times the loop will execute at compile time, you are basically screwed. You could unroll the loop to reduce the number of dynamic checks you need, but then that increases your code size. So you mostly have to check that the multiplication doesn't overflow on every iteration. You may also have to do dynamic type checks if you can't prove that x,y,i will always be integers.

btilly · on March 4, 2021

FORTRAN is faster for many tasks, and is probably more popular in high performance computing.

Also tasks that can be moved to the GPU go a lot faster. You can interact with those programs in C, but not natively. But some languages, like Julia, can easily move calculations to/from the GPU. And also can transparently take advantage of parallelism.

Julia is in the process of growing rapidly for high performance computing. I don't know if it has officially passed C there. But if not yet, it will.

Leo_Verto · on March 4, 2021

To add onto your point: from what I've heard the split at HPC conferences is about 40% C to 60% Fortran.

SubjectToChange · on March 5, 2021

In my limited experience, almost every language specific presentation at an HPC conference is about C++. New projects and initiatives like DPC++, SYCL/oneAPI, Ginkgo, etc are C++ based. And if you look at the open source libraries of the national laboratories, e.g. LLNL, C++ libraries are more common than C libraries, which in turn are more common than the Fortran libraries.

If anything I'd say that C++ is the default for HPC conferences.

btilly · on March 4, 2021

That sounds right.

I wonder what the current split is. I'm not in HPC computing, but I see a lot of posts about how quickly Julia is being adopted.

not2b · on March 4, 2021

Python users who use numpy and scipy are actually using a lot of Fortran code.

mhh__ · on March 4, 2021

C - and especially idiomatic C actually withholds a __lot__ of information from the compiler. People like to think C is just a portable assembler, but by god it ain't.

The reason why C programs often are fast or run faster is because the language forces you to approach almost all abstraction head on - this goes both ways however, in that linked lists are much faster to write in C than a safe array type, so programs can end up with bad data structures for too long.

Since the lingua franca of the operating system is still C or C-like, there are actual optimization opportunities that go missing because of the C interface: It's hard to devise many alternatives, but if you are calling into libc for mathematics in a hot loop it may be worth avoiding it and rolling your own since the compiler can't really inline libc calls.

coldtea · on March 4, 2021

>C - and especially idiomatic C actually withholds a __lot__ of information from the compiler. People like to think C is just a portable assembler, but by god it ain't.

Well it might not be, but the fact that it "withholds a lot of information from the compiler" is an argument in favor of it being (a portable assembler), not the opposite.

lmm · on March 4, 2021

> Like, in the sense that you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C (such as goto, no bounds checking, pointer arithmetic, etc.), any algorithm for say parsing JSON can be expressed equally as efficiently in C, while a clever and hacky C-specific algorithm cannot necessarily be expressed in those higher level languages.

This isn't true though. Just off the top of my head: C doesn't expose the carry flag directly, C doesn't let you write signed arithmetic using twos-complement overflow, C doesn't let you share a value between separate compilation units without allowing it to be mutated...

"faster" or "slower" is a pretty terrible metric anyway, given that most projects do not have infinite developer time available. But it's very hard to benchmark "equivalent developer effort" between two different languages.

wvenable · on March 4, 2021

C loses some optimization potential because of aliasing; it can't be guaranteed that some pointer operation isn't going to change a variable anywhere in memory so that limits certain reordering and loop optimizations.

Languages without arbitrary pointers don't have this issue and safely assume they knows all the assignments done to values in memory allowing for optimizations.

kevin_thibedeau · on March 4, 2021

The restrict keyword helps with this.

m_mueller · on March 4, 2021

That, or switching to FORTRAN. For numeric / HPC it is much easier to get a decently fast implementation there.

pjmlp · on March 4, 2021

Provided their user knows what they are doing and never do the mistake to lie to the compiler.

coliveira · on March 4, 2021

The C language has provided the restrict keyword to solve these aliasing issues since the C99 standard. That is, 20 years ago.

tachyonbeam · on March 4, 2021

I think this generally doesn't have much of an impact on performance.

_dps · on March 4, 2021

Just a personal anecdote, but it regularly makes a substantial (like 2x+) difference for me in tight linear algebra loops. It seems to be required if you want to coax clang to emit vfmad instructions for common C linear algebra idioms. Narrow use case, I know, but it can definitely make a big difference in some domains.

tachyonbeam · on March 4, 2021

I'll take your word for it. I guess I live in a world where the code I see is rarely tight algebra loops, but I can see that in certain domains you would definitely care.

cozzyd · on March 4, 2021

It certainly does... it's the main reason fortran was often faster than C, isn't it? Aliasing prevents automatic vectorization, among other things.

Whether or not the compiler will be smart enough to autovectorize is a different question.

pjmlp · on March 4, 2021

The developer must also be smart enough not to do the mistake to lie to the compiler, by actually doing aliasing with restricted variables as it is UB and nasal dragons beware.

C compilers don't validate correct use of restrict.

Arnavion · on March 4, 2021

Not all C compilers handle even correct uses of `restrict` properly. For about three years, Rust has been unable to use the LLVM equivalent of `restrict` when downleveling its references, because it keeps finding LLVM miscompilation bugs around it.

These LLVM bugs don't get noticed with C because C programs rarely use `restrict` as pervasively as a downleveled Rust program ends up doing.

cozzyd · on March 4, 2021

I wonder if these same LLVM issues affect LLVM fortran attempts?

johncolanduoni · on March 4, 2021

I feel like they would have to, if they were to produce a remotely competitive compiler. That's why I'm hoping NVidia's Flang[1] efforts will lead to this aspect of LLVM being cleaned up.

[1]: https://github.com/flang-compiler/flang

nynx · on March 4, 2021

Be careful when you say that everything can be converted to C. While that's true in a naive fashion, it gets rid of a lot of additional information.

For example, Rust can sometimes beat C because it's (1) often more friendly to auto-vectorization and (2) it has additional aliasing information.

kevin_thibedeau · on March 4, 2021

C code that makes heavy use of callbacks in inner loops is at a disadvantage compared to languages with more powerful inlining facilities. Compare qsort() to std::sort().

kllrnohj · on March 4, 2021

Easily. C++ has ways of being faster that C can't really match - namely, templates. They can be hellish to write & debug, but generating type-specific functions is fantastic for optimization & performance. It's also a lot easier to be faster in C++ on key things than it is in C, specifically small-size optimizations. Yes you can do an SSO string or function pointer in C, but it's hard & painful to do so, so it's rarely if ever done. But it's trivial to do in C++, and since the standard library does it for both strings & functions, it's also commonly done.

Similarly in languages like Java or C#, having first-class exceptions means fewer branches & error checking on the hot path over something like C. They are on the whole slower than C for other reasons, but it's not because C is "the best" or "the fastest" at everything. And of course you can't really do de-virtualization optimizations in C.

cb321 · on March 4, 2021

This seems oriented around just code bases you have seen personally rather than fundamental C limitations and so misleads. To clarify: C can do template-like macros for type specialization (and this is common in some code bases) and easier to debug ways [1] { I've personally used a variant of [1] techniques since the mid 90s }. C can do setjmp & longjmp exceptions and you can even do a TRY/CATCH like exception macros. Function pointers are in stdlib interfaces (e.g. qsort) and generally pretty common in C code. I believe gcc has been able to inline through C function pointers ("de-virtualize", though you qualify with "really" there..) in the same translation unit for like 2 decades now with the right compiler flags/annotations.

It is true that C cannot do full template meta-programming or certain state save/restore optimizations faster than setjmp/longjmp. Hard/painful/trivial are all a bit more subjective and relative to exposure/practice/training/etc.

Personally, I think Nim [2] is a much friendlier way to go than either C++ or Python without the pitfalls of either, but the ecosystem is admittedly much smaller. Also, I've yet to hit something where re-writing the C++/Rust in Nim did not speed things up from "at least a little" to "quite a bit". { This is more an artifact of "performance incautious programming" in the C++/Rust. Too many people fall into the trap that, since a language has a reputation for speed, they don't need to think. This is probably largely why @mehrdadn's original article had the title it did. ;-) }

[1] https://github.com/glouw/ctl/

[2] https://nim-lang.org

kllrnohj · on March 4, 2021

> C can do template-like macros for type specialization

Of course, C++ has those some macro capabilities. But macros are quite limited, and typically the "template-like" ones rely on non-standard preprocessor support like typeof for swap. But then you still lack the ability to specialize swap for different types (eg, you can't replicate std::swap's behavior on std::vector in C)

> C can do setjmp & longjmp exceptions and you can even do a TRY/CATCH like exception macros.

You can, but that's now another parameter to pass along down the call stack, and as LLVM notes https://llvm.org/docs/ExceptionHandling.html#setjmp-longjmp-... it still negatively impacts the non-exception performance path.

cb321 · on March 4, 2021

While C macros are limited, you underestimate their range. You can absolutely create specialized swaps without typeof (not part of the C preprocessor, incidentally), and the CTL which I linked to even has one. You just tell the macro the name of the type. And you can just invoke swap_ref or something if you want a more indirect one. There's no overloading. So, yeah, you need to know what you are operating upon, or abstract qualities thereof.

Sure, the use of these things in C is (usually) a bit more verbose/burdensome than C++. The misleading statements were performance-oriented, not syntactic sugar-oriented, and I already granted exception optimization. (Some folks, like Go/Rust authors, would tell you exceptions are bad anyway.)

Measter · on March 4, 2021

> [..] Like, in the sense that you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C [..]

How would you write this (admittedly contrived) Rust function in C without invoking UB:

    pub fn foo(a: &mut i32, b: &mut i32) {
        let (new_a, new_b) = a.checked_mul(*b)
            .map(|new_a| (new_a, b.saturating_sub(*a)))
            .unwrap_or((10, 20));

        *a = new_a;
        *b = new_b;
    }

For those unfamiliar with Rust, it multiplies a by b, and if it didn't overflow:

* a = a * b

* b = b - a, saturating at the minimum value (as in, it won't wrap it just stops there)

If it did overflow:

* a = 10

* b = 20

And finally, does it compiler better: https://godbolt.org/z/66n8W9

lorenzhs · on March 4, 2021

Something like this? https://godbolt.org/z/6nod5e. It even produces almost identical assembly.

The equivalent to checked_mul is __builtin_mul_overflow, which is a compiler builtin: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.... Similarly, saturating_sub seems like it can be implemented with __builtin_sub_overflow.

Measter · on March 4, 2021

Would those work on any compiler, or are they compiler-specific? If those are non-standard compiler-specific extensions, not even a library, is it truly a part of C++?

While I'll grant that Rust only has one fully functional compiler at this time, those functions have been part of Rust's corelib since 1.0. Any Rust compiler would have to support them.

lorenzhs · on March 4, 2021

They're not part of the standard, but gcc, clang, and icc (Intel) support them. Alternatively, you could emit the correct instructions in an asm block on supported architectures (x86: 'jo'/'jno' for jumps, 'cmovo'/'cmovno' for conditional moves [1]) or reimplement the operations in software [2]

[1] https://www.felixcloutier.com/x86/jcc and https://www.felixcloutier.com/x86/cmovcc

[2] https://wiki.sei.cmu.edu/confluence/display/c/INT32-C.+Ensur....

xrisk · on March 4, 2021

Rust already does better than C for some benchmarks. See: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

coliveira · on March 4, 2021

As the paper showed, even Python can be better than C at carefully selected tasks.

moonchild · on March 4, 2021

No, the paper showed that use of a particular ecosystem default in c++ led to algorithmically worse performance then use of an ecosystem default in python.

You can always beat c if c happens to implement a bad enough algorithm and your n is large enough. That's no guarantee that the performance ∆ will continue to hold once the two language implement the same algorithm.

junon · on March 4, 2021

I've thought about this very question for years. My answer has been, usually, "no" for all cases that do not allow the developer to write straight assembly.

That is, until things like C++'s stackless coroutines came about, which are a construct intrinsic to the compiler and not functionality directly exposed by C.

Further, any machine code language is going to allow you specific instruction access that a compiler might not otherwise utilize (rare, but it happens). In such cases you can gain 'manual' speedups over what C could allow you to do. I would hope that is the obvious exception, however.

But you are asking a very good question not a lot of developers are willing to think much about.

flohofwoe · on March 4, 2021

IME "faster than C" without going into a lot of details what manual optimizations have gone into a specific piece of the C code doesn't mean much. Writing C code doesn't automatically give you high performance, but it gives you quite a bit of wiggle room to manually optimize the code.

In my mind, different languages have different "optimization sound barriers", where further manual optimization becomes impossible, or at least runs into diminishing returns. For instance Javascript can be made nearly as fast as "vanilla" natively compiled C code, but that means writing asm.js by hand, which is entirely different from "idiomatic Javascript". It's not worth it writing asm.js by hand, when C compiled to asm.js (or WASM nowadays) is much more maintainable.

Same with garbage-collected languages in general. You can write fast code with very little overhead from the garbage collector, but this means you need to know exactly how the garbage collector of that language works, and the resulting code will usually look very different and will be harder to maintain than the idiomatic style of that language.

In standard C the optimization sound barrier is a bit further away than in many other high-level languages but it's not all that special, on the other hand C with compiler- and CPU-architecture-specific extensions is pretty good (e.g. extensions can push the sound barrier quite a bit further away).

Symmetry · on March 4, 2021

Well, generally optimized numerical libraries like BLAS are faster than what you write yourself. But a prolbem with calling them from C is that you don't get loop fusion, each call has to go through the data before the next one starts. That isn't true with Haskell's lazy evaluation so it can be faster in some cases.

Of course, C++'s eigen does loop fusion for you with template magic so to go really fast you probably want that.

Const-me · on March 4, 2021

> Can you actually be faster than C?

Sometimes. I know 2 reasons.

1. In some higher-level languages, some problems can be efficiently solved with code generation. For instance, you can take some of the input data, generate code from that, then run the generated code processing the rest of the input data. Examples of the mechanism include Lisp macros, or .NET shenanigans like System.Reflection.Emit or System.Linq.Expressions.

It’s hard to generate good machine code. Possible to do in C, but you gonna need to embed C compiler for that. They are extremely complex, and were not designed for the use case e.g. relatively slow (designed to run offline on fast developer’s computers, or on even faster build servers).

If your problem is a good fit for that approach, C# can actually be faster than C.

Parsing JSON is a good example. When your goal is not just parsing the syntax, but also populating structures with the data from the JSON, good C# JSON parsers gonna runtime-generate serialization code by reflecting the structure types being populated. For an isolated program, technically you can generate equivalent C code offline. But if you’re making a general-purpose JSON serialization library that’s borderline impossible to achieve in C, need reflection and code generation.

2. Modern computers are heterogenous, they have 2 chips computing stuff, CPU and GPU. Even when these pieces are on the same chip and using the same RAM like Intel’s UHD graphics or AMD’s APUs, GPU can be still faster for some problems. Not only because more GFlops (that’s not necessarily true for integrated graphics), also because GPUs have better strategy for dealing with RAM latency. They switch threads instead of waiting for the data to arrive. CPU cores only run 1-2 hardware threads each i.e. limited in that regard.

That’s how for some problems HLSL or OpenCL can actually be faster than C.

jcelerier · on March 4, 2021

> They are extremely complex, and were not designed for the use case e.g. relatively slow (designed to run offline on fast developer’s computers, or on even faster build servers).

in 2021 bundling clang along with your program is actually reasonable - if you are compiling small functions without two tons of headers it's measured in milliseconds.

Const-me · on March 4, 2021

I never tried to, but I think integration of runtime-generated native code gonna cause overhead.

Where do you place these functions once compiled? Into a separate DLL/each?

In .NET it’s quite easy to generate code that calls manually written functions, or access data provided by manually-written stuff. JIT runtime doesn’t treat generated code as something special, e.g. may inline calls across runtime-generated and manually written pieces of the program. With clang you gonna need a layer of indirection to integrate, with function pointers and such.

jcelerier · on March 4, 2021

LLVM can just put the compiled code at some place in your ram and then you can just execute it

Const-me · on March 4, 2021

Processors need addresses of things. Look at the following code https://godbolt.org/z/PYEqKn note the function uses another symbol, “func.counter”.

Shared libraries include relocation tables https://en.wikipedia.org/wiki/Relocation_%28computing%29 with all code locations which needs patching. That’s how the OSes can load them into arbitrary locations in memory and the code will still work.

jcelerier · on March 4, 2021

... yes ? https://github.com/weliveindetail/JitFromScratch

Const-me · on March 4, 2021

Interesting, I didn’t know llvm is that flexible.

Still, LLVM is a huge dependency to redistribute. And probably has many points of failure. For instance, I expect you gonna need to watch for INCLUDE and PATH environment variables when using that thing.

In .NET everything is in the standard library. The generated code is even platform-independent, for no additional overhead. Here’s couple non-trivial examples: https://github.com/Const-me/ComLightInterop/blob/master/ComL... https://github.com/Const-me/ComLightInterop/blob/master/ComL...

SubjectToChange · on March 5, 2021

> Still, LLVM is a huge dependency to redistribute.

But .NET isn't?

Const-me · on March 5, 2021

> But .NET isn't?

Indeed.

.NET runtime takes about 30 MB, or 50 MB if you're on Windows and need desktop components. Links there: https://dotnet.microsoft.com/download/dotnet/5.0

clang+llvm package takes 300-400 MB, that's almost an order of magnitude difference: https://github.com/llvm/llvm-project/releases/tag/llvmorg-11...

jcelerier · on March 5, 2021

> clang+llvm package takes 300-400 MB, that's almost an order of magnitude difference

I ship a statically compiled llvm + clang with my software and it does not add 300-400 mb to the binary at all, only something like 20-30 mb.

Const-me · on March 5, 2021

If you’re willing to compile, ship and support custom builds of .NET runtime, gonna be less than 20-30 MB.

coreclr.dll (the runtime) is 4.9 MB, clrjit.dll (JIT compiler) is 1.3 MB. The rest is mostly standard libraries, the largest piece of that is System.Private.CoreLib.dll at 9 MB. The numbers are for Win64 version of .NET 5.0.2.

Another thing, for .NET apps the runtime is required anyway, the overhead for runtime code generation is zero.

For C++ apps, llvm + clang (or an equivalent) is only needed on developer’s computers, not something you’d normally ship.

jcelerier · on March 5, 2021

> For C++ apps, llvm + clang (or an equivalent) is only needed on developer’s computers, not something you’d normally ship.

unless you want to JIT-compile C++ code at runtime which is the original point.

namibj · on March 4, 2021

I'll give Terra[0] as an example for something relatively high-level that uses LLVM as a JIT. It can also be used as an AoT compiler with fancy macro stuff in place of the C preprocessor.

[0]: http://terralang.org/

Blikkentrekker · on March 4, 2021

> Can you actually be faster than C? Like, in the sense that you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C (such as goto, no bounds checking, pointer arithmetic, etc.),

I'm not sure why you think that, especially with Rust one of the reasons it can be theoretically faster is because it allows for even more undefined behavior and has a variety of datatypes that simply exist to inform the compiler of potential optimizations.

TylerE · on March 4, 2021

Sure, theoretically.

For instance, C function calls always do some things (like preserving the stack) that may not always be necessary.

I sure wouldn't want to to TRY to beat a C compiler, but it seems obvious to me that it is possible.

thaumasiotes · on March 4, 2021

> you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C (such as goto

Turing-completeness tells you that it is in fact necessarily true that you can convert any bit of C into a corresponding bit of Python, Lisp, or Haskell. One obvious approach would be to emit code that implements a C runtime.

For goto in specific, you don't even need to do that. You don't need a goto keyword to implement goto functionality.

IgorPartola · on March 4, 2021

Let’s be careful here. All these languages are Turing complete. Heck, isn’t CSS Turing complete now? But insofar as goto in C produces a certain small amount of CPU instructions (1? 2?), can all those other higher level languages do the same? Or will something like JavaScript need a callback-based solution that will do several pointer lookups, memory allocation, etc?

jcelerier · on March 4, 2021

> One obvious approach would be to emit code that implements a C runtime.

how could you negate the python interpreter startup time though ?

thaumasiotes · on March 4, 2021

Are we talking about compiling or performance? You negate the Python interpreter startup time by changing your implementation of the Python binary, but that isn't relevant to the problem of producing compiler output in the Python language.

majjgepolja · on March 4, 2021

C also does lack few features which can provide better performance or provide hints to the compiler.

computed goto, non-aliasing guarantees, actual "const" - just a few on top of my mind.

Generic code is generally faster in C++ unless you manually reimplement it in C.

Rust can be faster than C in few cases because stronger aliasing guarantees, but few compiler bugs prevent it.

Doing some good-performance stuff is also more difficult in C. Strings are null terminated etc..

bqmjjx0kac · on March 4, 2021

Sure, managed languages can beat C performance. IIRC, the JVM can reorder branches on the fly to avoid jumps.

pkolaczk · on March 4, 2021

Reordering or hinting branches has virtually zero impact on branch prediction performance on modern architectures with advanced hardware branch prediction.

What's much more important for good performance is memory layout and good use of CPU caches. And in this area managed languages struggle a lot. For instance, every object in Java has 16 bytes of overhead for an object header (on 64-bit openjdk jvm). Or you can't organize objects in a flat array. Or there are some guarantees about memory zeroing which often lead to needless memory writes. Or you have to live with GC, which often wastes a lot additional memory and regularly brings unused but reachable memory to caches. Project Valhalla is going to improve some of these limitations hopefully some day, but don't expect the level of C, C++ or Rust performance.

limaoscarjuliet · on March 4, 2021

Per Jim Keller, current source of speed in CPU is better branch prediction and data fetching prediction. It is possible C++ (or some other lang) may have more met information to drive these better, in which case it will be faster than C/assembler.

https://www.youtube.com/watch?v=G4hL5Om4IJ4

tylerhou · on March 4, 2021

If you use the normal toolchains for C vs another language, the other language still can be faster because it can dynamically generate bytecode based on runtime information. In the normal C toolchain, this is not possible.

However, if you used an alterative toolchain that could also generate bytecode from C at runtime, then I would bet that C would stay on top or be equal.

thysultan · on March 4, 2021

You can literally emit runtime assembly in C in any C toolchain, what are you on about?

tylerhou · on March 4, 2021

How do you dynamically generate machine code without linking something like LLVM?

sgtnoodle · on March 4, 2021

I believe that's actually what folk are doing these days, literally linking in LLVM and using it to compile C into machine code then executing it during runtime.

tylerhou · on March 4, 2021

That's what I mean by "alternative toolchain," though; some of the code that you are executing at runtime isn't being generated by the normal GCC/Clang toolchain. It's being generated by your custom setup which links in LLVM.

You'd still need to compile your code with a regular toolchain, but you also need additional tools to compile, optimize, and debug code at runtime. That's what I meant by toolchain, even if it is not the conventional (static) definition.

thysultan · on March 7, 2021

same way you would with llvm, you can map executable memory without llvm and execute it.

offtop5 · on March 4, 2021

I think about programming languages like I do cars. While rust may be a Ferrari, some kid who doesn't know how to drive a stick and has only had his license for about a month is going to have a rough time beating you driving from New York to Texas even if you're driving a Corolla. I consider myself to be an extremely good software engineer, but lower level programming languages scare me. Recently I've been riding a lot of code in C++ for my Arduino project, but aside from that I can't stand it. You have to be such a good C programmer to make things faster than you probably could knock out using Python + numpy or something like that, but to be fair under the hood numpy uses tons of optimized C code.

sgtnoodle · on March 4, 2021

I mainly write embedded software in C and C++, but I tend to use python for non-embedded stuff. The funny thing is that I've attempted to speed up multiple python scripts by replacing lists of floats with numpy arrays, and each time it's ended up slightly slower. I suspect numpy only really starts to pay off when you're doing the same operations to 1000+ elements. For modest data sets, or algorithms that don't map particularly well to numpy's operations, the built in data types do better.

I also recently gave numba a go, and it was significantly slower than vanilla python. I was surprised because the decorated function seemed exactly like the type of code that numba would be good at (iterating through a list of coordinates doing square roots and some quirky floating-point modulo math.)

ColFrancis · on March 4, 2021

Just checking, even though you likely know. The numpy arrays really only pay off when you vectorise your calculations. Don't expect any speed up if you're still using list comprehensions.

A vector operation is just where instead of having two lists, U and V which add to make W = [x+y for x,y in zip(U,V)] you directly operate on the arrays, W = U + V.

This allows the inner loop to be completely in native code. Sometimes some curious things happen where it's easier to do more vector operations resulting in more looping just because each of those operations is a vector operation and so is faster than the loop. For example, increment a number every time you see a particular element (time series breaking out subsequences) might look like np.cumsum(U == value) which is two loops in practice, but much faster than the iterative approach.

condret · on March 4, 2021

Interesting for me to see someone being scared of lowlevel langs. For me it's the opposite, highlevel langs scare me. They always make me feel that I don't know, what is actually going on

offtop5 · on March 4, 2021

I don't know how cars work , I still drive.