Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Compiler engineer here. In practice, compilers for higher-level languages often have a lot of difficulty getting anywhere close to the efficiency of comparable C code. If you take Python, for example, you have to do a lot of inlining to eliminate various abstractions. Inlining is actually non-trivial. Yes, inlining, by itself, is an easy program transformation, but knowing where to inline to get the best performance is very hard. If you inline too much, you increase code size, and you lose performance. You have to know precisely which inlining decisions will pay for themselves, and your large codebase might have tens of thousands of call sites and a call hierarchy 40 functions deep. Python also adds the fun little problem that you can redefine any function at run time, which makes it hard for the compiler to know which function you're actually going to be calling. To complicate things further, inlining decisions affect other optimizations. For example, if you inline foo into bar, do you then also inline into bar the functions that foo is calling? Do you unroll the loops from foo into bar? Etc.

Also, there's an aspect that I feel is constantly overlooked, and this is the way that objects are laid out in memory. In Python, JavaScript, Ruby, you have a lot of pointers to objects. You get a lot of pointer-chasing as a result. This is BAD for caches. Each object you touch, each pointer you dereference means pulling in a new cache line. In C, you can design very compact and flat data structures. You can use 32-bit floats, 8-bit, or even one-bit integers if you want. You can have a struct within a struct, with an array inside of it, all without any pointer-chasing.

Modern CPUs are constrained by memory bandwidth, and it's very hard for any programming language to beat C on achievable memory efficiency. What's worse is that we have little to no academic compiler literature on automatically optimizing for memory efficient data layouts. It's an overlooked and understudied problem. You would have to prove that integer values lie within certain ranges (hard) and also do object inlining (collapsing objects into parents) which AFAIK is also hard and not done by any mainstream compiler.

So, yeah, keep thinking that a sufficiently-smart compiler will do everything for you. You will assuredly be very disappointed. Until we have strong AI, the sufficiently smart compiler is basically unobtainium. If you want efficiency, the best route is generally to have less layers of abstraction, or to only rely on compiler optimizations you know for certain will happen.



The sufficiently smart compiler is effectively unobtanium, but that presents a challenge for C as well. C has its own challenges with its memory model & language semantics.

C never was the lowest level of abstraction; there are other abstraction models out there and more still to be invented no doubt. C's model did well (though struggled mightily against Fortran for the longest time) at aligning with processor models of the 80's and 90's. It helps that C's ubiquity meant that to a degree the trail was washing the dog: processor designs were often measured against how they executed code compiled with a C compiler. But past preformance is a poor predictor of future success; who is to say that as processor designs continue to evolve, C's abstraction model won't become increasingly leaky against another? Absent a sufficiently smart compiler, it's entirely possible that C compiler writers will find themselves at a disadvantage.

And that assumes they're competing with a traditional compiler. It's possible, though unlikely, there will be competition from other execution models. As you said memory is often the core area of performance bottlenecks these days, and as terribly inefficient as bytecode interpreters are, they tend to have smaller code sizes. Efficient interprets are hand tuned specifically to make the overhead of bytecode interpretation as efficient as possible. Now, intrinsically they are performing a translation step at runtime that a compiler did before runtime, but one can at least theorize of a model where the interpreter is effectively a specialized decompression algorithm that feeds machine code to the CPU (that's really not that far afield from what happens in hardware in modern cpus and in mobile runtimes). Higher levels of abstraction might allow for more efficient decompression... It's crazy, but not inconceivable.


s/washing/wagging/

s/interprets/interpreters/

Damn, I should have proof read this on my desktop rather than just blind posting from my phone.


Well said.

I've had a lot of conversations with javascript engineers over the years who've argued to me that well tuned JS will be nearly as fast as the equivalent C code. I've written plenty of little toy benchmarks over the years, and in my experience they're partly right. Well written JS code in V8 can certainly run fast - sometimes around half the speed of C code. But a massive performance gap opens up when you use nontrivial data structures. Nested fields, non-uniform arrays, trees, and so on will all cripple javascript's performance when compared to C's equivalent of simple embedded nested structs. If you couple clean C data structures with allocation-free hot paths from arenas, the performance of C will easily hit 10-20x the performance of the equivalent JS code.

From memory my plain text based operational transform code does ~800k transforms / second in javascript. The equivalent C code does 20M/second. The C implementation is about twice as many lines of code as the JS version though.

(The code in question: https://github.com/ottypes/text-unicode/blob/master/lib/type... / https://github.com/ottypes/libot )


More like "well tuned JS will be nearly as fast as poorly tuned C code"...


True. But if I’m given the choice, I’d usually rather well written javascript than badly written C. And there’s a lot more decent JS programmers out there than decent C programmers.


Fair. I'd rather neither, and take Python, or even (God forbid) PHP. LOL


C doesn't tell you anything about your cache efficiency, which is to first order the only thing that affects your program's performance.

You're right that flat datastructures are important, but C is far from the only language that can offer those.

I don't think C can ever be the answer; "comparable C code", i.e. code that was produced with the same amount of developer effort, is almost always undefined behaviour that happens to work most of the time on the developer's compiler. C doesn't have the kind of polymorphism that you need to write code that's compositional, testable, but also has good cache locality, at least not for code that also has to interact back and forth with the outside world. There is at least some work in the literature on making efficient use of cache in a Haskell/ML-like evaluation-by-reduction model, and I think that's where the long-term way forward lies (and FWIW in my experience compilers for ML-family languages already more than smart enough).


I'm not arguing that C is the ultimate programming language for maximum performance or anything like that. Simply that it gives you a lot more freedom to tune your implementation where it matters for performance.

For HPC, you'd likely be better with something designed to run on GPUs, with built-in primitives for matrix operations, and a compiler that can tune the way those operations get implemented for you.

However, when it comes to cache locality, if the C programmer knows what they are doing, I think you'll have a very hard time beating them with Haskell or ML. AFAIK those languages have the same pointer-chasing issues I've described earlier. If you need to design a complex data structure, for example a scene graph in a 3D game engine, it will be much easier to make it compact in C.


> However, when it comes to cache locality, if the C programmer knows what they are doing, I think you'll have a very hard time beating them with Haskell or ML.

A programmer who knows what they're doing is a rare thing indeed. On actual business problems with real-world programmers Haskell consistently beats C, IME.

> AFAIK those languages have the same pointer-chasing issues I've described earlier.

There's a pragma to flatten strict record fields, or a compiler flag to do that by default (at least in GHC).


Thank you for the good explanation! I'm in embedded and my experience is that this stuff is almost totally ignored in universities. People have to go through a steep and (for us) expensive learning curve until they get the feeling for the whole system and what makes it fast. And then I think, is it asking for too much? Can we expect knowledge about a gazillion layers smeared onto each other from everybody who just wants to deliver value from the topmost layer?


> Inlining is actually non-trivial.

OTOH, JIT runtimes have more input data than a C compiler. They can implement some runtime equivalent of C++ profile-guided optimization: measure what actually happens in runtime, assume the input data is going to stay roughly the same, and re-generate machine code with this new information into something more efficient.

Pretty sure modern Java does that sometimes.

> In Python, JavaScript, Ruby, you have a lot of pointers to objects.

In C# you can make data structures which don’t use pointers much, or at all. The language is strongly typed, has value types, it’s easy to place structures inside other structures. There’s a native stack just like C, and even stackalloc keyword which is an equivalent of alloca() in C.


> In C# you can make data structures which [...]

Yeah but that's completely validating my point. C# is not Python or JS. It's a (remote) cousin of C which tries to take some of the valuable performance tools from C and bring those to a managed runtime. Because it's strongly typed, it's a lot easier for the compiler to optimize, and because you have all these tools to design compact objects without pointer, you can do that job so the compiler doesn't have to.

And again, an experienced C# programmer can probably write code that runs circles performance-wise around code written by an experienced JS/Python developer in most cases.


> It's a (remote) cousin of C which tries to take some of the valuable performance tools from C and bring those to a managed runtime.

That’s correct. But at the same time, the language is way higher level than C or C++.

> experienced C# programmer can probably write code that runs circles performance-wise around code written by an experienced JS/Python developer in most cases.

In some cases, an experienced C# programmer can even write code which approaches or outperforms C. My Linux video player library https://github.com/Const-me/Vrmac/tree/master/VrmacVideo#per... uses CPU on par with VLC, and 20% less RAM.


> JIT runtimes have more input data than a C compiler

They have more data, but they are at a disadvantage by being time-pressured. They can't apply costly analysis to these data because they need to compile fast. Therefore typically they limit themselves to local analysis which may miss a lot of opportunity for inlining.


> They can't apply costly analysis to these data because they need to compile fast.

True in general, but they use quite a few tricks to minimize the consequences.

They use interpreter, or very unoptimal but fast version of JIT compiler, first time a function is called. They replace it with faster version once it’s clear the function is called a lot.

Unlike C compilers, they don’t need to do that analysis globally for the whole program. They only need to do that for the hot paths, that’s often a small portion of the code.

They can offload that analysis and code generation to another CPU core, and replace the implementation once that background thread has produced a faster version.


> They only need to do that for the hot paths, that’s often a small portion of the code.

That's often correct, however unfortunately codebases today can be very, very huge. It can take a really lot of effort to optimize even just 10% of the hottest code if the product is several hundreds of MB of compressed byte-code. There are also applications with no obvious hot-spots, but flat profiles - e.g. database systems, where most of the time is being spent transferring data between various layers of the system. If a request gets passed through most of the layers, and can be routed into different areas depending on the query type set at runtime, and the clients are allowed to send queries of various types, so they target various different submodules of the server, there will be no hotspots. In these cases warmup can take enormous amount of time.

Even for a server this can be a problem, because after restarting you get an immediate performance hit.

Also keep in mind many software products are not long-living backend processes that can warmup for hours or even minutes. Client apps need good responsiveness, and before JIT even realizes which code to compile, it is already too late.


I think what you wrote largely applies to Java and especially JavaScript, much less to C#. Value types, real generics, and native stack allow even the faster version of the .NET JIT to produce native code that’s not too horrible performance wise.

Good enough for desktop or embedded use cases, even on slow CPUs. I have 3 such devices on my desk, Raspberry Pi 4, a dev.board with Rockchip RK3288, and a tablet with Atom Z3735G, .NET is reasonably fast on all of them, without noticeable warmup issues at startup.


I was talking about JITs in general. Sure, you can AOT-compile .NET code quite efficiently, but then this is a different story.

Here is a nice analysis of how various JITs warmup in practice:

https://tratt.net/laurie/blog/entries/why_arent_more_users_m...

TL;DR; often they don't!


> unfortunately codebases today can be very, very huge

why is that?


Actually, action for action managed languages tend to be "faster" than C in the sense that writing the equivalent program would be slower in C. However, the additional freedom leads to increased program complexity and that complexity eats into the performance budget enough to end up slower than "idiomatically" written C where the developer distilled the solution to its bare essence.

Javascript has the fastest general purpose hashmap/dictionary implementation of all programming languages but at the same time you are forced to use them all the time which overall slows the language down. Writing exactly the same code in C would be even slower since C hashmaps aren't as optimized. However, C code is rarely written like that so it's usually faster.


I don't know about LLVM but GCC has had something like this for years as a standard optimization feature. You create a binary with special profiling code which writes to a file. After running the program a few times, you recompile with optimization that uses this file. I forgot the flags though.

Personally, I'm more interested in executable optimization. It decompiles an executable, performs whole-program optimization, and re-compiles it. I'd love to tinker around with optimizing other people's binaries (e.g. games) for my machine. There is something like that for LLVM but it's very experimental.


It's called profile-guided optimization.

And it fails for C when the source data changes significantly, compared to the inputs used by developers when they were building the binary.


Yes indeed.


> You would have to prove that integer values lie within certain ranges (hard)

Don't JavaScript JITs rely heavily on this to reduce JavaScript's floating-point arithmetic to integer arithmetic?

Not to say it's easy, but hasn't a lot of work been done on this?


Sometimes you can prove that values will remain integer when iterating over array indices for example. However, in the general case, when doing integer additions or multiplications, they need to insert dynamic integer overflow check instructions everywhere. When the integers would overflow they have to be converted to floating-point values.

If you think about this for a second. Suppose I have a loop where I'm multiplying values:

function foo(x, y, n) { for (var i = 0; i < n; ++i) { x = x * y; }

    return x;
}

In order to know that the multiplication won't overflow, you have to be able to prove that both x and y, coming into that function, will be integers. You also have to have information about the range of x, y and i. If you know that x>=0 and y>= but you don't know how many times the loop will execute at compile time, you are basically screwed. You could unroll the loop to reduce the number of dynamic checks you need, but then that increases your code size. So you mostly have to check that the multiplication doesn't overflow on every iteration. You may also have to do dynamic type checks if you can't prove that x,y,i will always be integers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: