"We find that in regards to performance, C++ wins out by
a large margin. However, it also required the most extensive
tuning efforts, many of which were done at a level of sophistication that would not be available to the average programmer.
Scala concise notation and powerful language features allowed for the best optimization of code complexity.
The Java version was probably the simplest to implement,
but the hardest to analyze for performance. Specifically the
effects around garbage collection were complicated and very
hard to tune. Since Scala runs on the JVM, it has the same
issues.
Go offers interesting language features, which also allow
for a concise and standardized notation. The compilers for this language are still immature, which reflects in both performance and binary sizes."
Untuned C++ will probably be slower then even tuned python, I can tune any language, but C++ (although less so then something like C) requires more tuning to get good performance. The performance is what we care about here.
Could you give an example of where this would be the case? I write both Python and C++ code all day and I can't think of a single situation where I'd expect better performance from Python than a similar C++ implementation.
Here's one that bites every c++ newbie at least once (please forgive minor syntactic mistakes, it's been a while):
std::vector<LargeStruct> things = GetABunchOfThings(); // Why does this take so long?
// Oh well, time to sort
std::sort(things.begin(), things.end(), LargeStructComparator());
...and then your program takes forever and your profiler tells you that 99.9% of your time is spent in LargeStruct's copy constructor. And then you refactor to use pointers and teach the comparator to dereference them, whatever, it's fine, but the naive Python version will just work.
Of course, after c++0x becomes widespread this problem will go away. Yay move constructors.
Basically correct, except std::sort uses assignment, not copy construction.
It's also likely that the optimizer will have GetABunchOfThings construct its result once in the caller's "things". Lambdas are C++0x, so maybe move constructors (another C++0x feature) will be employed instead.
Python might be a bad example, as the interpreter really isn't terribly fast. But with Javascript, I wouldn't be surprised at all if the same naive implementation of a typical web-ish task (lots of strings, parsing, lots of heap churn) ran faster on V8 than it did compiled from C/C++.
By 'untuned' I'm also including using the wrong algorithm. For example, the using the usual recursive definition for the Fibonacci sequence in C++ as opposed to the loop or n^2 version of recursive Fibonacci in python. The python version will be faster in this case. It is probably true however that any decent programmer will churn out C++ that is faster than python for cpu intensive tasks most of the time.
The use of the algorithm is mostly orthogonal to the language being used. It's like saying 'walking from Amsterdam to Rotterdam can be faster than going by bike, if you take a detour to Beijing when using a bike'.
Personally, I often do an initial write of a program first in Python and rewrite in C/C++ if necessary. Usually, the first unoptimized C++ implementation is 10-100x faster than an optimized Python implementation.
> The use of the algorithm is mostly orthogonal to the language being used.
I don't find that to be true. I use more sophisticated algorithms in "higher level" languages because I can implement them in about the same time it takes me to implement less sophisticated algorithms in "lower level" languages.
> Personally, I often do an initial write of a program first in Python and rewrite in C/C++ if necessary. Usually, the first unoptimized C++ implementation is 10-100x faster than an optimized Python implementation.
Yes, but the c/c++ rewrite gets to take advantage of what you learned in the python AND you only do the c rewrite when you need the speed and are willing to pay the time penalty to get it. (There's some extra cost to the c version otherwise you'd always do c and never do python.)
I use more sophisticated algorithms in "higher level" languages because I can implement them in about the same time it takes me to implement less sophisticated algorithms in "lower level" languages.
These are factors not related directly to the language, but level of understanding, time, and money. An experienced C++ programmer can probably implement the 'better' algorithm equally quick in C++.
There's some extra cost to the c version otherwise you'd always do c and never do python.
Not necessarily. In an existing project in a non-C language, the C implementation will require you to write and maintain a binding as well. That could be a good reason to use Python/Ruby/Prolog/whatever by default, unless it's not fast enough.
> These are factors not related directly to the language, but level of understanding, time, and money. An experienced C++ programmer can probably implement the 'better' algorithm equally quick in C++.
There are programmer performance factors directly related to language differences.
The mainstream example is efficiency of DSLs. Try writing a complex regexp in a standard regexp DSL that originates from Perl and in plain C/C++.
The less known example is writing complex business logic in Clojure vs Java. The very fact that Clojure version lets you examine much more business logic in one screen without scrolling puts much less pressure on programmer's concentration and memory.
Thus the very ability to write complex business logic without lots of bugs in tight time constraints depends on a language.
> An experienced C++ programmer can probably implement the 'better' algorithm equally quick in C++.
While there are programmers who can implement a given algorithm faster in C++ than Python, I suspect that they're the exception (among folks who know both).
> In an existing project in a non-C language, the C implementation will require you to write and maintain a binding as well.
If there's no c penalty, other languages wouldn't have been used as often for said existing projects to begin with.
My view is that each language has both strengths and weaknesses, so different languages are better suited to different tasks depending on factors like available personnel, existing sw, ease of implementation (to reach an acceptable level of performance).
In the average case for large datasets. Quicksort is O(n^2) worst-case (choosing the worst pivots possible), and Bubblesort is O(n) in the best case (e.g. an already-sorted array).
The C++ Bubblesort will have better worst-case and best-case performance, but worse average-case performance, for large inputs.
But this is all irrelevant, because in Python, you'll be using Timsort[1], which I'm pretty sure is implemented in C anyway. In C++, you'll be using std::sort unless you have a template allergy. If you use the Gnu stuff, that will be Introsort[2]; if MSVC, Quicksort.
Python doesn't even use quicksort. It uses timsort (http://en.wikipedia.org/wiki/Timsort), which is a hybrid merge/insertion sort. It's also used to sort arrays in two different versions of Java. In other words, it's one of the fastest-known sorting algorithms.
Down voted because people think Quicksort always wins in Python vs. a Bubble sort in C++? Wish I had time to find a case where that's not true. I'll point out that in the worst case, Quicksort is O(n^2).
when turning map into hash_map gives you 30.4% increase in performance, you can't call this an 'extensive' tuning effort... The 'extensive' tuning in the paper (structure peeling, ...), accounts for minor improvements (< 10% total).
The initial code was supposed to be written in a straightforward way. That said, I was playing around with the code, and using tcmalloc with the original C++ code gives a better speedup (something close to 40% with the provided example.)
We find that in regards to performance, C++ wins out by a large margin. However, it also required the most extensive tuning efforts, many of which were done at a level of sophistication that would not be available to the average programmer.
I am a fan of C++, so I liked the first part (that C++ wins, which is not really surprising). However it was not clear if the sophistication they talk about is regarding the Google's internal data-structures or the list of optimizations listed in Section VI-D? Many of these optimizations are not big surprises to a C++ programmer (where possible use hash_map, vector instead of list, initialize data structure outside of loop and try to reuse, and so on).
Overall, I do greatly appreciate the effort and their sharing the results.
The problem with possible optimizations and tunning is that it requires constant effort. Projects often have serious deadlines and you simply can't constantly think about memory or allocation footprint and still do a good job in the amount of time you have, unless you're well above average.
Most of the optimizations that got them big wins in C++ (using hash maps instead of tree-based maps, dynamic arrays instead of linked lists, .empty() rather than .size() > 0 on a linked list) are "library knowledge" issues more than memory / allocation ones.
In fact, looking at the list of optimizations on page 9, the only one that involved a low-level understanding of the machine and resulted in a significant improvement was (unless I missed something) the use of InlinedVector (and this is only barely such a case - lots of C++ codebases have a similar class).
Exactly. Using hash maps and vectors are really elementary choices. Can you consider understanding enough in "systems programming" (see other comments for definition) without knowing that hash_map is implemented using hash tables and map using a tree? And that therefore adding elements to the map would incur rebalancing costs which hash_map wouldn't, as soon as you can estimate the number of elements to be processed. I consider it more a "meta" than a "language" decision.
Rather than library knowledge, I submit it's basic algorithmic and data structure knowledge. Perhaps one could argue it's knowing how that algorithmic and data structure knowledge maps to actual libraries.
By the way, std::list::size() is defined to be O(1) (see http://www.kuzbass.ru:8086/docs/isocpp/lib-containers.html#l...). While it's possible, of course, to define a linked-list where the size operation is O(n), that's not an implementation of std::list. Hence, there was no difference when std::list::empty() was used - although I agree it's better to use it, since it's more meaningful.
In a confusing bit of ISO trivia, std::list implementations can do either - the standards draw a distinction between "should" and "shall". If an implementation "shall" do something, it's a requirement, and if it "should" do something, it's the preferred choice among several. The size() function only "should" be constant time. In fact, in GCC (which I assume is the Google compiler of choice), std::list::size() is O(n) (http://gcc.gnu.org/onlinedocs/libstdc++/latest-doxygen/a0142...).
summary: author implemented a sophisticated compiler algorithm (loop header recognition) in a straightforward canonical way in C++, Java, Scala, and Go, measured performance/memory usage, and then (the most interesting part of the paper) he had colleagues who were experts in each respective language write highly-tuned optimized versions and reported on what it took to optimize in each language.
So C++ wins the performance test by large margin, Scala follows (3.6x), then Java 64bit (5.8x), Go (7x) and finally Java 32bit (12.6x).
Scala without any optimization is still better than Java with all the ninja-skills black magic applied. Scala optimized is pretty good - "just" 2.5x slower than C++
Actually it's odd. They say that in the Java tuning, they made some simple optimization that got it on par to the original C++ one but then they refused to do anything further; noting instead the same C++ optimization would have applied in Java as well.
Does that mean the java version would have been just as fast? It seems in general though that Java is just as fast as C++ unless you are an ultimate C++ expert.
Regarding the java version, "Note that Jeremy deliberately refused to optimize the code further, many of the C++ optimizations would apply to the Java version as well"
From a performance perspective, the VM is what dominates and Java operates closer to the VM assumptions about what it's going to run.
If you read the paper, you'll note that the Scala author significantly changed the structure of the algorithm to conform with the way Scala does things (recursion, etc). So it's sort of an apples-to-oranges comparison as far as Scala's concerned, too bad they couldn't write ugly Scala code that would give a better comparison.
I only spent an hour straightening out his use of Go. Had I know he was going to publish it externally as "Go Pro" I would have tried to turn it into a real Go program, and perhaps even optimize it somewhat. Overall I think the Java and C++ code got the most attention from language experts. Ah well.
Speaking of Scala Pro: "It should be noted that this version performs algorithmic improvements as well, and is therefore not directly comparable to the other Pro versions."
"It should be noted"?! What an understatement! What's the point of going to the trouble of writing a paper if you're not even going to make your comparisons fair? Crazy.
"... the benchmark also used a HashMap<Type, Integer> when it should have just stored a primitive int in Type; it also used a LinkedList where it should have used an ArrayDeque."
Using a "older Pentium IV workstation". Older than what? Older than Grandpa? Older than dirt? Older than the dinosaurs? Why?
Pentium IV - really? Let's see - we've had Core, Core 2, Nehalem and Sandy Bridge since then, not to mention 3 in-between process shrink versions of the same architecture.
Perhaps we could pass the hat around so that Google could afford a Sandy Bridge workstation and discover what these interesting results look like on an architecture that dates from some point in this decade - or at least, at some point in the previous decade.
This stuff actually, really, truly makes a difference. And not necessarily in favor or against any particular one of these languages...
Imagine: there is a paper discussing the techniques of weight lifting. Your complaint is that there are newer models of weights than those they lifted.
Newer processors don't just increase the performance of all programs equally. They have things like improved branch prediction, cache prefetching, better pipelining, and different cache sizes which can make a lot of performance optimizations that you get from C / C++ less relevant.
Except from the "most optimized C++ code" you have all sources of all benchmarks. Please try to run them on any newer x86/64 based processor and show us that the article conclusions don't hold. I'm not holding my breath though.
The parent commenter uses "improved branch prediction, cache prefetching, better pipelining, and different cache sizes" as a "mumbo-jumbo that can mean something different." I'm in the business, so I can tell you, most of the improvements give you just some "overall speedup" so that you can happily buy today's processor running on 3 GHz and be glad it's faster than almost a decade old P4 running on the same 3 GHz. Add to that that now you have a multi-core CPU and that you have to "clear" the paths to the cores in order to prevent them from slowing down one another and also to compensate the bigger delay introduced by more modern RAM technologies, which trade bigger delay for the possibility to feed more cores.
Then, measure the algorithms that run on one core, anyway, on the P4 and the latest Core iX. Your slow languages won't be faster than your fast ones just because the quoted changes were introduced to the processors in between.
Please note that I did not disagree with your conclusion - I agree that it probably won't make a difference. If it makes you feel better, I'll say it's a very high value of probably. But I'm in the business of performing systems experiments. Removing as many variables as possible is just good experimental design. If you want to know what the performance will be like on modern machines, then it's best to run on modern machines.
Dear lord, thank you for this bit of common sense.
It's actually quite hard to know what a given piece of code will do on a given microarchitecture even if on average it runs everything X% faster - you may find you're the bit of code that bites the big one and run X% slower on the new microarchitecture (e.g. you were depending on branch mispredicts being cheaper than they are) or suddenly your code runs way faster than competing codes (e.g. you're the superstar running 2X% faster because a sudden increase in ILP exposes that you've got a main loop full of independent operations).
Thought experiment: you are Intel and your new processor makes, who knows how, writing highly optimized C code unnecessary, since the speed up of some higher level languages (from now on Java or Scala) is bigger than the speed up of the highly optimized C code (in which the runtime of Java and Scala are implemented). Wouldn't you make the biggest announcement of the computing history? "With this new CPU, your sloppier written code is faster than a highly optimized code. Every programming assumption valid up to now isn't anymore!"
...oh, I think I see what you're saying. If you have hardware that can predict jump-to-pointer and not just jump-to-literal, that can make quite a difference in how well more dynamic languages / paradigms perform compared to more static ones. I was thinking more along the lines of languages giving programmers (the appearance of) the ability to tweak their code to take advantage of these things, and "don't bother, the compiler is smarter than you are".
Branch prediction and processor caches help all languages, and low level languages like C and C++ even more than dynamic languages. What I'm saying is that smart compilers are not in any way a substitute for branch prediction and processor caches, on the contrary: smart compilers are designed to take the best advantage of these features. And on the point why C and C++ are still relevant because of this: it's basically impossible to fully take advantage of e.g. processor caches in languages like Java and Ruby because you cannot fully influence how your data structures are laid out in memory. You can in C and C++. Sure, in theory smart compilers could do that for you, but in practice they don't. And the problem is so difficult that we probably get that in compilers working about as well as a human C/C++ programmer around the same time that we get strong AI.
I've programmed computers at a low level, and lifted weights for the past 21 years (and in fact, read some actual papers about weight lifting, and failed to make the aforementioned complaint). The Eleikos of today feel basically like the Yorks of yesteryear.
On the other hand, the processors of 2000 are quite a bit different to the processors of 2011, and anyone who thinks that they can prove that this is irrelevant to the experiment in question by pure reasoning (rather than experimentation) is a moron, plain and simple.
Having tracked individual embodiments of algorithms over architectures from the P4 to Sandy Bridge, I can assure you that tradeoffs between particular approaches on an given problem vary wildy based on changes in microarchitecture.
Overall, there's a steady improvement and perhaps these languages are all gaining exactly the same (despite a substantial improvement in effective ILP from P4 to Core 2)... but I don't know that for sure and neither do you, despite your sophomoric self-assurance.
If only they could do the experiments on a processor from the last 5-6 years...naturally someone else could do the work for them, but why bother?
The most instructive part of this are the listings from the changelogs alongwith the resulting improved performance. I'm really enjoying the C++ tunings. I wish they did show the results with the google hashes, though.
So, the C++ performance numbers were with the "public" version? It wasn't clear whether they left the internal version out completely (besides the changelog) or simple didn't release the code. The published C++ numbers are already way better than any other language in terms of speed and memory use. It would be surprising if there was a substantially faster version.
The final list of changes is supposed to be with the internal version, and roughly seem to show a 20% change over and above what the basic std::map -> ext/hash_map replacement. I'm not sure if I should add the percentages, or multiply them along to get the final runtime, though that doesn't make much difference, really :)
You're right that there may not be much improvement, but I've seen the google dense_hash_map do better sometimes than the basic SGI hash_map.
Edit: the code repo has the unoptimized versions. Oh well, at least I can get the satisfaction of seeing the 30% myself :)
Isaac, do you think you could add this problem to the shootout? I believe (and am therefore almost certainly wrong) that there isn't any benchmark there that derives almost purely from compiler theory.
link to the code: http://code.google.com/p/multi-language-bench/
This is a great resource to learn the different languages.
(I had never seen C++ template programming used in practice before)
Challenge to the pythonistas: rewrite the python version so it uses the C++ code under the hood using scipy weave
I was surprised by the fact that the Python version has 626 lines of code, compared to 297 to the Scala Pro. As Python is supposed to be more expressive, I was expecting something like 150ish LOC for Python.
Actually, the python version has about 328 lines of code, while the scala version has about 216 lines of code. You should visit the Google project page and use something like cloc to analyze the files:
I confess that I do not know a lot about Scala, but it looks like some of the functional aspects of the language allow for some savings in lines of code.
If I had the time right now, I doubt that it would be too difficult to shrink the python version a bit. Just taking a glance at it, I don't see a single list comprehension throughout their code. I am sure there are some other language features that could probably have been better leveraged as well.
Just out of curiosity, where did you get the 626 loc and the 297 loc from - I tried looking but I can't find it anywhere. Though that could just be a product of my lack of sleep right now.
The 297 loc for scala pro is from the paper that also includes 13 lines of Apache license stuff on the top and 14 lines of comments. The author used "wc -l" to count the LOC, that's not very scientific anyway. The 626 loc for python is also from "wc -l". I looked at the code again and found out it contains a lot of detailed comments and even test cases, so it's not fair to say the python version has 626 loc.
I suspect the author gives it out as a reference implementation.
C++ code, when sophisticated template meta programming techniques are used, can be as fast as tuned C code, and can beat any language in terms of speed with the exception of hand-tuned assembly. However, techniques like expression templates, automatic loop unrolling, template specialization, and static polymorphism require a VERY high degree of sophistication from the programmer. In some sense, the programmer is forced to be a compiler. There are some individuals that, despite this demand, can be highly productive. The others wallow in complicated syntax, horrible type errors, and difficult to trace run-time errors. Unless speed is absolutely essential, C++ should be avoided at all costs.
This is cool but I think it's important to remember that they're comparing algorithmic benchmarks but modern software has many different bounds besides for raw, algorithmic processing. For example, I think I/O and database interactions are the bottleneck for most web apps. I really liked the way this article broke it down: http://pl.atyp.us/wordpress/?p=2947
Given the languages they are using, I think it is fair to say that they are interested in system-programming kind of tasks, which for google means the stacks below what you normally use for web programming (i.e. I would expect google to have their own web servers, or at least heavily customized from existing ones, etc...). For system programming, "language speed" matters much more (one definition I like for system programming, seen on lambda ultimate: "programming where both algorithm complexity's constant and memory representation matter").
Of course, in most cases, you're right. Given an infinite pool of very good programmers and infinite time, C will almost always be faster than most other languages. But in reality, you have a very limited of maybe not so good programmers, and the project needed to be done yesterday :)
Keep in mind that since C++ is usually used for performance critical code, its compilers are often highly tuned and produce good code (see Intel's ICC for number crunching code).
A less popular language - or one that is usually used for less-critical code - will suffer.
C++ wasn't always so fast. See Kernighan and Pike, "The Practice of Programming" for a simple example where Java surprisingly beats C++.
True, but this is one of the reasons people choose to code in C++. The quality of tools & common libraries is as much a factor in language selection as the syntax and semantics of the language itself.
Looks like C++ is still one of the performance kings among current programming languages. Can't say I enjoy using it though.
It also looks like Go and Scala have a long way to go on compiler optimization. Not surprising, since they're young languages compared to C++ and Java, and it's a decent amount of work to optimize the higher-order constructs they provide.
Languages that come from an era where total memory was measured in kbytes, cpus in megahertz and pointer arithmetic was as common as if statements, will always be performance kings.
It'll take a pretty significant shift to change this. Something like 100+ core cpus each running slow, might be such a shift.
Being surprised that C++ is faster than newer languages which provide all types of abstraction is like being surprise assembly is faster than C++.
You're mostly correct, but the code they wrote DID use fancy abstractions like hash tables, templates etc. C++ delivered on abstraction WITHOUT a performance penalty, which is what the C++ designers have claimed it could do since forever. Nice to see it verified again. I have no doubt that the C++0x syntactic sugar enhancements would have have also delivered the same high performance.
Indeed. I write a lot of Objective-C code, and still drop into C++ for the CPU heavy tasks. Swapping out NS* containers for std::* alone gives a fairly high boost in performance.
All the C++ hate is blown way out of proportion. C++ is a great tool when you need performance. Tuning C++ isn't that difficult either. Contrary to what a commenter said above, steering clear of C++ at all costs is not what you should be doing. You should know when and how to use C++. It's useful more often than many people here seem to think. Especially if you develop for mobile devices.
I've been pushed back into C++ out of necessity for some DSP work and I'm actually enjoying it quite a bit. It sucks for text processing but for other things it's really not so bad. As you say, with computing shifting back to underpowered devices C++ is having another day in the sun.
Is it hard to use fsc? Fsc should fix those compile times, especially for minimal changes. In the paper it reduced compile time to 25-33% of scalac's compile time. (13.9s to 3.8s and 11.3s to 3.5s)
I was talking about incremental compiles (one loop around sbt ~test-compile), on a laptop, with a web browser open. Colleagues have benchmarked SSD compiles at 5x+ faster.
The java version spends 75% of it's time gc? That smells fishy to me.
edit: I just got to section VI of the paper, which does some gc tuning. In my playing, I also found about a 20% speedup by changing the HavlakLoopFinder.UnionFindMode class to being a statuc class.
This paper served as a great introduction to Go programming for me. I've been trying to work through my own basic program in Go for the past few weeks and their overview of the language features used cleared up a lot of loose ends for me.
Having the same algorithm implemented in several other languages helps, too. It enables you to look at the implementation in the language you're most comfortable with, then look at one of the other implementations and think "Oh, so that's how that's done in this language."
I wonder what JVM they used, as significant performance differences exist between Hotspot and JRockit. I also wonder what compiler they used, again Microsoft compiler will yield different results compared to gcc.
One of the interesting details is that Go Pro required 786 lines of code as opposed to 850 lines of C++ code and 297 lines of Scala Pro code. Regular Go code was 902 lines. It certainly doesn't look like Go scores any expressiveness points over C++.
"Go Pro" is unfortunately a very misleading name. I just spent an hour cleaning up his use of Go. Nobody made any attempt to actually turn this code into a well-written Go program. I did not realize that he was going to publish it externally.
Would you consider touching up the Go version again to make it more idiomatic and well written? I'd be interested in seeing good Go examples; I'm sure many others here would as well.
I keep that in mind. As well as it getting slower as it starts doing useful work and optimizations, thus losing its "fast compilation" edge (which is just about its only advantage right now)
It's also interesting to note that it looks like he measured LOC with "wc -l", so all includes, comments (including license), whitespace, debug code, and the main are included.
LOC isn't a great metric, and the authors of this paper seem to make it worse by not measuring it very well.
It may just be because I've been writing a ton of C++ lately, but to me the C++ data declarations are the most readable of the bunch. The stuff with iterators is much uglier, of course, but so far I like the syntax of C++ a lot better than that of Go.
and then if you typed period (.), it would show members from std::string. I'd say it works better than Intellisense (it's very precise and you use the same parser for code completion and final compilation).
I'm too dependent on the rest of the integration with XCode to venture back into emacs land again but I'm looking forward to a rev of XCode that includes this.
Who said that when you release a paper that it needs to look like it came from a text book. I couldn't get past the first page because I felt like I was back in high school.
Submissions must be in English and at most 12 pages total length in the standard ACM SIGPLAN two-column conference format (10pt).
Nearly all academic papers in computer science and engineering look like this. Not to say they've converged on an optimal format, but it wasn't exactly a creative decision on the part of these authors.
"We find that in regards to performance, C++ wins out by a large margin. However, it also required the most extensive tuning efforts, many of which were done at a level of sophistication that would not be available to the average programmer.
Scala concise notation and powerful language features allowed for the best optimization of code complexity. The Java version was probably the simplest to implement, but the hardest to analyze for performance. Specifically the effects around garbage collection were complicated and very hard to tune. Since Scala runs on the JVM, it has the same issues.
Go offers interesting language features, which also allow for a concise and standardized notation. The compilers for this language are still immature, which reflects in both performance and binary sizes."