Google paper comparing performance of C++, Java, Scala, and Go [PDF]

tectonic · on June 3, 2011

Their conclusions:

"We ﬁnd that in regards to performance, C++ wins out by a large margin. However, it also required the most extensive tuning efforts, many of which were done at a level of sophistication that would not be available to the average programmer.

Scala concise notation and powerful language features allowed for the best optimization of code complexity. The Java version was probably the simplest to implement, but the hardest to analyze for performance. Speciﬁcally the effects around garbage collection were complicated and very hard to tune. Since Scala runs on the JVM, it has the same issues.

Go offers interesting language features, which also allow for a concise and standardized notation. The compilers for this language are still immature, which reﬂects in both performance and binary sizes."

gaius · on June 3, 2011

It's not that you have to tune C++ it's that you can.

keeperofdakeys · on June 3, 2011

Untuned C++ will probably be slower then even tuned python, I can tune any language, but C++ (although less so then something like C) requires more tuning to get good performance. The performance is what we care about here.

MrScruff · on June 3, 2011

Could you give an example of where this would be the case? I write both Python and C++ code all day and I can't think of a single situation where I'd expect better performance from Python than a similar C++ implementation.

saucetenuto · on June 3, 2011

Here's one that bites every c++ newbie at least once (please forgive minor syntactic mistakes, it's been a while):

    std::vector<LargeStruct> things = GetABunchOfThings(); // Why does this take so long?

    // Oh well, time to sort
    std::sort(things.begin(), things.end(), LargeStructComparator());

...and then your program takes forever and your profiler tells you that 99.9% of your time is spent in LargeStruct's copy constructor. And then you refactor to use pointers and teach the comparator to dereference them, whatever, it's fine, but the naive Python version will just work.

Of course, after c++0x becomes widespread this problem will go away. Yay move constructors.

jongraehl · on June 4, 2011

Basically correct, except std::sort uses assignment, not copy construction.

It's also likely that the optimizer will have GetABunchOfThings construct its result once in the caller's "things". Lambdas are C++0x, so maybe move constructors (another C++0x feature) will be employed instead.

ajross · on June 3, 2011

Python might be a bad example, as the interpreter really isn't terribly fast. But with Javascript, I wouldn't be surprised at all if the same naive implementation of a typical web-ish task (lots of strings, parsing, lots of heap churn) ran faster on V8 than it did compiled from C/C++.

Raphael_Amiard · on June 3, 2011

> Untuned C++ will probably be slower then even tuned python

[citation needed]

keeperofdakeys · on June 3, 2011

By 'untuned' I'm also including using the wrong algorithm. For example, the using the usual recursive definition for the Fibonacci sequence in C++ as opposed to the loop or n^2 version of recursive Fibonacci in python. The python version will be faster in this case. It is probably true however that any decent programmer will churn out C++ that is faster than python for cpu intensive tasks most of the time.

danieldk · on June 3, 2011

The use of the algorithm is mostly orthogonal to the language being used. It's like saying 'walking from Amsterdam to Rotterdam can be faster than going by bike, if you take a detour to Beijing when using a bike'.

Personally, I often do an initial write of a program first in Python and rewrite in C/C++ if necessary. Usually, the first unoptimized C++ implementation is 10-100x faster than an optimized Python implementation.

anamax · on June 3, 2011

> The use of the algorithm is mostly orthogonal to the language being used.

I don't find that to be true. I use more sophisticated algorithms in "higher level" languages because I can implement them in about the same time it takes me to implement less sophisticated algorithms in "lower level" languages.

> Personally, I often do an initial write of a program first in Python and rewrite in C/C++ if necessary. Usually, the first unoptimized C++ implementation is 10-100x faster than an optimized Python implementation.

Yes, but the c/c++ rewrite gets to take advantage of what you learned in the python AND you only do the c rewrite when you need the speed and are willing to pay the time penalty to get it. (There's some extra cost to the c version otherwise you'd always do c and never do python.)

danieldk · on June 3, 2011

I use more sophisticated algorithms in "higher level" languages because I can implement them in about the same time it takes me to implement less sophisticated algorithms in "lower level" languages.

These are factors not related directly to the language, but level of understanding, time, and money. An experienced C++ programmer can probably implement the 'better' algorithm equally quick in C++.

There's some extra cost to the c version otherwise you'd always do c and never do python.

Not necessarily. In an existing project in a non-C language, the C implementation will require you to write and maintain a binding as well. That could be a good reason to use Python/Ruby/Prolog/whatever by default, unless it's not fast enough.

SergeyHack · on June 6, 2011

> These are factors not related directly to the language, but level of understanding, time, and money. An experienced C++ programmer can probably implement the 'better' algorithm equally quick in C++.

There are programmer performance factors directly related to language differences.

The mainstream example is efficiency of DSLs. Try writing a complex regexp in a standard regexp DSL that originates from Perl and in plain C/C++.

The less known example is writing complex business logic in Clojure vs Java. The very fact that Clojure version lets you examine much more business logic in one screen without scrolling puts much less pressure on programmer's concentration and memory.

Thus the very ability to write complex business logic without lots of bugs in tight time constraints depends on a language.

anamax · on June 3, 2011

> An experienced C++ programmer can probably implement the 'better' algorithm equally quick in C++.

While there are programmers who can implement a given algorithm faster in C++ than Python, I suspect that they're the exception (among folks who know both).

> In an existing project in a non-C language, the C implementation will require you to write and maintain a binding as well.

If there's no c penalty, other languages wouldn't have been used as often for said existing projects to begin with.

My view is that each language has both strengths and weaknesses, so different languages are better suited to different tasks depending on factors like available personnel, existing sw, ease of implementation (to reach an acceptable level of performance).

brown9-2 · on June 3, 2011

By 'untuned' I'm also including using the wrong algorithm

I believe this is a different definition of "tuned" than most people use.

swah · on June 3, 2011

Quicksort Python sorts faster than Bubblesort C++.

amalcon · on June 3, 2011

In the average case for large datasets. Quicksort is O(n^2) worst-case (choosing the worst pivots possible), and Bubblesort is O(n) in the best case (e.g. an already-sorted array).

The C++ Bubblesort will have better worst-case and best-case performance, but worse average-case performance, for large inputs.

But this is all irrelevant, because in Python, you'll be using Timsort[1], which I'm pretty sure is implemented in C anyway. In C++, you'll be using std::sort unless you have a template allergy. If you use the Gnu stuff, that will be Introsort[2]; if MSVC, Quicksort.

[1]http://en.wikipedia.org/wiki/Timsort [2]http://en.wikipedia.org/wiki/Introsort

BrandonM · on June 3, 2011

Python doesn't even use quicksort. It uses timsort (http://en.wikipedia.org/wiki/Timsort), which is a hybrid merge/insertion sort. It's also used to sort arrays in two different versions of Java. In other words, it's one of the fastest-known sorting algorithms.

melling · on June 3, 2011

Please provide the code and set of data. If C++ is 100x faster than Python this might not always be the case.

melling · on June 3, 2011

Down voted because people think Quicksort always wins in Python vs. a Bubble sort in C++? Wish I had time to find a case where that's not true. I'll point out that in the worst case, Quicksort is O(n^2).

http://en.wikipedia.org/wiki/Sorting_algorithm#Quicksort

pilooch · on June 3, 2011

when turning map into hash_map gives you 30.4% increase in performance, you can't call this an 'extensive' tuning effort... The 'extensive' tuning in the paper (structure peeling, ...), accounts for minor improvements (< 10% total).

copper · on June 3, 2011

The initial code was supposed to be written in a straightforward way. That said, I was playing around with the code, and using tcmalloc with the original C++ code gives a better speedup (something close to 40% with the provided example.)

chrisjsmith · on June 4, 2011

That's what I noticed. I think that's called "knowing what you are doing" personally.

ksolanki · on June 3, 2011

We find that in regards to performance, C++ wins out by a large margin. However, it also required the most extensive tuning efforts, many of which were done at a level of sophistication that would not be available to the average programmer.

I am a fan of C++, so I liked the first part (that C++ wins, which is not really surprising). However it was not clear if the sophistication they talk about is regarding the Google's internal data-structures or the list of optimizations listed in Section VI-D? Many of these optimizations are not big surprises to a C++ programmer (where possible use hash_map, vector instead of list, initialize data structure outside of loop and try to reuse, and so on).

Overall, I do greatly appreciate the effort and their sharing the results.

bad_user · on June 3, 2011

The problem with possible optimizations and tunning is that it requires constant effort. Projects often have serious deadlines and you simply can't constantly think about memory or allocation footprint and still do a good job in the amount of time you have, unless you're well above average.

davidtgoldblatt · on June 3, 2011

Most of the optimizations that got them big wins in C++ (using hash maps instead of tree-based maps, dynamic arrays instead of linked lists, .empty() rather than .size() > 0 on a linked list) are "library knowledge" issues more than memory / allocation ones.

In fact, looking at the list of optimizations on page 9, the only one that involved a low-level understanding of the machine and resulted in a significant improvement was (unless I missed something) the use of InlinedVector (and this is only barely such a case - lots of C++ codebases have a similar class).

acqq · on June 3, 2011

Exactly. Using hash maps and vectors are really elementary choices. Can you consider understanding enough in "systems programming" (see other comments for definition) without knowing that hash_map is implemented using hash tables and map using a tree? And that therefore adding elements to the map would incur rebalancing costs which hash_map wouldn't, as soon as you can estimate the number of elements to be processed. I consider it more a "meta" than a "language" decision.

scott_s · on June 3, 2011

Rather than library knowledge, I submit it's basic algorithmic and data structure knowledge. Perhaps one could argue it's knowing how that algorithmic and data structure knowledge maps to actual libraries.

By the way, std::list::size() is defined to be O(1) (see http://www.kuzbass.ru:8086/docs/isocpp/lib-containers.html#l...). While it's possible, of course, to define a linked-list where the size operation is O(n), that's not an implementation of std::list. Hence, there was no difference when std::list::empty() was used - although I agree it's better to use it, since it's more meaningful.

davidtgoldblatt · on June 3, 2011

In a confusing bit of ISO trivia, std::list implementations can do either - the standards draw a distinction between "should" and "shall". If an implementation "shall" do something, it's a requirement, and if it "should" do something, it's the preferred choice among several. The size() function only "should" be constant time. In fact, in GCC (which I assume is the Google compiler of choice), std::list::size() is O(n) (http://gcc.gnu.org/onlinedocs/libstdc++/latest-doxygen/a0142...).

pgbovine · on June 3, 2011

summary: author implemented a sophisticated compiler algorithm (loop header recognition) in a straightforward canonical way in C++, Java, Scala, and Go, measured performance/memory usage, and then (the most interesting part of the paper) he had colleagues who were experts in each respective language write highly-tuned optimized versions and reported on what it took to optimize in each language.

CCs · on June 3, 2011

So C++ wins the performance test by large margin, Scala follows (3.6x), then Java 64bit (5.8x), Go (7x) and finally Java 32bit (12.6x).

Scala without any optimization is still better than Java with all the ninja-skills black magic applied. Scala optimized is pretty good - "just" 2.5x slower than C++

arron61 · on June 3, 2011

Actually it's odd. They say that in the Java tuning, they made some simple optimization that got it on par to the original C++ one but then they refused to do anything further; noting instead the same C++ optimization would have applied in Java as well.

Does that mean the java version would have been just as fast? It seems in general though that Java is just as fast as C++ unless you are an ultimate C++ expert.

johnm · on June 3, 2011

Indeed, this is ridiculous and makes their final conclusions weaker.

wtracy · on June 3, 2011

Maybe I'm missing something, but I'm amazed to see Scala beating out Java.

jshen · on June 3, 2011

Regarding the java version, "Note that Jeremy deliberately refused to optimize the code further, many of the C++ optimizations would apply to the Java version as well"

gaius · on June 3, 2011

Why? Remember that at this point in history, Java-the-language and Java-the-VM are related in name only.

jbooth · on June 3, 2011

From a performance perspective, the VM is what dominates and Java operates closer to the VM assumptions about what it's going to run.

If you read the paper, you'll note that the Scala author significantly changed the structure of the algorithm to conform with the way Scala does things (recursion, etc). So it's sort of an apples-to-oranges comparison as far as Scala's concerned, too bad they couldn't write ugly Scala code that would give a better comparison.

enneff · on June 3, 2011

FWIW even though Ian improved the Go version somewhat, it's not really idiomatic Go code.

ianlancetaylor · on June 3, 2011

I only spent an hour straightening out his use of Go. Had I know he was going to publish it externally as "Go Pro" I would have tried to turn it into a real Go program, and perhaps even optimize it somewhat. Overall I think the Java and C++ code got the most attention from language experts. Ah well.

bokchoi · on June 3, 2011

From the Java Tunings section of the paper:

Note that Jeremy deliberately refused to optimize the code further, many of the C++ optimizations would apply to the Java version as well

It sounds like Jermey didn't spend much time tuning the Java version either.

jesstaa · on June 3, 2011

In the case of Go, he wrote the code himself and had one of the Go devs take a brief look at it. http://groups.google.com/group/golang-nuts/msg/f3af3549cd3ed...

aria · on June 3, 2011

Really shocked they would place numbers in a table like that when the 'Scala Pro' version is doing something algorithmically smarter.

They note in the paper this distinction, but it should be marked in the table, because I'm sure 99% of people don't see that. Sloppy!

sili · on June 3, 2011

That, plus the fact that they refused to do same optimizations to Java as they did to C++ for some reason.

scottjad · on June 3, 2011

Speaking of Scala Pro: "It should be noted that this version performs algorithmic improvements as well, and is therefore not directly comparable to the other Pro versions."

enneff · on June 3, 2011

"It should be noted"?! What an understatement! What's the point of going to the trouble of writing a paper if you're not even going to make your comparisons fair? Crazy.

Djehngo · on June 3, 2011

Because the initial comparison was fair.

igouy · on June 4, 2011

Really?

"... the benchmark also used a HashMap<Type, Integer> when it should have just stored a primitive int in Type; it also used a LinkedList where it should have used an ArrayDeque."

http://jeremymanson.blogspot.com/2011/06/scala-java-shootout...

onan_barbarian · on June 3, 2011

Using a "older Pentium IV workstation". Older than what? Older than Grandpa? Older than dirt? Older than the dinosaurs? Why?

Pentium IV - really? Let's see - we've had Core, Core 2, Nehalem and Sandy Bridge since then, not to mention 3 in-between process shrink versions of the same architecture.

Perhaps we could pass the hat around so that Google could afford a Sandy Bridge workstation and discover what these interesting results look like on an architecture that dates from some point in this decade - or at least, at some point in the previous decade.

This stuff actually, really, truly makes a difference. And not necessarily in favor or against any particular one of these languages...

acqq · on June 3, 2011

Imagine: there is a paper discussing the techniques of weight lifting. Your complaint is that there are newer models of weights than those they lifted.

jnhnum1 · on June 3, 2011

Newer processors don't just increase the performance of all programs equally. They have things like improved branch prediction, cache prefetching, better pipelining, and different cache sizes which can make a lot of performance optimizations that you get from C / C++ less relevant.

acqq · on June 3, 2011

Except from the "most optimized C++ code" you have all sources of all benchmarks. Please try to run them on any newer x86/64 based processor and show us that the article conclusions don't hold. I'm not holding my breath though.

scott_s · on June 3, 2011

I agree that it probably won't make a difference. But good experiments remove as many of those probablys as possible.

acqq · on June 3, 2011

The parent commenter uses "improved branch prediction, cache prefetching, better pipelining, and different cache sizes" as a "mumbo-jumbo that can mean something different." I'm in the business, so I can tell you, most of the improvements give you just some "overall speedup" so that you can happily buy today's processor running on 3 GHz and be glad it's faster than almost a decade old P4 running on the same 3 GHz. Add to that that now you have a multi-core CPU and that you have to "clear" the paths to the cores in order to prevent them from slowing down one another and also to compensate the bigger delay introduced by more modern RAM technologies, which trade bigger delay for the possibility to feed more cores.

Then, measure the algorithms that run on one core, anyway, on the P4 and the latest Core iX. Your slow languages won't be faster than your fast ones just because the quoted changes were introduced to the processors in between.

scott_s · on June 3, 2011

Please note that I did not disagree with your conclusion - I agree that it probably won't make a difference. If it makes you feel better, I'll say it's a very high value of probably. But I'm in the business of performing systems experiments. Removing as many variables as possible is just good experimental design. If you want to know what the performance will be like on modern machines, then it's best to run on modern machines.

onan_barbarian · on June 3, 2011

Dear lord, thank you for this bit of common sense.

It's actually quite hard to know what a given piece of code will do on a given microarchitecture even if on average it runs everything X% faster - you may find you're the bit of code that bites the big one and run X% slower on the new microarchitecture (e.g. you were depending on branch mispredicts being cheaper than they are) or suddenly your code runs way faster than competing codes (e.g. you're the superstar running 2X% faster because a sudden increase in ILP exposes that you've got a main loop full of independent operations).

acqq · on June 3, 2011

Thought experiment: you are Intel and your new processor makes, who knows how, writing highly optimized C code unnecessary, since the speed up of some higher level languages (from now on Java or Scala) is bigger than the speed up of the highly optimized C code (in which the runtime of Java and Scala are implemented). Wouldn't you make the biggest announcement of the computing history? "With this new CPU, your sloppier written code is faster than a highly optimized code. Every programming assumption valid up to now isn't anymore!"

tbrownaw · on June 3, 2011

Um, don't even semi-modern compilers optimize things well enough that most of those aren't particularly relevant already?

jules · on June 3, 2011

No, not at all. On the contrary, modern compilers make use of these things when deciding how to optimize.

tbrownaw · on June 3, 2011

...oh, I think I see what you're saying. If you have hardware that can predict jump-to-pointer and not just jump-to-literal, that can make quite a difference in how well more dynamic languages / paradigms perform compared to more static ones. I was thinking more along the lines of languages giving programmers (the appearance of) the ability to tweak their code to take advantage of these things, and "don't bother, the compiler is smarter than you are".

jules · on June 3, 2011

Branch prediction and processor caches help all languages, and low level languages like C and C++ even more than dynamic languages. What I'm saying is that smart compilers are not in any way a substitute for branch prediction and processor caches, on the contrary: smart compilers are designed to take the best advantage of these features. And on the point why C and C++ are still relevant because of this: it's basically impossible to fully take advantage of e.g. processor caches in languages like Java and Ruby because you cannot fully influence how your data structures are laid out in memory. You can in C and C++. Sure, in theory smart compilers could do that for you, but in practice they don't. And the problem is so difficult that we probably get that in compilers working about as well as a human C/C++ programmer around the same time that we get strong AI.

onan_barbarian · on June 3, 2011

I've programmed computers at a low level, and lifted weights for the past 21 years (and in fact, read some actual papers about weight lifting, and failed to make the aforementioned complaint). The Eleikos of today feel basically like the Yorks of yesteryear.

On the other hand, the processors of 2000 are quite a bit different to the processors of 2011, and anyone who thinks that they can prove that this is irrelevant to the experiment in question by pure reasoning (rather than experimentation) is a moron, plain and simple.

Having tracked individual embodiments of algorithms over architectures from the P4 to Sandy Bridge, I can assure you that tradeoffs between particular approaches on an given problem vary wildy based on changes in microarchitecture.

Overall, there's a steady improvement and perhaps these languages are all gaining exactly the same (despite a substantial improvement in effective ILP from P4 to Core 2)... but I don't know that for sure and neither do you, despite your sophomoric self-assurance.

If only they could do the experiments on a processor from the last 5-6 years...naturally someone else could do the work for them, but why bother?

bonch · on June 3, 2011

What an inaccurate analogy. Newer processors actually behave differently from older processors, while gravity doesn't change.

thesz · on June 3, 2011

What I'd like to see there is Haskell version and how all those implementations speed up when running on several cores.

This is much more interesting considering our current reality.

I once gained 10% speed up on two cores by changing just two lines of our optimization tool written on Haskell.

copper · on June 3, 2011

The most instructive part of this are the listings from the changelogs alongwith the resulting improved performance. I'm really enjoying the C++ tunings. I wish they did show the results with the google hashes, though.

btmorex · on June 3, 2011

So, the C++ performance numbers were with the "public" version? It wasn't clear whether they left the internal version out completely (besides the changelog) or simple didn't release the code. The published C++ numbers are already way better than any other language in terms of speed and memory use. It would be surprising if there was a substantially faster version.

copper · on June 3, 2011

The final list of changes is supposed to be with the internal version, and roughly seem to show a 20% change over and above what the basic std::map -> ext/hash_map replacement. I'm not sure if I should add the percentages, or multiply them along to get the final runtime, though that doesn't make much difference, really :)

You're right that there may not be much improvement, but I've seen the google dense_hash_map do better sometimes than the basic SGI hash_map.

Edit: the code repo has the unoptimized versions. Oh well, at least I can get the satisfaction of seeing the 30% myself :)

igouy · on June 4, 2011

afaict Yes, the C++ performance numbers were with the "public" version.

afaict Yes, they left the internal version out completely (besides the changelog).

Notice C++ Dbg has the same wc -l as C++ Opt.

copper · on June 5, 2011

Isaac, do you think you could add this problem to the shootout? I believe (and am therefore almost certainly wrong) that there isn't any benchmark there that derives almost purely from compiler theory.

mulander · on June 3, 2011

The paper uses wc -l (probably because sloccount doesn't handle go and scala?) to count the lines of code.

Here is the output of sloccount:

    SLOC    Directory       SLOC-by-Language (Sorted)
    595     java_pro        java=595
    591     java            java=591
    488     cpp             cpp=488
    328     python          python=328
    0       go              (none)
    0       go_pro          (none)
    0       scala           (none)
    0       scala_pro       (none)

    generated using David A. Wheeler's 'SLOCCount'.

Compare it to the paper.

    Benchmark   wc -l
    C++ Dbg/Opt 850
    Java        1068
    Java Pro    1240
    Scala       658
    Scala Pro   297
    Go          902
    Go Pro      786

igouy · on June 4, 2011

Here are GZip minimally compressed source-code sizes after removal of comments and removal of duplicate whitespace characters -

    Benchmark   GZ Bytes    Factor  (wc -l Factor from paper)
    Java Pro    5198        1.65x   1.9x
    Java        4403        1.40x   1.6x
    C++         4229        1.34x   1.3x
    Go          3768        1.20x   1.4x
    Go Pro      3259        1.03x   1.2x
    Scala       3138        ====    ===
    Python      2755        0.87x   ????
    Scala Pro   1929        0.61x   0.5x

lbolla · on June 7, 2011

This is the result using cloc:

  Bench        files        blank          comment        code
  cpp          4            145            236            501   
  go           5            153            239            546   
  go_pro       5            132            233            457   
  java         8            151            352            605   
  java_pro     8            157            351            772   
  python       2            110            204            334   
  scala        2            104            224            384   
  scala_pro    2             59             62            230

ivan_ah · on June 3, 2011

link to the code: http://code.google.com/p/multi-language-bench/ This is a great resource to learn the different languages. (I had never seen C++ template programming used in practice before)

Challenge to the pythonistas: rewrite the python version so it uses the C++ code under the hood using scipy weave

igouy · on June 4, 2011

http://shootout.alioth.debian.org/

lbolla · on June 7, 2011

Templates are not used in the C++ benchmark.

cygwin98 · on June 3, 2011

I was surprised by the fact that the Python version has 626 lines of code, compared to 297 to the Scala Pro. As Python is supposed to be more expressive, I was expecting something like 150ish LOC for Python.

Edit:

The Python version can be found at http://code.google.com/p/multi-language-bench/source/browse/...

Spyro7 · on June 3, 2011

Actually, the python version has about 328 lines of code, while the scala version has about 216 lines of code. You should visit the Google project page and use something like cloc to analyze the files:

http://code.google.com/p/multi-language-bench/

http://cloc.sourceforge.net/

I confess that I do not know a lot about Scala, but it looks like some of the functional aspects of the language allow for some savings in lines of code.

If I had the time right now, I doubt that it would be too difficult to shrink the python version a bit. Just taking a glance at it, I don't see a single list comprehension throughout their code. I am sure there are some other language features that could probably have been better leveraged as well.

Just out of curiosity, where did you get the 626 loc and the 297 loc from - I tried looking but I can't find it anywhere. Though that could just be a product of my lack of sleep right now.

cygwin98 · on June 3, 2011

The 297 loc for scala pro is from the paper that also includes 13 lines of Apache license stuff on the top and 14 lines of comments. The author used "wc -l" to count the LOC, that's not very scientific anyway. The 626 loc for python is also from "wc -l". I looked at the code again and found out it contains a lot of detailed comments and even test cases, so it's not fair to say the python version has 626 loc.

I suspect the author gives it out as a reference implementation.

euroclydon · on June 3, 2011

I would imaging the expressiveness is overpowered by all the code necessary to make it performant. I'm thinking ctype libraries.

Spyro7 · on June 3, 2011

Well, if this is the code that is in question:

http://code.google.com/p/multi-language-bench/source/browse/...

No, there are no ctype libraries in there.

olifante · on June 3, 2011

TLDR: C++ fastest but hardest, Scala fast and elegant, Java simple but opaque, Go immature.

jbpritts · on June 3, 2011

C++ code, when sophisticated template meta programming techniques are used, can be as fast as tuned C code, and can beat any language in terms of speed with the exception of hand-tuned assembly. However, techniques like expression templates, automatic loop unrolling, template specialization, and static polymorphism require a VERY high degree of sophistication from the programmer. In some sense, the programmer is forced to be a compiler. There are some individuals that, despite this demand, can be highly productive. The others wallow in complicated syntax, horrible type errors, and difficult to trace run-time errors. Unless speed is absolutely essential, C++ should be avoided at all costs.

pnathan · on June 3, 2011

C++ TMP is quite painful to the uninitiated. Outside of that, C++ is not that bad, especially as you build up your project-specific abstractions.

choxi · on June 3, 2011

This is cool but I think it's important to remember that they're comparing algorithmic benchmarks but modern software has many different bounds besides for raw, algorithmic processing. For example, I think I/O and database interactions are the bottleneck for most web apps. I really liked the way this article broke it down: http://pl.atyp.us/wordpress/?p=2947

cdavid · on June 3, 2011

Given the languages they are using, I think it is fair to say that they are interested in system-programming kind of tasks, which for google means the stacks below what you normally use for web programming (i.e. I would expect google to have their own web servers, or at least heavily customized from existing ones, etc...). For system programming, "language speed" matters much more (one definition I like for system programming, seen on lambda ultimate: "programming where both algorithm complexity's constant and memory representation matter").

Of course, in most cases, you're right. Given an infinite pool of very good programmers and infinite time, C will almost always be faster than most other languages. But in reality, you have a very limited of maybe not so good programmers, and the project needed to be done yesterday :)

gtani · on June 3, 2011

Thx for that. Java and scala-specifically, it's a pretty high-dimension space. Background:

http://www.elis.ugent.be/JavaStats

http://www.bestinclass.dk/index.clj/2010/02/benchmarking-jvm...

http://isthisclojure.blogspot.com/2011/02/benchmark-clojurej...

https://docs.google.com/present/view?id=0AS8emH3-FLt3ZGRtbWJ...

Examples of benchmarks gone awry:

http://stackoverflow.com/questions/6146182/why-is-my-scala-c...

http://groups.google.com/group/scala-language/browse_thread/...

igouy · on June 4, 2011

a pretty high-dimension space :-)

http://download.oracle.com/docs/cd/E15289_01/doc.40/e15060/t...

nimrody · on June 3, 2011

Keep in mind that since C++ is usually used for performance critical code, its compilers are often highly tuned and produce good code (see Intel's ICC for number crunching code).

A less popular language - or one that is usually used for less-critical code - will suffer.

C++ wasn't always so fast. See Kernighan and Pike, "The Practice of Programming" for a simple example where Java surprisingly beats C++.

cageface · on June 3, 2011

True, but this is one of the reasons people choose to code in C++. The quality of tools & common libraries is as much a factor in language selection as the syntax and semantics of the language itself.

nimrody · on June 3, 2011

You mean - "forced to use C++" :)

I would love to use Digital Mars D or Scala for scientific applications if they had compilers/runtimes as good as C++.

Perhaps one day the LLVM backend would be useful as a universal backend for many less popular languages.

skimbrel · on June 3, 2011

Wow.

Looks like C++ is still one of the performance kings among current programming languages. Can't say I enjoy using it though.

It also looks like Go and Scala have a long way to go on compiler optimization. Not surprising, since they're young languages compared to C++ and Java, and it's a decent amount of work to optimize the higher-order constructs they provide.

latch · on June 3, 2011

Languages that come from an era where total memory was measured in kbytes, cpus in megahertz and pointer arithmetic was as common as if statements, will always be performance kings.

It'll take a pretty significant shift to change this. Something like 100+ core cpus each running slow, might be such a shift.

Being surprised that C++ is faster than newer languages which provide all types of abstraction is like being surprise assembly is faster than C++.

anonymous246 · on June 3, 2011

You're mostly correct, but the code they wrote DID use fancy abstractions like hash tables, templates etc. C++ delivered on abstraction WITHOUT a performance penalty, which is what the C++ designers have claimed it could do since forever. Nice to see it verified again. I have no doubt that the C++0x syntactic sugar enhancements would have have also delivered the same high performance.

alexkcd · on June 3, 2011

Indeed. I write a lot of Objective-C code, and still drop into C++ for the CPU heavy tasks. Swapping out NS* containers for std::* alone gives a fairly high boost in performance.

All the C++ hate is blown way out of proportion. C++ is a great tool when you need performance. Tuning C++ isn't that difficult either. Contrary to what a commenter said above, steering clear of C++ at all costs is not what you should be doing. You should know when and how to use C++. It's useful more often than many people here seem to think. Especially if you develop for mobile devices.

cageface · on June 3, 2011

I've been pushed back into C++ out of necessity for some DSP work and I'm actually enjoying it quite a bit. It sucks for text processing but for other things it's really not so bad. As you say, with computing shifting back to underpowered devices C++ is having another day in the sun.

fizx · on June 3, 2011

Sweet $deity, the Scala compiler is so painful! You changed one line of code in a 40KLOC code base, now you wait 40s for the recompile.

tvorryn · on June 3, 2011

Is it hard to use fsc? Fsc should fix those compile times, especially for minimal changes. In the paper it reduced compile time to 25-33% of scalac's compile time. (13.9s to 3.8s and 11.3s to 3.5s)

fizx · on June 3, 2011

I'll check it out; thanks for the tip!

jzoidberg · on June 3, 2011

nobody uses the Scala compiler that way. It is common to use FSC or the Simple Build Tool to compile incrementally

fizx · on June 3, 2011

I was talking about incremental compiles (one loop around sbt ~test-compile), on a laptop, with a web browser open. Colleagues have benchmarked SSD compiles at 5x+ faster.

edabobojr · on June 3, 2011

The java version spends 75% of it's time gc? That smells fishy to me.

edit: I just got to section VI of the paper, which does some gc tuning. In my playing, I also found about a 20% speedup by changing the HavlakLoopFinder.UnionFindMode class to being a statuc class.

igouy · on June 4, 2011

http://jeremymanson.blogspot.com/2011/06/scala-java-shootout...

javanix · on June 3, 2011

This paper served as a great introduction to Go programming for me. I've been trying to work through my own basic program in Go for the past few weeks and their overview of the language features used cleared up a lot of loose ends for me.

scott_s · on June 3, 2011

Having the same algorithm implemented in several other languages helps, too. It enables you to look at the implementation in the language you're most comfortable with, then look at one of the other implementations and think "Oh, so that's how that's done in this language."

igouy · on June 3, 2011

http://shootout.alioth.debian.org/

Although you shouldn't necessarily think "Oh, so that's how that's done in this language."

supersillyus · on June 3, 2011

Keep in mind, though, that the Go code shown there is not a good example of how idiomatic Go is written.

xedarius · on June 3, 2011

I wonder what JVM they used, as significant performance differences exist between Hotspot and JRockit. I also wonder what compiler they used, again Microsoft compiler will yield different results compared to gcc.

igouy · on June 4, 2011

Here's the Makefile - gcc

http://code.google.com/p/multi-language-bench/source/browse/...

16s · on June 3, 2011

What is this InlinedVector<> they speak of in C++ tuning? Is that a Google specific container or is it in the std library (or some other library)?

jongraehl · on June 4, 2011

It probably holds small vectors directly inside the sizeof(InlinedVector<X>()) bytes, instead of indirecting. I think it's a pretty common technique.

mhd · on June 3, 2011

It probably won't take long until someone does the same exercise with other languages. Maybe they should add it to the shootout.

Can't wait to see Modula-3's numbers…

igouy · on June 3, 2011

> Maybe they should...

"they"? The space monsters who run the universe?

LiveTheDream · on June 3, 2011

Loaded directly as a PDF for me.

rrrazdan · on June 3, 2011

The scribd link is an alternate.

kristianp · on June 3, 2011

A bit of cross-promotion, Scribd is a ycombinator company.

manveru · on June 3, 2011

At least on my galaxy tab using opera mobile, the scribd link simply tells me to download the PDF.

zohebv · on June 3, 2011

One of the interesting details is that Go Pro required 786 lines of code as opposed to 850 lines of C++ code and 297 lines of Scala Pro code. Regular Go code was 902 lines. It certainly doesn't look like Go scores any expressiveness points over C++.

ianlancetaylor · on June 3, 2011

"Go Pro" is unfortunately a very misleading name. I just spent an hour cleaning up his use of Go. Nobody made any attempt to actually turn this code into a well-written Go program. I did not realize that he was going to publish it externally.

LiveTheDream · on June 3, 2011

Would you consider touching up the Go version again to make it more idiomatic and well written? I'd be interested in seeing good Go examples; I'm sure many others here would as well.

uriel · on June 6, 2011

The Go distribution already contains plenty of idiomatic, well written and very readable code.

coliveira · on June 3, 2011

It would be interesting if you did the necessary optimization and posted your results for a better comparison.

masklinn · on June 3, 2011

It is also slower than most alternatives, and is in the same (real) memory ballpark as Java.

On the other hand, it does compile fast.

azth · on June 3, 2011

Keep in mind that the current Go compilers are not mature yet, and don't do any significant optimizations. The GC is pretty primitive too.

masklinn · on June 3, 2011

I keep that in mind. As well as it getting slower as it starts doing useful work and optimizations, thus losing its "fast compilation" edge (which is just about its only advantage right now)

supersillyus · on June 3, 2011

It's also interesting to note that it looks like he measured LOC with "wc -l", so all includes, comments (including license), whitespace, debug code, and the main are included. LOC isn't a great metric, and the authors of this paper seem to make it worse by not measuring it very well.

igouy · on June 4, 2011

Please show that including comments, license, whitespace ... made some kind of significant difference to the comparison.

http://news.ycombinator.com/item?id=2618625

cageface · on June 3, 2011

It may just be because I've been writing a ton of C++ lately, but to me the C++ data declarations are the most readable of the bunch. The stuff with iterators is much uglier, of course, but so far I like the syntax of C++ a lot better than that of Go.

stephen_g · on June 3, 2011

Yep, iterators are one of the most annoying things in C++. That's why I love range for in C++0x - instead of writing this, for example:

for(std::vector<std::string>::iterator it = vec.begin(); it != vec.end(); it++)

you can just do:

for(std::string x : vec)

So awesome!

dzorz · on June 3, 2011

Even better:

for (auto &x : vec) { }

cageface · on June 3, 2011

I can't wait until I can start using this in XCode.

dzorz · on June 3, 2011

If you compile recent clang, it supports range based for loops.

Using https://github.com/Rip-Rip/clang_complete you can use smart context-aware completion with vim/emacs.

For example if you had this code:

vector<string> vec; for (auto &x : vec) { x

and then if you typed period (.), it would show members from std::string. I'd say it works better than Intellisense (it's very precise and you use the same parser for code completion and final compilation).

cageface · on June 3, 2011

I'm too dependent on the rest of the integration with XCode to venture back into emacs land again but I'm looking forward to a rev of XCode that includes this.

voyvf · on June 3, 2011

> If you compile recent clang, it supports range based for loops.

Thanks for the tip, will update my copy and try it out this weekend! :D

DonnyV · on June 3, 2011

Who said that when you release a paper that it needs to look like it came from a text book. I couldn't get past the first page because I felt like I was back in high school.

sparky · on June 3, 2011

>Who said that... The conference. From http://days2011.scala-lang.org/node/91 :

Submissions must be in English and at most 12 pages total length in the standard ACM SIGPLAN two-column conference format (10pt).

Nearly all academic papers in computer science and engineering look like this. Not to say they've converged on an optimal format, but it wasn't exactly a creative decision on the part of these authors.