There are a few things wrong with this benchmark. In Lush, he's measuring the time it takes to sum 3 million pre-computed random numbers. In Clojure, he's measuring the time it takes to generate 3 million random numbers, and then sum them.
It also looks like he's using two different data structures. In Clojure, its a lazily generated linked list of objects; in Lush, it appears as if he's using a vector pre-initiated to the right size.
He also doesn't mention which versions of Clojure and Lush he's using, or how much memory Lush uses. He complains that a JVM with a 130M is too small, but the 30 million long array in Lush would have been almost twice that size if it was populated with 64-bit doubles.
For reference, it takes me 571ms to run the 300 thousand number example that took him 861ms. If I factor out the random-number generation from the Clojure benchmark, it cuts the time down to 120ms.
If then I use a pre-generated fast array and the specialized areduce function (to match his specialized idx-sum function), I get 16ms for 300k, 22ms for 3m and 230ms for 30m.
So performance is on par with Lush, so long as you compare like for like.
There are various tools you can use, and using a screwdriver instead of a hammer won't help you much if what you have is nails.
From my point of view, you may get back to me when Lush has data structures for concurrent programming and does (nearly) pauseless GC on multicore machines with heaps of 12GB composed of complex data structures, not just vectors of numbers. This is what I get with Clojure and for my applications it works great. Oh, we also do numerics, quite a bit in fact. But whenever I need actual raw-metal double-adding performance, I use an appropriate tool, e.g. an extension/library. For certain string operations we even went down to x86-64 assembly, with great results.
What you describe is a JVM limitation, not a Clojure limitation — and you're right in only one thing: Clojure is not the right tool for you.
Digression here, but I like the "(nearly)". This bit always amuses me about garbage collection wonks. The pauseless bit is a real time requirement. Saying your GC has great latencies or is "(nearly) pauseless" is tantamount to telling a real time engineer your system only fails some of the time. It makes you look dumb.
GC is great. GC makes a ton of things simpler. GC as implemented in popular environments still sucks for real time use.
The 'pauseless' bit is not just a real-time requirement. A website that stalls for 2 seconds once every 10 minutes is unacceptable to me. That application thus requires a garbage collector that is nearly pauseless: the pauses are small enough that your users won't notice them. So I don't think it makes anyone look dumb at all to speak of a 'nearly pauseless' garbage collector.
I can relate to this. Nothing is more disappointing than putting together a nice piece of code and then seeing it run too slow to be practical.
A while ago, I wrote some physics simulation in Python. The code turned out to be really neat and quite idiomatic Python code. The code ran very slow, I was seeing only 10-20 frames per second when I was shooting for something closer to 120 frames per second. All I was doing was simulating a cube suspended on a plane with four dampened springs, and I was only simulating one of those where I wanted lots of 'em.
The problem was that my code was allocating and freeing way too many objects. The proper solution would have been to go from a neat idiomatic approach where the code resembles the mathematical equations it simulates to some kind of structure of arrays -style code with Numpy. Needless to say, there isn't much fun in using Python that way, so I might as well write the damn thing in C.
Later I went on to rewrite that code in C and running it on my GPU with OpenCL. I did write it in a structure of arrays style, so maybe the Python experience gave me a valuable lesson. Now it runs fast enough for my purposes.
With a decent implementation you can write performant code in a dynamic language.
An example in Common Lisp using the SBCL compiler on an Intel Core2 Duo @ 2.66GHz w/ 6G of RAM:
CL-USER> (defparameter *nums* (make-array 3000000 :element-type 'float :initial-element 0.0))
*NUMS*
CL-USER> (dotimes (i (length *nums*)) (setf (aref *nums* i) (random 30.0)))
NIL
CL-USER> (time (reduce #'+ *nums*))
Evaluation took:
0.375 seconds of real time
0.379975 seconds of total run time (0.379975 user, 0.000000 system)
[ Run times consist of 0.197 seconds GC time, and 0.183 seconds non-GC time. ]
101.33% CPU
1,000,511,208 processor cycles
96,010,096 bytes consed
4.4951772e7
CL-USER>
C isn't always the answer (thought it often is a pretty good default).
I think it's naïve to speak of "performant code" in general. Sure, I used SBCL (and CMUCL, and AllegroCL) to get blazing double-type single-cpu performance. But then I wanted to make use of multiple cores, which is when I ran into a brick wall. Things are better with SBCL now than they used to be, but people still run into bugs related to multithreading, not to mention that CL has little support for concurrent programming.
The definition of "performance" is different for every application. For some, it's multiplying long vectors of doubles. For others, it's making use of 16 cores for complex data structures operations. For others it's about servicing tens of thousands of mostly idle TCP connections (tried to do that with AllegroCL, failed). For others yet, it's about achieving consistent response times, which means concurrent and nearly pauseless GC.
In other words, choose the right tool for the job.
However I think the OP was comparing dynamic vs static languages. Ey made a prototype physics simulation in Python whose source code was nice to read but performed terribly. Ey then decided to write it in C (not a bad choice mind you).
It seemed to me that the OP found the readability of Python (and dynamic languages in general) a useful trait to have while lamenting that their implementations were too slow to be practical for eirs purposes.
Therefore I proposed an example of a dynamic language with the property of being easy to read while being fast in implementation.
I've used Qwt for decent results. It may be very heavy if you're not already using Qt, though, and the way its 2D plot is implemented makes it unusable if you need to update your data many times per second (I eventually just put my data in a texture and displayed it via OpenGl, doing the scaling manually). Another solution is to call gnuplot, matlab, or whatever through the command line.
Too many developers are paying way too much attention to the code being "nice" or it being in a "nice" language, and not enough attention to "does it work" "is it fast".
The problem is that when working with prototype code, you want to be able to understand what's going on. Code in structure of arrays style is harder to read. It would be nice that it would run with at least some kind of performance that's even close to what you need in the end.
We definitely need "nice" languages, because they give a huge productivity booster over writing stuff in C or other low level language.
However, many nice languages tend to be implemented using an interpreter, a crappy garbage collector and a giant lock to prevent anything with threads. This makes them unwieldy for applications where latency or throughput is important.
Recently the trend has been towards languages implemented with a compiler, using smart static typing disciplines and an LLVM-based backend. This is a very positive and welcome change.
Nice code is maintainable code. High-level languages exist for the sole purpose of making code "nice."
You pay enough attention to "is it fast" and you'll find yourself writing everything in assembler. There's a time and place for that, sure, but when you have a high level language that can theoretically optimize idiomatic code into something faster, it should opt to.
You pay enough attention to "is it fast" and you'll find yourself writing everything in assembler
Not everything. Not even then. Just the good bits maybe. Sometimes none of it. It might make me prefer C# over Java for example, because C# can lay down structs linearly in memory so I can throw them to the graphics card directly. Or maybe I'd do all the hard work in C and SWIG it so I could use Ruby.
But what I certainly wouldn't do is attempt to write a physics simulation in python, "idiomatic" or not.
"But what I certainly wouldn't do is attempt to write a physics simulation in python"
Depends on whether performance is a hard requirement or not - I've written engineering simulations of parts of power stations (including one nuclear plant) in Lisp and they weren't particularly fast but nobody bothered because they weren't used interactively.
Eventually I did write Lisp to generate C++ code for the same models when we did want better performance and it worked really well - but I only did that as an optional step when someone needed the extra performance.
But that's the thing - I would argue that NumPy code is not "idiomatic" Python which uses built-in operations/structures, like list comprehensions, tuples, and dictionaries. I had a similar experience during extensive NumPy coding where I thought, why am I not just writing this in Fortran.
Depends on what are you working on and what's the goal. I'm frequently on call for maintenance of a distributed system. During review I often return code that is not nice enough. Where "nice" means "will I understand this at 3am, seconds after I get woken up". I couldn't care less if it takes twice as long to execute. (please don't confuse it with not caring about complexity though - I do care if it's N^2 and doesn't need to be)
This doesn't mean it's not fun to wrap some long function into a reduce of a list comprehension. That goes into my fun projects, not stuff that's supposed to run 24/7 and worked on by others.
The title is link bait to the point of nearly trolling. He didn't do his research, he should have jumped on the Google group or Stack overflow or something before writing this. Someone named Mike in the comments on the OP comes up with a fix that uses Java primitives and that is very fast, in the range of his Lush examples.
The first example lets lazy evaluation of the random numbers leak into the timing, toss a (dorun tmp) before the timing for a little more sense.
The type annotation ^doubles is useless - it's telling Clojure to expect the definition to get an array of doubles, but then binds it to the same old linked lists.
Based on the work of a previous commenter, I posted this code in the blog comments:
(defn add-rands []
(let [ds (double-array 30000000)]
(dotimes [i 30000000] (aset ds i (Math/random)))
(time (areduce ds i res 0.0 (+ res (aget ds i))))))
This adds 30,000,000 numbers in 73ms on my machine. His Lush code added 30,000,000 in 180ms. I estimate my computer is twice as fast as his, putting them on par. Hopefully he will run the code so we can see all the run times on the same machine. (I never could get his Lush code running.)
Of course, the Clojure code here is fairly involved to do some basic stuff, but if you did things like this often it would not be hard to add some nice syntactic sugar over it.
To me, this reads like someone who used C to add a large list of numbers recursively and concluded C was trash because they ran out of stack.
There's no denying there is a specific problem. People are certainly entitled to decide that solving that problem is too much hassle for them. But how seriously should we take their opinions when they do that?
I'd say it depends on the quality of the effort they did invest. In this case, the OP admits that they used the wrong data structure in Clojure and the right one in Lush, which makes the entire exercise meaningless.
This is not the correct C program, this is a C program for the compiler with tail call optimization (or with the stack large enough for it to work). There's no TCO in the C standard.
Because it relies on the particular implementation of C compiler, not on the C standard. Take -Os flag away or use a compiler that doesn't have TCO, and the program has a bug.
See also:
* using memcpy() for overlapping regions (relies on the particular implementation of memcpy(); was exposed by a change in glibc - http://lwn.net/Articles/414467/);
It is ansi C standard code. There is an implementation limitation (stack size) that this program runs into. However that doesn't violate the C standard in any way, because the standard doesn't specify available stack space.
So the program is compatible with certain C implementations in certain configurations. Given the C standard allows a C implementation to pick any stack size they like (including zero) we could claim that about all valid C programs that make any use of the stack.
They're not incorrect, but that's mostly because the C standard also allows you to assume an infinite stack, so recursive programs that ask for absurd stack depths don't violate the standard with or without TCO. But if in practice the program only runs successfully when compiled on a compiler with TCO, it's not very portable, in that it relies on non-guaranteed compiler features for it to actually work.
> In this case, the OP admits that they used the wrong data structure in Clojure and the right one in Lush, which makes the entire exercise meaningless.
This right here is the crux of the flawed hypothesis of the entire article.
Languages where the upgrade path to fastness isn't horrible are okay with me also. In Common Lisp, for example, you can start adding in type declarations in speed-critical areas, which Clojure also has something similar to (not as familiar with it, so not sure how it compares to CL's type declarations).
Here is another place where Go hits the sweet spot between performance and expressiveness/simplicity.
Also by giving you control over how memory is laid out you can be much more efficient both in space and time, and when you need to go even faster you can optimize things quite tightly.
Guy Steele said it better though. He wants not the fastest language, but the fastest way to get to the solution. It may mean that the language isn't as fast to run, but faster to program.
Learn to be more calm and patient. Maybe you need to think in another way, if you use another language and another VM. Maybe this language or VM really isn't strong for what you want to do, but for what other tasks.
Also you might stop thinking in absoluta. There are no fast languages. Every good and well known language has their strengths and their weaknesses and all optimizations are a tradeoff. If you get stronger in one thing you MUST get worse something else. So if you have a fixed set of problems you really just need to find out which language solves your problemspace best. And if you find it, that doesn't say this language is better then the others.
If you want to benchmark two languages you need to do this:
Find N coders skilled in language X and N coders equally skilled in language Y. Have all of them code several small but realistic systems of various kinds, then benchmark the results and compare. Better yet, benchmark real, live systems of comparable purpose built using different languages. Anything else leaves you open to having your results dominated by differences in level of coding skill, quirks specific to your synthetic benchmark, and effects that are only prominent for tiny, overly simplistic programs.
Scala vs. Clojure is not something to discuss for this use case, I guess. As https://news.ycombinator.com/item?id=3293261 points out in that other thread, you won't write (recognizable) Scala (or Java. Or Clojure) code anymore if you want the best performance.
A language can't be fast; a language is a notation. Only implementations can be fast or, perhaps, generate fast code.
For an example, look at GHC versus some toy-Haskell whipped up in Common Lisp. Now, with those two implementations to hand, is Haskell 'fast' or 'slow'?
A language can't be fast; a language is a notation. Only implementations can be fast or, perhaps, generate fast code.
Theoretically, maybe. But it is nearly impossible to come up with an implementation for Ruby, that is as performant as a compiled C program (and retains the characteristics of Ruby).
So, it is definitely true that it easier to compile some languages to fast machine or byte code than other languages. As a consequence, people will call Ruby slow, and C fast.
> But it is nearly impossible to come up with an implementation for Ruby, that is as performant as a compiled C program
PyPy is getting there.
JIT, Runtime type detection and path optimization go a long way. And it's much easier to write correct Ruby than correct C.
When I went to college, the HP-41 was HP's top-of-the-line calculator and all the rich kids had them. I had a BASIC-programmable CASIO PB-700. In the end, I was able to finish tests in less time than it took the HP guys to program their calculators to spit the answer. I think it was the first time (1986 or so) I realized a more expressive programming language could be a decisive advantage over your competition.
"And it's much easier to write correct Ruby than correct C."
No, it isn't. C may be spartan, but its failure modes are predictable and manageable. A ruby program can be incorrect for a host of reasons completely unrelated to the program (interpreter bugs, garbage collection problems, etc.) and the language as a whole provides features (dynamic typing, method injection, exceptions) that make it harder to write correct programs.
Ruby may make it easier to write programs, but it definitely doesn't make it easier correct programs.
Eh, Ruby's failure modes are predictable too. I've never tracked a bug down to a "garbage collection problem", whatever that is. But I certainly have come across problems with C's lack of type safety, lack of memory safety, lack of GC, etc.
This is true, and I'm often peeved by exactly the same conflation. However, in cases like Clojure, where there's (as far as I know) really only a single implementation out there to be used, it's easy to blur the lines. Python is getting less so thanks to PyPy, though my perception of that is that thus far it's more on an "awareness" level than it actually being in widespread production use (though hopefully as it matures we'll see more of the latter).
The languages mentioned in this blog post present an interesting sort of in-between case though, since at some level you could almost regard them as being a collection of implementations of the same language (Lisp), rather than as actually distinct languages. I realize that's stretching some definitions a bit, but I think there's kind of a continuous spectrum there between "different language" and "different implementation of the same language". I mean, GCC implements extensions to C that aren't in the C standard and aren't (to my knowledge) supported in, say, MSVC; some refer to this as "GNU C", though I'd guess most non-language-lawyers looking at it would probably still call it "C" anyway.
I agree with your main point but think Clojure is not a good example as there are 3 "official" implementations: the JVM one, the CLR one and ClojureScript.
putStrLn =<< show . sum . take 3000000 . randomRs (0, 100::Int) `fmap` newStdGen
::Int is necessary to nail down exactly which numeric type we're dealing with, as nothing else in the expression nails down an exact type.
Personally I prefer loading Control.Applicative and using <$> where the `fmap` is, but I thought I'd keep the dependencies low. This demonstrates both how adding IO to an expression can be a bit of a pain, and that Haskell has come up with tools to deal with this without turning it into the naive "do" block with three or four lines a newbie would turn out.
I wouldn't expect miracles on the random number generation speed. Also in this context you may not be getting the best optimizations that you would get if you really cared. It so happens that Real World Haskell uses a virtually identical example in its optimization chapter, where by the end they get down to what is essentially the optimal non-SIMD assembler, if you're really interested in the dirty details: http://book.realworldhaskell.org/read/profiling-and-optimiza...
More practical example: CPython vs. PyPy. Also applies to CPython and CRuby vs. their JVM and CLR counterparts like Jython, IronPython, JRuby, etc.
However, a language will impose some limitations on the implementation, be it a compiler or an interpreter. Dynamic typing is perhaps the most concrete example of this.
The idea of a 'fast language' is still nonsense, given how much faster the best interpreters of dynamic languages are compared to the code generated by the worst compilers of static languages.
The title says only fast languages are interesting, then rules out using Java because it's not lispy enough :/
Either you want speed, or you want your favorite syntax and high level stuff. You can't have both.
FWIW My favorite syntax happens to be c/java/assembly type stuff. I'd hate to be one of the developers who hates those, must put you at a big disadvantage - as shown in the original post.
Why not? Apparently lush gets it right: C-like speed on numerics with a Lisp-ish syntax this guy likes. Yes, C-style stacks are going to beat Lisp-style closures; but there is no reason why a "fancy" language can't do fast matrix operations, which is 99.9% of numerical code anyway.
Even Scheme has (rather un-Lisp-ish) vectors in addition to the standard linked lists.
I once made the joke that a Java web app usually catches up with a Django one in about six months. The Java app is faster, but, the Django one stars serving users three months before the Java enters live beta.
And if you are serious about number crushing, my friends tell me you should use FORTRAN, not Java.
It also looks like he's using two different data structures. In Clojure, its a lazily generated linked list of objects; in Lush, it appears as if he's using a vector pre-initiated to the right size.
He also doesn't mention which versions of Clojure and Lush he's using, or how much memory Lush uses. He complains that a JVM with a 130M is too small, but the 30 million long array in Lush would have been almost twice that size if it was populated with 64-bit doubles.
For reference, it takes me 571ms to run the 300 thousand number example that took him 861ms. If I factor out the random-number generation from the Clojure benchmark, it cuts the time down to 120ms.
If then I use a pre-generated fast array and the specialized areduce function (to match his specialized idx-sum function), I get 16ms for 300k, 22ms for 3m and 230ms for 30m.
So performance is on par with Lush, so long as you compare like for like.