It's true that it's arbitrary; the problem is, there's no non-arbitrary alternative, so it's a weak charge to make.
Unless you're going to allow every language to converge on simply blasting some bytes into RAM and executing optimized machine-language code and watch all the languages cluster to the exact same location give or take how long it takes to load the machine code, you have to draw some lines about what's a "real" implementation and what isn't. If I were running it, I'd be even more strict and insist that the solutions must be "idiomatic"... and what's idiomatic and what isn't would also be arbitrary, because there is no escape from the "arbitrary". Yet the end result is useful. Not totally determinative, but it's not anywhere near "useless" either.
Combining your comment and DarkShikari's (http://news.ycombinator.com/item?id=2404278), it seems like a good shootout should simply compare the simplest, most idiomatic solution possible in each language.
The problem with a comparison of idiomatic solutions is that they rely in different degrees on what's in their libraries. IIRC the shootout tries to avoid this by comparing similar solutions using the same algorithm that are actually implemented with the interpreter/compiler in question. IMHO that's the more sensible way to do this.
Perhaps that concern is eliminated with good test programs? As an extreme example, if I wanted to compare how different languages do array slicing it's meaningless to force python to implement array slicing without the [:] operator. But then this whole test program is meaningless anyway.
You're right that there's a lot of nuance here, and perhaps there's no better solution. It's not even clear to me if there's a meaningful question to be answered. "This ocean's warmer than that one? What part? In summer or winter? Did you measure near a blue whale's fart?"
No I think I'll just sit here and complain ineffectually, but thanks for the suggestion!
I can work on only one thing at a time, but I can complain about many things at once. I think both functions are useful.
I wasn't even aware of this particular problem. Now this thread's taught me (and others) to utterly ignore the shootout. That's useful even if nobody builds a replacement.
I don't see how your one-line response is addressing the original article's criticisms, especially in the final paragraph.
I'm also disappointed that you respond to a rebuttal by switching your argument, without acknowledging whether the rebuttal is correct or not. I'm going to stop responding to this thread.
Do you think that final paragraph is somehow The Truth?
>> "It's also not possible to send any messages once your ticket has been marked as closed, meaning to dispute a decision you basically need to pray the maintainer reopens it for some reason."
Truth: There's a public discussion forum!
I'm disappointed that Alex Gaynor never mentioned that the first problem with his program was a bug in CPython, but instead kept that to himself for his blog.
I'm disappointed that Alex Gaynor never mentions in his blog that the next version of his program didn't work on x64 - it hung and timed-out after 1 hour.
Joseph La Fata contributed a Python pi-digits program the same week - his program worked first time on x86 and x64, on PyPy and CPython and Python 3 and only used ctypes to get to GMP.
2 days ago Joseph La Fata contributed a Python spectral-norm program - his program worked first time on x86 and x64 on PyPy and CPython and Python 3.
Do you see the difference yet?
What do you think the blog entry "My experience with Alex Gaynor" would be like?
I hate rhetorical questions. Just say what you want to say. Most people won't react to them like me, but they won't expend the effort to parse them and understand your point either.
I'm not going to fact check every story I read. When I actually do something that requires choosing a platform I may look at the shootout. Or I may just try a few little programs myself.
You may think I'm being unfair, or moronic. But I suspect most people are like me. Even you, when you don't notice. PR feeds on the interested-but-uninvolved.
I find this thread tragic. People could have seen the shootout's side, but now most of them will remain uninformed because the herd passed through while you were asking rhetorical questions. It's the shootout's loss.
I like your standard but I think the real criticism being made is not that it's arbitrary but that it's arbitrary in different ways for different languages. This starts to look like intentional dishonesty. Another standard closer to the status quo would be "anything portable". Why does one language require portability and not another? What is that actual problem with the example given? There doesn't appear to be one.
> Why does one language require portability and not another?
Because more than one language implementation was measured for that language but only one language implementation was measured for the other languages.
If only one language implementation was shown for Ruby it would be Ruby 1.9 - not JRuby
If only one language implementation was shown for Lua it would be Lua 5.1.4 - not LuaJIT
If only one language implementation was shown for Python it would be Python 3 - not PyPy
PyPy and LuaJIT are being treated more favourably than other language implementations.
In my experience language shootouts in general are often terrible by design -- with these being just a prime example.
A lot of the problem is in the questions they're designed to answer versus the questions people use them to answer.
For example, if I'm comparing Python and C, I typically want to know "how much slower would my program be in Python?", not "how much slower is my program in Python if I spent so much time hyper-optimizing it that I might as well have written it in C?"
But the test cases usually try to answer the latter, not the former.
It might be more reasonable than it seems at first glance. It's true that it's good to know how fast typical code runs, but there's another important question: when I run into performance problems and need to optimize a bottleneck, how fast can I make it before I have to resort to non-portable code or C extensions that complicate my build process?
Anybody writing, for example, Python code to solve these sort of problems in the real world would instantly reach for numpy. Which, while not part of the core language distribution, is pretty close to being a standard library for most python programmers. I'm sure several of the other languages have similar libraries that are being ignored in these benchmarks. Without taking things like that into account, theses results don't say too many useful things about real world performance.
Languages/implementations also vary in how much overhead switching to C costs you, especially in loops say, which eg the pidigits benchmark does sort of measure.
Because there's always a "for what" in there. An an analogy, consider MMA. On the surface, it seems to answer the question "what's the best martial art?". But really, it only answers the question of what is the best martial art for fighting a single opponent in an octogon shaped ring in front of an audience aiming for a submission? And the answer is of course Brazilian Ju Jitsu, exactly the answer the founders of UFC wanted...
Former cop Rory Miller writes about this in his book, the police experimented with BJJ and found it useless. Why? Because in BJJ you pin your opponent on his back because it makes a better show for the audience, but as a cop you always pin your opponent on his front so you can handcuff him!
Ok, sure, but I'd rather see a rough attempt at getting some numbers than just throwing your hands up in the air and saying "gee, that's a hard problem".
Also, I think that we all know enough about programming and languages and their many uses that we can talk directly about it, rather than about an analogy.
Why does somebody saying "I wouldn't use this." have to provide an alternative? If such a statement is backed by reasons I find it interesting to read. They're saving me trouble trying it out, just like any other review.
"So you can either complain that they're not good, or you can try and improve them."
Yes, that sentence is literally correct. But it sounds like it's saying one option is not useful. And I still haven't heard a single reason why reviews of benchmarks are bad.
Responding to a criticism with "those who can't do criticize" is super, super boring. It's been done to death. You're just tarring all criticism with an overly broad brush. If it's bad criticism why is it worth responding to? And if it's plausible criticism why aren't you focusing on the actual details?
Maybe there's a space for "if you spent an average amount of time optimizing" :) For example my PyPy optimizations took maybe 4 hours total, with 0 time spent looking at assembly.
The question marks seem to suggest that you're poking a hole in grandparent's argument, but it seems like you're both in agreement that the shootout is misused. What am I missing?
I don't see why the same program has to run on PyPy, CPython and Python3. Isn't the idea to do different implementations for each language. I know technically they expose the same (or a similar) language, but they're dissimilar under the hood.
I'd imagine the implementation varies across the other languages by more than just syntax.
The point that worries me the most, is the amount of microoptimisation being applied to these "benchmark" programs, making the results more or less pointless for real-world use.
There's nothing to say that all of CPython, Python 3, and PyPy must be measured.
For the moment, they all are being measured, so it's interesting to see that programs written for CPython might perform badly with PyPy, and programs written for PyPy might perform badly with CPython.
I think in general, programs written in "Python" will perform better in PyPy then CPython, but the current submissions are hyper-optimised for the implementation details of CPython.
It's a shame that it all seems to be up to one person and what he will allow considering how often the shootout is referenced in discussions. I wonder if Mike Pall's experience was similar when he posted the Lua/LuaJIT versions.
Same experience here with LuaJIT. My submissions using low-level types (byte arrays and such) were put into 'interesting alternative', too.
Almost all other languages can use byte arrays, when they are the appropriate data structure for the job. The C submissions make heavy use of GCC extensions, Haskell gets to use mutable (OMG!) byte arrays and Free Pascal has about as much in common with Wirth's Pascal as the name.
But Python and Lua are not allowed to do that? Apparently not all languages are treated equally. Dismissing submissions by resorting to a flawed definition of 'standard' and then suppressing further debate is really lame.
I contributed almost all of the Lua programs to the shootout, but I do not feel particularly encouraged to continue contributing any programs.
Lua still has no regex-dna because the standard string functions aren't proper regexes (no choice operator) and LPeg isn't available as a Debian package (which is the real problem).
'course, it'd be hard for LuaJIT to do much better in the shootout than it does now. It's beating C#.
The Shootout's C implementations do occasionally turn into a festival of SSE intrinsics and the associated unrolling. Some benchmarks, anyway.
Still, not using SSE when it's available is dumb and perhaps all the other languages need to start playing, too. A little tricker in Python, of course....
Unfortunately, SSE - especially in integer-land, so I suppose we're talking SSE2 and beyond - often requires more specialist care and feeding than any automatic method (compiler or run-time) can provide.
Some cases aren't hard to pick up (e.g. bulk operations on big arrays) but others require trickery of the kind that compilers usually don't (or couldn't) have.
This isn't made easier by the notoriously non-orthogonal nature of the SSE integer operations and the rather limited number of ways that you can get in and out of SSE-land (to, say, affect a conditional or get something into a GPR).
I once wrote a function which uses MMX to convert a (possibly very large) string to lowercase. That was pretty much faster than the library's strlwr() function. http://codepad.org/BeDqS1Ws
Great stuff. I think there are some potential tricks here to reduce the number of comparisons - maybe do a parallel subtract by k to pull down 'A' to -128 (smallest possible byte) then do your comparison against (ord('Z')-k). Or maybe push up 'Z' to +127..?
That way you can get a single comparison and can replace a pcmpgtb and pand with a single subtract. Then switch it to SSE2, unroll and you're good to go.
Alternately, http://www.azillionmonkeys.com/qed/asmexample.html in section 11 ("Converting Uppercase") contains a brainsmashing version of this entirely in SWAR ("SIMD Within a Register"), which could be adapted with a certain amount of pain (largely due to the absence of a double-quadword bitshift in SSE2, which is retardlepated).
Maybe you should have to submit and have it ran against a secret program that covers many areas of an implementation, so it's unknown precisely what needs to be optimized to improve on the test. Have a limit for how often you can retest it to so the only real way to be the fastest is to optimize lots of stuff which is a net win for language users.
Why not at the least accept the first submission for the PyPy entries? It shouldn't matter that it doesn't run on CPython -- if it's idiomatic Python and it runs on PyPy it looks valid.
At the same time, I can see where the guy is coming from with ctypes.
Unless you're going to allow every language to converge on simply blasting some bytes into RAM and executing optimized machine-language code and watch all the languages cluster to the exact same location give or take how long it takes to load the machine code, you have to draw some lines about what's a "real" implementation and what isn't. If I were running it, I'd be even more strict and insist that the solutions must be "idiomatic"... and what's idiomatic and what isn't would also be arbitrary, because there is no escape from the "arbitrary". Yet the end result is useful. Not totally determinative, but it's not anywhere near "useless" either.