TL;DR - Nuitka (which has been garnering some attention on HN today - https://news.ycombinator.com/item?id=8771925) is not as fast as CPython and less faster than PyPy for the chosen scenario (enumeration of a fat graph and computing their graph homology).
A year and a half down the line after this paper was published (Apr 2013), Nuitka's main strengths seem to be the convenience in packaging the code into an executable which a number of readers have reported success in. If runtime and memory usage are more important, you are better off sticking with the interpreters for now.
This should be re-done with up to date versions of all programs. The authors mention having opened bug reports/existing known bugs. All contestants should have time to react to those, shouldn't they?
While that would be interesting, I don't see it as the responsibility of the author of a conference poster that has already been presented. It would make more sense for the authors of the various Python implementations to take this up as a benchmark.
Not necessarily. A sort of test with a somewhat random computation where implementations have not specifically tuned for said test can provide beneficial information. If you have enough of such tests, one can get a broad idea of the performance of such systems.
The worst case is when people hyperoptimize for a particular benchmark, which tells you nothing. You see this especially in the Benchmarks Game.
This comparison is interesting in that it compares the performance of plain or type-annotated Python code, but to get the full performance benefit of Cython you would replace lists with C arrays, objects with C structs, etc.
Did you see that those graphs are logarithmic? PyPy looks between 3x and 4x on problems that run long enough for PyPy to have a chance to JIT properly.
I didn't find this paper to be very good. While it talked lightly about a relatively complex mathematical object it computes, it did not talk much about what's involved in its computations, except for some very high level keywords ("comprehensions", "object graphs").
What algorithms were used? What data structures? Was the code idiomatic? Was there any effort to reduce things like allocation?
Was homological computation the only test case? Even numerical benchmarks typically come in a suite (a good sprinkling of linear algebraic computations, tight straight line floating point programs, differential equation solvers, various numerical simulators, ...), because one LAPACK function will not give you the full picture.
This paper did not give me a very good understanding of how performant non-numeric math—which in and of itself is an extremely broad and general term—is on each implementation.
It's a conference paper, from EuroSciPy 2013, distributed through a preprint service. There's no expectation it will be a high quality paper. Instead, it's an appropriate quality for where and how it was published.
Checking it now, it gives a reproducible way to download the specific packages used, and the benchmark framework. The actual code benchmarked is fatghol, from https://code.google.com/p/fatghol/ . There's also a link to a preprint describing the construction algorithm, at http://arxiv.org/pdf/1202.1820v2.pdf .
What you propose is an unrealistic expectation, and only possible for people with lots of money and time.
Instead, in real life what happens is people do A, and publish A, then do B (building on A), and publish B, then do C (building on B) and publish C. There's a trail of work backing up the final publication. It makes no sense for publication Q to revisit all of A-P, nor for the author to wait until Z before finally publishing everything. I also think knowledge transfer would be lower since someone interested in this paper's conclusions about the available documentation for the different Pythons (EuroSciPy is not a graph theory specialist conference) would almost certainly not be interested in the algorithm generation details.
You do realize the LINPACK is the "gold standard" benchmark used to rank the top 500 supercomputers, right? And all it does is solve A x = B. In any case, the performance suites like SPEC MPI still need to evaluate the individual benchmarks before assembling them in a suite. Even if you require a suite for something to be meaningful to you, this could be seen as a first step to building such a meaningful suite.
It appears to me, therefore, like you are needless harsh and critical.
I don't know about EuroSciPy 2013. I guess from your comment that such a conference does not require very high quality submissions.
It is typically not good style to simply say "here is a repository which contains the benchmark code". That is necessary, but not sufficient. (Although, I will say many papers do not include any link to where code can be found, so this was a distinct advantage of this paper.)
There's no need to regurgitate all previous work, but a bit more than a reference is extremely beneficial to legibility and allows for emphasis on particular aspects of what will be measured.
My problem with the paper is my answer to the following question, "What can I conclude from the paper?" What I gathered was approximately the following:
1. The author has a library for computing homologies. The abstract method for their computation is referenced in a (peer reviewed? published?) paper. The library is linked to, though no particular version is mentioned. (Can we really call it reproducible then?)
2. The author has given a very brief overview of the stages of the FatGHoL program, two of which are relevant to the benchmark. The author does not discuss the structure of the objects as implemented, so I must view it as a black box, unless I read source code.
3. The author, in a few sentences, summarizes (but does not delve into) which few very high level data structures used.
4. The author spends the rest of the paper showing CPU time and memory graphs.
5. The author makes conclusions from the data, with sometimes plausible explanations.
There is no outline as to what is actually being tested, except this black box library. There are no code samples as you would typically see in a survey or conference paper. As a reader, I've at best concluded, "a subset of some version of FatGHoL has the following time and space measurements for a few input parameters." Was this the conclusion the author wanted me to have?
But note the abstract says the tests "[are] an opportunity for every Python runtime to prove its strength in optimization." Is this true? The author has not even remotely convinced me that the code being run is even relevant optimization capabilities.
I don't think adding up to one or two pages more talking about these things would have cost excessive time or money on the author's part.
The unfortunate bit is elsewhere on HN and Reddit, people are now linking to this paper as almost the definitive resource for comparing the performance of Nuitka versus other implementations.
Lastly, I do realize LINPACK is among the benchmarks used for supercomputers (even though it's probably more appropriate LAPACK is used, which it sometimes is). I am very well aware of the details of the benchmark, having written an equivalent version before myself.
Quoting from the web site: "The annual EuroSciPy Conferences allows participants from academic, commercial, and governmental organizations to: showcase their latest Scientific Python projects, learn from skilled users and developers, and collaborate on code development." It isn't a conference which requires rigorous submissions.
You say "very high quality". I used "rigorous" because quality has many dimensions. I believe people go to EuroSciPy in part to learn which other tools exist, and to learn from the experience of others. This paper appears to have that audience in mind. It's partially an experience paper, and discusses things like available documentation and the stage of development of the tools (eg, Falcon is in early development, and crashed on the test code).
If someone came to the conference, interested in performance (which is most of the audience) but not in NumPy (which is a smaller number), then this is a high quality paper for this type of conference for guiding them on which Python implementations to prioritize, even if the benchmark per se were ignored.
You quoted where the abstract said "an opportunity for every Python runtime to prove its strength in optimization". I can see how that might be interpreted as a very broad benchmark. But it earlier mentioned "Python library FatGHol ... moduli space of Riemann surfaces" and later says "This paper compares the results and experiences from running FatGHol with different Python runtimes", so I think you're reading too much into that quote.
My code is also non-numeric scientific code. It's extremely unlikely that I would understand the algorithm in that code, or that the mix of instructions would match my code. I would skip the extra details as irrelevant to my interests. Whereas the other points, like how Nuitka's claim that it "create[s] the most efficient native code from this. This means to be fast with the basic Python object handling." has at least one real-world counter-example, and like how PyPy can use a lot of memory, again affects my weights about how I might evaluate the available options.
Do you seriously think that one or two pages more would have had a significant effect on the comments on HN or Reddit? For that matter, I see eight comments total on HN about the paper, including mine and your three. I don't see (in HN) peopling regard it as a 'definitive resource', but only a resource. I don't read Reddit so can't say anything what's going on there, but surely complaining here about Reddit doesn't help.
Also, the paper was 4 1/2 pages long. You want the author to spend about 30% longer to write the paper, which I think is excessive.
A year and a half down the line after this paper was published (Apr 2013), Nuitka's main strengths seem to be the convenience in packaging the code into an executable which a number of readers have reported success in. If runtime and memory usage are more important, you are better off sticking with the interpreters for now.