This is, sadly, completely opposite to what we try to advocate with PyPy, which is you should not sacrifice your abstractions in favor of performance and you should not need to rewrite in C, barring few corner cases, like writing kernels.
The point of the OP was "patterns for fast python", i.e performance. It's absurd to claim that you "advocate abstractions over performance" in the context of achieving performance, so your comparison doesn't make sense to me. Just because you don't advocate performance doesn't mean that these aren't the ways to achieve performance. In addition, it sounds you mostly agree with regard to rewriting in C... use it as a last resort. It sounds to me like you are just trying to disagree.
I don't see where in fijal's comment you got the impression that he doesn't advocate performance. I very much would hope (and believe) otherwise, that he cares very much about it. It also does not seem that he agrees at all with regard to writing in C, since his "last resort" sounds much more last than yours (again, a thing I'd expect from him ;). His point sounds like it is "if you care about performance, these are not really ways to get it, because they sacrifice more important things". If your issue with that is "yeah but I still want performance" I would bet that fijal would point you to PyPy :).
>you should not sacrifice your abstractions in favor of performance
I only point out that a list of optimization techniques that advocates nothing will be different than a list of optimization techniques that advocates certain abstractions at the expense of performance. And there's nothing sad about that.
I repeat, I see nothing there that says "don't optimize for performance".
It says "if you want performance don't do X because X is not a reasonable sacrifice". Then he linked to some better ones.
If you watch the linked talk, "write simple code when given the choice between simple and complex code" seems like fijal's main suggestion (and in my limited experience with PyPy has worked for me). The other main suggestion that I assume is underlying is "if you want performance try PyPy". Neither of those require sacrificing abstraction. Some of Guido's suggestions do.
>The point of the OP was "patterns for fast python", i.e performance. It's absurd to claim that you "advocate abstractions over performance" in the context of achieving performance, so your comparison doesn't make sense to me. Just because you don't advocate performance doesn't mean that these aren't the ways to achieve performance.
He never said he doesn't advocate performance. He said performance should not be achieved by sacrificing abstractions.
>In addition, it sounds you mostly agree with regard to rewriting in C... use it as a last resort.
In CPython, not necessarily in Guido's advice, it's one of the first and most common things you hear about performance. "Just write parts of it in C".
If that was the advice of the early JVM guys ("just write parts of it in JNI"), we would never have gotten a fast JVM.
Agreed. But Guido didn't advocate performance, or anything, so fijal is trying to disagree with him about a topic he didn't state an opinion on. There's no need to make up a disagreement. Imagine if fijal instead said:
"This is a fine list of techniques if you are willing to sacrifice your abstractions. At PyPy, we're not willing to do that, and we still achieve sufficiently high performance by..."
The value of those tips is unfortunately ephemeral, since we're improving in PyPy in a lot of places. I'll try to put it up in the form of blog post, to have a point of reference. No definite deadline yet though :)
Now we need a "PiPy" (I suggest it simply be python implementation of the algorithm to calculate decimal places in pi) and PiPi (not idea what it would do). (Sorry about fluff comment, couldn't resist. We digress.)
Python code optimization is actually tricky ground where analogies from other languages (like C/C++) don't necessarily apply.
One of the more surprising results (esp. for non-pythonists) is the fact that string formatting:
s = "%s" % some_integer
is faster than "casting" to string:
s = str(some_intenger)
That's solely because looking up the 'str' symbol requires finding an element in global symbols' hashtable. This turns out to be more expensive than parsing the format string and building the result of % operator.
>>> timeit.timeit("for i in range(100): s = '%s' % (i,)", number=100000)
2.868873119354248
>>> timeit.timeit("for i in range(100): s = '%s' % i", number=100000)
2.615748882293701
>>> timeit.timeit("for i in range(100): s = str(i)", number=100000)
2.4016571044921875
>>> timeit.timeit("for i in range(100): s = i.__str__()", number=100000)
1.8993198871612549
python2.7:
>>> timeit.timeit("for i in xrange(100): s = '%s' % (i,)", number=100000)
1.9474480152130127
>>> timeit.timeit("for i in xrange(100): s = '%s' % i", number=100000)
1.6135330200195312
>>> timeit.timeit("for i in xrange(100): s = str(i)", number=100000)
2.009705066680908
>>> timeit.timeit("for i in xrange(100): s = i.__str__()", number=100000)
1.539802074432373
Apparently there's a function call overhead as well that the % operator doesn't incur because
for i in xrange(10000000):
s = "%s" % i
is also faster than
lstr = str
for i in xrange(10000000):
s = lstr(i)
[Edit] After thinking about it a second longer, I wonder whether there is some lookup for lstr as well even though it's local. But storing the function in lstr is faster than using str so I'm not sure how this is actually implemented. I'm sure someone here will know more.
You need to make sure you do all your benchmarks in a function. In a funtion, local variables translate to indexes into a local variable array; outside of a function, they remain globals lookups.
Your first reply that "there's always a lookup" isn't right. Some variable accesses (specifically: local variable accesses) do simply map to array accesses.
Yes, that's one of the many optimizations PyPy does, however it's only a very limited one. PyPy goes much, much deeper than this, with tools like function inlining and escape analysis.
CPython compiles to bytecode, turning all local variable references into offsets into a local variable array. So, getting the "str" function requires a global name lookup, while getting your local "lstr" only requires an array lookup with a constant embedded in the bytecode.
You can have a lot of fun playing with the dis module to look at the bytecode that Python generates for your functions:
Cython is definitely worth learning if you're writing in Python and want to make something that runs fast -- numeric or otherwise. It's a Python-to-C compiler that has two killer features:
1. For the most part, you can just take existing Python code and have it magically transformed into not-too-horrible C. A few optional type annotations will help with the speed, but the compatibility is great right from the get-go.
2. With a little care, you can often get the inner loops of your Python code to be just as fast as hand-written C.
The tutorial is a quick read, and gets you up to speed without much effort:
This is one advantage of C++ (and C, and others) which I didn't really realise until fairly recently. Because I have a good quality compiler, I do not have to worry about little functions.
If I want to make a class with a single int member, and a bunch of member functions, I can trust that will, in most cases, compile away to be just as efficient as a raw int and inline code. It is very liberating to not have to worry about the efficiency of creating another function.
It's missing one tip I like:
Make sure your functions are calling CPython functions for heavy operations, specially when using libraries.
I often use python libs which functions are written in python itself and create bootlenecks. Rewrite those functions by calling "native" CPython functions, generally make an extremely large differences
Be aware, a lot of Python tips for improving performance are releated to the Python implementation you're using.
We should remark a clear distinction between Python tips and CPython tips.
In PyPy or other implementations some tips does just not make sense and are useless (e.g.: local scope stuff vs. outer scope stuff in loops).
Don't assume that threads will be useful, in CPython they are simply an abstraction and don't actually run in parallel. To make use of a multi core system you need to use processes.
I'd still like it if they could get rid of the GIL...
GIL or no GIL, threads run concurrently in Python. The difference might be in performance:
- IO-bound tasks: GIL is released
- CPU-bound tasks on N-core on a single host. To exploit multiple CPUs:
a) no GIL (hypothetical): N times speed up (optimistically)
(it suggests weak data dependency i.e.,
multiprocessing can be used to the same effect)
b) option a) with multiple processes (shared memory or communication-based approach)
Code complexity is the same on average (except on Windows)
c) C extensions (existing or new): speed up 10*N or more on numerical code
Cython makes it easy to write new extensions.
Currently due to dynamic nature of Python, GIL or no GIL,
C extensions might be necessary to exploit hardware fully
(though C extensions might not be an option in some projects)
- scalable to multiple hosts tasks: different processes i.e., GIL is not a problem
Benefits of GIL:
- C extensions (and the interpreter) are simpler to write correctly.
Multithreaded programming is not trivial we need all the help we can get.
- no performance penalty for single-threaded code
- it encourages a synchronization through communication concurrent model (builtin in Go)
Disadvantages:
- some applications have no localized performance bottlenecks.
So writing a small C extension won't help to get possible
performance benefits of running on multicore in parallel.
If performance is critical; Python might not be the right tool in this case
- there are pathological cases when performance suffers greatly due to GIL
(though other approaches would have their own pathological cases)
- a (non-informed) perception that Python can't benefit from multicore
OK, sure, they run concurrently, but they can't be in the interpreter at the same time. The result is that using python threads to do python things, you won't utilise more than one core, most of the time.
There are a myriad of ways around it, but it is another thing you need to take into account when writing python programs or deciding to use python for a project.
So if someone writes code that makes heavy use of nonlibrary python code in multiple threads, you're automatically going to blame them as writing in the style of some other language? That's pretty No True Scotsman.
I'm not arguing for or against the GIL here but it's something to keep in mind. "Threads run concurrently" is misleading. And it's not a matter of being pythonic, the GIL isn't even a python feature.
They aren't entirely a non-parallel abstraction. C extension modules can release the GIL[1] and continue calculating. A notable example of this is Numpy/Scipy[2].
True, there are options, I guess I was just referring to straight python threads and interpreter calls, not what you can do with extensions.
For heavy math workloads, which was what I was thinking of, you can use the techniques you linked to there to great effect. It is something that you need to put a little more thought into than you might in a non-locked situation like pthreads in C though.
Programmers should not be forced to sacrifice flexibility from using function calls, objects or getters/setters in order to speed up a program. The performance hit from those things should either be miniscule or optimised away by the compiler as appropriate.
This is too black and white. Don't fret, you can use function calls all you want in python... the situations in which Guido is talking about being conscious of your stack frame are pretty rare in typical practice.
Also, getters/setters are totally inane in python (and probably most other languages).
In fact, in general Guido's advice reads as a good warning to folks showing up from other languages who's first reaction might be to create a com/mycubiclefarm/exceptions/abstract/ directory and start writing SeriousBaseClassesForMyExceptions.
Sounds like you're looking for a Sufficiently Smart Compiler[0]. James Hague has a good piece on why this might not be so desirable[1].
One of the reasons I'm fond of Python is that, while there is a tradeoff between flexibility and performance, it gives you the means to sacrifice flexibility to aid you in improving performance -- once you've identified what (if any) actual performance bottlenecks you face.
In general, function calls cannot be inlined by the Python compiler because (almost) any function name may be re-bound to a different object at run time.
A very smart compiler could probably attempt to prove that no such modification can occur at run time throughout the whole program; but this is much harder than simply deciding whether inlining a given call is worth it or not.
And more to the point, PyPy can do Python inlining with its tracing JIT. The method, in both cases, is similar: find some type assumptions that are useful and usually true, generate code under those assumptions, and fall back on more generic code if they're ever false.
Actually, my experience with PyPy, while generally positive, has exhibited many of the characteristics that article talks about in terms of downsides to "sufficiently-smart compilers." It's almost always much faster than cpython, but how much faster is highly variable, and not especially predictably so; seemingly-insignificant changes can have large and unexpected performance implications, and it's difficult as a developer to have the proper mental model to figure out what's actually going on.
In CPython land, Python is slower, but performs predictably, and if you want to speed it up you write C, which is much faster, and also performs predictably, though it takes some developer effort. In PyPy, you get some of the speed of C without the effort, but without the predictability either.
I'm not sure that counts in the same way. You would have the branch prediction issue even if you program in assembly language and have precise control over the instructions the computer executes.
>Sounds like you're looking for a Sufficiently Smart Compiler[0]. James Hague has a good piece on why this might not be so desirable[1].
So, sounds like he's looking for something like v8.
if v8 can be 10-30 times faster than Python, for an equal or even more dynamic language, I don't see why Python should need to manual tune the things Guido describes in order to get some speed.
If you are writing very performance critical code, you shouldn't write in Python anyway. You chose Python because ease of development is more important than raw performance.
The overhead of function calls and property access in Python is part of the price for the increased flexibility of the language.
Rossums tips are for the corner case where you want to improve performance but you don't bother rewriting in C, e.g. if you have a small bottleneck in a larger program.
There are few applications that require 100% of the code to be high performance. Most performance problems are of the "bottleneck" kind, so writing your entire app in a language like Python for that ease of development is smart. It gives you options to optimize that 1-2% of your code that is critical without incurring a development penalty on the other 98-99%.
GVR's tips can be seen as an escalation path for optimizing. Don't jump into C before you've actually optimized your Python, because you might not need to.
In 10 years of writing Python, I've yet to hit a point where I've needed to do that. I'm kind of looking forward to it, actually.
I don't know how much experience you have with python.. but you don't need getters/setters in order to gain flexibility. You can always redefine how assignment/accessing works for any property later.
Even in Java you don't need the trivial getters/setters anymore. Everyone uses an IDE with refactoring capabilities, so the change can be done instantly when necessary. Why uglify your code in advance?
Of course it depends on who's using your code. In java, if you have an api that other people are using, then you must use getters and setters. In python there is no such restriction.
Not much, so I will ask a question. What's more pythonic - avoiding getters/setters, or having assignment do more than simple assignment without the caller knowing about it?
Or does the issue simply not come up in practice, because you rarely need to redefine how assignment/accessing works?
For what it's worth, the only thing you sacrifice in Python by omitting trivial a getter/setter pair is the docstring: unlike Java, the syntax doesn't change, and, for Python classes, there is no such thing as .NET-style "binary compatibility".
"The universal speed-up is rewriting small bits of code in C" - pretty much sums up the subject of Python performance. Even in Cython you have a Python dialect that's translated to C.
Surprisingly enough, I happen to have a talk when I discuss this precise topic, for people with a 30min to kill. http://www.youtube.com/watch?v=ZHF5Aius_Qs&feature=youtu...