Guido van Rossum's Performance tips for Python

fijal · on Oct 13, 2012

This is, sadly, completely opposite to what we try to advocate with PyPy, which is you should not sacrifice your abstractions in favor of performance and you should not need to rewrite in C, barring few corner cases, like writing kernels.

Surprisingly enough, I happen to have a talk when I discuss this precise topic, for people with a 30min to kill. http://www.youtube.com/watch?v=ZHF5Aius_Qs&feature=youtu...

jmilloy · on Oct 14, 2012

The point of the OP was "patterns for fast python", i.e performance. It's absurd to claim that you "advocate abstractions over performance" in the context of achieving performance, so your comparison doesn't make sense to me. Just because you don't advocate performance doesn't mean that these aren't the ways to achieve performance. In addition, it sounds you mostly agree with regard to rewriting in C... use it as a last resort. It sounds to me like you are just trying to disagree.

JulianWasTaken · on Oct 14, 2012

I don't see where in fijal's comment you got the impression that he doesn't advocate performance. I very much would hope (and believe) otherwise, that he cares very much about it. It also does not seem that he agrees at all with regard to writing in C, since his "last resort" sounds much more last than yours (again, a thing I'd expect from him ;). His point sounds like it is "if you care about performance, these are not really ways to get it, because they sacrifice more important things". If your issue with that is "yeah but I still want performance" I would bet that fijal would point you to PyPy :).

jmilloy · on Oct 15, 2012

>you should not sacrifice your abstractions in favor of performance

I only point out that a list of optimization techniques that advocates nothing will be different than a list of optimization techniques that advocates certain abstractions at the expense of performance. And there's nothing sad about that.

JulianWasTaken · on Oct 15, 2012

I repeat, I see nothing there that says "don't optimize for performance".

It says "if you want performance don't do X because X is not a reasonable sacrifice". Then he linked to some better ones.

If you watch the linked talk, "write simple code when given the choice between simple and complex code" seems like fijal's main suggestion (and in my limited experience with PyPy has worked for me). The other main suggestion that I assume is underlying is "if you want performance try PyPy". Neither of those require sacrificing abstraction. Some of Guido's suggestions do.

dmorgan · on Oct 14, 2012

>The point of the OP was "patterns for fast python", i.e performance. It's absurd to claim that you "advocate abstractions over performance" in the context of achieving performance, so your comparison doesn't make sense to me. Just because you don't advocate performance doesn't mean that these aren't the ways to achieve performance.

He never said he doesn't advocate performance. He said performance should not be achieved by sacrificing abstractions.

>In addition, it sounds you mostly agree with regard to rewriting in C... use it as a last resort.

In CPython, not necessarily in Guido's advice, it's one of the first and most common things you hear about performance. "Just write parts of it in C".

If that was the advice of the early JVM guys ("just write parts of it in JNI"), we would never have gotten a fast JVM.

jmilloy · on Oct 15, 2012

>He never said he doesn't advocate performance.

Agreed. But Guido didn't advocate performance, or anything, so fijal is trying to disagree with him about a topic he didn't state an opinion on. There's no need to make up a disagreement. Imagine if fijal instead said:

"This is a fine list of techniques if you are willing to sacrifice your abstractions. At PyPy, we're not willing to do that, and we still achieve sufficiently high performance by..."

timClicks · on Oct 13, 2012

Do you have any tips from the last third of the talk that you needed to skip?

fijal · on Oct 13, 2012

The value of those tips is unfortunately ephemeral, since we're improving in PyPy in a lot of places. I'll try to put it up in the form of blog post, to have a point of reference. No definite deadline yet though :)

zobzu · on Oct 13, 2012

Pypy, you mean the people who dont see value in properly supporting HTTPS when retrieving packages?

JeremyBanks · on Oct 13, 2012

We're talking about PyPy: http://pypy.org/

You seem to be talking about PyPi: http://pypi.python.org/

RegEx · on Oct 13, 2012

I personally was confused by this for a decent amount of time.

someperson · on Oct 14, 2012

Now we need a "PiPy" (I suggest it simply be python implementation of the algorithm to calculate decimal places in pi) and PiPi (not idea what it would do). (Sorry about fluff comment, couldn't resist. We digress.)

janl · on Oct 14, 2012

Pipi is weewee in German. You pick what the project does :) (OT)

zobzu · on Oct 16, 2012

Ah thanks!

Xion · on Oct 13, 2012

Python code optimization is actually tricky ground where analogies from other languages (like C/C++) don't necessarily apply.

One of the more surprising results (esp. for non-pythonists) is the fact that string formatting:

    s = "%s" % some_integer

is faster than "casting" to string:

    s = str(some_intenger)

That's solely because looking up the 'str' symbol requires finding an element in global symbols' hashtable. This turns out to be more expensive than parsing the format string and building the result of % operator.

jedbrown · on Oct 13, 2012

This is not a robust observation

python3.2:

  >>> timeit.timeit("for i in range(100): s = '%s' % (i,)", number=100000)
  2.868873119354248
  >>> timeit.timeit("for i in range(100): s = '%s' % i", number=100000)
  2.615748882293701
  >>> timeit.timeit("for i in range(100): s = str(i)", number=100000)
  2.4016571044921875
  >>> timeit.timeit("for i in range(100): s = i.__str__()", number=100000)
  1.8993198871612549

python2.7:

  >>> timeit.timeit("for i in xrange(100): s = '%s' % (i,)", number=100000)
  1.9474480152130127
  >>> timeit.timeit("for i in xrange(100): s = '%s' % i", number=100000)
  1.6135330200195312
  >>> timeit.timeit("for i in xrange(100): s = str(i)", number=100000)
  2.009705066680908
  >>> timeit.timeit("for i in xrange(100): s = i.__str__()", number=100000)
  1.539802074432373

Xion · on Oct 13, 2012

It would seem that loop bookkeeping dominates execution time here. Without it, I get quite different results:

    >>> timeit.timeit("'%s' % i", setup="i = 42", number=1000000)
    0.31799793243408203
    >>> timeit.timeit("str(i)", setup="i = 42", number=1000000)
    0.4146881103515625

This is another argument for profiling everything.

jedbrown · on Oct 13, 2012

On the contrary, the time per inner iteration is slightly higher this way and the same conclusions hold:

1. python3.2 str() is at least as fast as '%s' formatting

2. '%s' is slightly faster than str() with python2.7

3. theint.__str__() is faster than either alternative in all cases

zurn · on Oct 14, 2012

And also, for using timeit correctly. The command line usage is handy too: python -m timeit -s "i=42" "str(i)"

fauigerzigerk · on Oct 13, 2012

Apparently there's a function call overhead as well that the % operator doesn't incur because

  for i in xrange(10000000):
    s = "%s" % i

is also faster than

  lstr = str
  for i in xrange(10000000):
    s = lstr(i)

[Edit] After thinking about it a second longer, I wonder whether there is some lookup for lstr as well even though it's local. But storing the function in lstr is faster than using str so I'm not sure how this is actually implemented. I'm sure someone here will know more.

jemfinch · on Oct 13, 2012

You need to make sure you do all your benchmarks in a function. In a funtion, local variables translate to indexes into a local variable array; outside of a function, they remain globals lookups.

Your first reply that "there's always a lookup" isn't right. Some variable accesses (specifically: local variable accesses) do simply map to array accesses.

njharman · on Oct 13, 2012

Yes there is lookup, always lookup. Dynamic language, something in loop could change what lstr is.

Each "dot" incurs lookup. Your example reminds me of one idiom. Assigning nested.look.up to local var for access inside loop.

dmorgan · on Oct 14, 2012

>Yes there is lookup, always lookup. Dynamic language, something in loop could change what lstr is.

And why is that presented as something inevitable?

The interpreter/compiler could analyze that part and see that the function/name is not changed during the loop, and cache for that.

I'd guess that PyPy tries to do it that way, anyway...

lvh · on Oct 14, 2012

Yes, that's one of the many optimizations PyPy does, however it's only a very limited one. PyPy goes much, much deeper than this, with tools like function inlining and escape analysis.

pjscott · on Oct 14, 2012

CPython compiles to bytecode, turning all local variable references into offsets into a local variable array. So, getting the "str" function requires a global name lookup, while getting your local "lstr" only requires an array lookup with a constant embedded in the bytecode.

You can have a lot of fun playing with the dis module to look at the bytecode that Python generates for your functions:

http://docs.python.org/library/dis.html

freyrs3 · on Oct 13, 2012

Numba and Cython are also worth mentioning if you're doing numeric programming.

pjscott · on Oct 14, 2012

Cython is definitely worth learning if you're writing in Python and want to make something that runs fast -- numeric or otherwise. It's a Python-to-C compiler that has two killer features:

1. For the most part, you can just take existing Python code and have it magically transformed into not-too-horrible C. A few optional type annotations will help with the speed, but the compatibility is great right from the get-go.

2. With a little care, you can often get the inner loops of your Python code to be just as fast as hand-written C.

The tutorial is a quick read, and gets you up to speed without much effort:

http://docs.cython.org/

CJefferson · on Oct 14, 2012

This is one advantage of C++ (and C, and others) which I didn't really realise until fairly recently. Because I have a good quality compiler, I do not have to worry about little functions.

If I want to make a class with a single int member, and a bunch of member functions, I can trust that will, in most cases, compile away to be just as efficient as a raw int and inline code. It is very liberating to not have to worry about the efficiency of creating another function.

avallark · on Oct 13, 2012

I would just say:

1. Use built in data structures whenever possible

2. First fix your process flow, then spend more time to fix your code. Doing steps A C E F B and D might be better than A B C D E F.

3. Write direct queries with db when all else fails.

dbaupp · on Oct 13, 2012

Another good one is: use PyPy (assuming that it supports your code).

zobzu · on Oct 13, 2012

It's missing one tip I like: Make sure your functions are calling CPython functions for heavy operations, specially when using libraries.

I often use python libs which functions are written in python itself and create bootlenecks. Rewrite those functions by calling "native" CPython functions, generally make an extremely large differences

fox91 · on Oct 13, 2012

Be aware, a lot of Python tips for improving performance are releated to the Python implementation you're using.

We should remark a clear distinction between Python tips and CPython tips. In PyPy or other implementations some tips does just not make sense and are useless (e.g.: local scope stuff vs. outer scope stuff in loops).

Nursie · on Oct 13, 2012

Don't assume that threads will be useful, in CPython they are simply an abstraction and don't actually run in parallel. To make use of a multi core system you need to use processes.

I'd still like it if they could get rid of the GIL...

d0mine · on Oct 13, 2012

GIL or no GIL, threads run concurrently in Python. The difference might be in performance:

  - IO-bound tasks: GIL is released
  - CPU-bound tasks on N-core on a single host. To exploit multiple CPUs: 

     a) no GIL (hypothetical): N times speed up (optimistically)
         (it suggests weak data dependency i.e., 
          multiprocessing can be used to the same effect)

     b) option a) with multiple processes (shared memory or communication-based approach)
         Code complexity is the same on average (except on Windows)

     c) C extensions (existing or new): speed up 10*N or more on numerical code
         Cython makes it easy to write new extensions.
         Currently due to dynamic nature of Python, GIL or no GIL, 
         C extensions might be necessary to exploit hardware fully
         (though C extensions might not be an option in some projects)

  - scalable to multiple hosts tasks: different processes i.e., GIL is not a problem

Benefits of GIL:

  - C extensions (and the interpreter) are simpler to write correctly. 
     Multithreaded programming is not trivial we need all the help we can get.
  - no performance penalty for single-threaded code
  - it encourages a synchronization through communication concurrent model (builtin in Go)

Disadvantages:

  - some applications have no localized performance bottlenecks. 
    So writing a small C extension won't help to get possible 
    performance benefits of running on multicore in parallel.
    If performance is critical; Python might not be the right tool in this case
  - there are pathological cases when performance suffers greatly due to GIL
    (though other approaches would have their own pathological cases)
  - a (non-informed) perception that Python can't benefit from multicore

Nursie · on Oct 13, 2012

OK, sure, they run concurrently, but they can't be in the interpreter at the same time. The result is that using python threads to do python things, you won't utilise more than one core, most of the time.

There are a myriad of ways around it, but it is another thing you need to take into account when writing python programs or deciding to use python for a project.

d0mine · on Oct 13, 2012

yes, writing efficient programs is different in different languages.

"- Don't write Java (or C++, or Javascript, ...) in Python." and then you don't need to search for workarounds.

Dylan16807 · on Oct 13, 2012

So if someone writes code that makes heavy use of nonlibrary python code in multiple threads, you're automatically going to blame them as writing in the style of some other language? That's pretty No True Scotsman.

I'm not arguing for or against the GIL here but it's something to keep in mind. "Threads run concurrently" is misleading. And it's not a matter of being pythonic, the GIL isn't even a python feature.

Nursie · on Oct 13, 2012

What I'm saying is that it's a misleading language feature and something to bear in mind. It's not really a case of just being pythonic.

dbaupp · on Oct 13, 2012

They aren't entirely a non-parallel abstraction. C extension modules can release the GIL[1] and continue calculating. A notable example of this is Numpy/Scipy[2].

[1]: http://docs.python.org/c-api/init.html#thread-state-and-the-... [2]: http://www.scipy.org/ParallelProgramming#head-9e56edb190bf1e...

Nursie · on Oct 13, 2012

True, there are options, I guess I was just referring to straight python threads and interpreter calls, not what you can do with extensions.

For heavy math workloads, which was what I was thinking of, you can use the techniques you linked to there to great effect. It is something that you need to put a little more thought into than you might in a non-locked situation like pthreads in C though.

andrewcooke · on Oct 13, 2012

also multiprocessing.

but i suspect you were voted down because these days hn favours tribal groupthink (you criticised python) over rational thought.

rjh29 · on Oct 13, 2012

Programmers should not be forced to sacrifice flexibility from using function calls, objects or getters/setters in order to speed up a program. The performance hit from those things should either be miniscule or optimised away by the compiler as appropriate.

famousactress · on Oct 13, 2012

This is too black and white. Don't fret, you can use function calls all you want in python... the situations in which Guido is talking about being conscious of your stack frame are pretty rare in typical practice.

Also, getters/setters are totally inane in python (and probably most other languages).

In fact, in general Guido's advice reads as a good warning to folks showing up from other languages who's first reaction might be to create a com/mycubiclefarm/exceptions/abstract/ directory and start writing SeriousBaseClassesForMyExceptions.

ExpiredLink · on Oct 13, 2012

>> creating a stack frame is expensive.

> the situations in which Guido is talking about being conscious of your stack frame are pretty rare in typical practice

Excessive function calls in tight loops may be expensive but hardly stack frames.

grifaton · on Oct 13, 2012

Sounds like you're looking for a Sufficiently Smart Compiler[0]. James Hague has a good piece on why this might not be so desirable[1].

One of the reasons I'm fond of Python is that, while there is a tradeoff between flexibility and performance, it gives you the means to sacrifice flexibility to aid you in improving performance -- once you've identified what (if any) actual performance bottlenecks you face.

[0] http://c2.com/cgi/wiki?SufficientlySmartCompiler [1] http://prog21.dadgum.com/40.html

Peaker · on Oct 13, 2012

Optimizing away getter/setter calls and function call overhead in general is well within the reach of current compilers.

tjgq · on Oct 13, 2012

In general, function calls cannot be inlined by the Python compiler because (almost) any function name may be re-bound to a different object at run time.

A very smart compiler could probably attempt to prove that no such modification can occur at run time throughout the whole program; but this is much harder than simply deciding whether inlining a given call is worth it or not.

cma · on Oct 13, 2012

V8 handles them w minimal overhead in JavaScript.

pjscott · on Oct 14, 2012

And more to the point, PyPy can do Python inlining with its tracing JIT. The method, in both cases, is similar: find some type assumptions that are useful and usually true, generate code under those assumptions, and fall back on more generic code if they're ever false.

rjh29 · on Oct 13, 2012

Well, there's PyPy, which is on average 5.7 times faster than CPython. So these optimisations are definitely possible, not hypothetical.

apendleton · on Oct 13, 2012

Actually, my experience with PyPy, while generally positive, has exhibited many of the characteristics that article talks about in terms of downsides to "sufficiently-smart compilers." It's almost always much faster than cpython, but how much faster is highly variable, and not especially predictably so; seemingly-insignificant changes can have large and unexpected performance implications, and it's difficult as a developer to have the proper mental model to figure out what's actually going on.

In CPython land, Python is slower, but performs predictably, and if you want to speed it up you write C, which is much faster, and also performs predictably, though it takes some developer effort. In PyPy, you get some of the speed of C without the effort, but without the predictability either.

jrockway · on Oct 14, 2012

C is also unpredictable: see the article from a couple days ago entitled "Why does my function run faster when the input array is sorted?"

(The answer was: branch prediction.)

to3m · on Oct 14, 2012

I'm not sure that counts in the same way. You would have the branch prediction issue even if you program in assembly language and have precise control over the instructions the computer executes.

dmorgan · on Oct 14, 2012

>Sounds like you're looking for a Sufficiently Smart Compiler[0]. James Hague has a good piece on why this might not be so desirable[1].

So, sounds like he's looking for something like v8.

if v8 can be 10-30 times faster than Python, for an equal or even more dynamic language, I don't see why Python should need to manual tune the things Guido describes in order to get some speed.

olavk · on Oct 13, 2012

If you are writing very performance critical code, you shouldn't write in Python anyway. You chose Python because ease of development is more important than raw performance.

The overhead of function calls and property access in Python is part of the price for the increased flexibility of the language.

Rossums tips are for the corner case where you want to improve performance but you don't bother rewriting in C, e.g. if you have a small bottleneck in a larger program.

SoftwareMaven · on Oct 13, 2012

There are few applications that require 100% of the code to be high performance. Most performance problems are of the "bottleneck" kind, so writing your entire app in a language like Python for that ease of development is smart. It gives you options to optimize that 1-2% of your code that is critical without incurring a development penalty on the other 98-99%.

GVR's tips can be seen as an escalation path for optimizing. Don't jump into C before you've actually optimized your Python, because you might not need to.

In 10 years of writing Python, I've yet to hit a point where I've needed to do that. I'm kind of looking forward to it, actually.

axiak · on Oct 13, 2012

I don't know how much experience you have with python.. but you don't need getters/setters in order to gain flexibility. You can always redefine how assignment/accessing works for any property later.

ntoshev · on Oct 13, 2012

Even in Java you don't need the trivial getters/setters anymore. Everyone uses an IDE with refactoring capabilities, so the change can be done instantly when necessary. Why uglify your code in advance?

axiak · on Oct 13, 2012

Of course it depends on who's using your code. In java, if you have an api that other people are using, then you must use getters and setters. In python there is no such restriction.

rjh29 · on Oct 13, 2012

Not much, so I will ask a question. What's more pythonic - avoiding getters/setters, or having assignment do more than simple assignment without the caller knowing about it?

Or does the issue simply not come up in practice, because you rarely need to redefine how assignment/accessing works?

scott_s · on Oct 13, 2012

Sacrificing abstractions is quite common in all languages when optimizing code.

jasomill · on Oct 13, 2012

For what it's worth, the only thing you sacrifice in Python by omitting trivial a getter/setter pair is the docstring: unlike Java, the syntax doesn't change, and, for Python classes, there is no such thing as .NET-style "binary compatibility".

shin_lao · on Oct 13, 2012

Feel free to create your language...

stefantalpalaru · on Oct 13, 2012

"The universal speed-up is rewriting small bits of code in C" - pretty much sums up the subject of Python performance. Even in Cython you have a Python dialect that's translated to C.