He's missing Cython, which is another good option when you're looking for speed.
My personal favourite optimisation, from needing to shave a few milliseconds off our API response times, was discovering that it's measurably slower to use * args and * *kwargs, and switching to explicitly declaring and passing arguments in the relevant parts of the code.
We also did a few other neat things:
- Rolled our own UUID-like generator in pure Python (I was surprised this helped, but the profiler doesn't lie)
- Switched to working directly with WebOb Request and Response objects rather than using a framework
- Used a background thread with a single slot queue to make sure our response was returned to the user before we emitted the event log message, but always emit the message before moving to the next request
- Heavy optimisation of memcache / redis reads and writes
- Used a background thread with a single slot queue to make sure our response was returned to the user before we emitted the event log message, but always emit the message before moving to the next request
The crosstown_traffic API in hendrix does exactly this.
The order of tactics to take is wrong. In terms of energy expended, one should use PyPy first! It is amazingly compatible with CPython and can now be embedded directly in CPython programs, https://github.com/fijal/jitpy (supports numpy arrays)
Dump your virtualenv, create a new one with pypy, reinstall libraries and test your app. Takes less than 20 minutes, even for complex applications.
I would say it is a close second thing. If I have a slow system, I will move to faster runtime before modifying any code. Going from CPython to PyPy, if possible will almost always gain you enough perf increase while you refactor the slow parts.
It depends on what you're doing. I think your strategy can often be high risk since you are making a change that affects every part of your program at once -changing the runtime on a mature Python project is not a trivial choice! Changing the relevant code to tidy up data structures, or moving a few function calls to Cython or C, however, will only touch the obvious places and so it has less impact on the program.
Also, profiling pypy is less straightforward than profiling CPython code since the hotspot changes the runtime characteristics of the program. This means you need to run tests many times over to make sure the code warms up. This makes further optimizations slightly more difficult. It's not a problem for people with experience optimizing Python code, but for people who actually hope to learn something from OP's blog post, it might be a sticking point.
In my experience in using Python for stats, script type work (as opposed to writing servers and daemons), pypy just isn't that useful. All the Python code is doing is gluing numpy and Cython code together and pypy isn't likely to be able to warm up in time to beat it - and it won't beat it since it's spending most of its time in C.
Obviously, if pypy is an ideal choice for you, use it. But I don't think your experience should really be put forward as a general approach.
I have found PyPy wins if the wall time is over about 1.5 seconds, making it my goto Python environment for 1GB+ ETLs.
Everything you say is true, but under the systems I use/write I am able to test for correctness pretty quickly. The nice thing about switching from CPython to PyPy is that everything get faster. I have also found that using PyPy has removed lots of cases where I would want to drop down to native code.
Changing platforms can make one's designs simpler and more robust. When it comes to structured storage, I'll start with sqlite, then when it starts to get slow I'll switch to PostgreSQL. It takes almost no work to port from one to the other.
You really should give PyPy another shot. It supports more of numpy every day and the startup time is excellent. Maybe give jitpy a try if you are not likely to move off of CPython.
Serious question: if you have some code that really has to be fast, is it viable to keep it in Python, or should you ultimately end up rewriting it in a compiled language?
For example, I am writing code that implements networks that evolve over time for AI research. Prototyping it in Python makes it easy to test things out, but I expect that I will have to rewrite it in C++ or maybe something more fun, like Haskell[1].
1. Mostly for the sheer joy of trolling my colleagues with a learning agent monad.
In my experience the reason why you wrote something in python (implement features faster) remain valid reasons later on when you want to add functionality.
By using pypy, cpython, rewriting small parts in C/C++ and/or using the libraries which are written in C (such as numpy) you can normally make the hotspots in your code fast enough, while keeping the advantages of python.
In my experience it is worthwhile to first try and improve the performance in Python. It's easier to play around with different ideas in Python, and things like numpy quite often enable you to get "fast enough". Only if that is not enough should you consider writing a C extension.
Python makes it easy to do lots of things -- including very inefficient ones. Profiling your existing stuff and trying to optimize in pure Python often gets you pretty far.
I've never had to do this, but the common advice I hear is that the right algorithm in a high level language can be faster then the wrong algorithm in a low level language. So even if you do end up going the route of writing it in something lower level, it is likely worth optimizing your code at the higher level first where it is less expensive to try different things.
Same question to you: if you have some code that really has to be fast, is it viable to keep it in C++, or should you ultimately end up rewriting it for GPU?
Is this really a suitable counter-question? GPUs lend themselvse towards specific kinds of programming problems + added latency of GPU communication. On the contrary, the cost between switching from any on-CPU language is programmer time that may result in significant runtime advantages.
Also disadvantages: increased programmer time for implementing new features. Risks: you might optimize the wrong part (i.e. you invest into rewrite, but it will not solve performance problems).
That's why you must quantify advantages and disadvantages, including risks minimization and only then you'll see, whether given course of action is viable.
You should probably first consider seeing if there are critical bits of code that can be rewritten to use MMX/SSE instructions, since your data is in the right cache already, without needing to move it anywhere else.
True, but in this particular case, the cost to time going (temporarily) backwards would only be connecting to a potentially-suboptimal disque node, and that would be remedied after the subjective-machine-time caught up with the previous subjective time.
Nope and Nope. Jython is made to run on the HotSpot which is a JIT compiler, and Jython should be comparable to speed to cPython and faster in some cases (used to be slower, but that was 3-4 years ago, the optimized it a lot, and added stuff in Java 7/8 helped too).
There is a huge difference between an interpreter with a JIT compiler and an interpreter running on another interpreter that has one. These are not equivalent at all.
In the JRuby case, JRuby compiles Ruby to JVM bytecode.
Jython may do the same: rather than create cpython bytecode it may produce JVM bytecode which may be optimized by the JVM. However, I do not know anything about Jython performance and they could not be employing the same tactics as JRuby.
With modern Java features like lambdas coming into java 7 and 8 and other interesting languages like Scala, Groovy, etc being written for the JVM I'm sure things have come a long way since the time jython 2.4 was being developed on Java 5/6 and I'm sure the JVM has many more optimizations that dynamic languages may benefit from.
> other interesting languages like Scala, Groovy, etc being written for the JVM
I think Clojure is one of the most interesting so I'm keen you include it in your minimal list of examples. I don't think Clojure actually uses the Java 7 "invoke dynamic" bytecode though.
Mostly orthogonal. You can have an interpreter without a virtual machine (Basic or Python doesn't have one, just a runtime), and similarly a virtual machine without an interpreter.
An interpreter executes scripting instructions directly.
A virtual machine implements a faux (virtualized) cpu, with its instruction set etc, and executes its "assembly code" (with or without JITing).
(Things get complicated in that you can also have combinations of those concepts).
Python in CPython is executed after being compiled to the lower-level Python bytecode. Is this not sufficient to consider CPython to involve a virtual machine?
Almost all interpreters compile an ast to an "assembly code" that gets executed. CPython even executes that assembly code without seeing the source at all just like Java.
My personal favourite optimisation, from needing to shave a few milliseconds off our API response times, was discovering that it's measurably slower to use * args and * *kwargs, and switching to explicitly declaring and passing arguments in the relevant parts of the code.
We also did a few other neat things:
- Rolled our own UUID-like generator in pure Python (I was surprised this helped, but the profiler doesn't lie)
- Switched to working directly with WebOb Request and Response objects rather than using a framework
- Used a background thread with a single slot queue to make sure our response was returned to the user before we emitted the event log message, but always emit the message before moving to the next request
- Heavy optimisation of memcache / redis reads and writes
Edit: Fixed formatting