I doubt spaCy will ever be faster on PyPy (the neural network library Thinc is currently 50% slower). It'd still be really great to get it running, so people who benefit from PyPy for other parts of their stack don't have to manage two Python environments.
Just keep rewriting C as Python. I still remember the day I switched from Numpy and CPython to array.array on PyPy for a 60x boost in benchmarks. (Only a 20x speedup on actual running code; this was for geometry generation in a networked video game server.)
This is great. Can't wait until Python 3.5 support is out of beta.
Just out of curiosity, I'd love to hear from others who've used PyPy for their web apps. Are there any issues to look out for? I remember that a few years ago, packages like psycopg2 were not compatible, which made the migration somewhat difficult. Would love to hear real-world experiences here.
We have made alot of progress over the past few years, especially with compatibility . An example of a web stack is crossbar.io who recommend it for production and even got their stack running with PyPy on a rasberryPi [0]
I use PyPy almost exclusively. I like the availability of their precompiled binaries, and other peoples' benchmarking criticisms aside it's orders of magnitude faster than standard Python in all of my actual applications.
psycopg2cffi fixed the psycopg2 issue some time ago.
I've run into compatibility issues with packages that reference math-centric libs (matplotlib?) but aside from that I've been quite happy with it.
Coincidentally I tried PyPy today for my shell, which is around 14K lines of completely unoptimized Python code [1]. I have never used PyPy before, despite being a long-time Python programmer.
I didn't expect PyPy to speed it up, just based on my impressions of the kind of workloads PyPy excels at.
In my first test case (parsing a 976 line shell script), PyPy took 2.0 seconds and Cpython took 1.0. And that 2x slower number held up for other a couple other tests.
I will probably try running the benchmark in a loop to see if PyPy's JIT warms up (does it do that?). But I wasn't really expecting to use PyPy -- I just wanted to see how it does, because there aren't many ways to speed up my code without rewriting a good portion of it.
My impression is that JITs don't work well in general for certain workloads, not just PyPy. IIRC LuaJIT is actually slower than Lua for string-processing workloads. It makes a lot of sense in say machine learning applications which are all floating point calculations. But string processing is probably dominated by allocations and locality, and the JIT doesn't do very much there, whether it's LuaJIT or PyPy.
I probably overstated... I don't think JITs including LuaJIT do much for string manipulation, but that would mean it's on par with Lua.
Off the top of my head I think of Lua/Python as 10-100x slower than C, but LuaJIT might get you within 2x for some numerical workloads. But it will not get you within 2x for string workloads -- it's still in the 10x or worse category.
Someone reported a slower benchmark here but the maintainer argued that is invalid. But he says all the time is within C functions, which is what I was getting at.
I think the bottom line is that the semantics of Lua strings are completely different than how you manage strings in C, so there's a limit to how much you can optimize the program, even if you have a lot of time to (which the JIT does not.)
So if I have an existing Django application in production, assuming no package compatibility issues, how long might it take to get it running with PyPy?
I have a production Django application (the https://www.pastery.net pastebin). Just for fun, I'm going to change the container's image from python:3 to pypy:3 and document what happens here.
EDIT:
* Failed building psycopg2. Apparently I need to use psycopg2cffi instead. Retrying...
* manage.py now can't import Django. Hmm.
* Yeah, I have no idea why it can't import Django. I took most of the settings out of settings.py, thinking it somehow caused an ImportError, but it still fails. I think I have to give up at this point, as I have no more guesses.
* Turns out manage.py was trying to launch `/usr/bin/env python` and it needed `/usr/bin/env pypy3`. Seems odd to include a `python` binary on the image and not just alias it to `pypy3`, but it is what it is. Continuing...
Welp, everything seems to be running just fine. Here's the diff of all the changes I had to make:
Basically, install psycopg2cffi instead of psycopg2, use pypy3 for the interpreter instead of python, and add two lines to settings.py. All in all, pretty damn short!
Thanks! I'll be doing this soon too then. Most of my apps don't rely on too many obscure packages and / or really large scientific computing packages, so I doubt I'll run into many problems either.
My only experience was a couple of years back, but I remember that there was a "warm-up" period where resource utilization was substantially worse than pre-pypy conversion, then performance would improve.
Again, I haven't followed recent developments, so this may be
less of an issue now.
Edit: I should add - we loved it! Pypy was like a magic bullet
that solved our performance problems.
I've yet to get PyPy to work well with uwsgi, but I haven't tried it with the past 3-4 releases. However, since uwsgi embeds the runtime in its worker processes, that is sort of expected.
(feedback on this would be great, since someone else might have tried it recently)
3.5 was the first Python3 worth using according to Raymond Hettinger... but I think it was 3.6 that made dicts ordered by default and also includes f-strings. 3.6 is the one to have ;-)
3.6's order-preserving dict was first implemented by PyPy (so it's already had it for years, even in pypy2) and f-strings were back ported to pypy3 too.
dicts being ordered is still an implementation detail. The only place they are officially ordered is where the come from the catch-all keyword arguments in your function's signature.
The original title emphasises that NumPy and Pandas now are functional on PyPy.
The PyPy JIT cannot look inside C code, and crossing the python-c interface is slow, but give it a chance and you may be pleasantly surprised how fast your pure python code can run.
It depends: we are heavily invested in making both numpy and pandas faster, so next releases will improve that. That said, you are quite likely to find your program slower (but it really depends on what you do). It's a good time to experiment, but definitely not to fully switch just yet
If you have for loops there are opportunities for the JIT to speed them up. It very much depends how close those loops are to pure python, if they are than PyPy should be able to help. Many code constructs do look like large for loops. The JIT will trace them (which requires memory to record the code paths) and after about 1000 times through will convert the python AST to assembler (with checks to bail out if assumptions are violated). So the question you ask is very specific to your workflow, and is best answered by trying it out
I wonder if this will run apistar: https://github.com/encode/apistar which is currently the fastest (python 3.6 generally) python web framework out there.
In my experience, the answer is nearly always "yes" for pure python code. Now that with the last few PyPy releases support the C API via emulation, the answer is probably still yes for any python package.
EDIT: apparently it does work [1] though not officially supported until/unless it can be tested on Travis CI.
As I do with every PyPy release, I would like to point out that the PyPy official benchmarks for comparison against CPython[0] continue to misleadingly compare their latest and greatest with CPython 2.7.2 (released in 2011), as opposed to the modern CPython 2.7.13 or 3.5.3 versions for which they target API compatibility.
We should indeed compare PyPy3.5 vs CPython 3.5.3, but having a benchmark suite that works on both continuous to be a problem.
Regarding 2.7.13 - you might find it surprising, but it's actually SLOWER than 2.7.2, there has been no speed improvements and quite a few speed decreases, so we decided to keep the faster one.
EDIT: part of the problem is that comparing PyPy 2 vs CPython 3 is apples vs oranges, but PyPy3 is not ready yet (unicode improvements I'm working on right now are missing)
> Regarding 2.7.13 - you might find it surprising, but it's actually SLOWER than 2.7.2, there has been no speed improvements and quite a few speed decreases, so we decided to keep the faster one.
I don't find it that surprising, but do find it disappointing that you would run benchmarks against the current version, but not post them online for perusal, nor provide any sort of explanation for the use of the older version in head-to-head comparisons. For me, at least, it produces the impression that PyPy has something to hide, and I doubt I'm the only one.
This is the usual case of budget - if I had budget to have anyone improve the website, improve the buildbot, improve the benchmark comparison, trust me I would do it. Right now there are no volunteers and the benchmark side is sort of lingering on.
I remember a FOSDEM conf. where someone of PyPy said they were working with a very low budget, I bet they have no time to spend on fixing a very superficial bad impression. You don't come to PyPy just to try, you come to PyPy to boost performances of some existing projects.
It seems like the benchmarks should be reasonably reproducible, right? (In the sense that if they're not, then that's a bigger complaint about their validity.)
Can you or I just run the benchmarks against 2.7.13 to see how it goes?
Was surprised to see Cython support on this list. Can somebody elaborate on the relationship between the two? I had always viewed them as alternatives.
Cython is a way of writing compiled code that links against Python in a way that's easier than than using the Python.h API directly.
There are two things you can do with this. The first is you can write your performance-sensitive code directly in Cython, in which case, yes, that's a direct competitor to PyPy. (So is writing your performance-sensitive code directly in C and using Python.h to expose it to Python as a native-code module.)
The second is that you can write bindings to existing C (or C-ABI-compatible, really) code in Cython, instead of using C and Python.h to write those bindings. In that case, it's not quite that you care about the performance of your C code, but that it already exists, and you just need to call into it somehow. Having PyPy be able to use these existing codebases is valuable.
Cython is just a nice way of writing Python C extensions without the boilerplate, which hooks into setuptools and has a more pythonic syntax than C/C++.
They are not so much alternatives, rather complimentary things. I guess this release allows Cython compiled modules to work directly with PyPy, which is nice.
The aim would be to have Cython generating C code that compiles into extensions compatible with PyPy. I'm guessing this is most easily achieved by doing a bit of work on the Cython end, and a little bit of work on the PyPy end.
Is PyPy ideally plug'n'play, i.e. is it supposed to be able to seamlessly replace the CPython interpreter (ideally in that it may not factually be completely compatible, but is it aimed to be completely compatible)?
No. There's a documented list of things that are different. [0] That said, you likely don't have many of these issues in your code, if any. The ideal is for folks who write portable pure Python without relying on implementation-specific details to be able to run on PyPy and CPython without any hacks.
basically yes, with caveats, the big gotcha is C libraries exposed to python. CFFI helps alleviate this issue, but it's not a 100% identical replacement for the c bindings that CPython gives you. See their website for all the details.
Thanks @fijal and team for all the effort! This is awesome.
The last update on the Pypy+Pandas wiki[0] is from this August, and it mentions that there are still 15 outstanding failing tests. Does this release mean that 5.9 is now at 100% parity? What does the same metric look like for Pypy+Numpy, and where can that one be tracked if not 100% yet?
I am looking forward to migrating some pipelines over to 5.9 soon.
As you point out there are still a few failing tests. The
The number of failing tests on both is very low but not zero.
The PyPy team has had good collaboration with Pandas and NumPy, but there are some deeper issues with these packages depending on refcount semantics in some edge cases, IMO rarely seen in real world scenarios. For numpy this means using an out keyword argument can be tricky, and for Pandas it means some galse positives in determining when a dataframe is being held by another reference, making it read-only
Of course if your workflow depends on those features, they are critical. We are working on full compatibility and also on increasing speed.
Tests failing:
* https://github.com/explosion/spaCy
Confirmed working:
* https://github.com/explosion/thinc
* https://github.com/explosion/preshed
* https://github.com/explosion/cymem
* https://github.com/explosion/murmurhash
I doubt spaCy will ever be faster on PyPy (the neural network library Thinc is currently 50% slower). It'd still be really great to get it running, so people who benefit from PyPy for other parts of their stack don't have to manage two Python environments.