I recently took a computer vision course taught in both Python and C++, students chose the language they used for assignments. I signed up twice and did both languages. The C++ versions of the exact same tasks are not only significantly faster, they also occupy exponentially less memory, and in a few tasks I was able to trivially introduce C++ threads for jaw dropping speed improvements, while Python's situation for such optimizations is dismally convoluted. From the experience of this class taken twice, it really looks like there are serious gains to be had with an optimization pass via C++/rust/go for many systems that could benefit from multi-core implementations.
Python is a glue language. It is one thing to use it to string together bunch of algorithms written in C. But somebody's got this genius idea of using it as a general language and now we have this backend frameworks that are 2 orders of magnitude slower than they should be.
A popular approach for Python web serving is to launch a number of "workers" (eg via gunicorn, etc), that hang around waiting to serve requests.
Each one of these workers in recently running code (here) idled using ~250MB of non-shared memory. With about 40 workers needed to handle some fairly basic load. :(
Rewrote the code in Go. No need for workers (just using goroutines), and the whole thing idles using about 20MB of memory, completely replacing all those Python workers. o_O
This doesn't seem to be all that unusual for Python either.
In a forking model that shouldn’t be the case, I guess all the workers are loading and initializing things post-fork that likely could have been accomplished pre-fork?
That said, Python devs are some of the worst engineers I encounter, so it’s not surprising things are being implemented incorrectly.
Last I heard, forking wasn’t a very effective memory-sharing technique on CPython because of the way it does reference counting: if you load things in before you fork, when the children start doing work they update the refcounts on all those pre-loaded objects and scribble all over that memory, forcing most of the pages to be copied anyway.
I worked on an facial recognition system written in C++ that used SIMD optimizations to achieve 25M facial compares per core per second. On that same system I wrote an optimized ffmpeg player library that only consumed one core while doing a few hundred frames per second with 4K video.