I think it is surprising to a lot of people who do take it as read that async wi...

john-radio · on June 12, 2020

> Fundamentally, you do not waste "3 billion cpu cycles" waiting 1000ms for an external service. Making alternative use of the otherwise idle CPU is the purpose (and IMO the proper domain of) operating systems.

Sure, the operating system can find other things to do with the CPU cycles when a program is IO-locked, but that doesn't help the program that you're in the situation of currently trying to run.

> An async implementation that does multiple ("embarrassingly parallel") tasks in the same process - whether that is DB IO waiting or microservice IO waiting - is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things. In fact in practice an async version is normally lower throughput, higher latency and more fragile. This is really what I'm getting at when I say async is not faster.

You're right. "Arbitrary programs will run faster" is not the promise of Python async.

Python async does help a program work faster in the situation that phodge just described (waiting for web requests, or waiting for a slow hardware device), since the program can do other things while waiting for the locked IO (unlike a Python program that does not use async and could only proceed linearly through its instructions). That's the problem that Python asyncio purports to solve. It is still subject to the Global Interpreter Lock, meaning it's still bound to one thread. (Python's multiprocessing library is needed to overcome the GIL and break out a program into multiple threads, at the cost that cross-thread communication now becomes expensive).

quietbritishjim · on June 12, 2020

> unlike a Python program that does not use async and could only proceed linearly through its instructions

This isn't how it works. While Python is blocked in I/O calls, it releases the GIL so other threads can proceed. (If the GIL were never released then I'm sure they wouldn't have put threading in the Python standard library.)

> Python's multiprocessing library is needed to overcome the GIL

This is technically true, in that if you are running up against the GIL then the only way to overcome it is to use multiprocessing. But blocking IO isn't one of those situations, so you can just use threads.

The comparison here is not async vs just doing one thing. It's async vs threads. I believe that's what the performance comparison in the article is about, and if threads were as broken as you say then obviously they wouldn't have performed better than asyncio.

--------

As an aside, many C-based extensions also release the GIL when performing CPU-bound computations e.g. numpy and scipy. So GIL doesn't even prevent you from using multithreading in CPU-heavy applications, so long as they are relatively large operations (e.g. a few calls to multiply huge matrices together would parallelise well, but many calls to multiply tiny matrices together would heavily contend the GIL).

gshulegaard · on June 12, 2020

> > Python's multiprocessing library is needed to overcome the GIL

> No it's not, just use threads.

I just wanted to expand on this a little to describe some of the downsides to threads in Python.

Multi-threaded logic can be (and often is) slower than single-threaded logic because threading introduces overhead of lock contention and context switching. David Beazley did a talk illustrating this in 2010:

https://www.youtube.com/watch?v=Obt-vMVdM8s

He also did a great talk about coroutines in 2015 where he explores threading and coroutines a bit more:

https://www.youtube.com/watch?v=MCs5OvhV9S4&t=525s

In workloads that are often "blocked" like network calls our I/O bound work loads, threads can provide similar benefits to coroutines but with overhead. Coroutines seek to provide the same benefit without as much overhead (no lock contention, fewer context switches by the kernel).

It's probably not the right guidelines for everyone but I generally use these when thinking about concurrency (and pseudo-concurrency) in Python:

- Coroutines where I can.

- Multi-processing where I need real concurrency.

- Never threads.

quietbritishjim · on June 12, 2020

Ah ha! Now we have finally reached the beginning of the conversation :-)

The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO. The point of the article is to dispel that belief. There seems to be some debate about whether the article is really representative, and I'm very curious about that. But then the parent comment to mine took us on an unproductive detour that based on the misconception that Python threads don't work at all. Now your comment has brought up that original belief again, but you haven't referenced the article at all.

gshulegaard · on June 13, 2020

I didn't reference the article because I provided more detailed references which explore the difference between threads and coroutines in Python to a much greater depth.

The point of my comment is to say that neither threads or coroutines will make Python _faster_ in and of themselves. Quite the opposite in fact: threading adds overhead so unless the benefit is greater than the overhead (e.g. lock contention and context switching) your code will actually be net slower.

I can't recommend the videos I shared enough, David Beazley is a great presenter. One of the few people who can do talks centered around live coding that keep me engaged throughout.

> The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO. The point of the article is to dispel that belief.

The disconnect here is that this article isn't claiming that asyncio is not faster than threads. In fact the article only claims that asyncio is not a silver bullet guaranteed to increase the performance of any Python logic. The misconception it is trying to clear up, in it's own words is:

> Sadly async is not go-faster-stripes for the Python interpreter.

What I, and many others are questioning is:

A) Is this actually as widespread a belief as the article claims it to be? None of the results are surprising to me (or apparently some others).

B) Is the article accurate in it's analysis and conclusion?

As an example, take this paragraph:

> Why is this? In async Python, the multi-threading is co-operative, which simply means that threads are not interrupted by a central governor (such as the kernel) but instead have to voluntarily yield their execution time to others. In asyncio, the execution is yielded upon three language keywords: await, async for and async with.

This is a really confusing paragraph because it seems to mix terminology. A short list of problems in this quote alone:

- Async Python != multi-threading.

- Multi-threading is not co-operatively scheduled, they are indeed interrupted by the kernel (context switches between threads in Python do actually happen).

- Asyncio is co-operatively scheduled and pieces of logic have to yield to allow other logic to proceed. This is a key difference between Asyncio (coroutines) and multi-threading (threads).

- Asynchronous Python can be implemented using coroutines, multi-threading, or multi-processing; it's a common noun but the quote uses it as a proper noun leaving us guessing what the author intended to refer to.

Additionally, there are concepts and interactions which are missing from the article such as the GIL's scheduling behavior. In the second video I shared, David Beazley actually shows how the GIL gives compute intensive tasks higher priority which is the opposite of typical scheduling priorities (e.g. kernel scheduling) which leads to adverse latency behavior.

So looking at the article as a whole, I don't think the underlying intent of the article is wrong, but the reasoning and analysis presented is at best misguided. Asyncio is not a performance silver bullet, it's not even real concurrency. Multi-processing and use of C extensions is the bigger bang for the buck when it comes to performance. But none of this is surprising and is expected if you really think about the underlying interactions.

To rephrase what you think I thought:

> The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO.

Is actually more like:

> Asyncio is more efficient than multi-threading in Python. It is also comparatively more variable than multi-processing, particularly when dealing with workloads that saturate a single event loop. Neither multi-threading or Asyncio is actually concurrent in Python, for that you have to use multi-processing to escape the GIL (or some C extension which you trust to safely execute outside of GIL control).

---

Regarding your aside example, it's true some C extensions can escape the GIL, but often times it's with caveats and careful consideration of where/when you can escape the GIL successfully. Take for example this scipy cookbook regarding parallelization:

https://scipy-cookbook.readthedocs.io/items/ParallelProgramm...

It's not often the case that using a C extension will give you truly concurrent multi-threading without significant and careful code refactoring.

camgunz · on June 13, 2020

For single processes you’re right, but this article (and a lot of the activity around asyncio in Python) is about backend webdev, where you’re already running multiple app servers. In this context, asyncio is almost always slower.

willseth · on June 12, 2020

> But blocking IO isn't one of those situations, so you can just use threads.

Threads and async are not mutually exclusive. If your system resources aren't heavily loaded, it doesn't matter, just choose the library you find most appropriate. But threads require more system overhead, and eventually adding more threads will reduce performance. So if it's critical to thoroughly maximize system resources, and your system cannot handle more threads, you need async (and threads).

otabdeveloper4 · on June 12, 2020

> But threads require more system overhead, and eventually adding more threads will reduce performance.

Absolutely false. OS threads are orders of magnitude lighter than any Python coroutine implementation.

dragonwriter · on June 12, 2020

> OS threads are orders of magnitude lighter than any Python coroutine implementation.

But python threads, which have extra weight on top of an cross-platform abstraction layer on top of the underlying OS threads, are not lighter than python coroutines.

You aren't choosing between Python threads and unadorned OS threads when writing Python code.

otabdeveloper4 · on June 13, 2020

You're absolutely right.

I'm pointing out that this is a Python problem, not a threads problem, a fact which people don't understand.

dragonwriter · on June 13, 2020

Everyone has been discussing relative performance of different techniques within Python; there is neither a basis to suggest from that that people don't understand that aspects of that are Python specific, nor a reason to think that that is even particularly relevant to the discussion.

willseth · on June 12, 2020

Okay, then let's do a bakeoff! You outfit a Python webserver that only uses threads, and I'll outfit an identical webserver that also implements async. Server that handling the most requests/sec wins. I get to pick the workload.

js2 · on June 12, 2020

FWIW, I have a real world Python3 application that does the following:

- receives an HTTP POST multipart/form-data that contains three file parts. The first part is JSON.

- parses the form.

- parses the JSON.

- depending upon the JSON accepts/rejects the POST.

- for accepted POSTs, writes the three parts as three separate files to S3.

It runs behind nginx + uwsgi, using the Falcon framework. For parsing the form I use streaming-form-data which is cython accelerated. (Falcon is also cython accelerated.)

I tested various deployment options. cpython, pypy, threads, gevent. Concurrency was more important than latency (within reason). I ended up with the best performance (measured as highest RPS while remaining within tolerable latency) using cpython+gevent.

It's been a while since I benchmarked and I'm typing this up from memory, so I don't have any numbers to add to this comment.

heavyset_go · on June 12, 2020

Each Linux thread has at least an 8MB virtual memory overhead. I just tested it, and was able to create one million coroutines in a few seconds and with a few hundred megabytes of overhead in Python. If I created just one thousand threads, it would take possibly 8 gigs of memory.

otabdeveloper4 · on June 13, 2020

Virtual memory is not memory. You're effectively just bumping an offset, there's no actual allocations involved.

> ...it would take possibly 8 gigs of memory.

No. Nothing is 'taken' when virtual memory is requested.

shk1338 · on June 13, 2020

But have you tried creating one thousand of OS threads and measuring the actual memory usage? If I recall correctly I read some article where it was explained that threads in Linux are not actually claiming their 8MB each so literally. I need to recheck that later.

heavyset_go · on June 13, 2020

You're right, I've read the same. Using Python 3.8, creating 12,000 threads with `time.sleep` as the target clocks in at 200MB residential memory.

Jasper_ · on June 12, 2020

People seem to keep misunderstanding the GIL. It's the Global Interpeter Lock, and it's effectively the lock around all Python objects and structures. This is necessary because Python objects have no thread ownership model, and the development team does not want per-object locks.

During any operation that does not need to modify Python objects, it is safe to unlock the GIL. Yielding control to the OS to wait on I/O is one such example, but doing heavy computation work in C (e.g. numpy) can be another.

ben509 · on June 12, 2020

To clarify that the CPython devs aren't being arbitrary here: There have been attempts at per-object or other fine-grained locking, and they appear to be less performant than a GIL, particularly for the single-threaded case.

Single-threaded performance is a major issue as that's most Python code.

Jasper_ · on June 12, 2020

Yes. I expect generic fine-grained locking, especially per-object leaks, to be less performant for multi-threaded code too, as locks aren't cheap, and even with the GIL, lock overhead could still be worse than a good scheduler.

Any solution which wants to consider per-object locking has to consider removing refcounting, or locking the refcount bits separately, as locking/unlocking objects to twiddle their refcounts is going to be ridiculously expensive.

Ultimately, the Python ownership and object model is not condusive to proper threading, as most objects are global state and can be mutated by any thread.

mrits · on June 12, 2020

Instead of disagreeing with some of your vague assertions I'll just make my own points for people that want to consider using async.

Workers (usually live in a new process) are not efficient. Processes are extremely expensive and subjectively harder for exception handling. Threads are lighter weight..and even better are async implementations that use a much more scalable FSM to handle this.

Offloading work to things not subjective to the GIL is the reason async Python got so much traction. It works really well.

brightball · on June 12, 2020

This is often a point of confusion for people when looking at Erlang, Elixir or Go code. Concurrency beyond leveraging available CPU's doesn't really add any advantage.

On the web when the bulk of your application code time is waiting on APIs, database queries, external caches or disk I/O it creates a dramatic increase in the capacity of your server if you can do it with minimal RAM overhead.

It's one of the big reasons I've always wanted to see Techempower create a test version that continues to increase concurrency beyond 512 (as high as maybe 10k). I think it would be interesting.

camgunz · on June 12, 2020

> On the web when the bulk of your application code time is waiting on APIs, database queries, external caches or disk I/O it creates a dramatic increase in the capacity of your server if you can do it with minimal RAM overhead.

Python doesn't block on I/O.

kirkeby · on June 12, 2020

Of course it does.

camgunz · on June 12, 2020

It releases the GIL.

Edit: sorry I can do better.

If you're using async/await to not block on I/O while handling a request, you still have to wait for that I/O to finish before you return a response. Async adds overhead because you schedule the coroutine and then resume execution.

The OS is better at scheduling these things because it can do it in kernel space in C. Async/await pushes that scheduling into user space, sometimes in interpreted code. Sometimes you need that, but very often you don't. This is in conflict with "async the world", which effectively bakes that overhead into everything. This explains the lower throughput, higher latency, and higher memory usage.

So effectively this means "run more processes/threads". If you can only have 1 process/thread and cannot afford to block, then yes async is your only option. But again that case is pretty rare.

cvlasdkv · on June 12, 2020

From my understanding the primary use of concurrency in Erlang/Elixir is for isolation and operational consistency. Do you believe that not to be the case?

toast0 · on June 12, 2020

The primary use of concurrency in Erlang is modelling a world that is concurrent.

If you go back to the origins of Erlang, the intent was to build a language that would make it easier to write software for telecom (voice) switches; what comes out of that is one process for each line, waiting for someone to pick up the line and dial or for an incoming call to make the line ring (and then connecting the call if the line is answered). Having this run as an isolated process allows for better system stability --- if someone crashes the process attached to their line, the switch doesn't lose any of the state for the other lines.

It turns out that a 1980s design for operational excellence works really well for (some) applications today. Because the processes are isolated, it's not very tricky to run them in parallel. If you've got a lot of concurrent event streams (like users connected via XMPP or HTTP), assigning each a process makes it easy to write programs for them, and because Erlang processes are significantly lighter weight than OS processes or threads, you can have millions of connections to a machine, each with its own process.

You can absolutely manage millions of connections in other languages, but I think Erlang's approach to concurrency makes it simpler to write programs to address that case.

brightball · on June 12, 2020

That's a big topic. The shortest way I can summarize it though:

Immutable data, heap isolated by concurrent process and lack of shared state, combined with supervision trees made possible because of extremely low overhead concurrency, and preemptive scheduling to prevent any one process from taking over the CPU...create that operational consistency.

It's a combination of factors that have gone into the language design that make it all possible though. Very big and interesting topic.

But it does create a significant capacity increase. Here's a simple example with websockets.

https://dockyard.com/blog/2016/08/09/phoenix-channels-vs-rai...

zzzeek · on June 12, 2020

this is true for compiled languages as the ones you mention, but generally does not apply to Python, which as an interpreted language tends to add CPU overhead for even the smallest tasks.

szatkus · on June 12, 2020

CPU can do billions of operations every second. When you have 200ms for every request that overhead is not that large, you're still blocked by I/O.

zzzeek · on June 12, 2020

for local services like databases, real world benchmarks disagree.

szatkus · on June 12, 2020

You should add that you mean just databases. I've just looked at your profile and as I understand it's your focus.

I built a service that was making a lot of requests. Much enough that at some point we've run out of 65k connections limit for basic Linux polling (we needed to switch to kpoll). Some time after that we've ran out of other resources and switching from threads to threads+greenlets really solved our problem.

arghwhat · on June 12, 2020

>... is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things.

This is very true, especially when actual work is involved.

Remember, the kernel uses the exact same mechanism to have a process wait on a synchronous read/write, as it does for a processes issuing epoll_wait. Furthermore, isolating tasks into their own processes (or, sigh, threads), allows the kernel scheduler to make much better decisions, such as scheduling fairness and QoS to keep the system responsive under load surges.

Now, async might be more efficient if you serve extreme numbers of concurrent requests from a single thread if your request processing is so simple that the scheduling cost becomes a significant portion of the processing time.

... but if your request processing happens in Python, that's not the case. Your own scheduler implementation (your event loop) will likely also end up eating some resources (remember, you're not bypassing anything, just duplicating functionality), and is very unlikely to be as smart or as fair as that of the kernel. It's probably also entirely unable to do parallel processing.

And this is all before we get into the details of how you easily end up fighting against the scheduler...

crimsonalucard1 · on June 12, 2020

Yeah except nodejs will beat flask in this same exact benchmark. Explain that.

talideon · on June 12, 2020

CPython doesn't have a JIT, while node.js does. If you want to compare apples to apples, try looking at Flask running on PyPy.

e12e · on June 12, 2020

Ed: after reading the article, I guess it's safe to say that everything below is false :)

---

I'd guess the c++ event loop is more important than the jit?

Maybe a better comparison is quart (with eg uvicorn)

https://pgjones.gitlab.io/quart/

https://www.uvicorn.org/

Or Sanic / uvloop?

https://sanicframework.org/

https://github.com/MagicStack/uvloop

Tronic2 · on June 14, 2020

Plain sanic runs much faster than the uvicorn-ASGI-sanic stack used in the benchmark, and the ASGI API in the middle is probably degrading other async frameworks' performance too. But then this benchmark also has other major issues, like using HTTP/1.0 without keep-alive in its Nginx proxy_pass config (keep-alive again has a huge effect on performance, and would be enabled on real performance-critical servers). https://sanic.readthedocs.io/en/latest/sanic/nginx.html

e12e · on June 15, 2020

Interesting, thank you. I wasn't aware nginx was so conservative by default.

https://nginx.org/en/docs/http/ngx_http_proxy_module.html#pr...

talideon · on June 13, 2020

You're not completely off. There might be issues with async/await overhead that would be solved by a JIT, but also if you're using asyncio, the first _sensible_ choice to make would be to swap out the default event loop with one actually explicitly designed to be performant, such as uvloop's one, because asyncio.SelectorEventLoop is designed to be straightforward, not fast.

There's also the major issue of backpressure handling, but that's a whole other story, and not unique to Python.

My major issue with the post I replied to is that there are a bunch of confounding issues that make the comparison given meaningless.

crimsonalucard1 · on June 13, 2020

The database is the bottleneck. JIT or even C++ shouldn't even be a factor here. Something is wrong with the python implimentation of async await.

talideon · on June 13, 2020

If I/O-bound tasks are the problem, that would tend to indicate an issue with I/O event loop, not with Python and its async/await implementation. If the default asyncio.SelectorEventLoop is too slow for you, you can subclass asyncio.AbstractEventLoop and implement your own, such as buildiong one on top of uvloop. And somebody's already done that: https://github.com/MagicStack/uvloop

Moreover, even if there's _still_ a discrepancy, unless you're profiling things, the discussion is moot. This isn't to say that there aren't problems (there almost certainly are), but that you should get as close as possible to an apples-to-apples comparison first.

crimsonalucard1 · on June 13, 2020

When I talk about async await I'm talking about everything that encompasses supporting that syntax. This includes the I/O event loop.

So really we're in agreement. You're talking about reimplementing python specific things to make it more performant, and that is exactly another way of saying that the problem is python specific.

talideon · on June 13, 2020

No, we're not in agreement. You're confounding a bunch of independent things, and that is what I object to.

It's neither fair nor correct to mush together CPython's async/await implementation with the implementation of asyncio.SelectorEventLoop. They are two different things and entirely independent of one another.

Moreover, it's neither fair nor correct to compare asyncio.SelectorEventLoop with the event loop of node.js, because the former is written in pure Python (with performance only tangentally in mind) whereas the latter is written in C (libuv). That's why I pointed you to uvloop, which is an implementation of asyncio.AbstractEventLoop built on top of libuv. If you want to even start with a comparison, you need to eliminate that confounding variable.

Finally, the implementation matters. node.js uses a JIT, while CPython does not, giving them _much_ different performance characteristics. If you want to eliminate that confounding variable, you need to use a Python implementation with a JIT, such as PyPy.

Do those two things, and then you'll be able to do a fair comparison between Python and node.js.

crimsonalucard1 · on June 14, 2020

Except the problem here is that those tests were bottlenecked by IO. Whether you're testing C++, pypy, libuv, or whatever it doesn't matter.

All that matters is the concurrency model because that application he's running is barely doing anything else except IO and anything outside of IO becomes negligible because after enough requests, those sync worker processes will all be spending the majority of their time blocked by an IO request.

The basic essence of the original claim is that sync is not necessarily better than async for all cases of high IO tasks. I bring up node as a counter example because that async model IS Faster for THIS same case. And bringing up node is 100% relevant because IO is the bottleneck, so it doesn't really matter how much faster node is executing as IO should be taking most of the time.

Clearly and logically the async concurrency model is better for these types of tasks so IF tests indicate otherwise for PYTHON then there's something up with python specifically.

You're right, we are in disagreement. I didn't realize you completely failed to understand what's going on and felt the need to do an apples to apples comparison when such a comparison is not Needed at all.

talideon · on June 14, 2020

No, I understand. I just think that your comparison with _node.js_ when there are a bunch of confounding variables is nonsense. Get rid of those and then we can look at why "nodejs will beat flask in this same exact benchmark".

crimsonalucard1 · on June 14, 2020

> I just think that your comparison with _node.js_ when there are a bunch of confounding variables is nonsense

And I'm saying all those confounding variables you're talking about are negligible and irrelevant.

Why? Because the benchmark test in the article is a test where every single task is 99% bound by IO.

What each task does is make a database call AND NOTHING ELSE. Therefore you can safely say that for either python or Node request less than 1% of a single task will be spent on processing while 99% of the task is spent on IO.

You're talking about scales on the order of 0.01% vs. 0.0001%. Sure maybe node is 100x faster, but it's STILL NEGLIGIBLE compared to IO.

It it _NOT_ Nonsense.

You Do not need an apples to apples comparison to come to the conclusion that the problem is Specific to the python implementation. There ARE NO confounding variables.

talideon · on June 14, 2020

> And I'm saying all those confounding variables you're talking about are negligible and irrelevant.

No, you're asserting something without actual evidence, and the article itself doesn't actually state that either: it contains no breakdown of where the time is spent. You're assuming the issue lies in one place (Python's async/await implementation) when there are a bunch of possible contributing factors _which have not been ruled out_.

Unless you've actually profiled the thing and shown where the time is used, all your assertions are nonsense.

Show me actual numbers. Prove there are no confounding variables. You made an assertion that demands evidence and provided none.

crimsonalucard1 · on June 14, 2020

>Unless you've actually profiled the thing and shown where the time is used, all your assertions are nonsense.

It's data science that is causing this data driven attitude to invade peoples minds. Do you not realize that logic and assumptions take a big role in drawing conclusions WITHOUT data? In fact if you're a developer you know about a way to DERIVE performance WITHOUT a single data point or benchmark or profile. You know about this method, you just haven't been able to see the connections and your model about how this world works (data driven conclusions only) is highly flawed.

I can look at two algorithms and I can derive with logic alone which one is O(N) and which one is O(N^2). There is ZERO need to run a benchmark. The entire theory of complexity is a mathematical theory used to assist us at arriving AT PERFORMANCE conclusions WITHOUT EVIDENCE/BENCHMARKS.

Another thing you have to realize is the importance of assumptions. Things like 1 + 1 = 2 will remain true always and that a profile or benchmark ran on a specific task is an accurate observation of THAT task. These are both reasonable assumptions to make about the universe. They are also the same assumptions YOU are making everytime you ask for EVIDENCE and benchmarks.

What you aren't seeing is this: The assumptions I AM making ARE EXACTLY THE SAME: reasonable.

>you're asserting something without actual evidence, and the article itself doesn't actually state that either: it contains no breakdown of where the time is spent

Let's take it from the top shall we.

I am making the assumption that tasks done in parallel ARE Faster than tasks done sequentially.

The author specifically stated he made a server that where each request fetches a row from the database. And he is saying that his benchmark consisted of thousands of concurrent requests.

I am also making the assumption that for thousands of requests and thousands of database requests MOST of the time is spent on IO. It's similar to deriving O(N) from a for loop. I observe the type of test the author is running and I make a logical conclusion on WHAT SHOULD be happening. Now you may ask why is IO specifically taking up most of the time of a single request a reasonable assumption? Because all of web development is predicated on this assumption. It's the entire reason why we use inefficient languages like python, node or Java to run our web apps instead of C++, because the database is the bottleneck. It doesn't matter if you use python or ruby or C++, the server will always be waiting on the db. It's also a reasonable assumption given my experience working with python and node and databases. Databases are the bottleneck.

Given this highly reasonable assumption, and in the same vein as using complexity theory to derive performance speed, it is highly reasonable for me to say that the problem IS PYTHON SPECIFIC. No evidence NEEDED. 1 + 1 = 2. I don't need to put that into my calculator 100 times to get 100 data points for some type of data driven conclusion. It's assumed and it's a highly reasonable assumption. So reasonable that only an idiot would try to verify 1 + 1 = 2 using statistics and experiments.

Look you want data and no assumptions? First you need to get rid of the assumption that a profiler and benchmark is accurate and truthful. Profile the profiler itself. But then your making another assumption: The profiler that profiled the profiler is accurate. So you need to get me data on that as well. You see where this is going?

There is ZERO way to make any conclusion about anything without making an assumption. And Even with an assumption, the scientific method HAS NO way of proving anything to be true. Science functions on the assumption that probability theory is an accurate description of events that happen in the real world AND even under this assumption there is no way to sample all possible EVENTS for a given experiment so we can only verify causality and correlations to a certain degree.

The truth is blurry and humans navigate through the world using assumptions, logic and data. To intelligently navigate the world you need to know when to make assumptions and when to use logic and when data driven tests are most appropriate. Don't be an idiot and think that everything on the face of the earth needs to be verified with statistics, data and A/B tests. That type of thinking is pure garbage and it is the same misguided logic that is driving your argument with me.

talideon · on June 15, 2020

Buddy, you can make all the "logical arguments" you want, but if you can't back up them up with evidence, you're just making guesses.

jinglebells · on June 12, 2020

Nodejs is faster than Python as a general rule, anyway. As I understand, Nodejs compiles Javascript, Python interprets Python code.

I do a lot of Django and Nodejs and Django is great to sketch an app out, but I've noticed rewriting endpoints in Nodejs directly accessing postgres gets much better performance.

Just my 2c

arghwhat · on June 12, 2020

CPython, the reference implementation, interprets Python. PyPy interprets and JIT compiles Python, and more exotic things like Cython and Grumpy statically compiles Python (often through another, intermediate language like C or Go).

Node.js, using V8, interprets and JIT compiles JavaScript.

Although note that, while Node.js is fast relative to Python, it's still pretty slow. If you're writing web-stuff, I'd recommend Go instead for casually written, good performance.

1337shadow · on June 13, 2020

The compare between Django against no-ORM is a bit weird given that rewriting your endpoint in python without Django or ORM would also have produced better results I suppose.

crimsonalucard1 · on June 13, 2020

Right but this test focused on concurrent IO. The bottleneck is not the interpreter but the concurrency model. It doesn't matter if you coded it in C++, the JIT shouldn't even be a factor here because the bottleneck is IO and therefore ONLY the concurrency model should be a factor here. You should only see differences in speed based off of which model is used. All else is negligible.

So you have two implementations of async that are both bottlenecked by IO. One is implemented in node. The other in python.

The node implementation behaves as expected in accordance to theory meaning that for thousands of IO bound tasks it performs faster then a fixed number of sync worker threads (say 5 threads).

This makes sense right? Given thousands of IO bound tasks, eventually all 5 threads must be doing IO and therefore blocked on every task, while the single threaded async model is always context switching whenever it encounters an IO task so it is never blocked and it is always doing something...

Meanwhile the python async implementation doesn't perform in accordance to theory. 5 async workers is slower then 5 sync workers on IO bound tasks. 5 sync workers should eventually be entirely blocked by IO and the 5 async workers should never be blocked ever... Why is the python implementation slower? The answer is obvious:

It's python specific. It's python that is the problem.

arghwhat · on June 12, 2020

JIT compiler.

crimsonalucard1 · on June 13, 2020

Bottleneck is IO. Concurrency model should be the limiting factor here.

NodeJS is faster than flask because of the concurrency model and NOT because of the JIT.

The python async implementation being slower than the python sync implementation means one thing: Something is up with python.

The poster implies that with the concurrency model the outcome of these tests are expected.

The reality is, these results are NOT expected. Something is going on specifically with the python implementation.

nurettin · on June 13, 2020

You mean express.js ?

crimsonalucard1 · on June 13, 2020

NodeJS primitives are enough to produce the same functionality as flask without the need for an extra framework.

wongarsu · on June 12, 2020

Async IO was in large part a response to "how can my webserver handle xx thousand connections per second" (or in the case of Erlang "how do you handle millions of phone calls at once"). Starting 15 threads to do IO works great, but once you wait for hundreds of things at once the overhead from context switching becomes a problem, and at some point the OS scheduler itself becomes a problem

tijsvd · on June 12, 2020

Not really. At least on Linux, the scheduler is O(1). There is no difference between one process waiting for 10k connections, or 10k processes waiting for 1 each. And there is hardly a context switch either, if all these 10k processes use the same memory map (as threads do).

I've tested this extensively on Linux. There is no more CPU used for threads vs epoll.

On the other hand, if you don't get the epoll imementation exactly right, you may end up with many spurious calls. E.g. simply reading slow data from a socket in golang on Linux incurs considerable overhead: a first read that is short, another read that returns EWOULDBLOCK, and then a syscall to re-arm the epoll. With OS threads, that is just a single call, where the next call blocks and eventually returns new data.

Edit: one thing I haven't considered when testing is garbage collection. I'm absolutely convinced that up to 10k connections, threads or async doesn't matter, in C or Rust. But it may be much harder to do GC over 10k stacks than over 8.

dathinab · on June 12, 2020

I recently have read a block with benchmarks doing that for well written C in their use case async io only becomes faster then using threads from around 10k parallel connections. (Through the difference was negligible).

This seems to also be a major behind io_uring.

staticassertion · on June 12, 2020

I don't think this is true? At least, I've never seen the issue of OS threads be that context switching is slow.

The issue is memory usage, which OS threads take a lot of.

Would userland scheduling be more CPU efficient? Sure, probably in many cases. But I don't think that's the problem with handling many thousands of concurrent requests today.

rzk · on June 12, 2020

> is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things

Co-routines are not necessarily faster than threads, but they yield to a performance improvement if one has to spin thousands of them : they have less creation overhead and consume less RAM.

tijsvd · on June 12, 2020

> Co-routines are not necessarily faster than threads, but they yield to a performance improvement if one has to spin thousands of them : they have less creation overhead and consume less RAM.

This hardly matters when spinning up a few thousand threads. Only memory that's actually used is committed, one 4k page at a time. What is 10MB these days? And that is main memory, while it's much more interesting what fits in cache. At that point it doesn't matter if your data is in heap objects or on a stack.

Add to that the fact that Python stacks are mostly on the heap, the real stack growing only due to nested calls in extensions. It's rare for a stack in Python to exceed 4k.

dullgiulio · on June 12, 2020

Languages that to green threads don't do them for memory savings, but to save on context switches when a thread is blocked and cannot run. System threads are scheduled by the OS, green threads my the language runtime, which saves a context switch.

tijsvd · on June 12, 2020

Green threads are scheduled by the language runtime and by the OS. If the OS switches from one thread to another in the same process, there is no context switch, really, apart from the syscall itself which was happening anyway (the recv that blocks and causes the switch). At least not on Linux, where I've measured the difference.

crimsonalucard1 · on June 12, 2020

This is not what is happening with flask/uwsgi. There is a fixed number of threads and processes with flask. The threads are only parallel for io and the processes are parallel always.

viscanti · on June 12, 2020

Which is fine until you run out of uwsgi workers because a downstream gets really slow sometime. The point of async python isn't to speed things up, it's so you don't have to try to guess the right number of uwsgi workers you'll need in your worst case scenario and run with those all the time.

crimsonalucard1 · on June 13, 2020

Yep and this test being shown is actually saying that about 5 sync workers acting on thousands of requests is faster then python async workers.

Theoretically it makes no sense. A Task manager executing tasks in parallel to IO instead of blocking on IO should be faster... So the problem must be in the implementation.

_pmf_ · on June 12, 2020

> I think it is surprising to a lot of people who do take it as read that async will be faster.

Literally the first thing any concurrency course starts with in the very first lesson is that scheduling and context overhead are not negligible. Is it so hard to expect our professionals to know basic principles of what they are dealing with?

dspillett · on June 12, 2020

> think it is surprising to a lot of people who do take it as read that async will be faster.

This is because when they are first shown it, the examples are faster, effectively at least, because the get given jobs done in less wallclock time due to reduced blocking.

They learn that but often don't get told (or work out themselves) that in many cases the difference is so small as to be unmeasurable or in other circumstances the can be negative effects (overheads others have already mentioned in the framework, more things waiting on RAM with a part processed working day which could lead to thrashing in a low memory situation, greater concurrent load on other services such as a database and the IO system it depends upon, etc).

As a slightly of-the-topic-of-async example, back when multi-core processing was first becoming cheap enough that it was not just affordable at give but the default option, I had great trouble trying to explain to a colleague why two IO intensive database processes he was running were so much slower than when I'd shown him the same process (I'd run them sequentially). He was absolutely fixated on the idea that his four cores should make concurrency the faster option, I couldn't get through that in this case the flapping heads on the drives of the time were the bottleneck and the CPU would be practically idle no matter how many cores it had while the bottleneck was elsewhere.

Some people learn the simple message (async can handle some loads much more efficiently) as an absolute (async is more efficient) and don't consider at all that the situation may be far more nuanced.

nurettin · on June 13, 2020

> An async implementation that does multiple ("embarrassingly parallel") tasks in the same process

You mean concurrent tasks in the same process?

lucideer · on June 12, 2020

> I don't think that people who think async is faster have unreasonable expectations

I do.

And I don't think I'm alone nor being unreasonable.