Hacker News new | past | comments | ask | show | jobs | submit login

I think the async/await patterns solve one problem really well: UI latency.

My UI background is web testing and C++ game UIs in C++. There are like 3 patterns in multithreaded games. Thread per responsibility (old/bad), Barriers blocking all threads and give each game system all the threads, or something smart. Few games do something smart, and the thread per responsibility is not ideal and we will ignore it for now.

In games often pausing everything and letting the physics system have all the threads for a few milliseconds is "fast enough". Then the graphics systems will try to use all the threads, and so on so that eventually everything will get all the threads even thought everything else rarely needs it. Sometimes two things are both close to single threaded and have no data contention so they might both be given threads, but this is almost always a manually decided things by experts.

This means that once UI, the buttons, the text, cursors, status bars, etc, gets its turn there won't be any race conditions (good), but if it needs to request something from disk that pause will happen on a thread in the UI system (bad and analogous to web sites making web API calls) so UI latency can be a real problem. If any IO small or some resource system has preloaded it then there isn't a detectable slowdown, but there are still plenty of silly periods of waiting. There is also a lot of time when some single threaded part of the game isn't using N-1 hardware threads and all that IO could have been asynchronous. But often game UIs are a frame behind the rest of the game simulation and there is often detectable latency in the UI like the mouse feeling like it drags or similar.

Allowing IO to run in the back while active events are processed can reduce latency and this is the default in web browsers. IO latency in web pages is worse than in games and other computation seems smaller than games, so the the event loop is close to ideal. A function is waiting? throw it on the stack and grab something else to do! This means that all the work that can be done while waiting on IO is done and when does well makes a UI snappy.

If that were available sensibly in games it could allow a game designed appropriately to span IO across multiple frames and be snappy without stutters. With games using the strategy I described above latency in the game simulation or IO can cause the UI to feel sluggish and vice versa. In games caching UI details and trying to pump the frame rate is "good enough". If the UI is a frame behind but we have 200 frames per second, that isn't really a problem. But when it chugs and the mouse stops responding because the player built the whole game world out of dynamite and set it off the game will not process the mouse until that 30 minutes of physics work is done.

There are better scheduling schemes for games. I am a big fan of doing "something smart" but that usually means scheduling heterogenous work with nuanced dependencies and I have written libraries just for that because it isn't actually that hard. But if you don't have the raw compute demands of a game scheduling IO along side your UI computation is often "fast enough" and is any easy enough mental model for JS devs to grok and allow them freedom to speed things up with their own solutions like caching schemes and reworking their UIs.




Isn't async/await "scheduling heterogenous work with nuanced dependencies"? Or is that what you were implying?

Although my real guess is ECS but that's more like the "everyone gets every thread for a time."


TLDR; I hadn't meant it that way, but in web pages it really is enough. Web pages generally don't have computation time to worry about, mostly just IO. This simplifies scheduling because whatever is coordinating the event loop in the browser (or other UI) can just background any amount of independent IO tasks. If there is computation screwing with share mutable state something with internal knowledge needs to be involved and that isn't current event loops, but in the long run it could be.

Sorry for the novel.

I meant those nuanced dependencies as a way of managing shared mutable state and complex computations that really do take serious CPU time. Let's make a simple example from a complex game design. This example is ridiculous but conveys the real nature of the problems with CPU and IO. Consider these pieces of work that might exist in a hypothetical game where NPCs and a player are moving around a simulated environment with some physics calculations and that the physics simulation is the source of truth for locations of items in the game. Here are parts of a game:

Physics broad phase: Trivially parallelizable, depends on previous frame. Produces "islands" of physics objects. Imagine two piles of stuff miles apart, they can't interact except with items in the same pile, each island is just a pile of math to do. Perhaps in this game this might take 20 ms of CPU time. Across the minimum target machine with 4 cores that is 5ms apiece.

Physics narrow phase: Each physics island is single threaded but threadsafe from each other, depends on broad phase to produce islands. Each island takes unknown and unequal time, likely between 0 and 2 ms of just math.

Graphics scene/render: Might have a scene graph culling that is parallelizable, and converts game state into a series of commands independent of any specific GPU API. Depends on all physics completing because that is what it is drawing. Likely 1 or 2 ms per island.

Graphics draw calls: Single threaded, sends render results to GPU using directx/opengl/vulkan/metal. This converts the independent render commands to API specific commands. Likely less than 1 ms of actual CPU work, but larger wait on GPU because it is IO.

NPC AI: NPCs are independent but light weight so threading makes no sense if there are fewer than hundreds. Depends on physics to know what NPCs are responding to. Wants to add forces to the physics sim next frame. For this game lets say there are many, I don't know maybe this is dynasty warriors or something, so lets say a 1~3 ms.

User input: Single threaded, will to add forces to the physics sim next frame based on user commands. Can't run at the same time as NPCs because both want to mutate the physics state. Less than 1 ms.

We are ignoring: Sound, Network, Environment, Disk IO, OS events (window resize, etc), UI (not buttons or text positioning), and a few other things.

A first attempt at a real game would likely be coded to give all the threads to each piece of work one at a time in some hand picked order, or at least until this was demonstrate to be slow:

Physics Broad -> Physics Narrow -> Graphics render -> Graphic GPU -> NPC AI -> User input -> Wait/Next frame

But that is likely slow, and I picked our hypothetical math to be slow and marginal. Sending stuff to the GPU is a high latency activity it might take 5 ms to respond, and if this is a 60 FPS game then that is like 1/3 of our time. If we simply add our hypothetical times that is frequently more than 16ms making the game slower than 60fps. Even an ideal frame with just a little physics is right at 15 to 16 ms So a practical game studio might do other work while waiting on the GPU to respond:

Physics Broad -> Physics Narrow -> Graphics render ->

At the same time: { Graphics GPU calls (Uses one thread) NPC AI (Uses all but one thread) -> User input } ->

Wait/Next frame

Most of the time something like this is "fast enough". In this example that 5 ms of CPU time wait on the GPU is now running alongside all that NPC AI so we only need to add the larger of the two. If this takes only a few days of engineer time and keeps the game under the 16ms on most machines then maybe the team makes a business decision to raise the minimum specs just a bit (from 4 to 6 cores would reduce physics time by another ms) and now can they ship this game. There are still needless waits and from a purely GigaFLOPS perspective perhaps much weaker computers could do the work but there is so much waiting that it isn't practical. But this compromise gets all target machines to just about 60 FPS.

Alternatively, if the game is smart enough to make new threads of work for each physics islands (actually super complex and not a great idea in real game, but this all hypothetical but there are similar wins to be had in real games) and manage dependencies carefully based on the the simulation state then something more detailed might be possible:

1. Physics broadphase, create known amount of physics islands.

2. Start a paused GPU thread waiting on known amount of physics islands to be done rendering. This will start step 5 as soon as the last step 4c completes.

3. Add the player's input work to the appropriate group of NPCs

4. Each Physics island gets a thread that does the following: a. Physics narrow phase for this island, b. Partial render for just this island, c. Sets a threadsafe flag this island is done, d. NPC AI is processed for NPCs near this physics island, e. If this is the island with the player process their input.

5. The GPU thread waits for all physics islands threads to get to step 3c then starts sending commands to the GPU! and 3d gets to keep running.

6. When all threads from step 4 and 5 complete pause all game threads to hit the desired frame rate (save battery life for mobile gamers!) or advance to next thread if past frame budget or framerate is uncapped.

This moves all the waits to the end of each thread's frame runtime. This means a bunch of nice things. That last thread can likely do some turbo boosting, a feature of most CPUs where they clock up one CPU if it is the only one working. If the NPCs ever took longer than the GPU they still might complete earlier because they get started earlier. If there are more islands than hardware threads this likely results in better utilization because there are no early pauses.

This would likely a take a ton of engineering time. This might move the frame time down a few more ms and maybe let them lower the minimum requirements perhaps even letting the game run on an older console if market conditions support that. Conceptually, it might be a thing that could be done with async/await, but I don't think that is how most game engines are designed. I also think this makes dependencies implicit and scattered through the code, but likely that could be avoided with careful design.

I am a big fan of libraries that let you provide work units, or functors, and say which depend on each other. They all get to read/write to the global state, but with the dependencies there won't be race conditions. Such libraries locate the threading logic in one place. Then if there is some particularly contentious state that many things need to touch it can be wrapped in a mutex.

I suppose this might just be the iterative vs recursive discussion applied to threading strategies. It just happens that most event loops are single threaded, no reason they need to be in the long run. In the long run I could see making that fastest scenario happen in either paradigm even though the code would look completely different.


Dislaimer i work in gamedev. I think what ppl do in gamedev with tasks/jobs ( different ppl call it differently ) and colorless async with functions that may yield at any time are different. Yielding on I/O means you can not meet a deadline ( frame time ). Not on current hardware that has no I/O deadlines. Which means to me that there is no way we can share library code between async web and realtime part of a game. Ofc games have background best effort computations that can call web-like code and it is fine that it runs for unknown amount of time.


Doesn't it mean you can meet the deadline, but you cannot guarantee that your new textures will be loaded/TLS handshake with login server will be completed/etc. before the deadline happens?


Texture loading and TLS can not meet deadline for sure because we rely on APIs that do not support deadlines. They can only be best effort/background code.

The difference I believe is between updating each UI widget and doing something in case of still missing texture or yielding on the texture in some place of UI code and never touching rest of the UI in the frame.


I've always felt this is fine, just as long as there are API calls to preload. Then on one screen you start preloading the next screens while the user is navigating your menus to hide all this latency as much as possible.


It doesn't sound to me like the engines you dealt with use ECS, which are usually resolved with a job system (your work units and functors), but correct me if I'm wrong.

The good job systems I've dealt with have their dependencies in the functors. So you "wait" on a job to finish, which is really a while loop that plucks and executes other jobs while the dependency job hasn't finished. This kind of job system is nice to deal with as they are generally low overhead which means all threads (processes really) are generally saturated with work at all times.

I don't really remember any global state with contention because that's generally very very slow, but maybe there were bits of our gameplay code I'm not aware of.


The ECS concerns don't really relate to threading concerns.

I have worked with and without ECS systems both with and without good threading models. ECS writes do create possible issues if write-locks need to be acquired but that isn't usually so big of a deal.


In the "you're still going to have to wait for something" sense, sure. But the reason ECS exists is because the industry had to change our architecture when we moved to many core CPUs to take advantage.

I'm battling to understand what you want then, sorry. The systems that you say you would like to use (discrete jobs with dependencies) are the kind of systems the industry has been using since the advent of data-oriented architecture, which includes ECS. That is, a job worker process per core plucking off work and doing it.

In the engines I've dealt with, we don't usually have write locks, instead preferring copies of "last frame data" and "next frame data". And all our "read locks" are waits for jobs. Our game code is generally single threaded, but the main loop pretty much just kicks off and waits for jobs.

I guess what is a good threading model to you?

(As a side note I've worked on projects that use ECS on a single core and they still confer benefits there even though that's not what they were invented for)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: