I think the async/await patterns solve one problem really well: UI latency. My U...

jayd16 · 2024-07-23T15:42:43 1721749363

Isn't async/await "scheduling heterogenous work with nuanced dependencies"? Or is that what you were implying?

Although my real guess is ECS but that's more like the "everyone gets every thread for a time."

sqeaky · 2024-07-23T17:19:05 1721755145

TLDR; I hadn't meant it that way, but in web pages it really is enough. Web pages generally don't have computation time to worry about, mostly just IO. This simplifies scheduling because whatever is coordinating the event loop in the browser (or other UI) can just background any amount of independent IO tasks. If there is computation screwing with share mutable state something with internal knowledge needs to be involved and that isn't current event loops, but in the long run it could be.

Sorry for the novel.

I meant those nuanced dependencies as a way of managing shared mutable state and complex computations that really do take serious CPU time. Let's make a simple example from a complex game design. This example is ridiculous but conveys the real nature of the problems with CPU and IO. Consider these pieces of work that might exist in a hypothetical game where NPCs and a player are moving around a simulated environment with some physics calculations and that the physics simulation is the source of truth for locations of items in the game. Here are parts of a game:

Physics broad phase: Trivially parallelizable, depends on previous frame. Produces "islands" of physics objects. Imagine two piles of stuff miles apart, they can't interact except with items in the same pile, each island is just a pile of math to do. Perhaps in this game this might take 20 ms of CPU time. Across the minimum target machine with 4 cores that is 5ms apiece.

Physics narrow phase: Each physics island is single threaded but threadsafe from each other, depends on broad phase to produce islands. Each island takes unknown and unequal time, likely between 0 and 2 ms of just math.

Graphics scene/render: Might have a scene graph culling that is parallelizable, and converts game state into a series of commands independent of any specific GPU API. Depends on all physics completing because that is what it is drawing. Likely 1 or 2 ms per island.

Graphics draw calls: Single threaded, sends render results to GPU using directx/opengl/vulkan/metal. This converts the independent render commands to API specific commands. Likely less than 1 ms of actual CPU work, but larger wait on GPU because it is IO.

NPC AI: NPCs are independent but light weight so threading makes no sense if there are fewer than hundreds. Depends on physics to know what NPCs are responding to. Wants to add forces to the physics sim next frame. For this game lets say there are many, I don't know maybe this is dynasty warriors or something, so lets say a 1~3 ms.

User input: Single threaded, will to add forces to the physics sim next frame based on user commands. Can't run at the same time as NPCs because both want to mutate the physics state. Less than 1 ms.

We are ignoring: Sound, Network, Environment, Disk IO, OS events (window resize, etc), UI (not buttons or text positioning), and a few other things.

A first attempt at a real game would likely be coded to give all the threads to each piece of work one at a time in some hand picked order, or at least until this was demonstrate to be slow:

Physics Broad -> Physics Narrow -> Graphics render -> Graphic GPU -> NPC AI -> User input -> Wait/Next frame

But that is likely slow, and I picked our hypothetical math to be slow and marginal. Sending stuff to the GPU is a high latency activity it might take 5 ms to respond, and if this is a 60 FPS game then that is like 1/3 of our time. If we simply add our hypothetical times that is frequently more than 16ms making the game slower than 60fps. Even an ideal frame with just a little physics is right at 15 to 16 ms So a practical game studio might do other work while waiting on the GPU to respond:

Physics Broad -> Physics Narrow -> Graphics render ->

At the same time: { Graphics GPU calls (Uses one thread) NPC AI (Uses all but one thread) -> User input } ->

Wait/Next frame

Most of the time something like this is "fast enough". In this example that 5 ms of CPU time wait on the GPU is now running alongside all that NPC AI so we only need to add the larger of the two. If this takes only a few days of engineer time and keeps the game under the 16ms on most machines then maybe the team makes a business decision to raise the minimum specs just a bit (from 4 to 6 cores would reduce physics time by another ms) and now can they ship this game. There are still needless waits and from a purely GigaFLOPS perspective perhaps much weaker computers could do the work but there is so much waiting that it isn't practical. But this compromise gets all target machines to just about 60 FPS.

Alternatively, if the game is smart enough to make new threads of work for each physics islands (actually super complex and not a great idea in real game, but this all hypothetical but there are similar wins to be had in real games) and manage dependencies carefully based on the the simulation state then something more detailed might be possible:

1. Physics broadphase, create known amount of physics islands.

2. Start a paused GPU thread waiting on known amount of physics islands to be done rendering. This will start step 5 as soon as the last step 4c completes.

3. Add the player's input work to the appropriate group of NPCs

4. Each Physics island gets a thread that does the following: a. Physics narrow phase for this island, b. Partial render for just this island, c. Sets a threadsafe flag this island is done, d. NPC AI is processed for NPCs near this physics island, e. If this is the island with the player process their input.

5. The GPU thread waits for all physics islands threads to get to step 3c then starts sending commands to the GPU! and 3d gets to keep running.

6. When all threads from step 4 and 5 complete pause all game threads to hit the desired frame rate (save battery life for mobile gamers!) or advance to next thread if past frame budget or framerate is uncapped.

This moves all the waits to the end of each thread's frame runtime. This means a bunch of nice things. That last thread can likely do some turbo boosting, a feature of most CPUs where they clock up one CPU if it is the only one working. If the NPCs ever took longer than the GPU they still might complete earlier because they get started earlier. If there are more islands than hardware threads this likely results in better utilization because there are no early pauses.

This would likely a take a ton of engineering time. This might move the frame time down a few more ms and maybe let them lower the minimum requirements perhaps even letting the game run on an older console if market conditions support that. Conceptually, it might be a thing that could be done with async/await, but I don't think that is how most game engines are designed. I also think this makes dependencies implicit and scattered through the code, but likely that could be avoided with careful design.

I am a big fan of libraries that let you provide work units, or functors, and say which depend on each other. They all get to read/write to the global state, but with the dependencies there won't be race conditions. Such libraries locate the threading logic in one place. Then if there is some particularly contentious state that many things need to touch it can be wrapped in a mutex.

I suppose this might just be the iterative vs recursive discussion applied to threading strategies. It just happens that most event loops are single threaded, no reason they need to be in the long run. In the long run I could see making that fastest scenario happen in either paradigm even though the code would look completely different.

SleepyMyroslav · 2024-07-24T07:02:44 1721804564

Dislaimer i work in gamedev. I think what ppl do in gamedev with tasks/jobs ( different ppl call it differently ) and colorless async with functions that may yield at any time are different. Yielding on I/O means you can not meet a deadline ( frame time ). Not on current hardware that has no I/O deadlines. Which means to me that there is no way we can share library code between async web and realtime part of a game. Ofc games have background best effort computations that can call web-like code and it is fine that it runs for unknown amount of time.

anonymoushn · 2024-07-24T11:00:33 1721818833

Doesn't it mean you can meet the deadline, but you cannot guarantee that your new textures will be loaded/TLS handshake with login server will be completed/etc. before the deadline happens?

SleepyMyroslav · 2024-07-24T14:16:46 1721830606

Texture loading and TLS can not meet deadline for sure because we rely on APIs that do not support deadlines. They can only be best effort/background code.

The difference I believe is between updating each UI widget and doing something in case of still missing texture or yielding on the texture in some place of UI code and never touching rest of the UI in the frame.

mabster · 2024-07-24T23:03:42 1721862222

I've always felt this is fine, just as long as there are API calls to preload. Then on one screen you start preloading the next screens while the user is navigating your menus to hide all this latency as much as possible.

mabster · 2024-07-24T14:13:52 1721830432

It doesn't sound to me like the engines you dealt with use ECS, which are usually resolved with a job system (your work units and functors), but correct me if I'm wrong.

The good job systems I've dealt with have their dependencies in the functors. So you "wait" on a job to finish, which is really a while loop that plucks and executes other jobs while the dependency job hasn't finished. This kind of job system is nice to deal with as they are generally low overhead which means all threads (processes really) are generally saturated with work at all times.

I don't really remember any global state with contention because that's generally very very slow, but maybe there were bits of our gameplay code I'm not aware of.

sqeaky · 2024-07-24T16:21:14 1721838074

The ECS concerns don't really relate to threading concerns.

I have worked with and without ECS systems both with and without good threading models. ECS writes do create possible issues if write-locks need to be acquired but that isn't usually so big of a deal.

mabster · 2024-07-24T22:49:24 1721861364

In the "you're still going to have to wait for something" sense, sure. But the reason ECS exists is because the industry had to change our architecture when we moved to many core CPUs to take advantage.

I'm battling to understand what you want then, sorry. The systems that you say you would like to use (discrete jobs with dependencies) are the kind of systems the industry has been using since the advent of data-oriented architecture, which includes ECS. That is, a job worker process per core plucking off work and doing it.

In the engines I've dealt with, we don't usually have write locks, instead preferring copies of "last frame data" and "next frame data". And all our "read locks" are waits for jobs. Our game code is generally single threaded, but the main loop pretty much just kicks off and waits for jobs.

I guess what is a good threading model to you?

(As a side note I've worked on projects that use ECS on a single core and they still confer benefits there even though that's not what they were invented for)