Chrome is all-in on strict process isolation as their Spectre defense. Given the web platform that they need to support, they don't really have another choice -- precise timers and threads, as well as many other ad hoc timing mechanisms, were already part of the platform before Spectre hit. Using process for everything is pretty costly, though, to the point that they have to make some compromises which are open to attack: https://www.spookjs.com/
In Cloudflare Workers, process isolation for every worker isn't practical -- if we had to do that then we'd only be able to offer real "edge compute" to a small number of big enterprises with deep pockets, rather than at prices affordable to everyone. Our fundamental difficulty is that we need a single machine to be able to support thousands of running applications where each app may only get a small trickle of traffic, so that we can deploy every application to every one of our edge locations. It's like if you always had 10,000 tabs open in Chrome.
Instead our strategy is to stack a lot of different mitigations that make attacks slower, until the point where they are too slow to be remotely useful[0]. One part of the strategy is a novel defense we designed with researchers at TU Graz called Dynamic Process Isolation[1], in which we use hardware performance counters to detect execution patterns indicative of an attack and move those workers into private processes. For that strategy to work, though, we first need to slow down attacks enough to give ourselves a chance to detect the patterns -- and that requires disallowing precise timers or anything that could be used as a precise timer, such as multiple threads with shared memory. Luckily, we were thinking about timing attacks from the very start of the project (even before Spectre was known), so we were able to avoid ever putting these into the platform in the first place.
In general I think threads are not as important on servers because distributing work across machines is more powerful anyway. We're trying to create a platform where that is really easy.
Thanks for the detailed explanation, makes complete sense.
I suppose that means that for projects such as python-wasm [0] that are porting other language runtimes to WASM they will have to go the Asyncify route for CF Workers with all the overhead that entails?
From the looks of it they don't think asyncifying Python will be possible in the near to medium term [1].
If this becomes a big enough problem, we might be able to build a work-around into the Cloudflare Workers Runtime itself so that it can support synchronous waits. It'll take some hacks but I think it's doable, without actually supporting threads.
I'd probably use fibers, like node-fibers used to do. The problem is V8 stopped supporting fibers a few versions ago. We'd have to hack support back in. That's something we probably could get done but I'm not exactly excited about it. :)
> we'd only be able to offer real "edge compute" to a small number of big enterprises with deep pockets, rather than at prices affordable to everyone
What a nice way to formulate the tradeoffs between cost and security. When Workers came out with huge headlines about performance and cost I was very disappointed to find out there was no special technical wizardry behind that but just conscious trade-offs in disregard to customer data security. Workers simply skip all the sandboxing steps other providers take to implement a multi-tenant application runtime in a secure manner. So far the mitigations in place seem to make the possibility of Cloudflare customers getting bitten by a V8 vulnerability unlikely instead of impossible.
Sure, the platform has interesting ideas and I'm looking forward to try it out as a full-stack serverless platform. I just cannot foresee running anything serious on it before they come up with a more convincing security story.
Other providers run attacker-provided native code directly on hardware, deeply relying on bug-free silicon for their whole security model to work. I honestly think that's far more precarious than what Workers is doing.
In Cloudflare Workers, process isolation for every worker isn't practical -- if we had to do that then we'd only be able to offer real "edge compute" to a small number of big enterprises with deep pockets, rather than at prices affordable to everyone. Our fundamental difficulty is that we need a single machine to be able to support thousands of running applications where each app may only get a small trickle of traffic, so that we can deploy every application to every one of our edge locations. It's like if you always had 10,000 tabs open in Chrome.
Instead our strategy is to stack a lot of different mitigations that make attacks slower, until the point where they are too slow to be remotely useful[0]. One part of the strategy is a novel defense we designed with researchers at TU Graz called Dynamic Process Isolation[1], in which we use hardware performance counters to detect execution patterns indicative of an attack and move those workers into private processes. For that strategy to work, though, we first need to slow down attacks enough to give ourselves a chance to detect the patterns -- and that requires disallowing precise timers or anything that could be used as a precise timer, such as multiple threads with shared memory. Luckily, we were thinking about timing attacks from the very start of the project (even before Spectre was known), so we were able to avoid ever putting these into the platform in the first place.
In general I think threads are not as important on servers because distributing work across machines is more powerful anyway. We're trying to create a platform where that is really easy.
[0] https://blog.cloudflare.com/mitigating-spectre-and-other-sec...
[1] https://blog.cloudflare.com/spectre-research-with-tu-graz/