Hacker News new | past | comments | ask | show | jobs | submit login
Our container platform is in production. It has GPUs. Here's an early look (cloudflare.com)
199 points by jgrahamc 50 days ago | hide | past | favorite | 74 comments



> To add GPU support, the Google team introduced nvproxy which works using the same principles as described above for syscalls: it intercepts ioctls destined to the GPU and proxies a subset to the GPU kernel module.

This does still expose the host's kernel to a potentially malicious workload, right?

If so, could this be mitigated by (continuously) running a QEMU VM with GPUs passed through via VFIO, and running whatever Workers need within that VM?

The Debian ROCm Team faces similar challenge, we want to do CI [1] for our stack and all our dependent packages, but cannot rule out potentially hostile workloads. We spawn QEMU VMs per test (instead of the model described above) but that's because our tests must also be run against the relevant distribution's kernel and firmwares.

Incidentally, I've been monitoring the Firecracker VFIO GitHub issue linked in the article. Upstream does not have a use case for and thus no resources dedicated to implement this, but there's a community meeting [2] coming up in October to discuss the future of this feature request.

[1]: https://ci.rocm.debian.net

[2]: https://github.com/firecracker-microvm/firecracker/issues/11...


I’ve been looking at distributed CI and for now I’m just going to be running workloads queued by the owner of the agent. That doesn’t eliminate hostile workloads but it does present a similar surface area to simply running the builds locally.

I’ve been thinking about QEMM or firecracker instead of just containers for a more robust solution. I have some time before anyone would ask me about GPU workloads, but do you think firecracker is on track to get there or would I be better off learning QEMM?


Amazon/AWS has no use case for VFIO in Firecracker. They're open to the community adding support and have a community meeting soon, but I wouldn't get my hopes up.

QEMU can work -- I say can, because it doesn't work with all GPUs. And with consumer GPUs, VFIO is generally not an officially supported use case. We got it working, but with lots of trial and error, and there are still some problematic corner cases.


What would you say is the sort of time horizon for turnkey operation of one commonly available video card, half a dozen, and OEM cards in high end laptops (eg, MacBook Pro)? Years? Decades? Heat death?


I don't think I fully understand your question. If, with turnkey operation you mean virtualization, enterprise GPUs already officially support it now, and it already works with consumer GPUs, at least the discrete ones.


If the calls first pass through a memory safe language as what gvisor does, isn’t the attack surface greatly reduced?

It does seem however that Firecracker + GPU support (or https://github.com/cloud-hypervisor/cloud-hypervisor) is most promising though.

It’s surprising that AWS doesn’t have a need for Lambda but with GPU’s to motivate them to bring GPU’s to firecracker.


> If the calls first pass through a memory safe language as what gvisor does, isn’t the attack surface greatly reduced?

The runtime may be memory safe, but I'm thinking of the GPU workloads which nvproxy seems to pass on to the device via the host's kernel. Say I find a security issue in the GPU's driver, and manage to exploit it with some malicious CUDA workload.


Would having a VM inbetween help in that case? It seems like protecting against malicious GPU workloads requires the GPU to off virtualization to avoid this exploit.

This is helpful in explaining why AWS hasn't been excited to ship this use case in firecracker.


It would probably not stop all theoretically possible attacks, but it would stop many of them.

Say you find a bug in the GPU driver that let's you execute arbitrary code as root. That still all happens within the VM. To attack the host, you'd still need to break out of the VM, and if the VM is unprivileged (which I assume it is), you'd next need gain privileges on the host.

There are other channels -- perhaps you can get the GPU to do something funky on PCI level, perhaps you can get the GPU to crash the host -- but VM isolation does add a solid layer of protection.


Im not familiar with this cases specifics, but AWS also has an approach of virtualizing actual hardware interfaces (like nvme/pcie) to the host through dedicated hardware/firmware. I wouldnt be surprised if their solution was to map physical devices (partitions of) as a “hardware” device on the host and pass it directly through to the fire cracker instances. Especially if they can isolate multiple firecracker/lambda instances of a customer to a single physical device.


This is really cool and I can't wait to read all about it. Unfortunately, I've missed a month of blog posts because Cloudflare changed their blog's RSS URL without notice. If you change blogging platforms and can't implement a 301, please leave a post letting subscribers know where to find the new feed. RSS isn't dead!


We did? That's nuts if we did. What URL were you using?

EDIT: It looks like some people may have been using ghost.blog.cloudflare.com/rss because we used to use Ghost but the actual URL was/is blog.cloudflare.com/rss. We're setting up a redirect for anyone who was using the ghost. URL.


Hacker News is my favorite C-Suite level support forum of cloudflare and stripe.


Yes, it was the Ghost URL. Thank you for correcting it! I read just about every post, so I have a lot of catching up to do.


Sorry about the interruption! We migrated away from Ghost and not sure how you ended up with that URL but we're adding a redirect. Have a good catch up :-)



Yes, that should be the URL and I don't think that's changed. Just wondering what URL the parent was hitting.


So I just discovered that Cloudfare now owns the trademark for Sun's "The Network is the Computer".

"Cloudflare serves the entire world — region: earth. Rather than asking developers to provision resources in specific regions, data centers and availability zones, we think “The Network is the Computer”. "

https://blog.cloudflare.com/the-network-is-the-computer/


Did they also get the old DEC t-shirt trademark: "The Network Is The Network and The Computer Is The Computer. We regret the confusion."

IBM mocked Sun with: "When they put the dot into dot-com, they forgot how they were going to connect the dots," after sassily rolling out Eclipse just to cast a dark shadow on Java. Badoom psssh!

https://www.itbusiness.ca/news/ibm-brings-on-demand-computin...


> The global scheduler is built on Cloudflare Workers, Durable Objects, and KV, and decides which Cloudflare location to schedule the container to run in. Each location then runs its own scheduler, which decides which metals within that location to schedule the container to run on.

So they just use the term "location" instead of "region".


I love they built all this infra for running js to avoid building a container runtime and ended up building a container platform using all the hypervisors. On more serious note, I do not understand why they can't fix the 500mb upload limit. I hit that with r2 registry and ended switching away instead changing all the dockepush tooling. Not super excited re using more weird tools rather than fixing platform


Agreed; after 7 hopeful years this feels like a declaration of defeat for V8 isolate-based FaaS.


Why does it take 4 minutes (after being optimized from 8 minutes!) to move a 30 GB (compressed) docker image ...? The read slowness of docker registries continues to surprise me ...


Perhaps they are throttling this transfer to 1Gbps so as to not slam their network or disk I/O? It does seem quite slow.


I like the dig at "first generation" clouds.

There really is a wide gulf between the services provided by the older cloud providers (AWS, Azure) and the newer ones (fly.io, CloudFlare etc).

AWS/Azure provide very leaky abstractions (VMs, VPCs) on top of very old and badly designed protocols/systems (IP, Windows, Linux) . That's fine for people who want to spend all their time janitoring VMs, operating systems, and networks but for developers who just want to write code that provides a service it's much better to be able to say to the cloud provider "Here's my code, you make sure it's running somewhere" and let the cloud provider deal with the headaches. Even the older providers' PaaS services have too many knobs to deal with (I don't want to think about putting a load balancer in front of ECS or whatever)


This undersells the fact that there’s a lot more to infrastructure management than “janitoring”. You and many others may want to just say “here’s my code, ship it”, but there’s also a massive market of people that _need_ the customization and deep control over things like load balancers, because they’re pumping petabytes of data through it and using a cloud-managed LB is leaving money and performance on the table. Or there are companies that _need_ the strong isolation between regions for legal and security reasons, even if it comes with added complexity.

A lot of developers get frustrated at AWS or Azure because they want to deploy their hobby app on it and realize it’s too difficult dealing with stuff like IAM - it’s like trying to dig a small hole in your garden and someone suggests you go buy a Caterpillar Excavator, when all you needed was a hand trowel. The reason this persists is because AWS doesn’t target the hobby developer - it targets the massive enterprise that does need the customization and power it provides, despite the complexity. There are, thankfully, other companies that have come in to serve up cloud hand trowels.

There is no “one size fits all” cloud. There probably never will be. They’re all going to coexist for the foreseeable future.


Hn now clearly are swarmed by grandiose novice techs.

10 years ago, no such superficial assessment would appear on first page.

This set of words bear little substance and engineering facts.

> AWS/Azure provide very leaky abstractions (VMs, VPCs) on top of very old and badly designed protocols/systems (IP, Windows, Linux) .

AWS cannot be made parallel, they themselves are 2 gens

AWS gen1

Azure gcp gen 2

Gen1 is on vm, ecs ebs s3, for web2 era

Gen2 is on cluster computing which was enable by vm

The then "leaky abstraction" is the mandated abstraction at the time

And GPUs today is about 70s's CPU

For example, you don't have any form of abstracted runtime on GPU, it's like running dos system

It's more leaky than 00s ' vm


Amazon has had serverless functions for a long time now. I built an iOS app with a backend in AWS and it was as you say, “here’s my code, you make sure it’s running somewhere.” I uploaded my Typescript code, set up an API gateway to call the lambda function and… that’s it. No load balancer, no ECS management. It’s been running for years and I haven’t to do anything other than pay the bill every month.


> ”Remote Browser Isolation provides Chromium browsers that run on Cloudflare, in containers, rather than on the end user’s own computer. Only the rendered output is sent to the end user.”

It turns out we don’t need React Server Components after all. In the future we will just run the entire browser on the server.


What old is new again.


Does it say anywhere what GPUs they have available?

I really need NVIDIA RTX 4000, 5000, A4000, or A6000 GPUs for their ray tracing capabilities.

Sadly I've been very limited in the cloud providers I can find that support them.


The short answer here is that NVIDIA doesn't like Cloud Service Partners using RTX cards, as they are "professional" cards (they are also significantly cheaper than the corresponding data center cards). IIRC, A40, L40, and L40S have ray tracing, and might be more available on CSPs. Otherwise, the GPU marketplaces that aren't "true" CSPs will likely have RTX cards.

Paperspace (now DO), Vultr, Coreweave, Crusoe, should all have something with ray tracing.


Incredibly helpful, thank you!

We did try on the T4 and A10G but the raytracing failed even though those cards claim to support it.

We ended up on Paperspace for the time being but they depreciated their support for Windows so I've been looking for alternatives. Will check out the provides you mentioned. Thanks again.


Oof, the short answer is Windows + virtualization + GPUs is hard for anyone to support. Two main difficult problems:

- Licensing Windows is really, really annoying (and expensive), and BYOL is something people seem oddly reticent to do - Installing (and updating) NVIDIA GPU drivers on Windows is something that requires GUI access (at least the first time)

Paperspace was going to be my answer, but I guess DO didn't like those problems either! NVIDIA has an RTX Cloud, though I admit I'm struggling to find mention of it on their website, maybe something like: https://www.nvidia.com/en-us/data-center/free-trial-virtual-...


I doubt they'll commit to specific models before containers go GA in 2025 but they'll likely be NVIDIA:

https://www.cloudflare.com/en-gb/press-releases/2023/cloudfl...


You can as of very recently get A6000s on Hetzner, which is a pretty good deal (but not serverless, so you need a consistent load).


Super helpful, thank you!


So this will be similar to Google Appengine(now Google run) ? If that’s the case I would love to give it a try but then I need close SQL server nearby and other open source services as well


I like using Workers for smallish http services. The uptime, pricing, and latency are fantastic. I would never use them for anything complex as the vendor lock in is quite strong and the dev experience still needs to improve.

Containers on the edge with low cold starts, scalability, the same reliability as Workers, etc would be super cool. In part to avoid the lock in but also to be able to use other languages like Go (which Workers don't support natively).


Workers use the web worker API so theoretically there’s less lock in. I’ve also found wrangler pretty good, what problems have you run into?


You can't really run the Worker code without modifications somewhere else afaik (unless you're using something like Hono with an adapter). And for most use cases, you're not going to be using Workers without KV, DO, etc.

I've hit a bunch of issues and limitations with Wrangler and Workers locally over the years.

Eg:

https://github.com/cloudflare/workers-sdk/issues/2964

https://github.com/cloudflare/workerd/issues/1897


This is why I switched to deno deploy. Much of the same benefits and much more portable stack.


It's great you can run Deno anywhere you want. Their KV service is phenomenal too.

Personally I don't want to keep using JS in the server anymore. As more time passes I feel like TS is a hack compared to the elegance of something like Go.


surprised they didn't go straight into cloud-hypervisor, though I haven't actually tested with gpu yet but it is on my todo list. OCI layers can use zstd compression. I wonder if they are defeating layer sharing by splitting in 500 mb chunks. Lambda splits your image into chunks and shares at the block layer (I believe even same chunk different (user's?) container on a single host). Esp for 15 GB images I'd think using lazy pulling with nydus/stargz or whatever would be beneficial. I'd like to test out snapshotting, though my testing already boots a guest and runs a container in ~170ms; and I'm not actually sure how you write the guest init to signal it is ready for snapshotting and then wait properly (maybe you just sleep 1000?) so it resumes from the snapshot in a good state. I know fly has written about their use of snapshotting but I don't think it went into that detail. Cool stuff overall though, not worrying about locations and the yucky networking to do so seems nice


This seems like a pretty big deal.

I want to like CloudFlare over DO/AWS. I like their DevX focus too -- I could see issues if devs can't get into the abstractions though.

Any red flags folks would stake regarding CF? I know they are widely used but not sure where the gotchas are.


Is Cloudflare the one that goes from free to "call for pricing" ($100K+) at the drop of a hat?


I think they have some incredibly low pricing for what most small companies need. I also think they've done a very good job of carving out pieces that more sophisticated setups need into the enterprise tier, which does constitute a big jump.

One that bit me was https://developers.cloudflare.com/cloudflare-for-platforms/c... - we found an alternative solution that didn't require upgrading to their enterprise plan (yet), but it was a pretty compelling reason to upgrade and if I was doing it again I'd probably choose upgrading over implementing our solution. On balance I'm not sure we actually saved money in the end, considering opportunity cost


One data point but, one among our toy services has been pushing 30TB/mo to 60TB/mo for over a year now, and we haven't got the call: https://news.ycombinator.com/item?id=39521228


Can you share what kind of toy is shuffling around so much data?





I'll get that fixed. Thanks for reporting it.


Their solution isn’t GA yet.

For headless browsers, the latency benefits of “container anywhere” seems high. For things like AI inference, running on the edge seems way less beneficial than running on the cheapest location possible which would be larger regional data centers.


One would hope that “larger regional data centers” are not that far from The Edge. But the problem isn’t physics or the speed of light, it’s operational.

The operational excellence required to have every successful Internet company manage deployments to a dozen regions just isn’t there. Most of us struggle with three, my last gig tried to do two, which isn’t economical because you always try to handle one region going dark which means you need at least 200% capacity, where 3 data centers only need 150 + ??%, and 4 need 133 + ??%. It has all of the consistency problems of n > 1 and few if any of the advantages.

We need more help from the CDNs of the world to run compute heavy operations at the edge. And if they choose to send them 10-20ms away to a beefier data center I think that’s probably fine. Just don’t make us have to have the sort of operational discipline that requires.


Given how slow AI inference is (and for training it doesn't matter at all), the advantage of it being a few milliseconds closer to the user is greatly diminished. The latency to egress to a regional data center is inconsequential.

Good point about at the very least not exposing placement to customers. That is a definite win.


We’re gonna have so much local inference available.


Won't apply to everyone (most?), but some compliance assurances your customers may require can't be fulfilled by Cloudflare. And personally, I would hope their laissez faire attitude towards protecting hate speech should damage their business, but I suspect most people not targeted by such just don't give a damn.


but some compliance assurances your customers may require can't be fulfilled by Cloudflare.

Such as? See: https://www.cloudflare.com/trust-hub/compliance-resources/


What’s hate speech to you is free speech to someone else.


No, hate speech is hate speech. If it’s free speech to you, then you’re exactly in the group of people I’d rather not sharing infrastructure, or anything else with.


That's not a real distinction, pseudo-koan aside.


Looks like CloudFlare will soon be using "All other clouds are behind ours." slogan.


"We're the silver lining."

"We'll keep you on the edge of your seat."

"Nice parade you got there. It sure would be a shame if somebody were to rain on it."


"That's how dynamic pricing works, baby" (CF taking in learnings from the Oasis/Ticketmaster heist)


What I am always missing in these posts: How do they limit network bandwidth? Since these are all multi-tenant services, how do they make sure a container or isolated browser is not taking all the network bandwidth of a host?


You probably can do this through the proc filesystem/cgroups. If you think about it, you can use cgroups to limit the bandwidth, so you can also use it to measure it.


Lots of cool stuff in this blog post. Impressive work on many fronts!

If I understand correctly, you will be running actual third party compute workloads/containers in hundreds of network interexchange locations.

Is that in line with what the people running these locations have in mind? Can you scale this? Aren't these locations often very power/cooling-constrained?


Edgegap has been doing this for 5 years.


I can't find any GPU capabilities on their website and it seems to be 100% focused on game backend hosting, not general workers. Do you have more information?


They do have gpu, but pricing is not on the website. They also support non gaming use case. They list stuff like virtual rehersal studio, iot and some other non-gaming use cases as their customers.


Why is Cloudflare trying to create a walled-garden internet within the internet?


> Cloudflare serves the entire world — region: earth

Is that true for China though?


> We rely on it in Production

They really have a great engineering team




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: