Hacker News new | past | comments | ask | show | jobs | submit login
We improved the performance of a userspace TCP stack in Go (coder.com)
226 points by infomaniac 7 months ago | hide | past | favorite | 129 comments



Really cool to see others hacking on netstack, bit of a shame it's tied up in the gVisor monorepo (and all the Bazel idiosyncracies) but it's a very neat piece of kit.

I've actually been hacking on a similar FOSS project lately, with a focus on building what I'm calling a layer 3 service mesh for the edge. More or less came out of my learned hatred for managing mTLS at scale and my dislike for shoving everything through a L7 proxy (insane protocol complexity, weird bugs, and you still have the issue of authenticating you are actually talking to the proxy you expect).

Last week I got the first release of the userspace router shipped, worth taking a look if you want to play around with a completely userspace and unprivileged WireGuard compatible VPN server.

https://github.com/noisysockets/nsh/blob/main/docs/router.md


If you want to use netstack without Bazel, just use the go branch:

https://github.com/google/gvisor/tree/go

go get gvisor.dev/gvisor/pkg/tcpip@go

The go branch is auto generated with all of the generated code checked in.


I did this once for an experimental project and found it really difficult to keep the version of gVisor I was using up to date, since it seems like the API is extremely volatile. Anyone else had this experience? If so, is there some way around it that I don't know? Or did I just try it at a bad point in the development timeline?


That's just how Google operates in my experience... Avoid Google libraries unless absolutely necessary, and if you do adopt Google libraries, be prepared to either be forever multiple years out of date or spend significant resources on keeping it up to date.


Plus those libraries are often very hard to read and understand. Maybe I'm just dense though.


The API is indeed prone to change without notice, but it isn't anything terribly unmanageable.

> really difficult to keep the version of gVisor I was using up to date

For our project, we update gvisor whenever Tailscale does.


It could be that you happened to find a period of rapid change, but it is also possible that you ran into the issue that raggi mentioned in the sibling comment.


hey Ian, long time. Is there any chance y'all could swap out main so that main contains the generated code version?

I don't know the status on those export tools these days as I left the company years ago, but if they could sync with a different branch.

This would help various folks quite a bit, as for example tsnet users often fall into the trap of trying to do `go get -u`, which then pulls a non-functional gvisor version.


I don't work on gVisor anymore. That said, I think it would be a tough sell. It would be a pretty big breaking change. Also, there is already a problem with people trying to send patches against the go branch and making it the default would make that much worse.

I think the solution is an automatically exported repository at a different path. Kind of (or maybe exactly) like what Tailscale/bradfitz used to maintain.


I met one of the founders of Coder.com, he's a really cool dude. It's a pity that it is a product aimed more at enterprises than individual developers, else it would have far more developer mindshare.

Unlike, say, GitHub Codespaces, running something like this on your own infra means your incentives and Coder.com's are aligned, i.e. both of you want to reduce your cloud costs (as opposed to, say, GitHub running on Azure gives them an opportunity and incentive to mark up on Azure cloud costs).


It seems like a great product. I'm wondering why they don't offer more "startup-oriented" plans. It's like either Self Hosted or "Talk to sales". Is it maybe to not compete against Github codespaces?


Founder of Coder here. Many small (or teams at big) companies use Coder for free with <=150 devs just using our open-source.

We’ve tried to align our pricing with the value of the product. In small teams the productivity gains seem to be much lower, so we target Enterprise!


Speaking of ... https://coder.com/docs -- what's next is empty.


"Asking for elevated permissions inside secure clusters at regulated financial enterprises or top secret government networks is at best a big delay and at worst a nonstarter."

But exfiltrating data with a userspace VPN is totally fine?

I'm also wondering why not use TLS.


Every connection you make to a remote service "exfiltrates data". Modern TLS is just as opaque to middleboxes as WireGuard is, unless you add security telemetry directly to endpoints --- and then you don't care about the network anyways, so just monitor the endpoint.

The reason you'd use WireGuard rather than TLS is that it allows you to talk directly to multiple services, using multiple protocols (most notably, things like Postgres and Redis) without having to build custom serverside "gateways" for each of those protocols.


Adding your own network stack to bypass limitations like this works exactly until the point where someone notices that your userspace stack needs to fulfill the same requirements that the host stack does.

And then you're suddenly in a whole world of pain because all of this is driven by a stack of byzantine certifications (half of which, as usual, are bogus, but that doesn't help you), and your network stack has none of them.

(Written from first-hand experience.)


Any tips I'm a beginner


I think the point was more that doing this as a way to avoid the red tape of getting permission to open a new connection is odd?


I understand the impulse, but I think it misconstrues the "red tape" this method avoids. It's sidestepping a quirky OS limitation, which dates back to an era of "privileged ports" and multi-user machines. It's not really sidestepping any sort of modern policy boundary. For instance: you could do the exact same thing with WebSockets (and people do).


I was thinking websockets; though, I thought those largely hit the same criticisms? That is, tons of things moved to them specifically to avoid any firewall rules about what they were allowed to send over a network.

I'll fully grant that that seems to be the norm for everything browser related. Policies got difficult to install new software, just point your browser to this url and call it a day.


I was thinking websockets; though, I thought those largely hit the same criticisms? That is, tons of things moved to them specifically to avoid any firewall rules about what they were allowed to send over a network.

Arguably, this basic phenomenon has been going on for 20+ years. A lot of people by 2005-2007 or so had come to belive (and probably correctly) that a lot of the impetus for adopting SOAP based web-services over the preceding few years was simply because everything ran over ports 80 and 443 which were already open in the firewall. So deploying a remote service this way was more tractable than submitting a request to allow access to yet another port in firewall, and deal with the inevitable bureaucratic nightmare of getting that approved.


You can also sidestep that "quirky OS limitation" by just setting the first unprivileged port to 0(ip_unprivileged_port_start), no need for an new stack.

https://www.kernel.org/doc/Documentation/networking/ip-sysct...


You can do that. Random programs cannot. Our CLI, which also does user-mode WireGuard and TCP/IP, doesn't even want to run under sudo. You're seeing the point, now: you want to build interesting network features that work the same way everywhere without demanding that your users be system administrators. Hence: user-mode TCP/IP.


I'm still not seeing the point sadly.

If you're running on a system you don't administrate that has ports under 1024 set as privileged, there's no way(with or without your cli) to have a userspace program receive TCP or UDP packets coming into the kernel from external devices for these ports(unless I'm completely mistaken).

What can you accomplish with "user-mode TCP/IP" that you can't from userspace with system calls?


You are completely mistaken. You can do literally anything you want with TCP/IP provided you can talk UDP on any port, by running a user-mode TCP/IP stack over WireGuard on that port.


I don't think you understand, and based on your reply it doesn't sound like I'm mistaken.

With this CLI I am able to listen for external packets to port 80 from userspace without any elevated permissions and intercept traffic that's going to an application that's bound to that port on the OS?

Edit: I think I understand what you're trying to do, but if I do then traffic is going from the kernel UDP stack to the userland TCP stack, back to the UDP kernel stack. Not sure how that avoids sending the packet to the kernel. If it's to get around the port restrictions, why can you not just use unprivileged ports?


I'm not sure I can make this any simpler for you or easier to get your head around. If it helps: the idea here is giving an invocation of your program its own IP address. It can then do whatever it likes with TCP/IP for that address; its own routing, arbitrary protocols, whatever. The Go standard library makes it extremely easy to integrate. To the OS, it's all just ordinary socket code.


> the idea here is giving an invocation of your program its own IP address

I understand that, I just don't understand any case where that's desirable...

We have 2^32 ports available to applications(and a special `0` port that can be used to request any port) on a single IP(which is usually shared between multiple machines). I have never heard of a case where 2^32 ports is not enough ports for the number of applications that need to be listening.

> To the OS, it's all just ordinary socket code.

Which is what I don't understand. Why not just use ordinary socket code without all of these additional LoC in between that open you up to more bugs(security and functionality).


I think it is a valid question on why not use unprivileged ports, though? Or am I also missing something?


"Unprivileged ports" is just a case in point for everything you would need privileges to do, from binding arbitrary ports to adding arbitrary addresses and running local servers on them, etc, etc, etc. The point is: turning "complicated" network features that require privilege into simple unprivileged socket code.


Right, per my other thread below, my understanding was the "privileged" ports were mainly ones that were allowed for off machine communication by standard policy/convention long time ago. As such, using higher number ports should be just as easy in the code as using lower ports, outside of the discovery that was implied by following the other conventions. But, introducing new network addresses seems to have already side stepped the discovery affordances?

I'll offer the same caveat here, btw, I am not trying to torpedo the idea of trying this. I'm genuinely curious why you would need to do this. Not necessarily why you would want to.


We did it because we run WireGuard gateways to the public cloud we operate, and our CLI wants to talk to things on customer networks (like remote Docker server instances to build new versions of apps). Our options were:

* Do user-mode WireGuard (and thus TCP/IP) and talk "natively" to the infrastructure deployed on our platform.

* Write case-by-case application gateways for each of those pieces of infrastructure tunneled somehow through HTTP.


Number one reason is if you can't. Number two is if you don't want to.

You can't if your organization prevents you to, for example.

You don't want to if you follow strict rules which are not enforced by the OS, again for example.


I'm curious why you couldn't?

And if you don't want to, that feels misguided?

Granted, my old recollection was largely that the "privileged" ports were that way because they were blessed by the routing tables, at the time. The entire point was that the lower ports were expected to be connectable to external machines. Not shocking if I am out of date there.


Both reasons have nothing to do with technical capability and everything with organisational policies.


Right... but that just seems to imply that this is getting around policies by the letter?

I should hasten to add that I am not offering this as reason this shouldn't be done.


I know this is subtle but if you take a step back, user-mode TCP/IP moots the policies we're talking about; it doesn't subvert them. There are no security or policy implications to e.g. binding a low-numbered port on an IP address unrelated to your physical computer that is allocated directly to a running instance of your program. There are (archaic) implications to binding that port on an actual interface of your machine, because that binding (archaically) stood in for an assertion of identity/authority back in the 1990s.


Sorta? If it renders them moot, why not attack the policies? For that matter, if there are no implications to the number of the port, I'm again forced to ask why not just use the higher numbers? Wouldn't that have let you use the "simplicity of standard socket code" with no extra effort?

(This is also a new use of "moot" to me? You seem to be offering it as a synonym of obsolete? But a "moot" debate is one that is closer to "overcome by events" than one that is not relevant. Right?)


You get that these are just people shipping a program that random people are going to run on random computers, right? If your go-to-market involves "reversing longstanding Unix network policy rules", you have problems.

Respectfully, if at this point the situation hasn't been made clear to you, I don't think there's much more to productively discuss.


Amusingly, I would do an appeal to a word you already used. :D I was treating this as a bit of a moot debate. Discussing it as much for interests in the general discussion.

My impression from something said elsewhere was that this was largely for internal tools. I'm not sure why I got that impression, though.

I think it is fair, btw, that I would be pushing for both paths, at this point? If a long standing network policy rule has become obsolete from advance, it is worth considering dropping it? Is that not something people are looking at?

(I will also note that I will not be at all offended if you drop out from lack of interest here. Apologies if you feel I was wasting your time!)


Doesn't this program need to bind a UDP port on your machine? If policies prevent you binding ports I don't see how you'd be able to use this software...


I have no idea what you're talking about. We're talking about essentially fictitious IP addresses on synthetic machines. I think everyone's talking past each other here.


I don't really think people are talking past eachother, I think we're all trying to understand the point of this, and the article(and marketing on the website) doesn't make it clear at all.

Some potential options for "what is it for" come up, and others bring up reasons why they don't make sense.

It seems this is a solution to a very specific problem that nobody seems to have, which is why when people are trying to figure out what problem it solves they're coming up with 10 better solutions.


When they're talking about classified defense networks, the actual restrictions they mean is least privilege and separation of duties. Devs are not admins. They don't get root privilege on their machines. They can't create virtual network interfaces and they also can't change kernel settings. But if you put a full TCP/IP stack in userspace, well, they can run that and do whatever they want with it.

To answer the upstream question about why arbitary outbound connections are allowed, they're not. This is connecting to a cloud development environment, and I would have to assume this service can be self-hosted, because on a classified network, the "cloud" isn't the cloud as Hacker News readers know it. Amazon et all run private data centers on US military installations that only the military and the IC can access and they're airgapped from the Internet. If you're on a workstation that can access this environment, that's all it can access. The only place you can exfiltrate data to is other military-controlled servers.


If you're talking about developer machines, isn't the best(and easiest) solution to just run a VM that you administer so you can create virtual networks?

If you're talking about production machines, a userspace application wouldn't be able to sniff privileged ports without elevated permissions, so I fail to see how this application would let you get around that limitation.


No, because VMs are expensive and require some base level of system administration to operate, booting them usually requires privilege, and if the only problem you're trying to solve is reliably running (e.g.) Postgres and Redis protocol between your CLI and a server somewhere, it's extreme overkill.


VMs are free and can be run by any semi-competent developer wanting to host a test server on their development machine.

Postgres and Redis can use non-privileged ports, so I don't understand why this would matter.


"Just use a VM" instead of running a Unix command that drives a UDP socket is... a take.


imo a bit milder of a take than "just maintain a second TCP stack instead of hosting on a non-privileged port".

Also are we just ignoring that you pretended VMs were expensive to run? Most of your responses sound devoid of a lot of fundamental computer knowledge(networking and otherwise).


Getting that change onto the system sounds like "at best a big delay and at worst a nonstarter".


You can't control what information flows through an outbound connection, not even in trivial cases. Even if you straight go ahead and say "I allow you to make this connection, but I'm not even allowing you to send any data", you have timing sidechannels to deal with. In any more reasonable case, an almost infinite number of things can be used to exfiltrate any data you want, even if you think you have not only full application-level inspection, but even application-level rewrite.

Pretty much the only thing you can do is somewhat filter out known-bad, not directly motivated outbound traffic, such as malware payloads with very clear signatures. This only works if it's "not directly motivated", because as soon as there's a person who wants to do it, they can skirt around it again.


fwiw, you technically don't need a privileged container to use tun, you just need suitable permissions on the kernel tun interfaces.


Yeah, the optimisations are cool of course, but (maybe due to being unfamiliar with the tool?!) I didn't understand why they can't just `listen(2)`.


It’s answered in the opening paragraph although I’ll admit I’m still unclear.

> We are committed to keeping your data safe through end-to-end encryption and to making Coder easy to run across a wide variety of systems from client laptops and desktops to VMs, containers, and bare metal. If we used the TCP implementation in the OS, we’d need a way for the TCP packets to get from the operating system back into Coder for encryption. This is called a TUN device in unix-style operating systems and creating one requires elevated permissions, limiting who can run Coder and where. Asking for elevated permissions inside secure clusters at regulated financial enterprises or top secret government networks is at best a big delay and at worst a nonstarter.

The specific part that’s unclear is why encryption needs to be applied at the TCP layer and at that point if they need it at the transport layer why they’re not using something like QUIC which has a much more mature user-space implementation.


I think the key insight behind this approach (and I'm biased here having written something similar) is that the difference between QUIC and (wireguard + network stack) is A LOT less than you might think.


I'm confused on why they would need a TUN device for a client or server application, so why they would need this solution in the first place(even with their explanation).

As I understand the only reason you'd use a TUN interface is if you want to send/receive raw IP packets. Their marketing doesn't make it very clear what their product does, but I can't see a reason it would need to send/receive raw IP packets rather than TCP/UDP packets over a specific port...


Agree. Very unclear why they won't simply use a secure socket or why a user space tunnel will be needed.

I surmise that the reason might be that a user space tunnel might be faster (like maybe they can do UDP over TCP or something to gain speed improvements).

Good post nevertheless.


Or TLS. It seems to be a remote cloud desktop type of product, so why not use TLS like every other one?


The quote - is this yet another issue caused by abysmal FFI overhead in Go?


https://www.reddit.com/r/golang/comments/12nt2le/when_dealin...

If your C doesn't fight the scheduler it isn't that bad.


Great find! Specifically:

On a goroutine not locked to an OS thread (the default), don't take more than 1 microsecond in a single C call. If you need to take longer in C, lock the goroutine to an OS thread (runtime.LockOSThread), but then don't do things in Go that would park that goroutine (time.Sleep, blocking channel read/write, etc).


There's nothing related to FFI calls in this quote.


I don't know anything about Coder, but Gvisor proliferation is annoying. It's a boon for cloud providers, helping them find another way to get a large multiple performance decrease per dollar spent in exchange for questionable security benefits. And I'm seeing it everywhere now.


I don't understand - what do you suggest as an alternative to Gvisor?

> large multiple performance decrease per dollar spent

Gvisor helps you offer multi-tenant products which can be actually much cheaper to operate and offer to customers, especially when their usage is lower than a single VM would require. Also, a lot of applications won't see big performance hits from running under Gvisor depending on their resource requirements and perf bottlenecks.


> I don't understand - what do you suggest as an alternative to Gvisor?

Their performance documents you linked claim vs runc: 20-40x syscall overhead, half of redis' QPS, and a 20% increase in runtime in a sample tenserflow script. Also google "CloudRun slow" and "Digital Ocean Apps slow", both are Gvisor.

Literally anything else.


A decent while ago, I was the original author of that performance guide. I tried to lay out the set of performance trade-offs in an objective and realistic way. It is shocking to me that you’re spending so much time commenting on a few figures from there, ostensibly w/o reading it.

System call overhead does matter, but it’s not the ultimate measure of anything. If it were, gVisor with the KVM platform would be faster than native containers (looking at the runsc-kvm data point which you’ve ignored for an unknown reason). But it is obviously more complex than that alone. For example, let’s click down and ask — how is it even possible to be faster? The default docker seccomp profile itself installs an eBPF filter that slows system calls by 20x! (And this path does not apply within the guest context.) On that basis, should you start shouting that everyone should stop using Docker because of the system call overhead? I would hope not, because looking at any one figure in isolation is dumb — consider the overall application and architecture. Containers themselves have a cost (higher context switch time due to cgroup accounting, costs to devirtualize namespaces in many system calls, etc.) but it’s obviously worth it in most cases.

The redis case is called out as a worst case — the application itself does very little beyond dispatching I/O, so almost everything manifests as overhead. But if you’re doing something that has 20% overhead, you need hard security boundaries, and fine-grained multi-tenancy can lower costs by 80% it might make perfect sense. If something doesn’t work for you because your trade-offs are different, just don’t use it!


> it is shocking to me that you’re spending so much time commenting on a few figures from there

You give me too much credit! They were copy pastes to the same responder who responded to me in a few places in the thread. I did that to avoid spending too much time responding!

> because looking at any one figure in isolation is dumb

So the self-reported performance figures are bad, the are hundreds of web pages and support pages reporting slow performance and low startup time from their first hand experience, there are Google hosted documentation pages about how to improve app performance for cloudrun (probably the largest user and creators of Gvisor, can I assume they know how to run it?) including gems like "delete temporary files" and a blog post recommending "using global variables" (I'm not joking). And the accusation is "dumb" cherry-picking? Huh?

Also, if I'm not wrong CloudRun GCP's main (only? besides managed K8s) PaaS container runtime. Presenting it as a general container runtime with ultra fast scaling when people online are reporting 30 second startup times for basic python/node apps, is a joke. These tradeoffs should also be highlighted somewhere in these sales pages, but they're not.

This is the last I'm responding to this thread. Also my apologies to the Coder folks for going off topic like this.


I don’t think copy/pasting the same response everywhere is better.

IIRC, CloudRun has multiple modes of operation (fully-managed and in a K8s cluster) and different sandboxes for the fully-managed environment (VM-based and gVisor-based). Like everything, performance depends a lot of the specifics — for example, the network depends a lot more on the network path (e.g. are you using a VPC connector?) than it does the specific sandbox or network stack (i.e. if you want to push 40gbps, spin up a dedicated GCE instance.) Similarly, the lack of a persistent disk is a design choice for multiple reasons — if you need a lot of non-tmpfs disk or persistent state, CloudRun might not be the right place for the service.

It sounds like you personally had a bad experience or hit a sharp edge, which sucks and I empathize — but I think you can just be concrete about that rather than projecting with system call times (I’d be happy to give you the reason gen1 sandbox would be slow for a typical heavy python app doing 20,000 stats on startup — and it’s real but not really system calls or anything you’re pointing at,… either way you could just turn on gen2 or use other products, e.g. GCE containers, GKE autopilot, etc.).

I’m not sure what’s wrong with advice re: optimizing for a serverless platform (like global variables). I don’t really think it would be sensible to recompute/rebuild application state on any serverless platform on any provider.


Are you referring to gVisor the container runtime, or gVisor/netstack, the TCP/IP stack? I see more uptick in netstack. I don't see proliferation of gVisor itself. "Security" is much more salient to gVisor than it is to netstack.


In the issue of abysmal performance on cloud-compute/PaaS Im talking about the container runtime (most Paas is gVisor or Firecracker, no?) cloudrun, DO, modal, etc.

But given this article is about improving gvisors userland tcp performance significantly, it seems like the netstack stuff causes major performance losses too.

I saw a github link in another top article today https://github.com/misprit7/computerraria where the Readme's Pitch section feels very relevant to gvisor.


I don’t believe many PAAS run gVisor; a surprising number just run multitenant docker.

The netstack stuff here has nothing to do with the rest of gVisor.


> The netstack stuff here has nothing to do with the rest of gVisor.

How so? Besides being part of it, it is at least similar in the group of "bloated slow userland implementation of things the kernel handles well"


A TCP/IP stack is not an "implementation of syscalls". The things most netstack users do with netstack have nothing to do with wanting to move the kernel into userland and everything to do with the fact that the kernel features they want to access are either privileged or (in a lot of IP routing cases) not available at all. Netstack (like any user-mode IP stack) allows programs to do things they couldn't otherwise do at all.

The gVisor/perf thing is a tendentious argument. You can have whatever opinion you like about whether running a platform under gVisor supervision is a good idea. But the post we're commenting on is obviously not about gVisor; it's about a library inside of gVisor that is probably a lot more popular than gVisor itself.


> The gVisor/perf thing is a tendentious argument

Interesting to dismiss it as such. The gvisor netstack is a (big) part of gvisor and this article is discussing how the performance of that component was, and could well still be, garbage.

These tools bring marginal capability and performance gains, shoved down peoples throat by manufacturing security paranoia. Oh an it all happens to cost you like 10x time, but look at the shiny capabilities, trust me it couldn't be done before! A netsec and infra peddlers wet dream.


> The gvisor netstack ... this article is discussing how the performance of that component was ... garbage.

The article and a related GitHub discussion (linked from TFA) points out that the default congestion algorithm (reno) wasn't good for long-distance (over Internet) workloads. The gvisor team never noticed it because they test/tune for in-datacenter usecases.

> These tools bring marginal capability and performance gains

I get your point (ex: app sandbox in Android ruins battery & perf, website sandbox on chrome wastes memory, etc). While 0-days continue to sell for millions, opsec are right to be skeptical about a very critical component (kernel) that runs on 50%+ of all servers & personal devices.


Linux kernel LPEs do not routinely sell for millions. There's a market value on a specific subset of vulnerabilities that root flagship Google phones.


None of this has anything to do with security paranoia.


In the context of coder, the userspace TCP overhead should be negligible. Based on https://gvisor.dev/docs/architecture_guide/performance/ and assuming runc is mostly just using the regular kernel networking stack (I think it does, since it mostly just does syscall filtering?) it should be at most a 30% direct TCP performance hit. But in a real application you typically only spend a negligible amount of total time in the TCP stack - the client code, total e2e latency, and server code corresponding to a particular packet will take much more time.

You'll note their node/ruby benchmarks showed a substantially bigger performance hit. That's because the other gvisor sandboxing functionality (general syscall + file I/O) has more of an impact on performance, but also because these are network-processing bound applications (rare) that were still reaching high QPS in absolute terms for their perspective runtimes (do you know many real-world node apps doing 350qps-800qps per instance?).

Because coder is not likely to be bottlenecked by CPU availability for networking, the resource overhead should be inconsequential, and what's really important is the impact on user latency. But that's something likely on the order of 1ms for a roundtrip that is already spending probably 30-50ms at best in transit between client and server (given that coder's server would be running in a datacenter with clients at home or the office), plus the actual application logic overhead which is at best 10ms. And that's very similar to a lot of gvisor netstack use cases which is why it's not as big of a deal as you think it is.

TLDR: For the stuff you'd actually care about (roundtrip latency) in the coder usecase the perf hit of using gvisor netstack should be like 2% at most, and most likely much less. Either way it's small enough to be imperceivable to the actual human using the client.


TCP overhead is part of the story. Theres 20-40x overhead in syscalls, 20% running a tensorflow project end to end, 50% fewer RPS in redis, etc.


We are still talking about people using runsc/runc. That's not what `coder` is doing. All they did was poach a (popular) networking library from the gVisor codebase. None of this benchmarking has anything to do with their product.


I've already accepted this whole thread is a digression, but I keep getting pulled in. Calling out my dislike for Gvisor on a thread lauding a 5x tcp performance they found in it felt on topic to me at the time.


Ok. I'm only triggered by two things:

1. An argument that a tool using netstack is in any way tainted with gVisor's runtime costs.

2. An argument that shared-kernel multitenant is tenable and thus gVisor addresses no meaningful security concerns.


Not gonna lie am also getting 200% triggered whenever he states gVisor Syscall costs lol


Are they even using runc/runsc?


At coder, no since "gVisor is a container runtime that reimplements the entire Linux ABI (syscalls) in Go, but we only need the networking for our purposes"

but gvisor was using full runsc for the networking benchmarks I linked, and IIUC runc's networking should be sufficiently similar to unsandboxed networking that I believe runsc<->runc network performance difference should approximate gvisor netstack<->vanilla kernel networking.


Google is my former employer and this statement isn't referring to stuff I heard while employed there.

But after I left, I heard a that alot of the poor performance of Cloud Run is just plain old oversubscribed shared core e2 stuff.


There are still products from cloud providers that don't use gvisor. Basics like EC2 or GCE. Sounds like you chose the wrong cloud product.


Can you elaborate on your concern? Is the issue that you don't trust gVisor to keep the cloud provider secure?


Providers managed secure shared environments for decades before ultra inefficient wrappers and runtimes like gVisor existed.


No. The providers that did so soundly used virtualization to accomplish this, and a big part of the appeal of K8s is having a much lightweight unit of scheduling than full virtualization. gVisor is a middle ground between full virtualization and shared-kernel multitenant (which has an abysmal security track record).


Virtualization, lxc, containers (and K8s), etc were solutions to "secure shared environments". And they have an order of magnitude lower performance hit than gvisor does (Google 'cloudrun python startup times' if you're curious on the real impact of this stuff).

Have we proven they're not secure and safe? Have we broken out of containers yet? Heroku was running LXC for years before docker, did they run into major security woes (actual curious)?

If "secured shared environments" is a more specific term meaning "multi user unix environment", I didn't intend to say that.

Though you already mentioned my whole thread is a bit off topic to this post (and I sorta agree) but then baited me with this comment after. I'm happy to drop it and wait for a Gvisor container runtime thread.


Containers are not compute environments, their runtimes are, and gvisor (runsc) is one implementation of that. Docker engine (~runc) is another. It has similar performance characteristics to gvisor afaict looking online (the minimum cold start times I'm seeing are 500ms which I've beat in gvisor), yet implements less security features.

If by virtualization you mean VMs, gvisor can be more performant than those based on my experience. For example, AWS claims a p0 coldstart time of ~500ms using Firecracker but I know firsthand that applications sandboxed by gvisor can be made to cold start in significantly less time (like less than half): https://catalog.workshops.aws/java-on-aws-lambda/en-US/03-sn..., and you should be able to confirm this yourself by using products that leverage Gvisor under the hood or with your own testing. I actually worked on this (using gvisor, but working on adjacent tech) for years...

> Have we broken out of containers yet?

Sure, how about https://scout.docker.com/vulnerabilities/id/CVE-2024-21626 where runc (Docker) exposed the host filesystem to containerized applications? Precisely the kind of exploit gvisor is designed to prevent.

I'll note that a lot of people are thinking about how to reduce sandbox overhead in multitenant PaaS and it's one of the things I want to eventually address in my own startup. But I think blindly hating on gvisor because of a nebulous dislike of overhead really is misplaced without considering its alternatives.


The charts you linked in the performance guide show a 30x syscall overhead in runsc vs runc (careful quit a few of the charts are logrithmic). That's insane! They also go on claim a 20% tensorflow workload difference.

> I want to eventually address in my own startup.

You worked on CloudRun and their performance is dogshit. Seriously google it theres like 100 stack overflow questions on the subject. It's common enough a query Google even suggests follow up questions like: "Why is cloud run so slow?".

Now your answer might be "avoid syscalls", "don't do anything on the file system (oh by the way your file system is memory mapped hehe)", "interpreters can be slow to load their code, sorry", "look at these charts its not as bad as you say", "tcp overhead is only 30%", etc but your next set of customers wont have the same vendor lock in you enjoyed at Google.

Then do the same query for "Digital Ocean Apps slow", also gvisor. And bam you'll have a long list of customers ready to use your better version! Perhaps Google and Digital Ocean will enlist your expertise (again).


Yes, we have proven that shared-kernel multitenant is unsafe. The best example (though there are many) is the `waitid` LPE; nobody's container lockdown configuration was blocking `waitid`, which is what you'd have had to do to prevent container code from compromising the kernel. The list of Linux LPEs is long, and syzkaller crashes longer stil.


https://news.ycombinator.com/item?id=40591147

So the PaaS providers mentioned in that comment should be assumed to be compromised?


If they are using multitenant Docker / containerd containers with no additional sandboxing, then yes, then it's only a matter of time and attacker interest before a cross-tenant compromise occurs.


There isn't realistic sandboxing you can do with shared-kernel multitenant general-workload runtimes. You can do shared-kernel with a language runtime, like V8 isolates. You can do it with WASM. But you can't do native binary Unix execution and count on sandboxing to fix the security issues, because there's a track record of local LPEs in benign system calls.


OpenVZ, Virtuozzo and friends definitely weren't secure the way gVisor or Firecracker are. You can still do that and some providers do, doesn't make it a good idea.


It's great to see this, I know the team went on a long journey through this and the blog makes it almost look shorter and simpler than it was. I'm hoping one day we can all integrate the support for GSO that's been landing in gvisor too, but so far we've (tailscale) not had a chance to look deeply into that yet. It was really effective for our tun and UDP interfaces though.


At Coder we’re fans and users of Tailscale, so very happy to have these changes be consumed upstream as well!


> one day we can all integrate the support for GSO that's been landing in gvisor

Google engs recently rewrote the GSO bit, but unlike Tailscale, it is only for TCP, though.

Besides, gvisor has had "software" & "hardware" GSO support for as long as I can remember.


The obvious question is: How does it compare to the in-Kernel TCP stack?


It's less mature, which shows up in lots of places, such as sometimes having less than ideal defaults (as in buffer sizes shown here), and bugs if you start using more fancy features (which improve over time of course).

This is approximately the case for any alternative IP stack you might pick though, a mature IP stack is a huge undertaking with all the many flavors of enhancements to IP and particularly TCP over the years, the high variance in platform behaviors and configurations and so on.

In general you should only take on a dependency of a lesser-used IP stack if you're willing to retain or train IP experts in house over the long haul, because as is demonstrated here, taking on such a dependency means eventually you'll find a business need for that expertise. If that's way outside of your budget or wheelhouse, it might be worth skipping.


gVisor's netstack is still much slower than the kernel's (and likely always will be). The goal of this userspace netstack is not to compete with the kernel on performance, but offer an alternative that is more portable and secure.


How is it more portable or secure than an API that's been stable for decades, and getting constant security fixes?

I see an explanation in their blog about avoiding TUN devices since they require elevated permissions, but why would you need a TUN device to send data to/from an application? I can't understand what their product does from the marketing material but it doesn't look like it would require constructing raw IP packets instead of TCP/UDP packets and letting the OS wrap them in the other layers.


You can have multiple layers of security boundary on most of the customer-exposed surface area, and avoid more risky surface areas in the kernel.

Portable is a bit of a weird word here because for many of us with gray beards the word means architectures, kernels and systems, but I think in this context it tends to more mean "can run just as easily on my macbook as in a cloud container", but in practice the software isn't that portable, as Go isn't that portable - at least not in the context of vs. a niche C "portable network stack" that would build roughly anywhere that there's a working C toolchain, which is almost everywhere.

Constant security fixes for the kernel are a real pain in deployments unless you follow upstream kernels closely. If your business is in shipping Linux runtimes with a high packing density, you really need to find ways to minimize the exposed Linux surface area, or organize to be able to ship kernel upstream updates at an extremely high frequency (relative to normal infrastructure upgrade rates for kernels / mandatory reboots) (and I would not consider kexec safe in this kind of context, at all).

An alternative approach might be firecracker / microvms and so on, but those have their own tradeoffs too. The core point is that you want more than one layer between the host machines and the user code that wants to interact with Linux features.


> You can have multiple layers of security boundary on most of the customer-exposed surface area, and avoid more risky surface areas in the kernel.

I fail to see what "risky surface areas" in the kernel you're avoiding. You have more packets going through the kernel network stack(since you're wrapping a TCP connecting in a UDP connection that goes through the kernel) than just using the TCP stack in the kernel. Are you saying that the TCP stack in the kernel cannot be trusted, but a userspace kernel you maintain can(that's a bit ridiculous...)

> can run just as easily on my macbook as in a cloud container

Any POSIX C code that listens on non-privileged ports will run on machines with the correct glibc version(and you can statically compile the glibc or not need it like go does). This includes linux and macOS(and if you're using a library that's on multiple OSes you get even more support without having to implement TCP in userspace).

> Constant security fixes for the kernel are a real pain in deployments unless you follow upstream kernels closely.

I don't think you understand. You're still at the mercy of the kernel for security patches to the UDP stack, you're just now also having to maintain a TCP stack in parallel.

> An alternative approach might be firecracker / microvms and so on

Wouldn't an alternative approach just be to use cross-platform libraries and non-privileged ports?

> The core point is that you want more than one layer between the host machines and the user code that wants to interact with Linux features.

You just said the opposite... how can more things requiring security fixes be a bad thing, while you arbitrarily want more layers between you and the most security tested code for networking available to you.


> Are you saying that the TCP stack in the kernel cannot be trusted, but a userspace kernel you maintain can(that's a bit ridiculous...

Yes that’s exactly right. It’s not ridiculous. Netstack is written in a GC’d language which alone eliminates several categories of vulnerabilities that exist in the kernel. But more important than that is that it’s in USERSPACE. So even if you do compromise gVisor netstack the best you have is the capabilities that any other normal process has. Compare that to the kernel vulnerabilities where you potentially have cracked root.

> You're still at the mercy of the kernel for security patches to the UDP stack, you're just now also having to maintain a TCP stack in parallel.

The TCP stack is at least an order of magnitude more complex than UDP and has a correspondingly much higher number of bugs filed against it. Only relying on UDP is a security win.


> I fail to see what "risky surface areas" in the kernel you're avoiding. You have more packets going through the kernel network stack(since you're wrapping a TCP connecting in a UDP connection that goes through the kernel) than just using the TCP stack in the kernel. Are you saying that the TCP stack in the kernel cannot be trusted, but a userspace kernel you maintain can(that's a bit ridiculous...)

There's a constant stream of bugs in kernel network and IO interfaces, many of which require direct local interaction for exploitation, and aren't remotely attackable. Don't assume, spend a few hours and have a read through some.

> Any POSIX C code that listens on non-privileged ports will run on machines with the correct glibc version(and you can statically compile the glibc or not need it like go does). This includes linux and macOS(and if you're using a library that's on multiple OSes you get even more support without having to implement TCP in userspace).

That doesn't get anywhere near the use case here which is: run third party user supplied code unmodified.

> I don't think you understand. You're still at the mercy of the kernel for security patches to the UDP stack, you're just now also having to maintain a TCP stack in parallel.

The surface is not "UDP" and "TCP", this view is a huge distortion. As I suggested above, have a read through some of the relevant bugs over the last two years, and consider their implications in the relevant use case: running unmodified third party user code on a system.

> Wouldn't an alternative approach just be to use cross-platform libraries and non-privileged ports?

No, again, that doesn't meet the use case: run unmodified third party user code on the system.

> You just said the opposite... how can more things requiring security fixes be a bad thing, while you arbitrarily want more layers between you and the most security tested code for networking available to you.

Your characterization of Linux further suggests the exercise above would be a great experience.


for some definition of portable which is deeply tied to the go runtime


help me understand something.

> we’d need a way for the TCP packets to get from the operating system back into Coder for encryption.

yes, this is commonly done via OpenSSL for example.

> This is called a TUN device in unix-style operating systems and creating one requires elevated permissions

waitasec, wut? sure you could use a TUN device I guess, but assuming some kind of multi-tenant separation is an underlying assumption they didn't mention in their intro, couldn't you also use cgroup'd containers? sorry if I'm not fluent in the terminology.

i'm struggling to understand the constraints that push them towards gVisor. simply needing to do encryption doesn't seem like justification. i'm sure they have very good reasons, but needing to satisfy a financial regulator seems orthogonal at best. i would just like to understand those reasons.


Doesn’t creating a raw socket need elevated permissions?


They're not creating raw sockets†. The neat thing about WireGuard is that it runs over vanilla UDP, and presents to the "client" a full TCP/IP interface. We normally plug that interface directly into the kernel, but you don't have to; you can just write a userspace program that speaks WireGuard directly, and through it give a TCP/IP stack interface directly to your program.

I don't think? I didn't see them say that, and we do the same thing and we don't create raw sockets.


So it tunnels TCP/IP over Wireguard UDP?


Correct (I mean, that's fundamentally what WireGuard is: a UDP TCP/IP tunnel, with strong modern encryption).



is this part of the open source releases? I looked at the coder.com github, but couldn't find it. I haven't written a compatible TCP, but a different reliable transport in go userspace. fairness aside, i wonder why we dont see this more often. would love to take a look


They upstreamed their gVisor changes: https://github.com/google/gvisor/pull/10287


If you’re tunneling a better connection configuration isn’t the tunnel what defines the latency?


I have a problem right now which is that it’s slow to copy large files from one side of the earth to the other. Is this the basis of a solution to that maybe?


No. Profile first. Make sure you've tried tweaking params like batch sizes.


What do you think are the current problems contributing to your slow transfers?


Window and buffer size is a problem on high latency links.


Why do you suspect a user space implementation of TCP would improve those issues beyond existing kernel implementations?


not enough detail here to provide a good answer, but I can tell you explicitly that if you're using SMB you're likely not going to get good performance here even if your network stack is has tons of space to overcome bdp and congestion challenges.


it's a solution looking for a problem


It's an engineering challenge and they do solve a problem, it's just not your problem :) It's a nice read anyways.


gVisor definitely solves a problem for me: https://news.ycombinator.com/item?id=39900329


tl;dr Increased TCP receive buffer size, implemented HyStart instead of traditional TCP slow start in gVisor's netstack, changed an in-process packet queue from drop-when-full to block-when-full.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: