Hacker News new | past | comments | ask | show | jobs | submit login
What they don't teach you about sockets (macoy.me)
412 points by zdw on July 26, 2022 | hide | past | favorite | 113 comments



What really helped me understand and troubleshoot network communications at the socket level was the book by W. Richard Stevens[0]. I think this is because he starts with an introduction to TCP, and builds from there. Knowing TCP states and transitions is important to reading packet dumps, and in general debugging networking.

Also, no matter what platform you're using, there is probably a socket implementation somewhere at the core. You're best off understanding how sockets work, then understanding how the platform you're working with uses sockets[1].

Once I read the W. Richard Stevens book, I was able to read and understand RFCs for protocols like HTTP to know how things should work. Then you're better prepared to figure out if a behavior is due to your code, or an implementation of the protocol in your device, or the device on the other end of the network connection, or some intermediary device.

[0] - http://www.kohala.com/start/unpv12e.html

[1] - https://docs.microsoft.com/en-us/windows/win32/winsock/porti...


I had the privilege of working for Rich as a junior developer and then later with him as a co-author. He was my mentor and friend. Seeing messages like this over 30 years later really makes my day and reminds me how much I miss him.


I eventually decided to do theoretical CS for my research, but all of these books showed me as a student what attention to detail really meant.

In fact, the typesetting of these books also made me look into groff and play around with it.

I vaguely remember that there was a phrack prophile on W. Richard Stevens just around the time he died. Such a great loss.

Thank you for your inspiring books!!


Yes indeed, the typesettings for these books are really awesome. The template is also being used by other authors notably by Jon Snader in his VPNs Illustrated book and if it's not obvious enough the name of the book is paying homage to the original TCP Illustrated series [1]. Co-incidentally Jon also wrote 44 Tips for TCP/IP network socket programming before the VPN book [2].

[1] VPNs Illustrated: Tunnels, VPNs, and IPsec: Tunnels, VPNs, and IPsec:

https://www.oreilly.com/library/view/vpns-illustrated-tunnel...

[2] Effective TCP/IP Programming: 44 Tips to Improve Your Network Programs:

https://www.informit.com/store/effective-tcp-ip-programming-...


When I had to write a TCP stack in the 90s, you & Rich had a book ready for me! Thanks for that. :)

He wrote lots of the great, practical, 90s references. Shame he left us so young.


Not to cause offense, but can you indicate his cause of death? This idle question has been in my mind since 2002 or so when I used his books in graduate school. Was surprised to realize he was already dead back then.


According to Wikipedia[1], he died in 1999 at the age of 48, which is so young that I also would like to know what happened. He was apparently healthy and working and just fine a week before his death. But there's a bizarre resistance among some people regarding discussing cause of death. If a President suddenly died, we shouldn't ask about cause of death "out of respect"? I'm a big believer in privacy, but it's silly to not talk about cause of death after someone has passed away. In fact, we should know causes in order to improve all our lives (to be more cautious of heart disease, or suicide, or motorcycle accidents, etc.).

[1] https://en.wikipedia.org/wiki/W._Richard_Stevens


Sometimes though (and I'm not saying it was in this case) the cause of death is perceived as stigmatising, and so is withheld to protect the person's name, or protect the family.

This happened, for example, with Isaac Asimov who died in 1992. He contracted Aids from a blood transfusion [1]. In the 80s and 90s this was sufficiently stigmatising to warrent being kept quiet.

Today things like drug overdosing and suicide are sometimes considered shameful,and details are suppressed. Or worse; imagine having to answer that your relative was shot by police while shooting kids in an elementary school...

In truth people die of embarrassing things all the time, so in some societies it is impolite to ask.

Again, to be clear I am NOT suggesting that there was anything shameful about this death, I'm making a more general point about the nature of reporting cause of death.

[1] https://en.wikipedia.org/wiki/Isaac_Asimov


Thank you for the post, and I'd like to highlight that the only important driver here is Shame.

We all experience it (some much much much more than others), and yet we never discuss it, by its very nature. It can be debilitating and will naturally lead oneself to dark places.

Shame kills.


His TCP/IP Illustrated book was formative in my career. I wish I had've salvaged that company copy when that business shutdown - it probably just got thrown out :(


There were many authors in that era who had practically compulsory books. Stevens is one of the few to have two. His Unix Programming book was referred to as the Bible.


I still refer back to APUE often and push my junior folks to read it.


The Linux programming interface, by Michael Kerrisk, is an excellent modern update.


We had to debug an issue with ARP during a deploy of docker containers a year or two ago. A nearly complete understanding of the lower layers is not that much knowledge and quite useful at times.


I agree with your main point - a decent working knowledge down-stack really helps with network stuffs.

"nearly complete" strikes me as a whole lot of hubris though - I've spent almost 20 years at a career that can be summed up as "weird stuff with packets on linux", and the only thing that I'm certain of is: every time I think I'm starting to get a handle on this networking thing, the universe shows me that my assessment of how much there is to know was off by at least an order of magnitude.


I once interviewed someone with “Internet guru” on his résumé. I advised him there were only a handful of people I’d consider for that title.


Perhaps it can be "nearly complete" in a pareto sense. If you know the structure and semantics of IP, UDP, and TCP packets, including their most common few options, and DNS and HTTP, then you know <<10% of what there is to know about networks, but >90% of what you will see in a random pcap file, and >90% of what you know to solve a randomly selected networking problem.


i have no idea how networks work, and i have been doing link splitting, bonding, packet scheduling and dabbling in bgp in 2004. The knowledge i've gained since helps immensely when debugging simple networks created by kube-proxy, but i am just petrified by amount of the iceberg that remains underneath. And to this day i'm slightly perplexed when udp socket operation in php throws `connection reset` exception.


"Nearly complete"? No matter how deep your knowledge is, you're only scratching the surface. To pick a small example, there's something like 20 TCP congestion avoidance algorithms alone (many are available in the mainline Linux kernel, and they can be picked depending on the task and network at hand), and I believe it took something like a decade of research, trial and error to solve the bufferbloat problem.

https://en.wikipedia.org/wiki/TCP_congestion_control

https://lwn.net/Articles/616241/


Only CUBIC and BBR seem to have any meaningful usage, however: https://www.comp.nus.edu.sg/~ayush/images/sigmetrics2020-gor...


I tried out BBR on a system that had a ton of packet loss. It made a tremendous difference!


> No matter how deep your knowledge is, you're only scratching the surface.

I understand this is just emphasis, but no, its not magic, its not innate ability, its just software man! If you have dug deep enough, and understood it, that's it. Key phrase is IMO 'understood', but that's universal.


I think the point is that it may be impossible for a single human to have "nearly complete understanding" of how the networking stack work. But maybe what was meant was nearly complete understanding of the fundamentals. That's certainly achievable. But networking in the kernel is a beast of a thing with specialists in small parts of it. But I don't think there's a single human that know nearly all that those specialists know combined.


for those unfamiliar with C, if anyone can recommend sufficient reference material to understand the C examples presented in the book, that would be helpful


Some languages, like Python[0], have a low-level socket library that maps pretty closely to the C API used in the book.

[0] https://docs.python.org/3/library/socket.html


You’re opening a can of worms but Kernigan and Ritchie wrote the canonical book on C and you cant go wrong with that.


It's with stressing how different the K&R book is from books in the past that introduce a programming language. Usually those sort of books are fairly long and you end up with a decent but incomplete understanding.

With K&R, it's quite short, it's an easy going read, and yet when you get to the end you've covered the whole of C (even the standard library!).


Let's be controversial. K&R is lovely, I learned from it, everything good everyone says about it is true and K & R and both beautiful, sexy, clever, kind and wonderous men.

If you are learning programming (and who among us isn't?), the code examples stink to high heaven. No really. Don't program like that! K&R didn't! If you're learning programming and C it will teach you a bunch of really bad habits /because/ it is a book for experienced programmers with fully developed taste and the examples were designed to illustrate a particular facet of the language as pithily as possible. It's great for that.

I realize any criticism of the sacred texts will result in inspiring revulsion and possibly bile among those who think of such things as above criticism and any commentary is about the shortcomings of the commentator not the holy book but there it is... It's 2022 and I'm old enough to take it. ;-)

Alistair Moffat "Programming, Problem Solving and Abstraction" is one I found that seems better to learn C programming. Maybe other people have better recommendations than that..?

Separate to that "The C Puzzle Book" is superb when you're learning to really thoroughly understand pointers, arrays and so on. Koenig's "C Traps and Pitfalls" isn't bad for some of the dark corners.


I didn't give K&R that praise because it's a sacred text and I'd feel guilty or embarressed to say otherwise. I said all that because that's how it worked for me: I bought the book, read the whole thing (in far less time and with far less effort than I expected), and at the end of it all I understood C.

I'm exaggerating, but only a little. I did read some more about C afterwards and I wouldn't claim to be a C master. But I never came across any new concepts once the book was complete, or had to fundamentally rework my existing understanding of any concepts. (Not like C++... Oh, there are templates that take templates as template parameters? Or, oh, rvalue references aren't any different to normal references once overload resolution has completed? etc.)

Yes I agree it caters for people that already have a fairly good understanding of coding.


The most confusing part should be “inheritance” of sockaddr structs, afair (the rest should read as if it was javascript). If this is it, then be sure that no magic happens there, structure quirks matter much less than the values they are filled with.


curious whether the book help w/ understanding routing and firewalls?


Yes. To really understand routing and firewalls, I'd suggest implementing a firewall for yourself using Linux or OpenBSD. I call out OpenBSD because when I went through that process years ago, OpenBSD was lightweight from resource and complexity perspectives, so I could focus on network configuration. There might be better options with Linux these days, it's been a long time since I looked.


like at the kernel level? even if not, seems like a such a great exercise.

i guess i'm also interested in routing... any tips on creating a network of computers to experiment with? I guess ideally, you'd have real hardware to assemble, . what's the next best option? VMs? jails w/ vnet?


It depends a bit on what you want to learn exactly. E.g. do you want to just learn about basic routing, BGP, MPLS, certain vendor specific stuff (like Cisco) etc. If you mostly just want to try out what you can archive on Linux you can get quite far with doing things via network namespaces (+whatever software you want to use, e.g. BIRD or Quagga) these days. Large pro on doing things via namespaces is that you can set them up & tear them down a lot quicker compared to VMs or physical hardware & scripting those steps is a lot easier. VMs & physical hardware provide more ways to do things. At least in the past there were various limitations on what you could do emulate with Cisco VMs for example (especially on some more advanced Nexus features).


Back then, I put OpenBSD on a spare P133 with two network adapters to do something like this: https://www.openbsd.org/faq/pf/example1.html

VM and/or containers make sense today. I've got an old Mac Pro tower I found on Craigslist in the basement. 64G of RAM for it was less than $150.

Since Cisco offers virtual versions of their routers, you can look at what people are doing to build home labs for practicing for Cisco certifications.

The homelab subreddit has a lot of interesting examples of peoples' setups as well.


Very yes on both, especially with firewalls.


One of the big issues with TCP is a lot of communication isn't well-suited to the stream model and is message-oriented, so yes, like the author says, you have to go implement ACKs. And then you want multiplexing, which streams also fail at. Before you know it, you built a worse version of HTTP2.


And then you realize that sometimes it has become even worse due to tcp head of line blocking and you move to udp and build a worse version of http3


I think Head of Line blocking is a more concrete problem but it seems that it's often encountered when running away from a different one:

If you have different conversations going on between a single client and the server, you can make several connections to do it. You'll pay a little bit more at the start of the conversation of course, so be frugal with this design, and think about ways to 'hide' the delay during bootstrapping. But know that with funky enough network conditions, it's possible for one connection to error out but the other to continue to work.

The problem for me always comes back to information architecture, and the pervasive lack of it, or at least lack of investment in it. If you really have two separate conversations with the server, losing one channel shouldn't make the whole application freak out. But we all know people take the path of least resistance, and soon you have 2.5 conversations on one channel and 1.5 on the other.

The advice here is some of the same advice in more sophisticated treatises on RESTful APIs (it's all distributed computing, it's all the same problem set arranged in different orders). REST calls are generally supposed to be stateless. The client making what looks like a non sequitur call to the backend should Just Work. If you manage that, then having one channel inform the client of the existence of a resource and fetching that resource out of band isn't really coupling of the two channels. That subtlety is lost on some people who defend their actual coupling by pointing out that other people have done it too, when no, they really haven't. And if anything, multiplexing lets them get away with this bad behavior for much longer, allowing the bad patterns to become idiomatic.


sctp has been available on Linux (and other bsd-socket userlands, with sctp over udp) for some time now. HTTP3 fixes the one sctp limitation, which is encryption of the whole session (and not just individual streams). At least that's what HN told me last time I said 'but sctp!' :-)


Lots of things are still limited to tcp/udp only, such as traffic analyzers, home router port forwarding and cloud firewall rules.

If you want your networking software to be widely used, you probably don't want to limit yourself by using sctp.


Yes but sctp can run over udp, that's how I ever used it. And QUIC has bitten the bullet of using udp over the Internet, so it was possible. Just feels like a missed opportunity, but now that QUIC is here, there's not much more to say. There's lots of fancy things with sctp that I always found magic and it defines the base of your architecture. Having a multi-homing, multi-path transport API alleviates many headaches.


Naive questions but doesn't QUIC solve those problems?


HTTP2 solves these problems. HTTP3 (QUIC) solves the head of line blocking problem of TCP. That is, a packet getting corrupted or lost causes TCP to hold up other packets from going out until that packet can be successfully delivered. So you may be multiplexed in your messages, but you end up with a slowdown and backlog of data that could have been successfully received and interpreted but for the lost packet.


SCTP solved them. We're just rarely allowed to use it because the internet is broken.


Yes, it's a bummer that we have to encapsulate it in UDP, but when using it like that, it works quite well in my experience.


QUIC is HTTP/3 more or less and solves the same problems, yes.


Somewhat. Http/3 is a protocol that runs on top of QUIC. Every http/3 request uses a dedicated QUIC stream. However it is possible to use QUIC for other use cases that have nothing to do with HTTP. QUIC is more of a multiplexed TCP with mandatory encryption.


I'd also recommend MQTT if you want message based networking with automatic retries, etc.


If you insist on passing messages, use a well-designed message queue service and don't rebuild the wheel (but just a little shittier).


> use a well-designed message queue service

There is a faq about ZeroMQ written in the Monty Pythonesque fashion of "other than that, what does ZeroMQ do for us?"

I can't quite find it in my bookmarks, but it went by the "so it gives you sockets", "but also message batching" ... "but other than batching, what does it give us?".

Also, the whole problem with using just TCP is that often it needs kernel level tuning - like you need to fix INET MSL on some BSD boxes to avoid port exhaustion or tweak DSACK when you have hanging connections in the network stack (like S3 connections hanging when SACK/DSACK is on).

A standard library is likely to have bugs too, but hopefully someone else has found it before you run into it.


I'm pretty hesitant to use ZeroMQ for anything anymore. I was digging into what using ZeroMQ might look like in Rust and ran across a pretty interesting issue that eventually made me decide to drop it for AMQP - https://github.com/jean-airoldie/libzmq-rs/issues/125#issuec...


I think the situation is more subtle than the poster admits.

No, ZeroMQ and successors do not tell you about socket state. You can't detect disconnection or reconnection. But then if a TCP connection fails in some way that does not lead to disconnection (packets getting dropped, remote machine powers down), it can't possibly tell you about that either, but you still need to deal with it. So in any case, you need some sort of application-level error detection and recovery; you need heartbeats, and serial numbers in messages, and a protocol for explicitly restarting a connection and performing the initial handshake. And once you have that, explicit connection events from ZeroMQ are much less important.

Admittedly, given that this is a TCP transport, reporting reconnections would still be useful, because TCP won't ever drop messages from the interior of a sequence itself (if it delivers 15, it has delivered 1 - 14 already), so you shouldn't need the serial numbers.

And if it's really not possible to detect authentication failures, than that seems rubbish. And it seems that is indeed the case: https://github.com/zeromq/libzmq/issues/3505


A message queue service, or an RPC stack, adds a tremendous amount of overhead to a system. This is part of why computers are 200x faster than they were 20 years ago, but the performance feels the same.

HTTP works reasonably well on TCP, but a lot of what we want to do is better suited to a reliable UDP protocol. Unfortunately, routers often balk at UDP packets, so TCP it is.


> How you handle a file no longer existing vs. a socket disconnection are not likely to be very similar. I'm sure I'll get counter arguments to this,

That's my cue! I think the "everything is a file" is somewhat misunderstood. I might even rephrase it as "everything is a file descriptor" first, but then if you need to give a name to it, or ACL that thing, that's what the file-system is for: that all the objects share a common hierarchy, and different things that need names can be put in say, the same directory. I.e., that there is one hierarchy, for pipes, devices, normal files, etc.

I'd argue that the stuff that "is" a file (named or anonymous) or a file descriptor is actually rather small, and most operations are going to require knowing what kind of file you have.

E.g., in Linux, read()/write() behave differently when operating on, e.g., an eventfd, read() requires a buffer of at least 8 B.

Heck, "close" might really be the only thing that makes sense, generically. (I'd also argue that a (fallible¹) "link" call should be in that category, but alas, it isn't on Linux, and potentially unlink for similar reasons — i.e., give and remove names from a file. I think this is a design mistake personally, but Linux is what it is. What I think is a bigger mistake is having separate/isolated namespaces for separate types of objects. POSIX, and hence Linux, makes this error in a few spots.)

But if you're just writing a TCP server, yeah, that probably doesn't matter.

> and that you should write your applications to treat these the same.

But I wouldn't argue that. A socket likely is a socket for some very specific purpose (whatever the app is, most likely), and that specific purpose is going mean it will be treated as that.

In OOP terms, just b/c something might have a base class doesn't mean we always treat it as only an instance of the base class. Sometimes the stuff on the derived class is very important to the function of the program, and that's fine.

¹e.g., in the case of an anonymous normal file (an unlinked temp file, e.g., O_TMPFILE), attempting to link it on a different file-system would have to fail.


In the early 90s I was working on AppleTalk-PC interoperability SW, which included "ATP", the AppleTalk Transaction Protocol, which is kind of a guaranteed delivery protocol. It even had "at least once" and "exactly once" options to make sure state wouldn't get messed up in client-server applications: https://developer.apple.com/library/archive/documentation/ma...

I was kind of sad to see that things like this fell by the wayside in favor of streaming protocols but of course it's way more efficient to not handle each transaction individually.

The problem of users implementing TCP++ on top of TCP is a real problem, I think.


I just want to point out that "exactly-once" is not physically possible. The documentation claims "exactly-once", and proceeds to explain that after a while, it will timeout. Which makes it "at-most-once", as expected.

As far as I know; in communication; you can only have at-least-once (1..N) or at-most-once (0..1).


According to the docs, the transaction release timeout is specified by the requester. The requester also specifies the duration to wait between retries and the maximum number of retries. So this does indeed appear to be exactly-once as long as the sender ensures the transaction release timeout is greater than the total time (retry timeout * max retries) plus any network delay.


> As far as I know; in communication; you can only have at-least-once (1..N) or at-most-once (0..1).

If you hash-chain all of your messages, you can trivially and reliably have exactly-once semantics.

Just like a blockchain.

Each message includes its predecessor hash, so if you get a message that you don't have the predecessor for, you can just request it. ACKs become "this is the latest message hash for which I have all of the predecessors" and can be done lazily, i.e. every nth message, or after not receiving a message in awhile.

Source: Me, built a system that worked like this in 2014 (over UDP). Worked great, it had some similarities to how QUIC is designed (including the crypto).


Was there a driver or library for ATP or were you implementing that on PCs?


No, we were the ones writing the PC drivers, for DOS, Windows and OS/2. Back then the lower layers were more or less standardized but there was no AppleTalk stack so that's what we developed as a product.


> I will also ask: how often are you writing applications that want to accept either files or sockets?

In Go where there is an io.Reader/io.Writer abstraction where blocking interactions are OK and you're absolutely intended to handle all of the errors, it's really no problem at all to use a socket where you'd use a file.

Unfortunately, this only works because the abstraction handles it well. You can't really do custom things with file descriptors, so the amount of useful things you can do by treating sockets as files is quite limited (though it certainly exists.)

(I was going to say "TLS for example" though come to think of it this isn't even strictly true under Linux; https://docs.kernel.org/networking/tls.html )

Still, having TCP sockets and files at the same abstraction layer in general isn't all that bad; you should consider that writes to the filesystem could fail and do take time when blocking. When done right, this makes it much easier for apps to become transparent to the network and other media when it should be possible.


"Unfortunately, this only works because the abstraction handles it well. You can't really do custom things with file descriptors, so the amount of useful things you can do by treating sockets as files is quite limited"

In general, most code has a clear initialization step where it sets up everything it wants to set up, then you can pass it to something that only expects a io.Reader/io.Writer and it can operate. I have a couple of places where one way of getting something does an HTTP request, but another way opens an S3 file for writing, and yet another way opens a local disk file. Each of them has their own completely peculiar associated configuration and errors to deal with, but once I've got the stream cleanly I pass of an io.WriteCloser to the target "payload" function.

If you're doing super intense socket stuff you may need to grow the interface, or even just plain code against sockets directly. But most application-level stuff, even complicated stuff like "Maybe I'm submitting a form and maybe I'm writing to S3 and maybe I'm writing to disk and maybe I'm writing to a multiplexed socket and maybe I'm doing more than one of these things at once" can be cleanly broken into a "initialize the io.Reader/io.Writer, however complicated it may be" phase and a "use the io.Reader/io.Writer in another function that doesn't have to worry about the other details" phase. It is also highly advantageous to be able to pass a memory buffer to the latter function to test it without having to also try to figure out how to fake up a socket or a file or whatever.

People don't write applications that accept either files or sockets because in most languages there is one impediment or another to the process; a system that almost makes file-like objects share an interface but in practice not really, libraries that force you to pass them strings rather than file-like objects, etc. While it isn't attributable to Go qua Go, by getting it right in the standard library early Go legitimately is really good at this sort of thing, more because the standard library set the tone for the rest of the ecosystem than because of any unique language features. I hear Rust is good too, which I can easily believe. Every time I try to use Python to do this, I'm just saddened; it ought to work, it ought to be easy, but something always goes wrong.


The way I see it, Go did two things that made it work well:

- Have a very simple, fairly well-defined interface for arbitrary read/write streams. This interface needs to have decent characteristics for performance, some kind of error handling, and a way to deal with blocking. Go's interface satisfies all three.

- Have a good story for async in general. It's not really helpful to have an answer for how to deal with I/O blocking if the answer is really crappy and nobody actually wants to use it. A lot of older async I/O solutions felt very much in this camp.

I think that Rust does a pretty decent job, though I'm a little bearish on their approach to async. (Not that I have any better ideas; to the contrary, I'm pretty convinced that Rust async is necessarily a mess and there's not much that can be done to make it significantly better.)

But I think you can actually do a decent job of this even in traditional C++, in the right conditions. In fact, I used QIODevice much the way one would use io.Reader/io.Writer in Go. It was a little more cumbersome, but the theory is much the same. So I think it's not necessarily that Go did something especially well, I think the truth is actually sadder: I think most programming languages and their standard libraries just have a terrible story around I/O and asynchronicity. I do think that the state of the art has gotten better here and that future programming languages are unlikely to not attempt to solve this problem. So at least there's that.

The truth is that input and output is unreliable, limited and latent by the nature of it. You can ignore it for disk because it's relatively fast. But at the end of the day, the bytes you're writing to disk need to go through the user/kernel boundary, possibly a couple times, to the filesystem, most likely asynchronously out of the CPU to the I/O controller, to the disk which likely buffers it, and then finally from the disk's buffers to its platters or cells or what have you. That's a lot of stuff going on.

I think it's fair to say that "input and output" in this context means "anything that goes out of the processor." For example, storing data in a register would certainly not be I/O. Memory/RAM is generally excluded from I/O, because it's treated as a critical extension of the CPU and sometimes packaged with it anyway; it's fair for your application (and operating system) to crash violently if someone unplugs a stick of RAM.

But that reality is not extended almost anywhere else. USB flash drives can be yanked out of the port at any time, and that's just how it goes; all buffers are going to get dropped and the state of the flash drive is just whatever it was when it was yanked, roughly. USB flash drives are not a special case. Hell, you can obviously hotplug ordinary SSDs and HDDs, too, even if you wouldn't typically do so in a home computer.

So is disk I/O seriously that different from network I/O? It's "unreliable" (relative to registers or RAM). It's "slow" (relative to registers or RAM). It has "latency" (relative to registers or RAM). The difference seems to be the degree of unreliability and the degree of latency, but still. Should you treat a `write` call to disk differently than a `write` call to the network? I argue not very much.

I don't really know 100% why the situation is bad with Python, but I can only say that I don't really think it should've been. Of course, hindsight is 20:20. It's probably a lot more complicated than I think.


"I think the truth is actually sadder: I think most programming languages and their standard libraries just have a terrible story around I/O and asynchronicity."

Whenever I post this sort of claim, I try to make it clear that it's not really "Go triumphalism", because I agree with you. It ought to be just as easy in a lot of other languages too. It's not a matter of features, or missing features, or features at all.

C and C++ both have a number of abstractions I've seen on this idea, but they aren't compatible and not universal, so using them is a pain because you pretty much have to adapt everything yourself into whatever you are using. (C is particularly problematic; two libraries or even just two adaptors to some particular IO facility can be API compatible but still not work together properly if they have different ideas about memory ownership. C++ can still get into that problem though my perception is there's a better understanding of the problems, if nothing else. Rust has a huge advantage on that front.) Go had enough leadership that everybody has the same abstraction out of the box, and you end up very encouraged by the community to conform to it unless you have good reasons. Good reasons exist; I've got some things that wrap io.Readers but just can't be io.Readers on their own because the abstraction doesn't fit. But they are the exceptions, and I don't see them often.


> (I was going to say "TLS for example" though come to think of it this isn't even strictly true under Linux; https://docs.kernel.org/networking/tls.html )

TLS isn't that much harder to handle than regular TCP sockets--you're going to need a socket interface that lets you get reader and writer streams, and the extra roundtrips you need for negotiation are handled in the constructor for the TLS socket.

It is more difficult if you want to support more advanced features of TLS, or especially if you want to support something like STARTTLS (negotiate TLS on an already-open socket). But this is already kind of true for sockets in general: the reader/writer abstraction breaks down relatively quickly if you need to do anything smarter than occasionally-flushed streams.


I think my original post was fairly unclear of what I meant.

See, what I meant was like this. File descriptors are an OS abstraction; the "backends" you get are defined in the kernel. You can't really do custom behavior with FDs; for example, in Go, the Go TLS library can open a connection and return an io.Writer, and when you write, it will be symmetrically encrypted, transparent to you, as if the code spoke TLS. But when you're dealing with raw file descriptors, and the read and write syscalls, there's no way to make 'custom' read and write handlers, like you can with programming language abstractions.

(I do acknowledge that you could in fact do some of this with pipes, though I have seldom seen it used this way outside of shell programming. It's kinda missing the point, anyway, since pipes are just another type of fd. You can do pipes on top of Go's abstraction too, but it would be very cumbersome in comparison.)

But as a kind of quirk, Linux actually _does_ support TLS sockets. It will not do the handshake, so you still have to do that in userspace. But if you use the Linux kernel's TLS socket support, it will in fact give you an FD that you can read from and write to directly with transparency, as if it was any other file or socket; you don't have to handle the symmetric cryptography bits or use a separate interface. I think this is rather neat, although I'm not sure how practically useful it is. (Presumably it would be useful for hardware offloading, at least.)


I played around with making a little websocket layer for browser games a while back. I used a Windows tool called "clumsy" to simulate crappy network conditions and was surprised at how poor the results were with just WebSockets, despite all the overhead of being a protocol on top of a protocol. The result is that you need to build a protocol on top of the protocol on top of the protocol if you actually want your messages delivered...


I built a javascript data synchronization library specifically for games

https://github.com/siriusastrebe/jsynchronous

A core part of the library is the ability to re-send data that's been lost due to connection interruption. Absolutely crucial for ensuring data is properly synchronized.


That's because WebSockets are more or less just sockets for web apps. You'lll want to use a protocol that deals with messages and their at-least-once delivery, such as MQTT (that can run on top of WebSocket if you need it).


Interesting read. I’m quite curious of where all the initial misperceptions about sockets comes from.

I can highly recommend Beej’s guide to network programming: https://beej.us/guide/bgnet/

That together with Linux/BSD man pages should be everything needed, some great documentation there.


I definitely used to think TCP was more “high-level” than it actually is. Yes it does much more than UDP but still, its job is to get a sequence of bytes from A to B. You can tune it for higher throughput or more sensitive flow control but anything concerning message passing, request/response, … is beyond the scope of TCP.


Sure, but from a "high level" or "sockets" perspective, especially as a beginner it shouldn't be something you need to care about. A bit simplified, the basic stuff you need to know is:

1) UDP uses packages/messages which may or may not reach its destination. If it reaches its destination the data is intact. Normally connectionless.

2) TCP is a stream protocol. There is no package/message boundary unless you create it yourself (my tip is to do a simple binary TLV (type length value) protocol using say a fixed 4 byte header). Requires a connection to be setup first.

3) Network byte order - really important to read about.

4) Nagles algorithm (TCP_NODELAY) and SO_KEEPALIVE - those are a couple of things to read about.

5) Start with the simple select() approach to handle the socket activity.

You can then go ahead and get more advanced by doing nonblocking I/O or do blocking I/O with each client in its own thread, figuring out pros and cons for your use case. You can add SSL/TLS on top of your TCP connection etc.

EDIT: The SO_KEEPALIVE part is perhaps least important thing to start reading about. I'm a bit biased due to NAT traversal problems as I wrote a secure remote access solution for a major company several years back, utilising STUN/TURN servers, public key authentication (basically certificate pinning), TLS etc.


Yes and even at 2) some subtleties start. You can set up a connection, send a chunk of bytes, and close it. If you reach a clean connection close, you can be sure that all your bytes have reached the other side. As soon as you start sending multiple logical messages over a persistent connection and an error occurs, you need to write application logic to figure out where to pick up again after you reconnect. Even if you want to know which parts of your stream have already reached the other side, you need to add logic for that. This “multiple transactions over a persistent connection” may sound really straightforward but it’s not built into TCP itself.


I'm pleased to hear that anyone is still teaching anyone anything about sockets.

The bit about Windows not treating sockets as files made me pause, since Windows does treat so many things as files. After thinking about it some, I suppose it's kernel32 that treats kernel32 things as files. Winsock has a separate history.


Windows has a long and unfortunate history of encouraging programs to extend system behavior by injecting libraries into other programs’ address spaces. You can look up Winsock LSPs and Win32 hooks [0] for examples. This means that programs cannot rely on the public APIs actually interacting with the kernel in the way one would natively imagine — the implementation may be partially replaced with a call into user code from another vendor in the same progress. Eww!

So, as I recall, a normal socket is a kernel object, but a third party LSP-provided socket might not be. This also means that any API, e.g. select, that may interact with more than one socket at once has problems. [1]

[0] https://docs.microsoft.com/en-us/windows/win32/api/winuser/n... — see the remarks section.

[1] https://docs.microsoft.com/en-us/windows/win32/api/Winsock2/... — again, see the remarks.


It's complicated. Sockets are mostly interchangeable with filehandles but there are many exceptions. For example, ReadFile() works with sockets, whereas DuplicateHandle() silently corrupts them.

However, there's another problem: overlapped vs non-overlapped handles. socket() always creates overlapped sockets, while WSASocket() can create both types. Overlapped handles can't be read synchronously with standard APIs, which in turn means you can't read() a fd created from an overlapped handle.

Naturally, in their infinite wisdom, Windows designers decided there's no need to provide an official API to introspect handles, so there's no documented way to tell them apart (there are unofficial ways, though). BTW, it's a proof of poor communication between teams in Microsoft, because their C runtime (especially its fd emulation) would greatly benefit from such an API.

It's frustrating. I'm sure that if Windows was an open-source project, that mess would be fixed immediately.


You also have to remember that the WinSock API had to be implemented on cooperatively multitasked Windows 3.x. So some of the weirdness is due to that - async socket IO had to work in that environment.

There’s some story I heard at Build or another conference about them implementing sockets on Windows NT. I think maybe it was Larry Osterman who had to implement them. His boss told him to do it without another ***ing driver and tried but couldn’t do it. So the driver implementing sockets was AFD.SYS

I think that’s well behaved with respect to WriteFile, ReadFile etc. One thing they did get right is using sockets with IO completion ports. That was a great design.


I have two technical questions after reading this article.

1) The author writes: <<I expected that I would call e.g. isconnected() on the sockets after accept() tells me something happened to it.>> I cannot find any man page for this function. Google also fails me. Is this a hypothetical function or a non-POSIX call?

2) The author writes: <<On Linux, I also needed to disable signaling on the recv() so that I could handle the connection error inline rather than need to register a signal handler. I opted to add the MSG_NOSIGNAL to both send() and recv() and handle potential disconnect errors at each call.>>

I checked the Linux man page for recv(). The flag MSG_NOSIGNAL does not exist (at least in my version). However, send() supports it. Do I misunderstand?


Author here! Isconnected is a hypothetical.


The following sentence in the article jumped out at me: “The difference is that you are not dealing with the unreliability of UDP like TCP is.” This reads to me like TCP is built on top of UDP, which at one time I thought to be the case, but it’s not. UDP and TCP are both transport layer, built on the internet layer, which is unreliable.


Many of the problems with TCP are solved by QUIC: https://www.chromium.org/quic/

For example, it handles reconnect for you (it's based on UDP which is connectionless), it even survives IP address change, multiple streams over one connections and other thigns. There're several high quality libraries which you could use in your code.


This was a good write up but was kind of short of details. It would have been really awesome with some code examples. I love socket programming. Love, like I love going to the dentist.

Having a state machine to handle basic I/O between client/server kind of blows when it comes to plenko machine switch statements of arbitrary state enums. Languages like Go or Python allow you to have a way of communicating with the client from the server in a more direct client->server oriented way. Write/Read input, do thing, write, read, do, repeat until you finish ops.

Go is my favorite for this as I can spawn a goroutine to read/write to a socket. Rust is my second favorite for this but it's a bit trickier. Python's twisted framework has something like this with Protocols. I wish C++ had a standard socket implementation (std::network?)

Anyway, this gave me a smile today so thanks.


Here's a fun one: TCP sockets can connect to themselves if you bind to an ephemeral port:

https://sgros.blogspot.com/2013/08/tcp-client-self-connect.h...


Perhaps there should exist a flag for send() that would make it so that it doesn't return until all data in the send() call has been ACKed by the receiving side (with a user configurable timeout).

Of course, it's still not bulletproof. The other side can receive the packets, stuff them in its receive buffer, send an ACK for those packets, and then fail before draining the receive buffer due to an OS crash or hardware failure. But computers and operating systems tend to be much more reliable than networks, so it would still provide a much stronger guarantee of delivery or failure.


There's a difference between ACK of the remote network stack (yeah, we got your packets, they're waiting in line) and ACK of the application (yeah, app X processed your requests composed of 1 or more packets)

Compare with the classic OS optimization for spinning rust hard drives - write system calls will return immediately, but actual write requests to the hardware will occur sometime later. It's assumed most of the time your computer doesn't lose power, but that does happen sometimes, hence journaling.


Much stronger but of limited use still, for the reasons you listed.

You very rarely care that the remote tcp/ip stack has acked the message, you care that the messages has been received by the program and processed - You're better off implementing your own acks in those cases, allowing you to report back any errors in that ack as well. Or you don't really care, and can just fire and forget those messages.

And that also allows you to implement a system where you can pipeline messages - waiting for remote acks when allowing just 1 message in flight limits your throughput severely.


UDP packets aren't ACKed by default---it's "best effort" (or "fire and forget," depending upon your view point).


> I found it very confusing that I had to attempt to recv() from a socket and fail in order to even tell that the connection was no longer active. I expected that I would call e.g. isconnected() on the sockets after accept() tells me something happened to it. It does make sense to me now that it's better to have recv() fail and tell me about the disconnect. Otherwise, I might mistakenly assume that if I call isconnected() I am then guaranteed to have a good recv(). By keeping the disconnect tied to recv() failing, I know I need to handle potential disconnects at any recv() invocation. The same goes for send().

After thinking about it for a second, it makes sense since any check of isconnected() followed by a send() or recv() is subject to a race condition of the socket failing after the call to isconnected() but before the subsequent calls. You can never know without trying to send or receive, and even those may only fail "in the middle" of the transmission.


I think that's exactly what the second half of the quoted paragraph is trying to say.


If I could make a meta point, nothing about software is magic (except the fact it works at all). Computers do exactly as they are told and nothing more.

If you’re using some underlying technology you should really know how it works at some level so you can understand what assumptions you can make and which you can’t. TCP doesn’t mean there is never data loss.


Except when literal bugs short out your system, or actual cosmic rays bit flip your data, or any plethora of other things that can cause your "correctly written" code to not function as expected.

That's when the magicians come out and do their thing


This article may as well have been about writing to file streams does not block until a read occurs. Maybe the documentation (on sockets?!) could be more clear but at some point more words don't help with conceptual understanding.


I very often see "reconnect loops" in various codebases and I wonder are they necessary? Wouldn't the same effect be achieved by for example increasing timeouts or some other connection parameter?


They’re a bit of a feature of the connection-oriented nature of TCP as the other reply mentions. If the server process crashes and restarts for example, the client will be told that its previous connection is not valid anymore. Basically TCP lets client and server assume that all bytes put into the socket after connect()/accept() will end up at the other side in that same order. Each time there is an error that violates that assumption, the connection needs to be explicitly “reset”.


For TCP the state required to maintain the socket in the kernel is invalidated on error and needs to be reset. The only way to do this is to explicitly perform the connection setup again. An extended timeout only delays this process since the remote side will have invalidated its state as well.

UDP packets require no connection but you still might see some sort of re-synchronization code to reset application state which could be called "reconnect".


> How you handle a file no longer existing vs. a socket disconnection are not likely to be very similar.

Why not? It seems that fopen(3) and fread(3) provide the perfect abstraction for that. The semantics to remove(3) a file that is open are very clear, and they represent exactly what you want to happen when a connection is lost.

I never understood the need for "sockets" as a separate type of file. Why can't they be just be exposed as regular files?


"End-to-End Arguments in System Design" (https://web.mit.edu/Saltzer/www/publications/endtoend/endtoe...) is a must read when it comes to application reliability and network programming


Regarding his original expectations of TCP. Even if true, is there much difference between dropped data and data being delivered hours late? I imagine at an app level you would suspect any message that got sent 12 hours ago but kept in a queue.

I imagine if that scenario is Ok you would explicitly use a queue system.


The first bit of real code I ever wrote was an SO_LINGER bug fix for a game that couldn’t restart if users had disconnected due to loss of network.

Then I had to explain it to several other people who had the same problems. Seems a lot of copy and paste went on among that community.


If that was truly the first bit of real code you wrote, it's pretty darned impressive. Comparable to learning to swim by jumping in at the deep end of a pool that's full of sharks, razor blades, and coliform bacteria.


I was proud of it, but really homework assignments tend to be toy problems. As complex as they can get (which really isn’t that complex) they don’t do anything.

It was a known problem, with snippets floating about. Understanding what it did and why it should work took a bit more. Then again, the first technical book I read for myself instead of for class what’s the Comer TCP/IP books, which hardly anybody did at the time and definitely nobody does now.


Ah man, sockets. Real sockets. Yesterday's kiosk thread jogged my memory about all that but almost all of the network communication on my old kiosk installations was socket based. It was a pleasant way to work.


The title is confusing. I did learn it that way…


It seems like the OP is mostly talking about 'blocking' sockets. Such sockets return when they're ready or there's an error. So send returns when its passed off its data to the network buffer (or if its full it will wait until it can pass off SOME data.) You might think that sounds excellent - but from memory send may not send all of the bytes you pass to it. So if you want to send out all of a given buffer with blocking sockets - you really need to write a loop that implements send_all with a count of the amount of bytes sent or quit on error.

Blocking sockets are kind of shitty, not gonna lie. The counterpart to send is recv. Say you send a HTTP request to a web server and you want to get a response. With a blocking socket its quite possible that your application will simply wait forever. I am p sure the default 'timeout' for blocking sockets is 'None' so it just waits for success or failure. So a shitty web server can make your entire client hang.

So how to solve this?

Well, you might try setting a 'timeout' for blocking operations but this would also screw you. Any thread that calls that blocking operation is going to hang that entire time. Maybe that is fine for you -- should you design program to be multi-threaded and pass off sockets so they can wait like that -- and that is one such solution.

Another solution -- and this is the one the OP uses -- is to use the 'select' call to check to see if a socket operation will 'block.' I believe it works on 'reading' and 'writing.' But wait a minute. Now you've got to implement some kind of performant loop that periodically does checks on your sockets. This may sound simple but its actually the subject of whole research projects to try build the most performant loops possible. Now we're really talking about event loops here and how to build them.

So how to solve this... today... for real-world uses?

Most people are just going to want to use asynchronous I/O. If you've never worked with async code before: its a way of doing event-based programming where you can suspend execution of a function if an event isn't ready. This allows other functions to 'run.' Note that this is a way to do 'concurrency' -- or switch between multiple tasks. A good async library may or may not also be 'parallel' -- the ability to execute functions simultaneously (like on multiple cores.)

If we go back to the idea of the loop and using 'select' on our socket descriptors. This is really like a poor-persons async event loop. It can easily be implemented in a single thread, in a single core. But again -- for modern applications -- you're going to want to stay away from using the socket functions and go for async I/O instead.

One last caveat to mention:

Network code needs to be FAST. Not all software that we write needs to run as fast as possible. That's just a fact and indeed many warn against 'premature optimization.' I would say this advice doesn't bode well for network code. It's simply not acceptable to write crappy algorithms that add tens of milliseconds or nanoseconds to packet delivery time if you can avoid it. It can actually add up to costs a lot of money and make certain applications impossible.

The thing is though -- profiling async code can be hard -- profiling network code even harder. A network is unreliable and to measure run-time of code you only care about how the code performs when its successful. So you're going to want to find tools that let you throw away erroneous results and measure how long 'coroutines' actually run for.

Async network code may underlyingly use non-blocking sockets, select, and poll. But they are designed to be as efficient as possible. So if you have access to using them its probably what you want to use!


  select(…)
?

Why not

  EPOLLRDHUP

?


> If I send() a message, I have no guarantees that the other machine will recv() it if it is suddenly disconnected from the network. Again, this may be obvious to an experienced network programmer, but to an absolute beginner like me it was not.

Uhh, that's 100% obvious. That's why it's not taught.

Of course, I've also goofed on things that are obvious to other people. But, com'on, TCP isn't magic.


It's not 100% obvious. The mental model where send() blocks until recv() on the other side confirms it is coherent: the receiver sends ACKs with bumped ack numbers to acknowledge the bytes it's received, and could delay those ACKs until the application has taken the bytes out of the socket buffer. It doesn't work that way, of course, and shouldn't, but it could.


> Uhh, that's 100% obvious. That's why it's not taught.

Is it? You probably feel that way from knowing how TCP works. But it would be quite straightforward to make it true with a slightly modified version of TCP (that acknowledges all packets, rather than every second one) by having send() block until it receives back the ACK from the receiver. (Yeah this would kill transmission rates, but it would function!) And furthermore, while it would be a terrible idea, the ACK could even be delayed until the end application makes the recv() call for the packet.

To somebody not familiar with the details (and why this would be a terrible tradeoff), something along those lines would be entirely plausible.


Especially since TCP is often introduced like "TCP provides reliable, ordered, and error-checked delivery of a stream of octets (bytes) between applications running on hosts communicating via an IP network" or similar formulations (this one comes from Wikipedia). "Reliable" does seem to imply that all data sent is also received. It's of course impossible to guarantee that.

Actually I think it's even a mistake to call TCP reliable. It's best-effort, and allows you to detect disconnects. That's all it does because that's all it can do.


> Is it? You probably feel that way from knowing how TCP works

No, I knew that from yanking out a network cable long before I even knew what TCP was.


It could be obvious, but no need to be condescending about it.

There are many statements that fit into the category of "obvious once stated, but not obvious if you didn't consider the distinction to begin with".


If it were so obvious we wouldn’t have so many concurrency bugs that appear time and again in new programs. If it’s not network flush it’s file system flush.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: