No new code should be using select(). Using it assumes the execution environment keeps all file descriptor values within FD_SETSIZE.
I've personally had to debug and fix crashes due to this assumption at multiple jobs.
Back at Hostway decades ago we had Apache+SSL instances randomly crashing which turned out to be due to something opening /dev/random and using select() on the returned file descriptor in a .so. This only happened when the machine had enough virtual hosts to exceed FD_SETSIZE on the spurious open of /dev/random.
At Bizanga we had random crashes in their "carrier grade mail server" software, which would link in proprietary modules from vendors like Symantec for realtime virus scanning before even accepting messages for delivery. This stuff ran in-process for performance reasons, just in separate threads. Being "carrier grade" meant thousands of concurrent connections -> thousands of open files. Those vendor-provided plugins would phone home behind the scenes for virus db updates. Turned out they were using select() on the socket connected for the update, if there were enough active sessions at the time, the socket's value would exceed FD_SETSIZE and boom: segfault.
People tend to forget that it's the value of the fds you pass to select() which can't exceed FD_SETSIZE, not just the quantity of file descriptors. It could be just one file descriptor, but if it's a high enough value exceeding FD_SETSIZE, it will segfault. The value is used as an index into a fixed size array.
I ended up making an execution wrapper @ Bizanga to launch everything with >FD_SETSIZE file descriptors already opened once I saw the vendor modules were linking select(). We found numerous crashes this way, without waiting for them to occur in production.
> No new code should be using select(). Using it assumes the execution environment keeps all file descriptor values within FD_SETSIZE.
Perhaps it might be time to start treating select() like gets() was treated: the compiler should emit a warning by default whenever it's used, and after perhaps a decade, the API should no longer be available for newly compiled code. While not as bad as gets(), which according to Rusty's classification of API quality (https://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html and https://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html) gets a -10 ("It's impossible to get right"), select() is around a -5 ("Do it right and it will sometimes break at runtime"), or perhaps even a -7 ("The obvious use is wrong").
This makes me wonder if we could realistically start enforcing SystemCallFilter on new services which immediately crashes an app of it tries to use select(). I suspect there's a lot of them that would fail at the moment... But maybe in a few years.
Add a switch to enable the previous behaviour. If the software is actually maintained than that switch will be added to the build config. If it isn't it shouldn't be used in production going forward. Whoever adds the option to the build process takes responsibility if the bugs it is deprecated to avoid do in fact affect this code.
Of which the only one left is Windows, where you probably have a completely different event loop anyway. The Windows select by the way does not have the limitations on the file descriptor that the article talks about, because Windows file descriptors are not small consecutive integers.
Actually that’s not totally accurate. The native IO model in the Win32 subsystem, I/O completion ports, is indeed completion based. But the NT kernel does have a scalable readiness model too, limited to sockets, that people have used to build an epoll(). See: https://github.com/piscisaureus/wepoll. This is used by e.g. deno and node.
You might be surprised at how little standards like POSIX actually cover. There are huge gaps in the APIs implemented by e.g. QNX. (I haven't checked if this specifically is one of them but I would be surprised if it isn't.)
…… and at least until 2011 which was when I last checked, select() was the reliable way to do usleep() in Windows without having to change system time granularity, set up invisible windows with their own message pump or a variety of other weird and unreliable ways.
select() on windows is kinda fine. It uses an actual array in fd_set and you can increase the array length by defining a constant or passing a custom structure.
Maybe I am out of date. (I probably am because I haven't written any code that uses select() for two decades.)
Can somebody please identify the POSIX method by which one can wait on a set of descriptors, that does not involve spinning in a polling loop and gobbling up a CPU thread? The last time I had a need for such a thing, select() was the preferred way to do that.
As others mentioned, the POSIX alternative is poll() (and ppoll() as the alternative to pselect() ).
If you want both portability and performance (as in, O(1) instead of O(N) as poll() and select() ), there's no portable standard. In that case you're better of with a wrapper library like libuv or livev.
While this is a laudable goal and good for uncovering unexpected bugs, it also means writing for the lowest common baseline. And in this case, the lowest common baseline is unfortunately not very high. poll is what you want, but has its own gnarly gremlins. Better mechanisms are not standardized.
The article mentions one (poll) and doesn’t make it clear they it’s posix. The rest is Linux specific.
Aside from being a lot more recent (posix 2001), poll was also broken on macOS until 10.9. Though you can always use kqueue instead that remains a portability annoyance: Linux still doesn’t have kqueue.
> No new code should be using select(). Using it assumes the execution environment keeps all file descriptor values within FD_SETSIZE.
I'm with you here. I'm also guilty of using it still for some CLI tools or other small programs. Just to be clear you don't need to assume all fd values will stay within FD_SETSIZE. Just that the fds you will use with select will stay in that range. Even if there will be lots of fds (more than FD_SETSIZE) for other resources, it works for example to select on stdin if you're not doing funny business reopening it.
There's no much of an excuse to keep using it though other than laziness.
Laziness is pretty good reason to continue to use select. It's a convenient interface for simple tasks. The very nature of fire a set of fds and get a result beats (e)poll/kqueue/fd-registering based API in UX regard. It's sad there is no alternative with same simple interface and without FD_SETSIZE shortcoming.
Edit: I'm biased by Python which has `select.poll` with API similar to epoll and it's really hard to use it instead of select.
> People tend to forget that it's the value of the fds you pass to select() which can't exceed FD_SETSIZE
Is that true? I have a memory of working on systems where you can pass larger arrays in if you want, you just have to allocate your own bitvectors instead of using the convenient struct. A quick look at the linux kernel makes me think that this is true for Linux; the kernel doesn't appear to reference FD_SETSIZE in its implementation.
Some platforms (but not Linux) let you define FD_SETSIZE before including C headers to override the default. The fd_set structure will be made correspondingly larger. On these platforms it should work fine, though there may still be an undocumented kernel limit.
Some really old platforms with select() don't even define fd_set or FD_SETSIZE, and you're on your own :-)
Linux added the poll() syscall in 2.1.23, and Linux select() had a kernel-side limit of 256 fds up to 2.1.26, when it raised the kernel-side limit to 5120/10752 depending on architecture. It was 2.5 years until the kernel-side limit was removed, by which point everyone would have had poll() in Glibc.
So if you are going to the effort of writing portable select() except using the trick of defining your own, larger arrays, and using poll() when it's available, on Linux don't bother. You'll hit a kernel limit anyway on those very old systems which don't have poll().
On FreeBSD's select(2) man page, the NOTES section has the much friendlier
> The default size of FD_SETSIZE is currently 1024. In order to accommodate programs which might potentially use a larger number of open files with select(), it is possible to increase this size by having the program define FD_SETSIZE before the inclusion of any header which includes <sys/types.h>.
Although, if you link to a library which uses select and doesn't check that fds it uses are valid with the FD_SETSIZE it was compiled with, you're still in for a bad time, as suggested in the article. And still, probably use something better than select.
Yep, and if you pass an fd >= 1024, it will corrupt the stack by using FD_SET and your program will crash. I've seen that in production, was solved by forcing glibc to use epoll.
From what I recall (without looking it up, and relying on wetware memory that's 20 years old), there is a set of macros that are used to initialize, add, and remove descriptors to a file descriptor set. FD_SETSIZE defines the maximum number of descriptors that the set can hold. I could be wrong, but I seem to recall that the implementation uses both an array and a bit-mask. The array holds the descriptors and the bit-mask identifies which array elements are "active".
The argument is just an array of unsigned longs, each bit of which represents one file descriptor — there's no magic in there. The add/remove macros just do the typical shift-and-mask style bit twiddling. You can pass a bigger or smaller array than the one in fd_set if you want; TTBOMK the kernel doesn't care. It just looks at the numfds argument to know how large an array you passed in.
It's correct but outdated: according to https://man7.org/linux/man-pages/man2/poll.2.html#VERSIONS "The poll() system call was introduced in Linux 2.1.23. On older kernels that lack this system call, the glibc poll() wrapper function provides emulation using select(2)."
So as long as you use a Linux kernel made this century, you will be fine.
At this point I believe ppoll has been added to every mainstream Unix except macOS. It's also been accepted for the next POSIX revision.
For many years pselect on macOS didn't actually fix the race condition, notwithstanding its UNIX03 certification. pselect was just a naive wrapper around select that didn't atomically unmask signals. So I ended up having to create a kqueue-based implementation for use on macOS and stragglers like OpenBSD which at that time lacked pselect entirely. See https://github.com/wahern/cqueues/blob/87f14f9/src/cqueues.c... I can't remember if pselect has been fixed on macOS.
The same approach could be used to emulate ppoll on macOS. Though epoll, kqueue, and Solaris ports are so similar that you can easily wrap all of them behind a thin shim. See https://github.com/wahern/cqueues/blob/87f14f9/src/lib/kpoll... (Caution: it's an incomplete refactor of a similar shim I used in some non-open source code. The original code is even simpler and has seen extensive use. The refactored shim should work but is untested.)
AIX is really the only mainstream Unix lacking an API w/ equivalent semantics. AIX has a better-than-poll interface, but it doesn't work anything like the others. Notably, epoll, kqueue, and Solaris ports control descriptors are all pollable in their own right, permitting you to nest different [readiness polling-based] event loop frameworks.
There is a way to simulate it portably, but it's a little complicated. You can use the self-pipe trick portably on POSIX systems I believe. (There is also the sigsetjmp/siglongjmp trick on most platforms, but I remember it breaking on some version of Cygwin.)
The self-pipe trick is where you create a pipe(), add the read side to select(), poll() or other syscall, and have your signal handlers write a byte to the write side. When reading the pipe to flush this state, remember to read until it's empty because there may have been multiple signals. Reading 2 bytes and looping if you get 2 will get this done usually in one syscall.
It's best in principle to make the pipe non-blocking so the writes can never block if the pipe fills. This is basically the portable equivalent to Linux's eventfd. (If you're worried about out-of-memory preventing non-blocking write of a single byte, you can close the pipe instead, but this causes several complications because it's not safe to close the same fd twice, and you may have signals interrupt other signal handlers, so race hazards around closing the fd twice must be protected against by blocking signals inside the signal handler and using more flags.)
You can optimise self-pipe by setting a write-enable flag just before select() etc and clearing it afterwards, with the signal handlers only writing a byte (or closing the pipe) when the flag is set, and setting a separate flag to say that the signal occurred, which is checked between setting the write-enable flag and before the select() etc syscall. Check the signal-happened flag after select() etc and if it's set, read the pipe to flush even if it doesn't show as readable in the fd set, because the signal could have happened later. A further optimisation is for the signal handler to clear write-enable when writing the pipe, so subsequent signals don't do it again. Don't assume this means there can only be one byte in the pipe, because nested signals introduce a race hazard to writing and clearing write-enable atomically.
The sigsetjmp/siglongjmp trick (setjmp/longjmp on platforms without the "sig" versions) is to sigsetjmp() before select(), poll() or other syscall, then set a jump-enable flag (or just jmp_buf pointer) telling your signal handlers to siglongjmp(), then clear the flag after the select() etc syscall returns. This is more complicated than self-pipe in a number of ways (and a bit slower), because you have to be careful to prevent nested signal handlers or deal with them appropriately, and it's doesn't work reliably on all platforms. So I would stick with the self-pipe trick.
It’s faster than poll, and if you are accept()ing a lot, faster than epoll.
Multiple threads can do better, but only by working together. SO_REUSEPORT+select is faster if you can get away with it. Iouring looks promising though.
> Using it assumes the execution environment keeps all file descriptor values within FD_SETSIZE.
That’s what the rlimit is for. If you call a program that uses select with 1025 open FD you get what you deserve.
>> Using it assumes the execution environment keeps all file descriptor values within FD_SETSIZE.
>
>That’s what the rlimit is for. If you call a program that uses select with 1025 open FD you get what you deserve.
I thought that FD_SETSIZE was a compile-time constant that limited what values for file descriptors could be passed to select.
IOW, it doesn't matter if you have only two files open; if one of them has an fd with a value of 1026, then select will do the wrong thing.
To be more precise on one bit, you can't get fd=1026 without having had at least 1027 open at some point in the past. You can subsequently close a bunch of them in the middle.
A compatibility hack for some applications reliant on badly-done select() might be to set RLIMIT_NOFILE to FD_SETSIZE.
> Yeah but fds are guaranteed to be opened lowest available on first, so thats ok, you cant get fd=1026 without having 1026 open.
This is exactly correct.
> The Linux syscall does not have a fixed FD_SETSIZE so glibc could increase the size.
The application writer can define FD_SETSIZE before including <sys/socket.h> and set it to whatever they want on BSD (including OSX). glibc could* adopt this trick, but my feeling is you probably don't want to stack-allocate this stuff anyway.
Dealing with a large number of very-active connections concurrently efficiently requires more effort than just this though.
This might be a stupid question, but why couldn't select be extended to accept fd's >=1024? I don't see "not segfaulting" breaking existing behavior and it would solve basically all issues with legacy programs. Yes, it might lead to select being used a bit longer than it otherwise would be, but that would probably be less of a pain than fighting segfaults caused by a `select()` somewhere in a linked library.
You certainly can, the kernel interface doesn't care how big the set is, glibc makes it difficult, but it's easy enough to do it on other OSes.
The reason it's not generally done is because the select API has been effectively replaced with better things like kqueue and epoll and poll and ppoll. If you actually want to do something like select and you expect to have a lot of FDs, using something newer than select is a better choice.
Edit: the other thing is everytime you call select, you've got to pass that N-bit set value, 1024 with the normal value of FD_SETSIZE. I've run on systems with 4M fds (limit set to 8M), that's a huge value to pass and iterate through for every select.
> The reason it's not generally done is because the select API has been effectively replaced with better things like kqueue and epoll and poll and ppoll. If you actually want to do something like select and you expect to have a lot of FDs, using something newer than select is a better choice.
The performance difference between select() and poll() isn't measurable for small sets of descriptors, and for epoll and friends, fixed setup costs are substantially higher. It's therefore not correct to suggest select() has been replaced with better interfaces, in the case of poll() it is a more complex interface and in the case of epoll (or io_uring or ..) it is a much more expensive interface.
It's really debatable whether poll() is more complex than select(), for example select() requires you to rebuild from scratch the fdsets for every call.
select(2) does support fds >= 1024. At least, it does on Linux and many other systems. The issue is that <sys/select.h> defines the type fd_set, a struct containing a fixed-size bit field as an array of integers. The maximum descriptor number (i.e. the bit field size) is reported by the constant FD_SETSIZE. Thus, conceptually fd_set is defined like
<sys/select.h> also provides a set of macros for initializing the structure (FD_ZERO), and testing (FD_ISSET) and setting (FD_SET, FD_CLR) descriptor numbers. But these macros have no way of reporting if a specified descriptor overflows the bit array. In all the implementations I'm familiar with, including Linux/glibc and Linux/musl, they don't even check, and will happily overwrite the array if given a descriptor greater than or equal to FD_SETSIZE.
But the maximum descriptor number in the bit field is reported to the kernel as an argument to select. And most kernels, including Linux, obey this parameter[1], and thus impose no intrinsic limit. If userspace provides a validly allocated and initialized buffer all will work just fine.
But how to allocate and use a properly sized buffer? Some systems, like OpenBSD, allow you to define FD_SETSIZE (or a similar constant) before including <sys/select.h>. On other systems, like modern Linux/glibc or Linux/musl, you have to get clever, like doing something like
but then be careful not to rely on FD_ZERO, which won't know to zero out the high bits.
[1] The same is true of (struct sockaddr_un).sun_path. On most systems you can pass Unix domain socket paths longer than 108 or whatever the local constant is, and the kernel will happily use the full path, though there may be another internal limit, like 256 or 1024. It can do this because syscalls like bind(2) and connect(2) take a separate addrlen parameter. One possibly shocking consequence to many people is that types like `struct sockaddr_storage` are not sufficient to represent every possible socket address. And unless you check the output of the addrlen pointer parameter to syscalls like accept(2), you may have received a truncated address without knowing it.
Many years ago I wrote an article about this very issue (because I ran into it while debugging a service outage at 5 AM, which turned out to be something using select() when it shouldn't have): http://beesbuzz.biz/code/5739-The-problem-with-select-vs-pol...
Ah, relief. Helped me hugely to understand that most resources should actually be thought of file descriptors as opposed to files. I took the popular literature too literally.
This issue has been driving me crazy and I spent a while a couple of times trying to figure out why the limit was 1024 and if we could bump the default. I got sidetracked and never finished it either time. I also separately knew about the select() issue but hadn't linked the two.
This sounds like a great change in that at least software can opt-in without having to even edit their systemd unit.
All parent processes above you need to first increase their ulimit. Check the blog link in my other post here for details. Changing the daemon's ulimit in with systemd is actually super easy with one line: LimitNOFILE=<your-limit>
Yep I know about that, but was more specifically interested in the case of not having to make such changes and having it just work (tm). Resource-wise having a 50,000 or 500,000 default ulimit wouldn't matter. But we find out here, it matters for compatability.
But with the systemd change he mentioned (which for Ubuntu, exists in 20.04 but not 18.04) - that the HARD limit is now 512k - applications can chose to adjust the ulimit from inside the application that is not running as root. So all modern software can just, if it knows its OK, make the call during startup to increase the limit to whatever it thinks is needed. Then we dont have to override.conf systemd services every time you hit a scaling limit or think about it.
Poettering makes some really valid points in this post; you shouldnt be using select. And the resource limits are weird. But I just want to consider that open fds are a resource hog, and you should also strive to close fds sooner rather than later. Most applications will stay well below 1024 fds, and if they consume more then that - it’s probably a bug.
Servers, sure; each connection can easily consume 2-3 fds, that’s 300-500 simultaneous connections - that’s nothing.
Systemd and other service supervisors, sure, with pidfds, timerfds, sockets, proc file sockets and whatever else.
But for your average image editing softwares, email clients, CLIs, etc; id think it’s probably better to keep the limit? That way you are saving yourself from badly written software draining too much resources.
My favorite fd limit isn’t this one — it’s the default limit of 256 on Solaris for fopen(), etc. Unless you compile differently you only get 256 in a 32-bit app. Backwards compatibility to the grave...
I have no idea how a man page for a Linux syscall gets defined or updated. What would it take to make the man page clearer on how bad select is, as per the article?
This is definitely a Lennart Poettering post: it advocates for a change that will break existing code, and then says that said code "should probably be considered just buggy".
He's arguing that it already is buggy, and I agree. The fact that select doesn't work on more than 1024 fds is terrible. It's incredibly easy to break things by changing the limits. Bumping FD limits is extremely common. I'm honestly surprised I've never run into this first hand, but perhaps I wouldn't have noticed even if I had.
Sure, Lennart is perhaps the Antichrist of Linux, but I'm pretty positive he's the Antichrist we need.
The long-lived fear of breaking existing code is certainly valid, but at the same time, why not try and see what will break before we write something off. It's often not much. It's practically the only way for us to shed the crufty bits, and boy are there a lot of those.
> The fact that select doesn't work on more than 1024 fds is terrible.
Pedantic clarification, but it's worse than that. It doesn't work on any fd with a value higher than FD_SETSIZE (default is 1024) - 1. You could be working with single fd with a value of 1024 and it would not work.
On the bright side you can extend this limit pretty easily, but it's certainly a good reason to avoid select whenever possible.
I'm not a Linux person, but select requiring help to work beyond FD 1023 isn't new. It also happens on FreeBSD and MacOS, so either you recompile your software (including glibc on linux, apparently) with a bigger FD_SETSIZE, or you use one of the better than select interfaces that doesn't care. If you tuned your FD limit higher, you're already on the hook for knowing about it.
Keeping the default soft limit of 1024 means that programs that don't fiddle with the soft limit can have a working select, and programs that do fiddle with the soft limit should know better. Auto tuning the hard limit to something appropriately sized seems reasonable, but fiddling with the soft limit may end up with this select problem, so it's nice that they didn't fiddle with it.
Also, I'm deeply surprised to be defending Mr. Poettering, but in this case, he's right.
To be clear - he's not advocating for a change to Linux or systemd. He's reiterating an already commonly accepted best practice going forward.
...which unfortunately due to the Linux ecosystem, where people apparently like to inject code into running processes (https://skarnet.org/software/nsss/nsswitch.html) (can't imagine this is brittle at all), can cause your program to break if someone injects a module that doesn't ALSO follow this best practice into your program.
See also: PAM, probably a much more enticing target.
I wonder if something like the eBPF compiler could be repurposed here, to generate lightweight "binaries" for a configuration that can be linked without needing to dynamically link their own dependencies at runtime.
I really hope Poettering decides to fix PAM too. It's well overdue.
IMO PAM should be replaced with a daemon. This way all the code that deals with authentication can be isolated into its own process, and my screensaver doesn't need root access to check my password against /etc/shadow.
I've personally had to debug and fix crashes due to this assumption at multiple jobs.
Back at Hostway decades ago we had Apache+SSL instances randomly crashing which turned out to be due to something opening /dev/random and using select() on the returned file descriptor in a .so. This only happened when the machine had enough virtual hosts to exceed FD_SETSIZE on the spurious open of /dev/random.
At Bizanga we had random crashes in their "carrier grade mail server" software, which would link in proprietary modules from vendors like Symantec for realtime virus scanning before even accepting messages for delivery. This stuff ran in-process for performance reasons, just in separate threads. Being "carrier grade" meant thousands of concurrent connections -> thousands of open files. Those vendor-provided plugins would phone home behind the scenes for virus db updates. Turned out they were using select() on the socket connected for the update, if there were enough active sessions at the time, the socket's value would exceed FD_SETSIZE and boom: segfault.
People tend to forget that it's the value of the fds you pass to select() which can't exceed FD_SETSIZE, not just the quantity of file descriptors. It could be just one file descriptor, but if it's a high enough value exceeding FD_SETSIZE, it will segfault. The value is used as an index into a fixed size array.
I ended up making an execution wrapper @ Bizanga to launch everything with >FD_SETSIZE file descriptors already opened once I saw the vendor modules were linking select(). We found numerous crashes this way, without waiting for them to occur in production.