So, on my system (Fedora 34, glibc 2.33, gcc11, Ryzen 5 3600) a trivial program takes only about 1 ms according to time:
$ cat trivial.c
#include <stdio.h>
int main(int nargs, char ** args)
{
FILE * f = fopen("/dev/null","w");
for (int i = 0; i < nargs; i++)
{
fprintf(f,"%s\n",args[i]);
}
return 0;
}
$ make trivial # just default CFLAGS
$ time ./trivial
real 0m0.001s
user 0m0.000s
sys 0m0.001s
But if I add strace -tt, it does indeed take 9 ms, even if I redirect strace output
$ time strace -tt ./trivial 2> /dev/null
real 0m0.009s
user 0m0.005s
sys 0m0.004s
Author here. Strace seems to add more overhead than I realized at the time, but not quite as much overhead as you suggest. For instance, you're measuring the time it takes to run strace itself, while my blog post uses the times from strace's output (i.e. excluding strace's own startup/shutdown costs). The overhead of `strace -tt` output seems much lower than `time strace -tt` on my testbed, but it's still not as good as `perf trace`.
(Also note that this was written in 2015. These days spectre/meltdown CPU mitigations make the kernel <-> userspace switch more expensive, which likely has an even bigger effect on strace. PTRACE_SYSCALL requires a thread-blocking context switch chain like (proc -> kern -> strace -> kern -> proc) for every syscall.)
I've reproduced my test conditions with `perf trace` below. My relative improvements seem to hold, and I stand by my suggestion that a no-libc solution would easily top the leaderboard. My level0-glibc solution placed #6 (3729/4048 = 92% of the #1 score): https://web.archive.org/web/20140210000314/https://stripe-ct...
It's been enough years and I don't have the original test machine. I cobbled together a similar testbed using a faster machine from that era and the same OS base (Ubuntu 14.04 LTS). Here are a handful of `perf trace` runs comparing: glibc dynamic/static empty binaries, my actual level0 solution as both glibc dynamic/static, and my level0 solution under lib43. I also show timings when using /dev/null as a no-op input for each level0 implementation.
Yes, good point, I neglected strace startup overhead, though it still seems to slow down the program by a factor of 5 compared to perf trace (2 ms vs. 0.4 ms on my machine). Indeed mitigations probably hurt here. Also fun, try time strace strace -tt ./prog to watch the walltime explode.
I guess glibc's inability to be truly static must make the biggest difference. Compiling my program with -static, I get a 2-3x speedup, but even still, callgrind suggests half of the time is spent in _dl_relocate_object, since I guess even staticly-linked glibc needs to dlopen some things, which probably your minimal libc doesn't require at all :).
So, looks like total syscall overhead is on the order of 0.1 ms? Granted, it's a significant part of the runtime, but I think all the relocations and such are driving the libc overhead (as shown in callgrind output: https://i.imgur.com/Yligh7S.png )
And great, now I know a new way to use perf! (I initially tried profiling with perf but it doesn't get enough samples to be useful!).
Yes, but if there are almost no syscalls, then strace is doing a lot less.
Profiling my trivial program with callgrind shows (unsurprisingly) that the majority of the time is in dynamic library relocation and whatever __GI__tunables_init does: https://i.imgur.com/Yligh7S.png
A couple of years ago I did a "Terminal Video Series" segment that also highlights the runtime overhead of 12 other languages, C of course actually one of the best ones: https://2ton.com.au/videos/tvs_part1/
I was actually surprised for C how expensive that is, I'd expected maybe a dozen or so syscalls to set up the environment the provided runtime wants, but that's a lot of syscalls for not very much value. Both C++ and Rust are more expensive but they're setting up lots of stuff I might use in real programs, even if it's wasted for printing "hello" - but what on Earth can C need all those calls for when the C runtime environment is so impoverished anyway?
Go was surprising to me in the opposite direction, they're setting up a pretty nice hosted environment, and they're not using many instructions or system calls to do that compared to languages that would say they're much more focused on performance. I'd have expected to find it closer to Perl or something, and it's really much lighter.
glibc isn't particularly optimized for startup-time. It also defaults to dynamic linkage which adds several syscalls, and even if it is statically linked, it may dynamically load NSS plugins. (musl gets around the latter by hardcoding support for NSCD, a service that can cache the NSS results).
C11 adds several things that benefit from early initialization (like everything around threads), but GCC had some level of support for most of them before that.
> I happened to run strace -tt against my solution (which provides microsecond-accurate timing information for syscalls)
Weirdly I always found the timing results of strace at under one millisecond generally unreliable, it just seemed to add too much overhead itself.
Also making system calls from your own code varies between prohibited and badly supported on most OSes. Some see calls that didn't pass through the system libc as a security issue and will intercept them, while Linux may just silently corrupt your process memory if you try something fancy as the Go team had to find out.
josefx> Also making system calls from your own code varies between prohibited and badly supported on most OSes. Some see calls that didn't pass through the system libc as a security issue and will intercept them, while Linux may just silently corrupt your process memory if you try something fancy as the Go team had to find out.
On Linux, making syscalls directly is fine. Good point about other platforms, but many people only care about Linux, for better or worse. And the author's last paragraph (quoted below) suggests using an alternate/static-linking-friendly libc, not making direct syscalls yourself. Presumably on platforms where those alternate libcs aren't available, you continue using the standard libc.
ryanhileman> If you're running into process startup time issues in a real world scenario and ever actually need to do this, it might be worth your time to profile and try one of the alternative libc implementations (like musl libc or diet libc).
IMHO, the Go vDSO problem [1] wasn't due to making direct syscalls but basically calling a userspace library without meeting its assumptions. I'd describe Linux's vDSO as a userspace library for making certain syscalls with less overhead. (If you don't care about the overhead, you can call them as you'd call any other syscall instead.) It assumed the standard ABI, in which there's a guard page as well as typically a generous amount of stack to begin with. Golang called into it from a thread that used Go's non-standard ABI with less stack space available (Go's own functions check the stack size on entry and copy the stack if necessary) and no guard page. On some Linux builds (with -fstack-check, apparently used by Gentoo Hardened ), it actually used enough stack to overflow. Without a guard page, that caused memory corruption.
Umm, what’s the story with Go on Linux? My understanding is that Linux explicitly supports making syscalls from your own code, it’s just that on platforms where the syscall story is a giant mess (32-bit x86) the prescription is to either use the slow historical interface (INT 80h) or to jump to the vDSO (which will do SYSENTER or SYSCALL or whatever—SYSENTER in particular has the lovely property of not saving the return address, so a common stub is pretty much architecturally required[1]).
If I’m guessing correctly about what you’re referring to, the Go people did something in between, got broken in a Linux kernel release, complained at the kernel people, the kernel got a patch to unbreak them. The story on MacOS or OpenBSD, where the kernel developers will cheerfully tell you to take a hike if you make a syscall yourself, seems much worse to me.
(And yes, I’d say there is a meaningful difference between a couple of instructions in the vDSO and Glibc’s endless wrappers.)
This gets rather over my head, but my best guess is this bug having to do with the Go runtime underestimating how much stack space vDSO code may need (blog post at [0], fix at [1]).
> I built 32 kernels, one for each bit of the SHA-1 prefix, which only took 29 minutes.
Oh, that is evil, thank you. I even encountered a link to this article a couple of months ago[1], but wasn’t hooked enough to go and read it.
Though the conclusion sounds somewhat dubious to me: the kernel doesn’t document or control stack usage limits for the vDSO, they happen to blow way up on a system built with obscure (if arguably well-motivated) compiler options, a language runtime that tries to minimize stack usage crashes as a result, and somehow the fault is with the runtime in question for not going via libc (which happens to virtually always run with a large stack and a guard page, thus turning this insidious concurrent memory corruption bug into a mere extremely unlikely crash)?
More like we’re collectively garbage at accounting for our stack usage. To be fair to the kernel developers, I would also never guess, looking at this implementation of clock_gettime() [2], that you could compile it in such a way that it ends up requiring 4K of stack space on pain of memory corruption in concurrent programs only (it originating in the kernel source tree has little to do with the bug, it’s just weirdly-compiled userspace C code executing on a small unguarded stack).
... And a closer look at one of the Go bugs involved[1] also reveals the kernel build system will now also unconditionally disable the offending stack probing flag[2], as it leaked even into the kernel-space configuration where it would do more straightforward damage.
By the way, that hashing trick somehow reminds me of edge-notched cards[1]. (Not old enough to have used them, old enough to have seen and played with them.)
While this is true, the go vs vDSO stack clash corruption highlights that a C ABI that doesn't specify stack usage is under defined. A platforms libc can be written such that it's aware of these limitations.
What's kinda neat about musl libc is that the interpreter/runtime linker and libc.so are the same executable, so that if you just dynamically link to libc, it does not need to open ld.so cache or locate and load libc.so.
More importantly this uses the devkitPro/libnx homebrew toolchain, while the OP project uses the official SDK (the "complicated legal reasons" behind the absence of code probably being an NDA signature).
I never quite understood why assembly wasn't part of programming language shootout competitions. C was always treated as the speed of light for computing, which didn't quite make sense to me.
In my experience there's usually not much speed advantage to be had from assembly unless you have a specific thesis about how you're going to do better than the code that a good compiler would generate; e.g. the article author skipping C runtime setup overhead, or Mike Pall dissecting the ARM implementation of one of the bytecodes in LuaJIT (recommended!):
Unless you can think of something you can do in assembly that a compiler just wouldn't be able to think of, their ability to generate good machine code is really quite impressive:
My colleagues and I have developed a technique that addresses a lot of the issues Mike discusses in that link. In this blog article explaining the technique, I included an example that nearly matches Mike's hand-coded assembly from your link: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...
In my experience, the weakest point of modern optimizing compilers is the register allocator. As Mike mentioned in his link, in functions with loops that contain complex control flow, the register allocation just falls apart. Too often you see the register allocator forcing a spill on the fast path that was caused by code in a fallback path. You can sprinkle __builtin_expect(cond, 0) on all your guards until the cows come home but the register allocator still sees a big complicated connected graph and optimizes the whole thing together.
The technique described in my link above works around this by basically forcing a specific register allocation at each "operation" boundary, putting fallback paths in separate functions entirely (as much as possible), so that they will be considered totally distinct to the optimizer. This isn't a perfect solution, but it helps a lot.
My overall view is that you can rarely beat the compiler in straight-line code. It's mainly in loops with complex control flow where the compiler's code quality suffers.
That technique is beautiful! And the results are as good as I would expect, once you’ve given the compiler the information it needs to do a good job. I’ll keep this one in mind next time I need to clearly delineate the fast path from the fallback path and I’m using something with access to the musttail attribute.
Normally it's better to focus on high-level time-complexity improving optimizations but there's still a whole lot of things where you really need assembly micro-optimizations, since they usually offer a 10x or 100x speedup. It's just that those things usually only concern core libraries. Stuff like crc32. But the rest of the time we're just writing code that glues all the assembly optimized subroutines together.
I would just add that on kookier platforms, sometimes there more low hanging fruit to be picked up with asm. I can still (barely) beat sha256 on the arm64 using the hardware sha calls that not everyone implements, stuff like that.
Since about 1/3 of the programming language shootout competitions seem to boil down to "which language can call gmp the fastest" Assembly might not add much. TFA is for a benchmark where the input size is too small.