Looking at performance counter data is good, but I would have liked to see a real validation of the hypothesis that bounds checking is to blame for the extra branches and instructions. That is, modify the Rust compiler to not emit bounds checks (or maybe there is even a flag for this?) and look at performance and counters. I would imagine that this would bring the data for Rust to pretty much the same as C. But other compiler (micro-)optimizations might be at play as well.
Also, from the paper's Conclusions: "The cost of [Rust's] safety and security features are only 2% - 10% of throughput on modern out-of-order CPUs." 10% network throughput is a lot.
This, imo, is absolutely correct (it is a dark idea to have a "let's be unsafe for more performance" flag) but maybe a experimental build of the Rust compiler could have this as a configuration option? Possibly the toolchain could warn every step of the way if such a 'tainted' module is ever linked, etc.
It just seems like this sort of question is going to recur, and being able to persistently track the overhead of checking (it would allow you to monitor specific performance improvements) is much nicer than having someone do a one-off experiment.
If it is implemented, it will be used. And people will put it in their own builds.
We already have one “secret” escape flag feature, and people do use it, as much as we don’t talk about it and tell people not to use it when they find it.
Maybe put a tainted flag in it that causes the linker or runtime to fail? Then don't open source or release the modifications to allow the linker/runtime to avoid that failure check and refuse to let anyone check in a "fix" that allows this check to be skipped to an official build...
This surely seems like an incredibly important cost. Surely it's worth doing a bit of ugly magic to be able to keep track of it persistently.
Thanks to both of you for the insightful discussion. A flag would be helpful for testing, but it's true that if it's there, it will be used. Still, this can be tracked as part of a CI system by keeping around a patch for disabling bounds checks and regularly building and benchmarking a patched version. Less nice, but should get the job done.
I’m not 100% sure if there’s a source exactly, but we don’t like safety and correctness to depend on what flags you pass or do not pass. We don’t offer a fast-math flag either for similar reasons.
The odd one out is overflow, and that’s only because it’s well defined (a “program error”) and not UB to overflow in Rust. This gets checked in debug but not currently release, though the spec allows for it.
What do you think of Julia's macro-based approach?
That is, there are `@inbounds` and `@fastmath` macros that turn off bounds checking/enable fast-math flags in the following expression.
`@fastmath` works simply by swapping functions (eg `+`) with versions (eg, `Base.FastMath.add_fast`) that have the appropriate llvm flags.
When testing Julia libraries, all `@inbounds` are ignored (ie, it'll emit bounds checks anyway).
I assume it's already possible for a user to similarly implement `inbounds!` and `fastmath!` macros in Rust to substitute `[]` for `.get_unchecked()`, etc. (I haven't checked if there are already crates.) But it sounds like it should be easy enough for folks to check performance sensitive regions this way (in particular, loops that may need these flags to vectorize).
I guess my thought is that much of correctness comes from the compiler being able to make assertions that some type (and thus some memory address) will only be used in a correct way at compile time, etc, etc.
For example if we were dynamically linking a Rust crate into a Rust binary is it necessary to check boundaries in both or can some of that be deferred because we can assume the binary that will link has already done the boundary checks, etc?
I know it's a bit contrived since ideally we'd just compile statically, but I think it's still potentially valid. If both pieces of software have the guarantees then ideally you can factor out some of the overhead.
Not really: indexing out of bounds without this check would invoke undefined behaviour. A compile time flag would not be able to distinguish the cases where a bounds check is required for the program to be correct, from the cases where the index is provably within bounds and so is unnecessary.
Who wants a compile-time flag that makes valid programs have undefined behaviour? Nobody, especially when you consider that UB in any language really does mean undefined: in the best case the program crashes, in the worst it deletes all your files.
What's wanted is a way to tell the compiler "no, in this specific case which I have determined to be a bottleneck in my program, I want to omit bounds checking because due to XYZ it's impossible for the index to ever be out of bounds" and that's exactly what this method provides.
They can just profile to find out which functions in their program are consuming the most CPU. Finding if these functions do have any bound checks, and if so, writing the single line of code required to tell the compiler "trust me, it is impossible for this index to ever be out-of-bounds, a bound check is not necessary".
If they are right, and bound checks are the issue, doing this should recover the performance difference.
yeah, I wonder in which world they live. If I could sell a limb to get 10% more audio plug-ins in my DAW, you could be sure that my bedtime book would be "Life pro-tips for quadruple amputees"
2-10% for an already fast user space driver is nothing.
State of the art for a lot of these use cases is still the kernel driver which is ~7 times slower. Sure, all that stuff is moving to XDP/eBPF/AF_XDP, but that is still ~20-30% slower than a user-space driver.
Also, these 2-10% only show up when underclocking the CPU while running the unrealistic benchmark of forwarding packets bidirectionally on only one core (trivial to parallelize).
In the end it's about 6-12 cycles spent more in the driver. That's not a lot if you have a non-trivial application on top of it.
Fortunately for your body, this problem is easily solvable by hardware. Modern DAW's performance scales well with multithreading, for regular use cases at least.
I don't know your use case, but generally, if you have so many VST processing on a single track that it loads a core of a modern CPU, it means you're doing either something really creative, sculpting a sound, or some heavy-handed audio restoration. Both are candidates for freezing/rendering to a stem. YMMV, of course.
Also, from the paper's Conclusions: "The cost of [Rust's] safety and security features are only 2% - 10% of throughput on modern out-of-order CPUs." 10% network throughput is a lot.