"sqr is a place where one might be tempted to use a macro in C, but writing such a macro in a way that x doesn't get evaluated twice is tricky."
Slightly off-topic, but it's important to point out that in modern C, implementing such a function as a macro is always a mistake. You'd do it as an inline function.
Macros in modern C should only be used for code generation (for DRY). If the language supports doing it without a macro, then do it without a macro.
...or link with LTCG / LTO, this automatically considers all functions for inlining across compilation units, no matter if they are marked as inline or not. For C code this is preferable IMHO because it allows to keep declaration and implementation code strictly separate, inline functions muddy that line because they must live in headers.
What about ThinLTO? The idea of having a overnight full build that creates some kind of cache which incremental builds throughout the day can use to do fast whole-binary LTO seems like the best of both worlds.
Basically just wondering if anyone has first hand experience using this kind of thing on a large project.
ThinLTO was written by people responsible for peak optimization of very large programs at Google. I honestly can’t call it “fast” but it’s a lot faster than GCC LTO. You normally only peak-optimize your release builds so it’s not like developers are sitting there counting the seconds.
> You normally only peak-optimize your release builds so it’s not like developers are sitting there counting the seconds.
Sometimes you do want to debug a release build because the optimizations have gone wrong, though. In that case, it's helpful if release builds build quickly, and especially if they rebuild quickly as you turn various sub-optimization flags on and off.
Yes, and sometimes you want to validate a performance change and you need the release build to run in the benchmark fixture, and then it's irritating that the build takes forever but what can be done? ThinLTO's real benefit is that it uses so much less memory than legacy LTO and can be applied to larger programs (like Chrome).
That's true, there are many situations where you simply can't replace a macro with a function because of this fact. MIN/MAX is a common example, unless you want to have one specialized version for every type.
Unfortunately there are always tradeoffs using macros for something like that, you either end up evaluating the parameters more than once or you have to introduce variables in the macro that may shadow existing variables. Some compilers have extensions to help with that: https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html
First, you can't do that for the macros like MIN/MAX mentioned in the comment you just replied to, and sqr mentioned earlier.
Second, you still risk shadowing existing variables, even if you make an effort to use "unlikely" names for the locals such as "___my_macro_secret_variable_1", if the macro might be used in one of its own arguments. For example MACRO(MACRO(x)).
If that sounds unlikely, consider MAX(a,MAX(b,c)), which is likely to happen eventually, if your codebase uses such macros or if they are part of a library.
When you decide to define new variables you have already decided to write more than just an expression (which by definition can’t contain a statement), so “you can’t do that for...” is a moot point. And since it’s not an expression, it’s not gonna appear as its own argument, unless you’re using more advanced substituting a whole block trick which is not what the do while(0) idiom is for.
>Note that introducing variable declarations (as we do in maxint) can cause variable shadowing [...] this example using maxint will not [produce correct results]:
That’s a GNU C extension. I don’t write non-portable code like that. (Okay, I didn’t notice the link in your previous post, so I was basically replying to the wrong thing.)
Inline was added in C99, which MSVC still doesn't support entirely. If this has to be taken into account when you choose what standard to use for your codebase, that's a quarter of a century trickle down for features to reach the consumer.
The Microsoft C compiler actually has pretty good C99 support since VS2015 (e.g. inline is definitely supported).
AFAIK the only non-supported C99 features in the C compiler are VLAs (optional since C11 anyway), and type-generic macros (those would be good to have though).
Of course it would be nice if Microsoft gave the C compiler a bit more love, especially since it's much less work keeping a C compiler uptodate than a C++ compiler, but at least we got "most of C99".
The quarter of a century thing does not apply to "inline".
Although inline was added in C99, it was already an extremely widely supported extension in mainstream compilers, even since the C89 days, when we just called it "ANSI C".
MSVC has supported inline for a long time, long before it started supporting other C99 features.
That doesn't matter if features you're able to used are gated on the standard you use. If the standard you choose is based on what your target platforms 'support': no inline for you.
It seems weird to decide that you won't use features because they are in a public standard that your target platform doesn't support, even though your target platform fully supports those features themselves.
I could understand the concern if it was about portability to other target platforms, or keeping the option of doing so. But in that case, the public standard your current target supports is irrelevant.
It happens. Imagine you decide on the MS-compatible bits C99, then the team naturally picks up new people and loses the ones who made the decision. Eventually, people will know the standard is C99 from the build system but not the reason behind the decision.
So they add a feature not supported by MSVC and don't learn that it doesn't work until someone else tries to build on Windows.
If you choose to use features based on whether they work or not, you don't need to choose a standard at all. But that loses you all of the guarantees a standard provides.
One of the main quality-of-life improvement of Rust over C and C++ IMO is the crate system, and how you don't have to worry about codes from other files not being inlined properly (there's LTO now but it's always more limited than compile-time inlining and optimization).
No need to move things to headers (no headers at all, actually), worry about forward declarations and exposing internal bits of the API etc... Just write the code normally and naturally and let the compiler figure it out. No need to consider subtle performance tradeoffs when deciding where to write the code, just put it where it makes sense.
Everybody should build everything with a reasonable setting of march. Debian Linux still builds everything for K8, a microarchitecture that lacked SSE3, SSE4, AVX, AES, etc. Even building with march=sandybridge seems petty conservative and gives significant speed improvements in many common cases.
I'm not a Rust guy, so can someone explain to me why SIMD intrinsics are "unsafe"? They don't seem unsafe in the way that, I dunno, raw memory accesses are unsafe.
1. Using instructions that don't exist on a given CPU will cause the program to crash.
2. `core::arch` exists so that stable Rust programs could make use of SIMD instructions for performance (in particular, `regex` crate). They are very low level, and there is a lot of those, so whether any given SIMD intrinsic is safe or not wasn't considered as those functions exist for safe SIMD abstractions to use - which is advantageous, as Rust cannot really do breaking changes, while a library could easily release a new version without breaking the users (which will still use the old version).
> Using instructions that don't exist on a given CPU will cause the program to crash.
But surely this also applies to a Rust program that only uses safe code but it's compiled with the right compiler switches that is allowed to use SIMD instructions?
Sure would be nice to have a way to mark a block as "compile for arches x, y" and have the binary check on entry which you have and select a different code path. This way you could do it only for your hot sections and not need to have the entire binary duplicated.
AFAIK the main way to manage this right now is throwing the optimized stuff into a plugin and then loading it at runtime— but that ends up having huge implications on your project structure, source layout, etc.
By default Rust builds binaries that will run on any processor for the given target. If you want it to generate instructions that are processor-specific--thereby reducing the portability of the binary--you must pass an explicit compiler flag. Clang exhibits the same behavior with its -march and -mcpu flags, and I would assume GCC does as well.
Is important to note what is the POV of "unsafe". Is not unsafe for you or for your setup.
Is unsafe from the COMPILER ie: it can't PROVE is safe!
Static analysis can only decide a subset of the possible "safe" interactions. In the time of COMPILING. Rust can't decide if use of SIMD is safe AT COMPILE TIME because at RUNTIME maybe the cpu don't have it!
Why can't it just run dynamic feature detection to see if it needs to load e.g. a SISD shim version of the function which is exactly the same except the intrinsic is replaced with a generic implementation?
Or at least the other way around: if a function is declared with dynamic feature detection and a default path exists why can't it be declared safe by the compiler?
The thing is that is not yet in-build in the lang. Let the crates sort some problems before commit it in the lang is like part of the philosophy of rust.
For example, rust change the hash implementation after certain crate prove it worthy.
I get that; obviously if you use an intrinsic that reads from memory it's a raw load. However, surely there's a great big pile of them that are really just different ways of doing things to a (slightly bigger than usual) Plain Old Data structure?
I think it's partly architecture compatibility. The SIMD functions exposed by Rust at the moment are all CPU-architecture-specific, but Rust code is generally expected to work seamlessly cross-platform.
I believe there are plans to expose higher-level functionality as a safe interface at a later date (the API design work just hasn't been done yet). For now you can get this functionality as a 3rd-party crate https://github.com/AdamNiederer/faster
In addition to other answers: SIMD in rust is currently exposed via a very low-level API. But in the future, someone will probably build a safe API on top of it.
OK, thanks for the clarifications. I think it's kind of daft to lump in "might not run on your particular processor" with "unsafe" - these seem like very different concepts - but it's at least consistent.
Inline assembly is possibly one of the most unsafe features available in any HLL. What happens when your inline assembly isn't valid on the target system? Undefined behavior. What happens when your inline assembly does operations that are outside the model of the language? Undefined behavior.
Surely if you control the whole toolchain, you can stick flags into the binaries/libraries and fail at link/load time.
The fact that default C ABI prefers to work without a condom here doesn't mean that Rust has to! This is one of the most avoidable forms of undefined behavior I'm aware of.
Eventually the program counter will advance to the unsafe inline assembly. At this point there are no guarantees. There is no enforced ABI. The inline assembly can do anything. This is not a compile or link time problem. This is a runtime problem.
I just meant that a binary or library could mark itself (say) "If you execute me, I will run AVX2" upon using this kind of inline asm. Static linking to create a bigger library or program transfers the mark. Eventually you will either dynamically link or load a library or attempt to run it on the actual computer where it's meant to run; at this point, you fail gracefully.
I know this isn't a 100% panacea, as some programs will have multiple bits of inline asm for different system and some sort of better or worse way of determining at run-time what to run. However, I've written plenty of programs where it pretty much says on the tin: "you must have AVX2 to run this".
All up this seems like a pretty trivial problem compared to the problem of 'lack-of-safety' in general, and I'm not thrilled to see SIMD intrinsics being put in the "OMG So Dangerous" box as if they are a bunch of wild pointer ops (except, of course, in the case where they actually are - e.g. scatter/gather)...
For your `sqr` function, what is the benefit of writing `x * x` over using `x.powi(2)` [0]? You didn't mention it in the article, but did you find a performance improvement from doing this?
As far as CPU instructions go, multiplication is _significantly_ faster (i.e. an order of magnitude) than exponentiation.
That said, I can't speak to whether the Rust compiler wouldn't just optimize that away -- it seems like unrolling exponentiation into multiplication for small constant powers would be a very safe and easy thing to do.
Slightly off-topic, but it's important to point out that in modern C, implementing such a function as a macro is always a mistake. You'd do it as an inline function.
Macros in modern C should only be used for code generation (for DRY). If the language supports doing it without a macro, then do it without a macro.