Hacker News new | past | comments | ask | show | jobs | submit login
Let the Compiler Do the Work (cliffle.com)
213 points by milliams on Jan 7, 2020 | hide | past | favorite | 60 comments



"sqr is a place where one might be tempted to use a macro in C, but writing such a macro in a way that x doesn't get evaluated twice is tricky."

Slightly off-topic, but it's important to point out that in modern C, implementing such a function as a macro is always a mistake. You'd do it as an inline function.

Macros in modern C should only be used for code generation (for DRY). If the language supports doing it without a macro, then do it without a macro.


...or link with LTCG / LTO, this automatically considers all functions for inlining across compilation units, no matter if they are marked as inline or not. For C code this is preferable IMHO because it allows to keep declaration and implementation code strictly separate, inline functions muddy that line because they must live in headers.


While I vastly prefer this workflow wise, for large projects it can increase compile time drastically. https://lwn.net/Articles/744507/

For projects that will never have more than ~1 million LOC it's probably fine. Less than ~100K, definitely preferable.


What about ThinLTO? The idea of having a overnight full build that creates some kind of cache which incremental builds throughout the day can use to do fast whole-binary LTO seems like the best of both worlds.

Basically just wondering if anyone has first hand experience using this kind of thing on a large project.


ThinLTO was written by people responsible for peak optimization of very large programs at Google. I honestly can’t call it “fast” but it’s a lot faster than GCC LTO. You normally only peak-optimize your release builds so it’s not like developers are sitting there counting the seconds.


> You normally only peak-optimize your release builds so it’s not like developers are sitting there counting the seconds.

Sometimes you do want to debug a release build because the optimizations have gone wrong, though. In that case, it's helpful if release builds build quickly, and especially if they rebuild quickly as you turn various sub-optimization flags on and off.


Yes, and sometimes you want to validate a performance change and you need the release build to run in the benchmark fixture, and then it's irritating that the build takes forever but what can be done? ThinLTO's real benefit is that it uses so much less memory than legacy LTO and can be applied to larger programs (like Chrome).


Those large projects benefit the most from LTO as well though.


The point for doing it as a macro is to be generic over types. In C++ you would do a template function, but no real option in C.


Since C11 there's _Generic which allows to do it. Using macros in that case is strongly advised (or else it would be very verbose).


That's true, there are many situations where you simply can't replace a macro with a function because of this fact. MIN/MAX is a common example, unless you want to have one specialized version for every type.

Unfortunately there are always tradeoffs using macros for something like that, you either end up evaluating the parameters more than once or you have to introduce variables in the macro that may shadow existing variables. Some compilers have extensions to help with that: https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html


> you have to introduce variables in the macro that may shadow existing variables.

You certainly don’t have to introduce the risk of shadowing existing variables. You just use the standard

  #define MACRO(x) \
      do { \
          /* Define whatever variables you like */ \
          /* Do things... */ \
      } while (0)


First, you can't do that for the macros like MIN/MAX mentioned in the comment you just replied to, and sqr mentioned earlier.

Second, you still risk shadowing existing variables, even if you make an effort to use "unlikely" names for the locals such as "___my_macro_secret_variable_1", if the macro might be used in one of its own arguments. For example MACRO(MACRO(x)).

If that sounds unlikely, consider MAX(a,MAX(b,c)), which is likely to happen eventually, if your codebase uses such macros or if they are part of a library.


When you decide to define new variables you have already decided to write more than just an expression (which by definition can’t contain a statement), so “you can’t do that for...” is a moot point. And since it’s not an expression, it’s not gonna appear as its own argument, unless you’re using more advanced substituting a whole block trick which is not what the do while(0) idiom is for.


I can't make sense of what you're saying. See the page I linked in my previous post for an example of what we're talking about: https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html

    #define maxint(a,b) \
      ({int _a = (a), _b = (b); _a > _b ? _a : _b; })
>Note that introducing variable declarations (as we do in maxint) can cause variable shadowing [...] this example using maxint will not [produce correct results]:

    int _a = 1, _b = 2, c;
    c = maxint (_a, _b);


That’s a GNU C extension. I don’t write non-portable code like that. (Okay, I didn’t notice the link in your previous post, so I was basically replying to the wrong thing.)


You don't need macros for generic code either :), Rust has generics.

    fn sqr<T: Copy + Mul>(x: T) -> T::Output {
        x * x
    }


Same advice in LISP.

Don't use a macro where a function will do.


Never use a macro to do an inline function’s job.


Inline was added in C99, which MSVC still doesn't support entirely. If this has to be taken into account when you choose what standard to use for your codebase, that's a quarter of a century trickle down for features to reach the consumer.

I hope I get to use C2x* before I retire.

*postmodern C?


The Microsoft C compiler actually has pretty good C99 support since VS2015 (e.g. inline is definitely supported).

AFAIK the only non-supported C99 features in the C compiler are VLAs (optional since C11 anyway), and type-generic macros (those would be good to have though).

Of course it would be nice if Microsoft gave the C compiler a bit more love, especially since it's much less work keeping a C compiler uptodate than a C++ compiler, but at least we got "most of C99".


The quarter of a century thing does not apply to "inline".

Although inline was added in C99, it was already an extremely widely supported extension in mainstream compilers, even since the C89 days, when we just called it "ANSI C".

MSVC has supported inline for a long time, long before it started supporting other C99 features.


That doesn't matter if features you're able to used are gated on the standard you use. If the standard you choose is based on what your target platforms 'support': no inline for you.


It seems weird to decide that you won't use features because they are in a public standard that your target platform doesn't support, even though your target platform fully supports those features themselves.

I could understand the concern if it was about portability to other target platforms, or keeping the option of doing so. But in that case, the public standard your current target supports is irrelevant.


It happens. Imagine you decide on the MS-compatible bits C99, then the team naturally picks up new people and loses the ones who made the decision. Eventually, people will know the standard is C99 from the build system but not the reason behind the decision.

So they add a feature not supported by MSVC and don't learn that it doesn't work until someone else tries to build on Windows.

If you choose to use features based on whether they work or not, you don't need to choose a standard at all. But that loses you all of the guarantees a standard provides.


One of the main quality-of-life improvement of Rust over C and C++ IMO is the crate system, and how you don't have to worry about codes from other files not being inlined properly (there's LTO now but it's always more limited than compile-time inlining and optimization).

No need to move things to headers (no headers at all, actually), worry about forward declarations and exposing internal bits of the API etc... Just write the code normally and naturally and let the compiler figure it out. No need to consider subtle performance tradeoffs when deciding where to write the code, just put it where it makes sense.


> there's LTO now but it's always more limited than compile-time inlining and optimization

In what way do you think it is more limited?


My prof in Introduction to Programming always said: "Generally, it's a good idea to avoid decisions."


Page isn't loading for me, so here's a snapshot: https://archive.is/i5uvl


Clang also does some crazy-good SIMD optimizations if you set -march=native. But I think the Rust compiler is based on LLVM as well, right?


Everybody should build everything with a reasonable setting of march. Debian Linux still builds everything for K8, a microarchitecture that lacked SSE3, SSE4, AVX, AES, etc. Even building with march=sandybridge seems petty conservative and gives significant speed improvements in many common cases.


It is, yes.


I'm not a Rust guy, so can someone explain to me why SIMD intrinsics are "unsafe"? They don't seem unsafe in the way that, I dunno, raw memory accesses are unsafe.

Non-portable, of course, I get.


There are multiple reasons.

1. Using instructions that don't exist on a given CPU will cause the program to crash.

2. `core::arch` exists so that stable Rust programs could make use of SIMD instructions for performance (in particular, `regex` crate). They are very low level, and there is a lot of those, so whether any given SIMD intrinsic is safe or not wasn't considered as those functions exist for safe SIMD abstractions to use - which is advantageous, as Rust cannot really do breaking changes, while a library could easily release a new version without breaking the users (which will still use the old version).


> Using instructions that don't exist on a given CPU will cause the program to crash.

But surely this also applies to a Rust program that only uses safe code but it's compiled with the right compiler switches that is allowed to use SIMD instructions?


Sure would be nice to have a way to mark a block as "compile for arches x, y" and have the binary check on entry which you have and select a different code path. This way you could do it only for your hot sections and not need to have the entire binary duplicated.

AFAIK the main way to manage this right now is throwing the optimized stuff into a plugin and then loading it at runtime— but that ends up having huge implications on your project structure, source layout, etc.



Well, that is awesome. Thanks for the link!


It will only use the SIMD intrinsically that work on the platform requested, in that case.


SIMD intrinsics are unsafe, because reaching a SIMD intrinsic not supported by the host CPU is undefined behavior (in practice usually SIGILL).


So SSE2 intrinsics should be safe for x86_64 builds?


Yes.


So most Rust builds are inefficient given they don't take advantage of modern instructions?


By default Rust builds binaries that will run on any processor for the given target. If you want it to generate instructions that are processor-specific--thereby reducing the portability of the binary--you must pass an explicit compiler flag. Clang exhibits the same behavior with its -march and -mcpu flags, and I would assume GCC does as well.


> why SIMD intrinsics are "unsafe"?

Is important to note what is the POV of "unsafe". Is not unsafe for you or for your setup.

Is unsafe from the COMPILER ie: it can't PROVE is safe!

Static analysis can only decide a subset of the possible "safe" interactions. In the time of COMPILING. Rust can't decide if use of SIMD is safe AT COMPILE TIME because at RUNTIME maybe the cpu don't have it!


Why can't it just run dynamic feature detection to see if it needs to load e.g. a SISD shim version of the function which is exactly the same except the intrinsic is replaced with a generic implementation?

Or at least the other way around: if a function is declared with dynamic feature detection and a default path exists why can't it be declared safe by the compiler?


This is possible (for example):

https://github.com/jackmott/simdeez

The thing is that is not yet in-build in the lang. Let the crates sort some problems before commit it in the lang is like part of the philosophy of rust.

For example, rust change the hash implementation after certain crate prove it worthy.



I get that; obviously if you use an intrinsic that reads from memory it's a raw load. However, surely there's a great big pile of them that are really just different ways of doing things to a (slightly bigger than usual) Plain Old Data structure?


I think it's partly architecture compatibility. The SIMD functions exposed by Rust at the moment are all CPU-architecture-specific, but Rust code is generally expected to work seamlessly cross-platform.

I believe there are plans to expose higher-level functionality as a safe interface at a later date (the API design work just hasn't been done yet). For now you can get this functionality as a 3rd-party crate https://github.com/AdamNiederer/faster


In addition to other answers: SIMD in rust is currently exposed via a very low-level API. But in the future, someone will probably build a safe API on top of it.


OK, thanks for the clarifications. I think it's kind of daft to lump in "might not run on your particular processor" with "unsafe" - these seem like very different concepts - but it's at least consistent.


Inline assembly is possibly one of the most unsafe features available in any HLL. What happens when your inline assembly isn't valid on the target system? Undefined behavior. What happens when your inline assembly does operations that are outside the model of the language? Undefined behavior.


Surely if you control the whole toolchain, you can stick flags into the binaries/libraries and fail at link/load time.

The fact that default C ABI prefers to work without a condom here doesn't mean that Rust has to! This is one of the most avoidable forms of undefined behavior I'm aware of.


Eventually the program counter will advance to the unsafe inline assembly. At this point there are no guarantees. There is no enforced ABI. The inline assembly can do anything. This is not a compile or link time problem. This is a runtime problem.


I just meant that a binary or library could mark itself (say) "If you execute me, I will run AVX2" upon using this kind of inline asm. Static linking to create a bigger library or program transfers the mark. Eventually you will either dynamically link or load a library or attempt to run it on the actual computer where it's meant to run; at this point, you fail gracefully.

I know this isn't a 100% panacea, as some programs will have multiple bits of inline asm for different system and some sort of better or worse way of determining at run-time what to run. However, I've written plenty of programs where it pretty much says on the tin: "you must have AVX2 to run this".

All up this seems like a pretty trivial problem compared to the problem of 'lack-of-safety' in general, and I'm not thrilled to see SIMD intrinsics being put in the "OMG So Dangerous" box as if they are a bunch of wild pointer ops (except, of course, in the case where they actually are - e.g. scatter/gather)...


For your `sqr` function, what is the benefit of writing `x * x` over using `x.powi(2)` [0]? You didn't mention it in the article, but did you find a performance improvement from doing this?

[0]: https://doc.rust-lang.org/std/primitive.f64.html#method.powi


As far as CPU instructions go, multiplication is _significantly_ faster (i.e. an order of magnitude) than exponentiation.

That said, I can't speak to whether the Rust compiler wouldn't just optimize that away -- it seems like unrolling exponentiation into multiplication for small constant powers would be a very safe and easy thing to do.


It certainly does do this: https://rust.godbolt.org/z/ZaDAKa

Not just small constant powers either. I tried .powi(1000000) and it compiled into a sequence of 25 vmulss instructions.


Interesting, thank you!

I suspect then that defining it at x*x is just just because it's easier to type than x.powi(2).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: