The background here is that "ctpop < 2" or "ctpop == 1" (depending on zero behavior) is LLVM's canonical representation for a "power of two" check. It is used on the premise that the backend will expand it back into a cheap bitwise check and not use an actual ctpop operation. However, due to complex interactions in the backend, this does not actually happen in this case (https://github.com/llvm/llvm-project/issues/94829).
I have come across a related compiler issue with popcnt before. I was trying to prototype a "population count" routine in C before rewriting it in 64-bit ARM assembly.
ARMv8.0 64-bit doesn't have a "popcnt" instruction in its scalar set.
I was curious to how well Clang would map my C statements into the ARM assembly I had been thinking of using, so I tested it on Compiler Explorer.
I noticed that in the output my routine was oh so slightly different, using an inverted constant. I then rewrote my routine with another algorithm ... but got the same compiler output.
Apparently Clang can recognise that "Oh, this is a population count routine" and replace it with a `llvm.ctpop.i32` intrinsic.
I have found no way to use compiler flags to suppress specifically this optimisation without turning off optimisations completely.
Closest you can realistically get in current compilers is adding some no-op asm statement "modifying" some subexpression as an "optimization barrier". e.g.
__asm__("" : "=r"(x) : "0"(x));
to make the optimizer unable to preserve known properties of "x" from before to after this line. Though this requires knowing the assembly register constraint to use ("r" here for a GPR), and can inhibit some other optimizations. (do note that this still allows dead code elimination and common subexpression elimination; "asm volatile" prevents these too)
This is called Idiom Recognition, and it was already needed in some compilers in the 1960s. So, unless you're now writing C from a retirement home probably all the non-toy compilers you've used had this feature.
Yes, but even though there were some advanced compilers back then, particularly for Fortran, there were also really dumb compilers for C and Pascal in the 1980s that didn't do any of that, and just did local register allocation statement by statement (storing to the stack after every statement), which is probably more like what they were used to.
So you need to do a binary search but you don't want to use division instructions. The funny thing is that DEC's assembler for the PDP-10 did a very similar thing to set up the binary search of its symbol table, which was its critical path. The JFFO instruction (Jump on Find First One) was added to the PDP-10 instruction set for this purpose.
Binary search is easy because division by two is just a shift. It doesn’t need any division or even multiplication instructions. I can guess why they might have used JFFO but it seems needlessly complicated.
The trick is quickly finding the largest power of two that is less than the size of the table. It was done with two instructions: the second was the JFFO, the first was an FADD with a purposely denormalized operand. The symbol table was kept sorted on every insert.
That was my guess :-) But binary search doesn’t need explicit powers of two. I wonder if maybe the PDP10 is much faster at loops with explicit counters instead of comparing the search upper and lower bounds.
No, the loop would have been slower. That is why they found the starting point with that bizarre two-instruction sequence. For any table size between (2^^n)+1 and 2^^(n+1), inclusive, it returned 2^^n as the starting point. Then of course it had to loop after that.
Great title. I feel like in some sense we've lost our way with compiler design. Languages do not help us write code that require minimal optimizations to run fast enough on most workloads. Instead, we overoptimize with -O[23] everywhere which slows down our development process and makes all of our code inscrutable. Better would be a faster -O0 experience and good profilers/observability so that we only optimize the hot spots (which could mean isolating specific subcomponents and compiling only those with optimizations on).
Of course it's much easier to just globally turn on release mode and optimize everything whether it matters or not. This both slows down the development process considerably and leads to bizarre issues like this.
This is a strong argument for moving to x86_64-v2 (or at least having an entire package ecosystem build for it) because popcnt shows up all over the place.
Most of the issue here comes from the mistaken belief that computers do mathematical (and more specifically arithmatic) operations as first class citizens.
A smart compiler would have used the test (x & (x - 1)) == 0 to verify that x (assumed non-zero) is a power of two. Don't know why Clang didn't do that here.
An "unrolled" popcnt is not that badly performing; I haven't seen numbers for a current server class CPU. Results for older CPUs for different implementation can be found easily, for example at https://news.ycombinator.com/item?id=11277891
The use case in the article here is for a single 8-byte (one size_t) popcount at a time. The comparison in your source is for when you want to compute many popcounts in parallel. The smallest size it compares is 32 bytes, where the scalar popcount instruction beats even the fastest other variants by a factor of almost 2x. The AVX2 lookup variant only starts beating it starting at 256 or 512 bytes, depending on the exact CPU. And even then, that variant is not equivalent to (a parallel version of) the scalar code shown in this article.
So, good that the parallel version exists, but it's really apples and oranges compared to this article's problem.
It is impressive but keep in mind that compilers follow relative familiar patterns with this kind of thing in that they specifically either emit the same pattern from an intrinsic or emit the same pattern from manually spotting a loop in the input source code i.e. it's not going in blind
It uses SWAR technique for counting bits because target platform wasn’t specified in compiler flags properly. popcnt and family are specific family extensions and without specifying the target flags the compiler generates lowest common denominator among available instructions
Edit: add -msse4 and you will see popcnt appear because popcnt is the SSE4 extension
I think the issue that the article is not explaining very well, is that what it's actually trying to do is check if a number is a power of two, and one part of the compiler says "use popcnt!" and then later another part says "oh sorry, we'll have to emulate that with a ton of instructions". Whereas it could have just used a shift and and.
> I do think it is a bug in this case, native popcnt could maybe be defended, but should not be used if compiling for architecture that does not support it.
It's a little impossible to show the difference between using popcnt and the expanded emulation code on a platform that doesn't support popcnt.
It doesn't change the fact that the actual program itself was being complied on a platform that doesn't support popcnt, so no amount of compiler arguments would fix that.
Looks like the author is into game development, and STL is often avoided in that context. I think one of the reasons is performance - STL has made some design decisions which prioritise general use over performance.
Often cited example is `std::unordered_map`, whose spec necessitates an implementation using separate chaining (or similar). Depending on the use case, in high performance applications, a different implementation can be much faster.
Although I don't understand why some people hate all of STL so much. Like sure, use your custom hash map, makes sense. But is anyone really replacing basics like `std::find` or `std::sort`?
Thanks for the clarification. I think STL is the greatest thing since sliced bread.
The time I have wasted in the 90’s debugging malloc/realloc/free rat’s nests for dynamic arrays, I could have put to better use. Thank god for vector/deque etc.
I work isolated. I didn’t learn about STL until years later,
Probably also because Game Developers had to deal with the MS STL implementation which for many years both the compiler and standard library were behind the times and not that optimal.
Case in point when using a std::deque with MS STL it's not really efficient because it has a very small buffer size, so it pretty much degrades into a linked list if the element is big enough. See: https://devblogs.microsoft.com/oldnewthing/20230810-00/?p=10...
It has some pretty bad containers, specifically unordered_map and map, and set.
There are some good talks on why from ccp con and other varius developer cons.
They are still usable mind you, just it's also true that unordered_map is slower then JS or Python hashmaps in some situations
because of some really questionable design choices (exposing bucket nodes, overly restrictive iterator invalidation requirements, etc..)
I've watched a few talks about the STL containers. I don't think they're savage enough. In particular the C++ growable array type std::vector is often cited as good. Is it terrible? No. But it's still not actually as good as it should be, we can easily do better if we provide the proper API.
Growable arrays are a very popular idea. Your preferred language almost certainly has them, but they might be built in to the language as an optimisation, if "normal" arrays in your language can grow and that's fast enough, chances are you're got the growable array type just as the only array, which is fine (but won't fly on tiny embedded systems).
This type has amortized O(1) aka constant time growth, which is one of those tricks which is amazing if you've never seen it before, then it just seeps into your understanding of the world and so it's easy to imagine why C++ doesn't do any better.
As a necessary optimisation beyond the amortized growth we must be able to tell a growable array hints about its future size. The only hint provided in the C++ STL is reserve, which tells the array its maximum final size. Rust calls this hint Vec::reserve_exact, if our array isn't this big, we should grow immediately, makes sense.
Here's where the problem kicks in: Suppose I don't know the maximum final size, but I do know I need to put say, at least 50 more things in the container? Algorithmically the answer is actually yes, the amortized growth can be preserved if we either grow sooner or skip growth steps and otherwise the hint can be ignored, this is Rust's Vec::reserve
If we try to use the C++ reserve here (aka Vec::reserve_exact) even though this isn't the final size, we destroy our amortized growth performance, our container may have lousy performance. If we just don't bother because we can't express our hint, our container has worse performance than necessary because we grow more often or less optimally.
resize(size() + N), resize(size() - N) should do the trick too, unless default constructing your elements is costly or not defined.
But I wonder what use case there is where you can't just push_back() one by one? The only potential issue here is that there might be a reallocation happening while append this range of N elements, but can't imagine what might break.
It's a perf leak, not a correctness issue, as with std::unordered_map. So yes, you can just push_back() each item, and as with std::unordered_map the worst case is linearly worse than the Right Thing™.
This is a Quality of Implementation issue in the library standard itself.
Ok - if the N elements you want to add would cause more than 1 reallocation by constant size factor, yes I can see that. (also I just saw that resize() has the same allocation behaviour as reserve() has, so scratch my suggestion above).
Never had a problem myself with that though, my main use case for std::vector is with reserve() right at the beginning and never provoking a reallocation (so I often end up coding up my own container to assert reallocation can't happen accidentally).
Sometimes (infrequently) I use the script-style push_back()-only pattern. But I feel very uneasy about uncontrolled reallocations in larger systems, and anyway they constrain the implementation space because of pointer and iterator invaliation.
You almost certainly don't need a "drop-in replacement" which is part of the problem. If you really just need a hash table, Abseil, Boost and Folly all provide several good options to suit your needs, and if you have a favourite utility library that's not one of those three it probably has at least one hash table type.
A "drop-in replacement" would commit to all the weird things about the standard library type which result in poor performance, some of them might be important to you, and you'll probably insist oh, it's not a "drop-in" replacement if I can't have those, and across the entire C++ population that means there is no high performance drop-in replacement, the type guarantees any number of weird properties and you could depend on all those, so, too bad.
The most likely reliance you'd have is reference stability, maybe you give out references to items in the table and they're long lived so you mustn't invalidate them when/ if there's a resize, let alone on unrelated changes to a table which keeps the same size. Abseil, Boost and Folly all provide types which lose some performance to get this back if it's all you need to be "drop-in".
The most unlikely reliances are two fold, firstly maybe you insist on messing with the bucket API, which makes no sense in open addressed tables (ie all the ones with great performance). In this case you're wedded to std::unordered_map because the details of how the type is structured are welded into your code. Sucks to be you. Secondly though, you might have decided you care about the order, the std::unordered_map says it doesn't promise an order, but on a single platform none of the implementations actually kick you if you assume the order is preserved, so you can merrily write a unit test which checks that after adding fake users Alice, Bob and Carol to your user hash table, when asked for the users back in no particular order you get exactly: Carol, Alice, Bob. Is that a good unit test? No. But the replacements will make it go red, they're not "drop-in" replacements.
> Abseil, Boost and Folly all provide types which lose some performance to get this back if it's all you need to be "drop-in".
Indeed, Google's initial migration from std::unordered_map to swisstables moved everyone to absl::node_hash_map since it was safer, even though flat_hash_map generally offers better performance.
> after adding fake users Alice, Bob and Carol to your user hash table, when asked for the users back in no particular order you get exactly: Carol, Alice, Bob.
Ow :(
This reminded me of Matt Kulukundis at cppcon talking about migrating away from std::unordered_map, wherein he brought up this point exactly: https://youtu.be/JZE3_0qvrMg?t=740
Matt then goes on to describe how Abseil adds a nondeterministic iteration offset to each hash table to prevent anyone from relying on this moving forward.
I was thinking of Matt's talk, although also of how many fucking times I have fixed unit tests which do this. And not just in C++.
Also: Unit test people, it's great that it's easy to check that "foo == 5" is true, and I can see why you thought it should be equally easy to check "foo == Carol, Alice, Bob" but could you perhaps nudge the developer to check they didn't mean "foo contains all of Carol, Alice, Bob" instead when they do that? Maybe have them pick "in order" versus "any order" with no default ?
There are a few more like this, case sensitivity is one, my newest colleague is going to write a test for "dnsName = example.com" even though actually dnsName might sometimes be "EXAMPLE.COM" or "Example.Com" or even "eXAMplE.cOM" and it'd be nice if they were nudged to spend that extra moment to decide whether that's OK or not before a test goes red spuriously.
Heh that last part is interesting, Java had the same issue with people's programs relying on (not guaranteed by the spec) deterministic order in HashMap/HashSet iteration and breaking after a JDK upgrade. They similarly added randomization to the iterators on the immutable collections returned by Map.of() and Set.of(), even though they could otherwise be deterministic since those collections never resize.
One of the Abseil containers if you really need an associative container to begin with, but honestly often you can play quite a few tricks with just plain arrays if you get a little creative, with better performance profiles...
I love how all the responses are talking about these arcane footguns that require watching obscure talks to be aware of.
Then there's the other segment of C++ programmers talking about how modern C++ is so much better now to the point that Rust and other languages are unnecessary.
People exaggerate this a lot, but mainly it's just the discontiguous containers (and regexes) that often have better alternatives. They're not bad for what they're trying to accomplish -- they just provide guarantees you often don't need, which results in you paying costs you shouldn't need to.
Worth noting that this issue is not unique to the C++ standard library either. Java is saddled with an overly-general HashMap implementation with awful worst-case performance characteristics as well, and older versions of Rust's HashMap had performance problems as well (though these have mostly been addressed).
ConcurrentHashMap is more performant in multi-threaded scenarios (it takes a performance hit versus the regular HashMap when single-threaded). Apart from that, it is fairly similar, and doesn't address the ridiculous rate of cache misses at all
STL has been both historically and recently slow in many ways. Many game devs are making their own engine, especially historically, so they just implement their own version of STL in the way they want. Or they use another engine which has made its own replacement. Even Andreas of SerenityOS made his own entire STL (and also his own Objective-C like class system).
The C++ STL is probably one of the most commonly avoided standard libraries of any language. I can’t think of another that is so often replaced by bespoke code
STL stands for Standard Template Library, which is an important piece of Generic Programming - it provides algorithms which are generic over the container types, so there's one library implementation of say sorting, and you can (but shouldn't) sort a linked list if you want to. When you see people asking why language X doesn't have "generics" this is the feature they're asking about - it's so fundamental now but in the 1990s it wasn't even obvious that you might want to write software this way. Generic Programming had been tried before, but the STL is an important idea and WG21 (the C++ Standards Committee) were persuaded to include it in the requirements for the new ISO standard.
However, in C++ because the STL is a big part of their Standard Library, and because Microsoft insists on naming their entire implementation of the Standard Library STL, it's usual to say STL to mean the entire stdlib, even though many parts of it aren't about generic programming at all.
So the STL is a crucial idea and inventing that was a tremendous accomplishment - but the C++ stdlib, which may be called "STL" and is in this article, is not a very good realisation of that idea. Indeed I'd argue the C++ programming language isn't a good realisation of that idea either - it was possible to write the STL for C++ and so that's what was done, but knowing this is a good idea you'd design the language differently. And people have.
Thank you for an educational comment, the historical context was interesting to learn about.
The last paragraph hints that there are more modern languages than C++ which implement generic programming in a better way, truer to its original conception?
Yes, D. D has great support for compile-time function execution, and generics. You can generate code at compile time and use it with generics. It's what C++ should have been.
I'm convinced, at this point, that Zig is that language, with comptime. Caveats: the system is not mature, Zig isn't 1.0 yet, there are some arbitrary limitations, the comptime type system is somewhat all-or-nothing, TL;DR it isn't done. All of this is immaturity, rather than an essential limitation of the design.
The breakthrough with C++ templates, was to provide a macro system which creates an AST for a static language, at compile time, using (surprise!) a template of the desired code. Contrast with the C preprocessor, which works on the token stream (there's more to it if you're curious, but in a way that makes it worse, not better).
Of course, a macro system operating on the AST is an older idea than C, this is one of the essential features of Lisp. But Lisps have always been dynamic languages. In some other timeline, there was an s-expression language for writing static programs, which used Lisp-style macros, and thereby avoided the problems with C++ templates. But we don't live in that timeline, it's a road not taken.
This template system enabled C++ programmers to write code which was generic over not just types, but features of types, and much else, it's all rather complex.
The problem, in essence, is that it was thought that template-expansion could be a decidable process implemented using a sub-Turing-complete language. So it was built using a second syntax, which was intended to be restrictive but flexible. Uh-oh!
C++ templates are Turing complete, as it turns out. So what C++ has is a second language. You could write your entire program in that language, and execute it at compile time. For everything you can do in C++, there's a second way to do it in templates, with its own syntax and semantics. Invariably, this language is much more verbose, and just... weird, in big and small ways. The sum of this is a system which is difficult to understand, and which generates notoriously terrible errors. This isn't intended as a complete summary of C++ template's shortcomings and weird features. That would take longer.
Zig just lets you run Zig at compile time. It's much, much better. Instead of trying and failing to make the problem decidable, Zig brute-forces the question by providing a branch quota: this is similar to gas in the EVM, or what Piccolo, a Rust implementation of Lua, calls "fuel". Comptime comes with a complexity limit, after which it quits, and which can be boosted for operations where the quota is insufficient.
So there aren't two languages to learn, just the one: Zig. That glosses over some things, like loop and switch constructs have inline equivalents for comptime, but the general premise holds.
There are other things which C++ makes easy which are less easy in Zig, like runtime polymorphism. But comptime is just a massive improvement over templates.
As far as I can tell, D's templates and CTFE are as powerful as Zig's comptime (not sure if they're exactly equivalently powerful, but if they're not, the difference is not very large in practice or I'm pretty sure I would've noticed).
> As far as I can tell, D's templates and CTFE are as powerful as Zig's comptime
You might very well be right. I didn't realize that D's templating was much different from C++, I don't know much about the language. I'll also note that the sibling comment you refer to wasn't posted when I wrote my reply.
I'll note that a careful reading of my post, and the one it was responding to, will reveal that I didn't claim that Zig was the only, or first, language to improve on C++ templates. Only that it's an example of a more modern language which has done so. But I'm also largely ignorant of the ins and outs of D.
It seems that you, however, are not. Here's an example of Zig building a data structure at comptime:
I admit I am no expert in D, I've written just a couple of hobby projects in it... but this sort of thing is actually much easier in D because it has associative arrays.
About comptime, it seems to me it's trivial to do that in D.
I started a little implementation of StringMap based losely on the Zig example, but I just did enough to show how D's templates and associative arrays could be used to implement that:
Zig also has a variety of generic data structures in the standard library for this purpose, HashMap and ArrayHashMap and variants. The main limitation in the current implementation of comptime is that it can't handle heap allocations, so these can't be used at comptime yet. D has a garbage collector, which makes that sort of thing easier. It's also about fifteen years older, which also helps.
That limitation won't be there forever, though, this is just the immaturity I referenced in my first post.
But fair enough, you've convinced me: D is also a language which has improved on C++ templates using compile-time execution.
Oh that's a great response, much appreciated. I'm glad I asked the question.
From what I understand, a disadvantage of this templating capability in C++ is that it makes the language "unpredictable" in terms of static analysis, so that, for example, intellisense in code editors cannot be implemented without having a full C++ compiler. ..But then again, maybe that's true of any language with macros, where the code can dynamically modify itself.
That's an interesting point too, about how C++ templating is a second language that runs at compile time, making the code more complicated to understand. Whereas in Lisps, macros are written in the same language, working with the same AST in a unified way. It's conceptually simpler, a more elegant way to achieve the same result.
I'd heard of comptime in Zig, but didn't know that there was a "complexity limit". Fascinating. It makes sense how it's related to the concept of "gas" in the Ethereum virtual machine, and "fuel" in Piccolo. (I remember seeing the latter language recently.)
..Whew, many tempting rabbit holes there. :) I'd been interested in Zig, reading about it here and there, but your comment convinced me to get more serious about studying it. Thanks!
We have the libc - what makes the STL such a crucial idea and tremendous accomplishment? Hell, they seem to share many of the same deficiencies even - caring about things they shouldn't, then doing a terrible job of it.
It's hard to even tell what you think you're commenting on here, so I'll be generous and assume you're comparing the libc qsort to the STL's sort algorithm.
The libc qsort is type erased and only works on arrays. So to sort Widgets we note down how big a Widget is and how many there are, then we forget these are Widgets and we provide qsort with just a type erased pointer to the array [of Widgets], the size and count, and we provide a function pointer (we'll need to ensure it's suited for specifically sorting Widgets) to a custom function we wrote which receives a pair of type erased pointers [to Widgets] to decide which comes before the other.
A lot of performance optimisations are left on the table when we do this, and of course because it's type erased it also loses type safety. We can't reliably apply our algorithm to any other collection, and we can't easily re-use the comparison functions either.
The STL's generics mean you preserve type safety and you can use this for any (conforming) collection type, and you can re-use comparisons because those are generic too. Along the way you pick up the performance optimisations that were lost with type erasure.
Better still, all of this is optional. You can perform the type erasure explicitly if, for example, you have a thousand different types and want to emit code to sort them all even if it's a little slower, rather than emit a thousand different sort implementations, if the code size is crucial for example.
libc doesn't really have a concept of containers or iterators. The only container it knows is an array.
The one well-known "generic" algorithm in libc is qsort. But qsort only works with pointers, and so can only sort continuous containers. You can't sort a linked list with qsort. In contrast, the STL has a more general concept of iterators that know how to fetch their next element (and some iterators can do more than just that). This allows the STL to have a std::sort algorithm that works on arrays, linked lists, trees, and whatever other data structure you want to come up with.
So the crucial idea is that the STL provides a large set of algorithms that work on all containers. No matter what kind of weird container you have, as long as you can provide an iterator for it, you can use most of the ~100 different standard algorithms on it. You can sort, find specific elements, partition, find the max element, filter elements, etc.
And the idea also goes the other way around. The STL also provides common containers that you'll need, such as linked lists, growable arrays, associative containers, strings. Of course, all the STL algorithms work with the STL containers.
Then, if you want to write your own algorithm, you only need to write it once in terms of iterators and then you can use it with all the different containers. No need to write the same routine twice.
The last big accomplishment is that all this can be done with almost no runtime overhead. All the types are resolved at compile-time, and generic algorithms can be specialized if you have a more efficient implementation for a specific container. This means that if you wrote a manual implementation of an algorithm that specifically matches one container, it would not be much faster or consume fewer resources than just using the generic STL algorithm. At the time the STL was originally incorporated, I think this was unique and no other language was actually capable of doing this.
they're right, but it's also almost 30 years old, and incorporates some unpleasant political compromises, as well as some failed experiments. you can do better, but unless you understand why the stl is one of the greatest accomplishments in computer science, you won't
It always depended on the specific C++ bubble you're in. The stdlib was also controversial back then (even in the late 90's when I started to use C++).
On top of what sibling comments mention, another common problem is that the STL has many different implementations with different/inconsistent performance profiles. If you want containers with guaranteed performance characteristics, you're better off using your own or a third party.
Yeah the bulk of the time it doesn’t matter. The only places it really matters are if you have stronger perf requirements, or dislike stupid amounts of UB.