> because without FDO (or PGO) the compiler has no idea how likely each branch is to be taken
So, the maximum amount of times you can hit '\0' is once in the string, because then the function returns, but you can hit the other characters many times, which seems to be information a compiler has access to without PGO.
PGO does help, of course, and on my machine gives me 2.80s, which is better than the code at the end of the `Rearranging blocks` section :)
> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo)
Side note: There's a part two to this post (linked at the bottom) where I make the C code as fast as I possibly can, and it beats all the assembly in this post.
I never said writing assembly is (necessarily) a good idea, I just find optimizing it, and deciphering compiler output, an interesting challenge, and a good learning opportunity.
Imagine a scenario where most of the strings being processed contain a single null character, with no other characters. In that case checking for the null character first would be optimal.
Does the compiler know that this isn't true? No, it doesn't. The author of the article is making an assumption about the contents of the data that might seem reasonable but isn't necessarily true.
But because in the single-null case the loop body is executed only once, the gains of arrangement that prefers nulls are pretty slim compared to long-string cases where the loop body is executed many times. For example if your dataset contains 99 cases of single null strings and one case of 100 chars long string, it might still be optimal on aggregate to use the long-string optimizing arrangement.
Of course there are still some cases where non-zero strings are extremely rare and as such optimizing for those makes sense.
So, the maximum amount of times you can hit '\0' is once in the string, because then the function returns, but you can hit the other characters many times, which seems to be information a compiler has access to without PGO.
PGO does help, of course, and on my machine gives me 2.80s, which is better than the code at the end of the `Rearranging blocks` section :)
> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo)
It's described under `Benchmarking setup`, and is in the repository here: https://github.com/414owen/blog-code/blob/master/01-six-time...
Side note: There's a part two to this post (linked at the bottom) where I make the C code as fast as I possibly can, and it beats all the assembly in this post.
I never said writing assembly is (necessarily) a good idea, I just find optimizing it, and deciphering compiler output, an interesting challenge, and a good learning opportunity.