Hacker News new | past | comments | ask | show | jobs | submit login
55 GiB/s FizzBuzz (codegolf.stackexchange.com)
1260 points by luu on Oct 28, 2021 | hide | past | favorite | 255 comments



The amount of low level CPU architecture knowledge to write such a program is mind boggling. Just goes to show how much room for improvement a lot of programs have.


FizzBuzz has many properties that make it very suitable for these kinds of optimizations that might not be applicable to general purpose code: + extremely small working set (a few registers worth of state) + extremely predictable branching behavior + no I/O

These properties however don't diminish the achievement of leveraging AVX-2 (or any vectorization) for a problem that doesn't immediately jump out as SIMD.


> no I/O

The problem description is writing out bytes which is probably some of the more expensive part of this. In fact, if you read the winning solution description, IO is the primary problem here.

> doesn't immediately jump out as SIMD.

IDK that I agree with this assessment. Very naively, I see no reason you'd not take the 512 SIMD registers and split them into 16 32 bit lanes. From there, it's a relatively simple matter of using 2 registers for the divisors, pulling out the results, and transforming them into the text to print. In other words, you be chunking this up into 16 iterations per loop. (vs 1 with the naive assembly).

This is the sort of thing that jumps out as easily vectorizable.

Now, the fastest answer very obviously does not take this approach because I'm certain they realized the same thing, that the difficult part here isn't the actual division, but instead pumping out the correct text at the highest speed possible. If you read through it, most of the code is dedicated to converting binary numbers into ascii :D


Maybe I should be more clear; no data needs to be fetched from disk/network, and the "writes" don't need to go past memory.

As for the second point, you might have a different definition of "naive" and "relatively simple" as my brain has rotted too much from only thinking about SIMD for numerical computation. While determining divisibility be relatively clear, it wasn't clear how the printing would be easily vectorizable as the output-per-number is variable in length.


> transforming them into the text to print

I think this nails it. Vectorizing the math problem is "easy" (send batches of numbers to cores, do the division) but then you have to re-order it for printing (not to mention actually print it), so paradoxically, it probably makes more sense for the program to be single-threaded.


You don't even need to do division. There are only 3 patterns that repeat for every set of 10 numbers. You just need to track which one you're on.


Yes the patterns repeat every 15 integers. So you only need to do one division operation, get the index modulo-15.

I was bored once at work and figured out a way to compute this without doing much arithmetic. It only requires 1 modulo operation.

   for (int i=1; i<100;i++)
   {
       // make a bit vector with a 1 in ith bit, modulo 15.
       unsigned int i_as_bitvector = 1 << (i % 15); 
       // Check it against the valid positions for FizzBuzz

       // 0ctal 011111 is binary 1001001001001          
       // Octal  02041 is binary   10000100001
       printf("%d  ", i); 
       if (i_as_bitvector & 011111 ) printf("Fizz"); 
       if (i_as_bitvector &  02041 ) printf("Buzz"); 
       printf("\n");
  }

I also have a version which has NO magic constants anywhere in the program, except 3 and 5. I'll post it if someone is interested.


Also, after printing 1 to 9, you have a single pattern of 30 numbers that repeats exactly 3 times from 10 to 99, then 30 times from 100 to 999, then 300 times from 1000 to 9999, and so on. That lets you extract lots of code from the loop and run it roughly once every 10^n numbers.


Why would you think in sets of ten, when there should actually just be one pattern in 15? Then it just becomes a challenge to arrange your code to work on these blocks of 15.

We could probably code 16 versions of the block of 15 code that repeat and are nicely aligned for SIMD.


In my mind a SIMD version would work by predefining a string buffer 15 elements long with Fizz in positions 0,3,6,... and Buzz in positions 0,5,10. These are constants that don't ever change. Then vectorized code would only write to positions that change, the numbers: 1,2,4,7,8,11,13,14. Most of the time these positions would have fixed width too (lengths of large numbers don't change often) so you can use constant write addresses. So 8 SIMD threads could handle a block, and write everything blockwise.

Same idea could be used to parallelize for a GPU.


Because the digits increment in groups of ten.


I was thinking blocks of 10 because I can itoa(the number of tens) and then concat with the subset of 0, 1, 2, 3, 4, etc. that I care about. I guess with blocks of 15 you just need to do two itoas, and worry about two types of block vs 3.


> There are only 3 patterns that repeat for every set of 10 numbers

Ah, yep, good point!


The real question is if you can use some for formatting numbers.


Using AVX 512 is not suitable for programs that takes very small time to run. There is a penalty in the ms range to "warm up" the CPU (it is more a cool down, actually).

As OP stated, the limiting factor is on the memory access. That's why he kept saying 64B every four cycles.

But OP likely didn't use because most CPU lacks support for AVX512.

The new Intel CPU introduced many changes for the frontend. This will likely improve the speed.

There might also be possible to try make the CPU operate at higher clockspeed.

Code looks a bit long. Not sure if the unrolling actually helps.

EDIT: Just look at the agner microarchitecture doc. Ice Lake and Tiger Lake can do 64 bytes/cycle.

In theory, it can run 4x faster (on bare metal, maybe).


I think you're missing the point. The issue is that with the advent of higher level languages, starting from Java, Javascript and on to Python and so on, most people have forgotten or have never learnt the skills to optimize code.

I'll argue that, as a percent, the number of people who can write proper multi-threaded code has only diminished over the years.

And we see the result, despite the massive increase in computing power, software in general has become slower and bloated.


The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.

If you have "months to write fizzbuzz" levels of resources available, sure, you can microoptimize everything. Except in practice you can't, because the amount of effort needed for this kind of microoptimization is quadratic or worse in the complexity of the problem you're actually solving.

For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster, because they'll have had a lot more time available to spend prototyping and picking good algorithms rather than debugging undefined behaviour because they used the wrong combination of integer bit lengths.


> The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.

I disagree. I believe the majority of code is slow because it's written without any consideration at all to performance. I like Casey Muratori's philosophy of non-pessimization where true optmization (measuring, working hot spots) is rare and rarely necessary but massive speedups compared to the general state of the art are achievable by simply not writing code using patterns that are inherently slow. This isn't deep algorithmic stuff it's just avoiding copies and/or pointer chasing.

Edit: https://www.youtube.com/watch?v=pgoetgxecw8 <- Casey's most recent intro to non-pessimization

> For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster

The Advent of Code runs every year and I'm not sure about C (not something I track) but there are plenty of Rust submissions and while the Rust submissions take longer to come in, it's not THAT much longer. From memory it's like 2x on a timescale of minutes. The Python programs are not faster.


Not disagreeing or anything, but an Advent of Code sized example will not be anywhere close in complexity to your typical CRUD application. The thing is, we possibly could not even write most of these applications in a low level language, as we should not forget that while Rust indeed has good abstractions, low level languages simply leak memory-related informations to high levels as well, making it necessary to deal with them on a high level. And in these cases, time to market, speed of refactors and maintainability is much more important than the row speed (which again, may not even be that significant at all. Eg, often most work is done on the db side)


> an Advent of Code sized example will not be anywhere close in complexity to your typical CRUD application

Advent of Code size problems is the best case scenario for Python. The difference in implementation time goes down as program size increases because you spend a larger fraction of the time figuring out what you're implementing.

The reason it's not done is mainly social and not technical. Ecosystems matter and this is particularly the case in graphical desktop environments where massive amounts of time and effort are required to re-implement features matching users' expectations (e.g. Flutter).

If we're talking a server side http app server then the library requirements are significantly lower and the libraries are generally there in almost any language. To keep the language comparison the same, it's not significantly slower to implement a CRUD backend in Rust than it is in Python. Depending on your specific setup and how much pre-prep is involved it can be faster. I'm not aware of any framework in a dynamic language that can produce production grade endpoint handlers that are as terse as a Rocket app heavily leveraging request guards. The guards handle validation, conversion, session context and db connection based on the endpoint parameter types so you just write the happy path which is 1-2 lines of code for CRUD.

> speed of refactors and maintainability

Dynamic languages are significantly worse for maintainability and refactoring. I say this as someone who generally gets paid to be a frontend developer and spent years writing both Python and Clojure professionally. Despite my arguments, I do not write server side Rust for work because team coherence is more important for doing my actual job (solving a customer's problem) than programming language and I'm the only person in my company who writes Rust. I've been doing this long enough that I accept the worse solution as the cost of doing business. My personal projects don't have this constraint so perf is orders of magnitude better in exchange for a lot more knowledge and a bit more design work.


> Advent of Code size problems is the best case scenario for Python. The difference in implementation time goes down as program size increases because you spend a larger fraction of the time figuring out what you're implementing.

The difference in code size between a good and a bad language goes up as program size increases, because a large codebase makes it harder to keep it all in your head and forces you to add more layers and duplication.


>And in these cases, time to market, speed of refactors and maintainability is much more important than the row speed (which again, may not even be that significant at all. Eg, often most work is done on the db side)

About the raw speed. Maybe you want to take a look at these nice benchmarks? https://www.techempower.com/benchmarks/

You don't have to ditch Java or C# for your app and use C or Rust instead. But you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.

Also, if performance, throughput, stability wouldn't be a thing, we have to wonder why Twitter switched from RoR to Java? Couldn't they just buy more servers instead?


> you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.

In practice you only have so much attention and there are better places to spend it.

> Also, if performance, throughput, stability wouldn't be a thing, we have to wonder why Twitter switched from RoR to Java?

If raw speed matters we have to wonder why Twitter started with RoR, and only switched (mostly to Scala, not Java, AIUI) once they had tens of millions of users.

In a tiny handful of outlier cases where a given piece of code is run literally billions of times a day it eventually becomes worth doing performance microoptimisations. But only after you've confirmed that there's absolutely massive demand for this particular code and already done your iterative refactors and figured out what the code should do.


> But you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.

Writing Java in a more performant way often means ditching OOP and using primitive types, primitive arrays everywhere instead, and also avoiding allocations up to the point where all objects are reused. Such code is just as slow to write and as error-prone as C, and way worse than C++ where at least you have RAII, generics, STL algorithms and zero-cost abstractions.


You only have to do that in hot loops (in most problems), that you can easily profile with the very good observability tools that java provides. The rest can be maintainable, good quality OOP code.

I would even go as far and say that OOP’s encapsulation pretty much exists just for this: the “ugly” implementation can reside inside a class and you only have to care about its public API from the outside.


> I believe the majority of code is slow because it's written without any consideration at all to performance. I like Casey Muratori's philosophy of non-pessimization where true optmization (measuring, working hot spots) is rare and rarely necessary but massive speedups compared to the general state of the art are achievable by simply not writing code using patterns that are inherently slow. This isn't deep algorithmic stuff it's just avoiding copies and/or pointer chasing.

I don't/can't watch videos; I would be interested to see a written argument, but generally my view is that copies and pointer chasing are irrelevant because code is so far away from optimal already. You're right that code is mainly written without any consideration to performance, but that manifests itself first in poor choices of algorithm and/or datastructure. I don't think there's anything "deep" about, say, using a hashmap rather than searching linearly through a list; indeed I'd consider it easier to understand than avoiding copying.

> The Advent of Code runs every year and I'm not sure about C (not something I track) but there are plenty of Rust submissions and while the Rust submissions take longer to come in, it's not THAT much longer. From memory it's like 2x on a timescale of minutes. The Python programs are not faster.

Advent of Code is still extremely tiny programs, and Rust is a much much better language than C for thinking in.


True. Most business apps I've seen seemed like people did serious research into how to write the worst code from a performance point of view. Gazillions of wrappers upon wrappers, tons of indirection, data being copied and recopied thousands of times, creating FactoryFactoryFactories just for the sake of it.

If people would pay more attention on how the data flows instead of Uncle Bob and GoF, they would not only end up with a cleaner software architecture but also with more performance.


> I believe the majority of code is slow because it's written without any consideration at all to performance.

That can both be correct. I use a std. data structures for many things even if some prefix tree might be better for the operations I perform on a set. I don't think about that in most cases since computing power is cheaper than the additional work. If performance becomes a problem, I can optimize later.

Advent of Code isn't a realistic performance metric. Working software is for most software projects.


> For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster, because they'll have had a lot more time available to spend prototyping and picking good algorithms rather than debugging undefined behaviour because they used the wrong combination of integer bit lengths.

I think this is more nuanced than that. Competitive programming actually gives fairly good insight here, because skilled programmers are placed under time constraints and obviously care about performance.

What you see, usually, is that programs that don't need maximum performance are often written in Python, which allows people to quickly write code that needs less debugging. But double the allotted time and the C programmer finally has their code finished and free of bugs, and they're beating the pants off of your Python code right out of the gate. You try and come up with clever ways to avoid copies and drive down the Big-O, but the C programmer comes up with the same thing just a bit behind you and leads most of the time because they get a free 10x speedup just because of the language they're using.

It's not surprising that the highest levels of competitive programming are dominated by C (and to a greater extent, C++) programmers, with Python use extremely rare, almost always reserved for rare, complicated algorithms that happen to bring a significant big-O improvement that justify the use of Python to write them.


The overwhelming majority of code isn't slow because it used the wrong algorithm, it's because the Product team can't demo or show off an increase in efficiency. It's because they aren't measured by their ability to reduce or control the growth of the AWS bill. And they probably wouldn't have the knowledge to ascertain a reasonable bill. So they push deadlines which pushes devs to take ugly but fast to write solutions using high level languages.

Also, performance improvements pay for themselves over months and years while new features usually pay for themselves faster (at least they do in forecasts).

And finally, touching working code to make a performance improvement necessarily has some risk and that risk might be worth more than a year of cost savings from the improvement.


>The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.

Try to look at The Computer Language Benchmarks Game where all implementations are using the same algorithm. An optimized C implementation can be 300 times faster than Python.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


That's exactly the kind of missing the point that I'm talking about: tiny programs where all implementations are using the same algorithm and people have spent months microoptimising and are utterly unreflective of realistic business problems.


> debugging undefined behaviour because they used the wrong combination of integer bit lengths.

It's a function of debugging time which is a function of the programmer's skills. I almost certainly fall into the Python subset though :/


There's a law that says that as resources becomes easier to consume, people consume them more. Like when increasing the capacity of a road doesn't reduce the time spent in traffic. Wouldn't that apply to software? I feel like it's a bit unfair to call software "slower and bloated" when you wouldn't call a road "bloated" because it has a new lane. People can do more things now.


More like I start manufacturing 3 lane wide cars because the material im using is cheaper than cutting and engineering it to normal size - the way I see modern software analogy in this context.


I don't think that's fair, we do way more things now with computers than before.


It feels like the frustration comes because I see the analogy this way:

The roads have been given extra lanes, and the total possible throughout of the road has increased. As a result, car manufacturers have decided that they can use a cheaper manufacturing process which makes slower cars. But when anyone complains, they point to the additional lanes and say ‘but the number of cars that get from point A to point B has increased, even if each trip is slower.’

The end user isn’t getting a choice when software developers make slower code each year. They have to buy more expensive hardware to keep up with the slowdowns of software developers.


I get it. But what exactly do you want to happen here?

In terms of your analogy, do you want more expensive cars that less people have?

Software developers could spend more time optimizing their code. But that means that are spending less time elsewhere, fixing bugs, adding features, developing other products.


It should be noted that many times features come in that people do not want.

“The new format bar” for slack and “huddles” (perhaps less antagonisingly) are examples of things that Slack has done seemingly to prove they’re still working on the program.

Despite the the features not being needed in the slightest.


I actually disagree, at least about huddles. Huddles are a direct response to the low-friction nature of Discord voice channels (and arguably Teamspeak/Mumble/etc before Discord, although nothing else married text and voice channels quite as well before Discord.) It's almost silly how small of a difference there is between starting a huddle and a channel-wide "call" - the only difference is that it doesn't start out by "ringing" for everyone in the channel. But that tiny difference completely shifts the feeling from "this is a formal communication" to "we're all just hanging out and you can join if you want but no pressure."

IMO huddles happened because the pandemic moved so many more people remote that the need for a water-cooler replacement became much more obvious.


I mean, we’re programmers here and it’s the easiest damn thing to learn. We could just write the software we like. Scratch our own itch.

I, for one, don’t care that much for performance so my personal forks of projects are all feature oriented.

But if you care about performance this much then writing something high performance is the best way to find others like you.


That would imply that we do less or as much things as before with computers which I think is false. Maybe you don't agree with what everyone does, but that doesn't mean that it's worthless.


Jevon's Paradox.


I think the kind of optimization here is beyond that which people used to do. AVX didn't exist when it was common to write assembly.


Yes, this is extreme and way beyond my C capabilities at any point in time. I was making a general point which I do believe is valid.


Sometimes bloaty design decisions transcend the question of programming language. Electron probably has (I didn't check) a faster JavaScript engine than Qt, but I'd expect that similarly sophisticated implementations of the same program in Qt or Electron show an advantage in performance to the former. But people use Electron because it's familiar and it's got a lot of active development, and more cynically, Google likes it that way because it means that people keep coding for Chrome (cf. Microsoft's VS ecosystem, Apple's walled garden, etc).

The problem with modern software is that vendors encourage new developers to learn using heavyweight toolkits because lock-in is profitable in the long run. One day Electron will work with Go, but it will still be a plague.


>FizzBuzz has many properties that make it very suitable for these kinds of optimizations that might not be applicable to general purpose code: + extremely small working set (a few registers worth of state) + extremely predictable branching behavior + no I/O

The working set depends on how you organize the data. I/o can be made in larger chunks most of the times. You have to consider about the data you need and when you need it.

As for branching, that most of the time depends on programmer rather than on the problem being solved. The situation where you need very complicate branching to solve a problem are rare. And I've seen O(n log n) algorithms beating clever O(n) algorithms because they could be better optimized for the CPU.


> for a problem that doesn't immediately jump out as SIMD.

It's a single transformation applied to elements of a large array. Parallelization is obvious, and maybe the exact SIMDization is not obvious, one is certainly motivated to formulate it. And a scatter-like approach does spring to mind right away, I believe.


FizzBuzz is a constant.

It should never branch.


Unrolling FizzBuzz is generally not a feasible solution.


Sure it is: https://codegolf.stackexchange.com/questions/215216/high-thr...

It executes just one (well predicted) branch every 30 numbers written, and incrementing/printing the line number is branchless too.

It's not as fast as the subject of the post (40 GB/s) but it's only a few hours of work.


The parent comment asserted that FizzBuzz should never branch. This would require it to not have any backwards control flow, and thus require full unrolling of an infinite sequence. Hence the infeasibility.


FizzBuzz is limited to [1…100].

Output as an ASCII string, it is fewer than 400 bytes including \n’s.

If you’re counting GB/S, you’ve failed to understand the specification.

One of the sophisticated engineering aspects of FizzBuzz is that all optimizations are premature.

It is a boring problem. People being people invent different problems to provide a chance to be clever.


I interpreted that as no branches, only loops.


It is possible to instrument the code to see how much overhead those branchs add.


It pays of to understand the CPU architecture even if you are not using the assembler: https://blog.cloudflare.com/branch-predictor/

Once upon a time most software was highly optimized with hot code paths written in assembly. If you look at DOS source code, DOOM source code you will see lots of optimization.

When CPUs got more powerful, people got lazy and they thought they can spend the improvements on conveniences.

Now we are at the point that we run "desktop" apps written in Javascript on top of embedded browsers.


Hard times create strong programmers. Strong programmers create good times. Good times create lazy programmers. And, lazy programmers create hard times.


Programming is a metaphor of life.


If you look at DOS source code, DOOM source code you will see lots of optimization

If you look at VSCode code, you’ll see even more of these. What really changed was more people wanted to hack things and write non-PhD code, and the web tech was (sadly) popular enough to take this role. It’s not programmers who are lazy, it was that desktop frameworks utterly failed in their non-competive niches. MSVC libraries replaced each other almost faster than github’s new web frameworks, to the point that barely anyone can remember their chronology. Gtk+ got stuck for years dealing with C legacy, added non-C styling engines too late, and then got “owned” by a particularly deteriorating DE. Qt always had a not-so-clear position on community-licensed versions, and C++ impedance mismatch didn’t help that either. UI/AppKits got pretty useful and advanced in recent ~ten years, but were ios/osx only. Minor frameworks like fox, wx, fltk, etc never got enough traction or new ideas and were just shadows of their bigger brothers. Meanwhile, with electron, bootstrap and little js one can make a page in few minutes, which could take few hours on conventional desktop.

I mean, you are correct that programming went from hardcore to relaxed mode, but there is more history to it in ui sense.


Business code must be optimized for maintenance first.

Because it will live for many years. It will have to survive in the hands of multiple caretakers. It will have to evolve to work with the underlying foundations changing (operating system, compiler/runtime, desktop -> web -> mobile).

That's different from most video games (one single release, then little patches) and from advent calendar code.


Hot code paths are still written in assembly, it's just on a different level. DOOM is an engine; nowadays people use off the shelf engines so that they can focus on creativity instead of low level details.

I mean I could probably hyperfocus on storing my application state in a file on disk, but why should I bother when there's off the shelf SQL databases right there? Which have been optimized endlessly, I might add. I don't get paid to write low level ASM, I get paid to build applications.

Edit: And to add, my current thing I get paid for is several orders of magnitude faster than the one it replaces. Not because I spend more time optimizing, but because I write it in sensible Go instead of 10K LOC PHP files that concatenate XML strings that get converted to JSON to be rendered in Dojo-flavored HTML concatenated together in JS because of reasons.


Not lazy, sensible. The market has spoken and it wants bloated electron apps with rapid development, rather than performant C/assembly apps that hardly ever change.


All Electron apps users were demanding something that eats up their RAM, crashes and run slowly?

The market demand seems more like:

Jack: "Our bellowed CEO wants us to deliver our wonderful app to unwashed masses still using a desktop. Our enlighted marketing team made a study which realized that for whatever weird reason, corporations and businesses still make heavy use of those boxes which come with a mouse and keyboard attached."

Joe: "Sure thing boss, we will have to hire some programmers and testers and will take about a year or so."

Jack: "I forgot to tell you that the marketing study already took a year and a very large budget because we hired the best of the best to do it. One year we don't have, money we don't have. But what about those people who wrote our web app? We still pay them. Can't they deliver?"

Joe: "We will have our glorious desktop app in two weeks, boss, I just had an idea."

Jack: "Do that and I will personally remind about you to our bellowed CEO when he will want to do some promotions."


“The market”?

The power dynamics of companies/customers are often not as dynamic as all that.

If slack is electron and I work at a company that uses slack: I must use it.

The competition in that space is all electron, you can’t choose.

It’s like saying that “the market chose non-ECC ram”. No, Intel chose for you and you don’t get much choice except to suck it up (or pay well above the odds.)

It takes a lot to avoid using a product. I mean people still use Oracle products!


That does not actually contradict the point. We got stuck in a suboptimal local maxima due to the all early design decisions of browsers and JavaScript. The original inventors did not expect anyone wiring web version of Google Drive on the web.

The market surely pushes against the bloated electron apps, yet the convenience of having the same app on web as well as "native" and the amount of man years which went to make HTML+JS the richest multi-platform UI framework on the market is more important.


The market has spoken. It wants factories to dump waste into nearby rivers because it allows them to make cheaper products that are more competetive.

It's not really a good argument is it?


There is no market for things that just work on V1.0 then continue to work flawlessly. You wont sell a version n again and again and there is no support contract for something that does a single thing and does it well.



Yes, thank you for mentioning it.


These days, compilers have gotten so damn smart with their optimizations that using assembly will actually make your code slower unless you really know what you're doing like the original post here.

A few years ago, I was doing Project Euler problems, and one of them had a solution that took about 45 minutes for my C program to solve. I decided that it was an incredibly simple problem, so rewrote it in Assembly.

My assembly solution took an hour to run.

I disassembled my C solution, and couldn't figure out what it was actually doing, though my assembly knowledge is pretty weak.


People might not care to learn it, but isn't there a lot more to know nowadays re: the obscurities going on deep down at the lowest levels of modern architectures?


I don't think you should be an expert in CPUs architectures but but having an understanding of memory layout and how instruction are dispatched is not very hard or time consuming.

I see many people having trouble to understand the difference between value and reference, what pointers are, why it's better for structures to contain variables of the same type and have a certain size, why is better to call a function once for a large chunk of data instead of calling it many times for small chunks of data, why iterative or tail call recursivity are to be preffered over simple recursive functions.

The view is most LOB apps won't care about performance because they are waiting for i/o so we should not care about performance but coding speed, OOP, Uncle Bob's principles, GoF patterns, Restful and trendy architecture of the day. While I sure that coding speed matters, a sound architecture matters, I also think that throughput matters, that minimizing delays matters and that some problems are better dealed with by thinking of the data and how the CPUs likes to access data and work with it instead of just firing up more Kubernetes pods hoping that scaling will get rid of performance issues. By not thinking about the data and the CPU we didn't get rid of the complexity we just moved it to the infrastructure and in code having to deal with a more complex infrastructure.


We are also at a point where people whine that $5 for an app on your phone is "vastly overcharging". You get what you pay for.


At some point I believe we will start to bump up against the limits we saw in the first three computer revolutions (tubes, transistors, integrated circuits). This will cause the pendulum to swing from commodification to optimization. What I mean is, you won't be able to snap your fingers and scale up or out, thus optimization will begin again in earnest. Clearly this isn't always the case: GPU shaders, crypto ASICs, video processing... there are plenty of places where optimization is crucial for high performance loop tuning. But optimization hasn't been required across the board like there was just before the three big innovations I described hit.


This tends to happen at a smaller scale with gaming consoles. Towards the end of a generation, the games coming out usually perform/look a lot better than the ones at the beginning of the generation. I'm assuming due to a lot more careful optimizations due to not being able to change the hardware.

I've always been curious about how far we could really push modern computers if somebody wanted to spend the time going to the lengths in the original post when creating practical software. Of course it's usually not worth the tradeoff, but it's interesting to think about.


Protip:

while you made a great comment, ND people like me, and even most NT people have diffivulty ingesting walls of text like what you just wrote.

Please, therefore, breakup your text into paragraphs every 3 sentences. It does wonders for readability for just about everyone. :)


…it’s literally four sentences.


You have multiple paragraphs longer than mine in your history.


This is what we call shifting the the goalposts and tu quoque fallacious attacks.

Just because I have erred in the past does not mean I cannot suggest to you to improve your outputs.

Or, I have had learnings previously, and I am simply trying to pass them along to you.

Here is another protip: dinae be so defensive. I was not attacking you personally, but your reply clearly was an attempt to attack me.


And imagine not coming up with this solution in your next MANGA interview!


It takes a lot to to correctly explain exactly why the set of design choices is fastest, but writing just takes quite a simple model of the CPU internals, knowledge of the insturction set, focus on constantly measuring performance, and an agility to iterate quickly with different approaches.


And reading more closely, beating the competition low level OS knowledge and understanding peculiarities of the benchmark in question.

The benchmark was about getting the output to a pipe as fast as possible, and there's this great pipe speed hack:

  // To produce output without losing speed, the program  therefore needs
  // to avoid copies, or at least do them in parallel with calculating
  // the next block of output. This can be accomplished with the
  // `vmsplice` system call, which tells the kernel to place  a reference
  // to a buffer into a pipe (as opposed to copying the data into the
  // pipe); the program at the other end of this pipe will then be able
  // to read the output directly out of this program's memory, with no
  // need to copy the data into kernelspace and then back into
  // userspace.


Also goes to show how much would it cost.


And maintain and evolve and debug and work on different machines, etc.


If you have talented staff then you'd be surprised how far you can get just buy giving someone who already does that particular application as a hobby an unlimited supply of coffee.

Obviously finding talented staff is very hard, but once you have your tribe you can go a very long way i.e. I look at apps made by some people I work with (fast, lightweight etc.) then compare with crap pumped out by startups with literal billions in capital. I think it's a question of confidence more than competence.


It shows that abstractions are leaky as f :)


No, this simply shows that abstraction slows performance, which is usually a worthwhile tradeoff.

Leaky abstractions are a different problem altogether.


The problem with that tradeoff is that you only compare the performance with the 'top layer of abstraction' that is already implemented and not with the baseline.

Death by thousand cuts


Not usually. There's a stack of a thousand abstractions between me writing this and you reading it, but it still works fine.


Yep. I suspect most people here could write a working fizzbuzz on a whiteboard in language of choice in under 5 mins during a job interview.

Sure your Python/JavaScript/Haskell/Rust version builds on a bunch of abstractions, but it’ll run on just about anything, and …

“I've spent months working on this program”

That’s not what your boss wants to hear.


You can pump out a Modern C++ version in 5 minutes too that will run loops (haha) around the higher level languages. The readability won't even be very different...


True. I bet the Rust guys would come close too.

But realistically? For anything except code golf and nerd fights, the actual client requirement is probably better met by a WordPress widget written in php/html, because what they asked for is something that'll print the fizz buzz all the way up to the person's age when they log into the company website... Nobody is even going to notice if it takes a whole second to fizz buzz all the way to 95 :-)

(Now I'm wondering if that guy's raw hyper optimised x86 assembly can get transpiled to WASM... Because nerd fights are fun.)


> Now I'm wondering if that guy's raw hyper optimised x86 assembly can get transpiled to WASM

Not really. WASM is significantly simpler and more abstract than x86 assembly, and has a JIT compile step that probably wouldn't get anywhere near as optimized. You could probably hand-write a WASM version that would JIT compile to something roughly similar and still get very good performance, but it would probably be more comparable to the other compiled versions at best, rather than the x86 ASM one.


> probably wouldn't get anywhere near as optimized.

Yeah, I'm not doing this for performance, I'm doing out for nerd fight points and the lulz :-)

The Most Hightly Optimised Fizz Buzz EVER!

In your browser!

As a service.

Join the waiting list now! Email: [_____________] [submit]


>But realistically? For anything except code golf and nerd fights, the actual client requirement is probably better met by a WordPress widget written in php/html, because what they asked for is something that'll print the fizz buzz all the way up to the person's age when they log into the company website... Nobody is even going to notice if it takes a whole second to fizz buzz all the way to 95 :-)

What if 100 million people log from different corners of the world? Would the WordPress widget still cut it?


In that scenario, the fizzbuzz widget is gonna be just a tiny tiny part of your WordPress scaling problem…


Not only you can write a modern C++ version in 5 minutes but by carefully choosing your subset of C++ and coding style you can do that while being as readable and maintainable as in higher level languages. C++ isn't hard because it's low level, it is hard because it is a Frankenstein monster and I don't think there is a living being that masters all the complexities of C++. I would love to see a C++ Next version with lots of crap cut down from it.


C++ next version with lots of crap cut down is called Rust. ;)


Judging by it's age and how much crap it's accumulated already, I think in 10 years time Rust won't be in a much better situation.

Similarly, judging by C++'s current trajectory, in 10 years it will have a simplified subset (enforced with something similar to --pedantic) which is easier to get right than Rust is today. Also, it will have a static analysis borrow checker based on clang-tidy.


Just put everything inside an electron container.

Unless your talking about micro controller programing Ram is basically free.


It could be, but in most cases it's not due to RAM commonly being unupgradable in laptops.

A larger amount of RAM might still be cheap to install in the first place, but that choice is not always directly up to the consumer.


Correct, but Slack is based off electron and is widely successful.

End users are used to tolerating a basic chat application eating an indefinite amount of ram.

From a business pov it doesn't make sense to spend time optimizing since most users don't seem to mind.


Cache however is not


Thanks for my daily dose of software engineer imposter syndrome.


This is one where I’m fully comfortable with feeling like an impostor. I’ve gotten this far (~20 years) without more than a cursory glance at machine code, I’ll be pleased if I retire at the same level of relevant expertise.

Edit: don’t get me wrong! I admire the talent that goes into this and similar efforts, and find performance-chasing particularly inspiring in any context. This is just an area of that which I don’t anticipate ever wanting to tread myself.


The arrival of tools like Compiler Explorer means examining the assembly code is trivial. It's quite easy to fall down a rabbit hole of code tweaks to see what effect they have on codegen. Even examining the differences bwtween gcc and clang can be quite revealing.


Don't worry. It just seems like everyone else is so talented because no one writes articles about the 2 hrs they spent just trying to get their project to just build without errors. Or if they do, they don't get voted to the top of HN.


That stuff almost seems like toil these days


I don’t think this took 2 hours. 2 months maybe?


He mentions it took him months in the response.


I was very grateful to see this disclaimer:

> I've spent months working on this program


Also:

> I already have a master's thesis. This was harder.


If you want to get better at this specific thing, work or study in this field.

Person who wrote this probably won't know much about, say, compatibilities of latest version of WebPack and VueX; or about cryptographic protocols.

You cannot know everything.


> You cannot know everything.

........right?


right?


This is called "specialization" and it's the only reason we've gotten so far as a species. Not everyone can or should know this level in any subject.


just keep in mind, this person claims to have taken a few months to write this. I wouldn't worry about it. It's interesting never the less.

The last time I have seen this low level of code was in a college level course about assembly language.


As a comment on the linked post notes, this is impressive enough that it could probably be the basis of a degree thesis. An astonishing exhibit of technical skill, very much a masterpiece in the original sense of the term.


Imposter syndrome you say? I scrolled all the way to the bottom of page hoping to see a solution written in a language I actually understand and not pretend to understand.


It never goes away, does it


It's a feeling, a fear to be specific, so stems from a different part of the brain. No, I don't think fears go away completely. One learns to manage them.


"future-proofed where possible to be able to run faster if the relevant processor bottleneck – L2 cache write speed – is ever removed)."

Being able to write a function limited mostly by the l2 cache size and able to realize that is rad

And btw this is an interesting example of how hand optimized assembly can be much much faster than any other solution. Can you get as fast as this solution with mostly C/C++? It uses interesting tricks to avoid memcopy (calling it slow rofl)


You could probably to very close to this solution with C or C++, plus AVX intrinsics. Some might consider that "cheating" since intrinsics occupy kind of a grey area between a higher level language and asm.


I suspect by the time you’ve disabled all the compiler “optimisations” that would lie knock a few orders of magnitude of performance off the directly translated algorithm, you may as well have written the assembler to start with…

And you probably can’t fine tune your C/C++ to get this performance without knowing exactly what processor instructions you are trying to trick the compiler into generating anyway.


> without knowing exactly what processor instructions you are trying to trick the compiler into generating

In fact, in this case you know exactly what processor instructions the compiler is going to generate. You are using AVX intrinsics after all.

And no, compiler optimizations work well with these intrinsics.


I mean, did you see the very complicated extremely optimized C and C++ code lower on the page? Despite that, they "only" got to 10% of the performance of the ASM code.


The speed of this program is partly that it's written in assembly, but mostly because it's written by someone who is quite clever and clearly put a large amount of time into this problem. None of the other solutions spend much time trying to fit their data into the CPU cache, nor do they have to drop to using slicing for zero copies, and not one is doing anything nearly as clever as this program is to generate its numbers. All of this would be possible to mostly translate to C++ with AVX intrinsics, but real accelerator here is not choice of language, it's the person behind the code.


Now that I have seen the power of madvise + huge pages, everything looks like a nail. Author reckons 30% from less page table juggling. There are techniques here that apply outside assembly.


It's not ASM that make the code fast, it's the way he laid data and code. C/C++ should be able to approach 90% the speed of this.


most other implementations do not use splicevm, which is a huge win for this specific problem.

Of course all the AVX and cache optimizations are also exceedingly clever.


    ///// Third phase of output
    //
    // This is the heart of this program. It aims to be able to produce a
    // sustained output rate of 64 bytes of FizzBuzz per four clock cycles
    // in its main loop (with frequent breaks to do I/O, and rare breaks
    // to do more expensive calculations).
    //
    // The third phase operates primarily using a bytecode interpreter; it
    // generates a program in "FizzBuzz bytecode", for which each byte of
    // bytecode generates one byte of output. The bytecode language is
    // designed so that it can be interpreted using SIMD instructions; 32
    // bytes of bytecode can be loaded from memory, interpreted, and have
    // its output stored back into memory using just four machine
    // instructions.


  // [The eighth byte of LINENO_MID] changes in meaning over the course of
  // the program. It does indeed represent the billions digit most of
  // the time; but when the line number is getting close to a multiple
  // of 10 billion, the billions and hundred-millions digits will always
  // be the same as each other (either both 9s or both 0s). When this
  // happens, the format changes: the hundred-millions digit of
  // LINENO_MID represents *both* the hundred-millions and billions
  // digits of the line number, and the top byte then represents the
  // ten-billions digit. Because incrementing a number causes a row of
  // consecutive 9s to either stay untouched, or all roll over to 0s at
  // once, this effectively lets us do maths on more than 8 digits,
  // meaning that the normal arithmetic code within the main loop can
  // handle the ten-billions digit in addition to the digits below.


You could probably get some blazing performance out of an FPGA. I made an FPGA version of FizzBuzz a few years ago, but it was optimized for pointless VGA animation rather than performance.

https://www.righto.com/2018/04/fizzbuzz-hard-way-generating-...


Would you be willing to back and try to optimize this for pure output speed?

I thought the exact same question, but wondered if FPGA's gate connections are too distant for FizzBuzz to beat 55 GiB/s.


With an FPGA, the speed limit will 100% be how you want to output the data.

If you require it to come out of IO pins, then the limit will be the total output bandwidth of all the serdes units on the IO pins. Typically that's much more than 55GiB/s.


Just had a quick look at the cheap FPGA dev boards and on my basic search the clock speeds seems to be in the low 1-10GHz range.

But this task could be scaled up with parallel FPGA boards in sync


It isn't clock speed that matters, but the number of IO pins and the rate the serdes units can be clocked.

Eg. Stratix V GX devices offer up to 66 integrated transceivers with 14.1-Gbps data rate.

Thats ~1Tbps right away from a single $500 device.


When I saw this I did wonder how much faster I could do it in hardware, but similarly I expect the bottleneck would be outputting it in a fashion that would be considered fair to be compared to software implementations (eg outputting readable text on a screen).

Regardless, I very much enjoyed your DVD screensaver-esque output.


At that point we are writing firmware for the display


GPU memory is an order of magnitude higher bandwidth than RAM, so that would seem to me to be the way to go to beat this. The output wouldn’t be accessible from CPU without a big slowdown though.


Ah, but the M1Max is unified memory, isn't it?

And am I doing the math right that a quintillion fizzbuzz will take about 7 months?


I would be doubtful since most FPGAs have a much lower clock speed than CPUs. Maybe unless you take advantage of parallelism.


I love it that humans are so diverse - there's always someone prepared to dedicate their life to the most obscure things.

It's so impressive and hilarious to me that he actually spent months on this. Well done!!


The amount of completely unnecessary effort unleashed on this problem in this post is amazing. I want to meet the author and shake his hand! It has everything from Linux kernel trivia in terms of output speed to intel AVX2 code. So unnecessary and so awesome!

I've done stuff like this before, and I imagine the satisfaction of completing it! A tip of my hat to you, sir!


> 64 bytes of FizzBuzz per 4 clock cycles

That's borderline incredulous, given a single AVX2 instruction can last multiple clock-cycles. The reciprocal throughput also doesn't go below ~0.3 to my, admittedly shallow, knowledge. A remarkable piece of engineering!


That's 16 bytes per clock cycle, i.e. one avx register per clock cycle. As most intel CPUs can only do one store per clock cycle, that's also the theoretical limit with avx. I wonder if using non temporal stores would help make the code be cache size agnostic.

Note that the instruction latency is not important as long as you can pipeline the computation fully (which appear to be the case here!).

Edit: to go faster you would need to be able to use the bandwidth of more than one cpu. I wonder if you could precompute were the output will cross a page to be able to have distinct cores work on distinct pages... Hum I need a notepad.

Edit2: it is much simpler than that, you do not need to fill a page to vmsplice it. So in principle the code should parallelize quite trivially. Have each tread grab, say 1M numbers at a time, for example by atomically incrementing counter, serialize them to a bunch of pages, then grab the next batch. A simple queue will hold the ready batches that can be spliced as needed either by a service thread or opportunistically by the thread that has just finished the batch next in sequence.


Except the program counting the data generation rate is running on only a single CPU...


That shouldn't be a problem, pv is basically only counting the number of pages being written. As long as the page size is much larger than the number of threads it should keep up just fine.


The page size can't be more than half the L2 cache (else writeback to DRAM would be triggered, dramatically lowering performance). That means you have to use a relatively small page size, and lock contention in the kernel for those pages then limits throughput if you use many cores, even if the data is never read by pv.

See the comments in the code for info...


Note: of course SSE is 16 bytes, AVX is 32 bytes! So there is still some margin to be exploited here.


> incredulous

FYI "incredible" might be a better word here. "Incredulous" would mean that it finds something incredible.


Alternatively "unbelievable"?


Inconceivable ? Sorry I could not resist.


As an aside, it would be fun to see a programming challenge website focused on performance and optimizations, rather than code golf (shortest program) or edge case correctness (interview type sites). I took a course like this in uni where we learned low-level optimization and got to compete with other classmates to see who had the fastest program - a fun experience that I don't really see much of online.


Benchmarking reliably and fairly for minimal cost is really hard in a cloud environment.


The course had a machine solely for benchmarking. It would only compile and benchmark one program at a time and the students had to queue up.

It worked well for the course and the results were very consistent between runs, but that kind of setup doesn't scale well.

There may be a (slow) option though, and that would be benchmarking the code using an simulated processor. The runtime would be measured relative to the simulated processor as opposed to the computer hosting the simulator.


True, but consider that compiling hello world takes me about 5 minutes in gem5 when using a semi-realistic model of the computer.


Only getting 7GiB/s on a Ryzen 7 1800X w/DDR4 3000. Zen 1 executes AVX2 instructions at half speed, but that doesn't account for all of the difference. I guess I need a new computer. To run FizzBuzz.


I ran this on a server with a 5950X (same CPU as this test was run on), but with 2666MHz memory (128GB) instead of 3600MHz memory (32GB) and I only got 41GB/s.


Maybe because of taskset? Same cpu, no taskset, I'm getting 38-43GiB/s. With taskset, ~55GiB/s


next up: custom FizzBuzz ASIC with 128 fizz and 128 buzz cores for maximum throughput.


Substandard design. You should have a suitable ratio of cores to keep them load balanced. I suggest 160 fizz cores and 96 buzz cores.

EDIT: And a fizzbuzz co-processor, obviously.


> fizzbuzz co-processor, obviously

One of the functional modules on the M1 Pro Max chip, but only the Max one, unfortunately.

Take that, Intel!


Or if your ASIC generalizes the problem in a slightly different direction, you end up reinventing TWINKLE and TWIRL: https://en.wikipedia.org/wiki/TWINKLE


Thanks for sharing this, its always amazing to see how physical properties can be used in weird ways to bypass normal "rules" of computation.


Could probably store all multiples of 3 and 5 up to some really big number burned directly to hardware and then do something like a big CAM table the way Internet core routers do mapping the numbers to ASCII bit strings. Then speedup IO by not having a general purpose display, but something more like an old-school digital clock that can only show digits and the words "fizz" and "buzz."


You'll get docked points from the typography nerd for displaying "fizz" instead of "fizz"


That's definitely not going to be useful. By doing this 15 numbers at a time you completely avoid needing to do any division anyway. print FB, i+1. i+2, Fizz, i+3, Buzz. i+4, ... i+14. Then increment by 15.

Also, even with a very slow CPU you'd already run faster than can be perceived on a digital clock


I think optimizing binary tree inversion is a higher priority, right now.


Not for fizzbuzz optimisation it isn't! Get your priorities straight mate! Who is trying to get hired when we could be outputting more fizzes and/or buzzes! :)


FizzBuzzCoin?


Ah yes, the new junior developer technical interview


That would be easier.


But does it support arbitrarily large integers? ;)

Ed: no, but does pretty well:

> The program outputs a quintillion lines of FizzBuzz and then exits (going further runs into problems related to the sizes of registers). This would take tens of years to accomplish, so hopefully counts as "a very high astronomical number" (although it astonishes me that it's a small enough timespan that it might be theoretically possible to reach a number as large as a quintillion without the computer breaking).


Now I'm kinda curious to see how much faster you could go on an M1 Max with the GPU generating the data. Once his solution gets to the point of being a bytecode interpreter, it's trivially paralellizable and the M1 has _fantastic_ memory bandwidth. Does anyone know if the implementation of pv or /dev/null actually requires loading the data into CPU cache?


Pv never touches the data. It splices the input into the output, /dev/null in this case, so once written, a page is never touched again.

Splice is linux specific though, so you would need to run it on M1.


If that is the case, does it actually matter what CPU core pv runs on? I feel like _something_ must ultimately zero the page out before it can get re-mapped to the process, but I'm not sure which core that happens on, or whether there's some other hardware mechanism that allows the OS to zero a page without actually utilizing memory bandwidth.


The page is reused by the generator process without being zeroed.


Couldn't we load it onto an NVIDIA RTX A6000? It is much much faster than the M1 Max. It has a much greater memory bandwith too


Unfortunately, the memory bandwidth that matters here is not bandwidth to GPU memory, but bandwidth to main system memory (unless anyone knows how to splice a pointer to GPU memory onto a unix pipe). That's specifically why the M1 came to mind, as a powerful UMA machine that can run Linux. Perhaps a modern gaming console could hit the same performance, but I don't know if they can run Linux.


Wouldn’t benefit from UMA


PCIe 5 speed is 64 GB/s, so theoretically if you perfectly pipelined the result you could achieve the same performance on a next-generation discrete GPU.


Wow. There is programming and then there is programming. Whenever I see something like this I feel like what I do for a living is duplo blocks in comparison.


I am thankful every day for the work of those who came before me. Their long hours toiling over hardware, assembly, compilers, protocol stacks, libraries, frameworks, etc let us focus on the bigger picture, how to write same-y CRUD apps for businesses :)


Ive had the opportunity to tinker with ASM, z80 architecture, low level programming and other similar stuff (I'm less than 1/1000th able as the referenced author).

I find this programming it very beautiful and rewarding in that you really know that you are programming the hardware. Unfortunately it's not an easy path to get a good paying job (unless you are exceptional like the gentleman). So I ended up building fintech web apps.


You can get a well paying job off of these skills, but the job title will read something like "performance engineer" or "security researcher" rather than "assembly programmer".


To me it's almost as impressive that pv does not become the bottleneck.


I was wondering the same thing! pv probably never touches its input and might itself be using splice to never read the bytes in users pace and just accumulate the byte counts.

Edit: yes it has a --no-splice parameter.


"The Grid. A digital frontier. I tried to picture clusters of information as they moved through the computer. What did they look like? Ships, motorcycles? Were the circuits like freeways? I kept dreaming of a world I thought I'd never see. And then, one day I got in...

It turns out The Grid is just a guy sitting in a chair, shouting about "Fizz!" and "Buzz!" as fast as he can.

It wasn't really what I had in mind.

(The image of this poor program, stuck shouting "fizz!" and "buzz!" for subjectively centuries at a time struck me...)


I kinda hope as a joke in Tron 3 they have a guy muttering in the corner "1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz" as someone walks by.


Still better than throwing away a significant percentage of the world total power output just to compute some useless hashes if you ask me..


The amount of optimizations remind me of story of Mel: http://www.catb.org/jargon/html/story-of-mel.html


    > // Most FizzBuzz routines produce output with `write` or a similar
    > // system call, but these have the disadvantage that they need to copy
    > // the data being output from userspace into kernelspace. It turns out
    > // that when running full speed (as seen in the third phase), FizzBuzz
    > // actually runs faster than `memcpy` does, so `write` and friends are
    > // unusable when aiming for performance - this program runs five times
    > // faster than an equivalent that uses `write`-like system calls.
Why can't `write` use a reference like vmsplice?


I'm not an assembly programmer, but my intuition is that this is safer. You can only rely on "zero-copy" behavior when you have total control of the program and know that that memory region is going to stay put and remain uncorrupted. Therefore, most external code will make a copy because they can't insist on this (and because for most people, making a copy is pretty fast).


I think that some of unix derivatives do or have done in the past. If the argument to write is page aligned and a multiple of a page size they play VM tricks to implement zero copy writes.

The issue is that the caller to write is allowed to do anything it wants with the buffer after write returns, so the write implementation need to unmap the stolen pages and perform copy on write if the caller ever touches them again, so in practice the optimization is very fragile. For this reason Linus as always refused to implement this optimization.

Splicevm gets away with it because it is part of the caller contract that it can't ever touch the pages until the kernel is done with them. Unfortunately there is no general way to know when it is safe to reuse them and it is very application specific (for example there might be an explicit ack from the consumer that it has received the data)


Did anyone else get recommended and see the FizzBuzz video on YouTube ( https://youtu.be/QPZ0pIK_wsc ) just before this article or did I just witness an incredible coincidence?



I haven't been on YouTube today, but I literally JUST got out of the shower and I was thinking about FizzBuzz while there.

But yeah, frequency illusion.


Literal 10x programmer moment


Yeah, the fact that it's a whole order of magnitude faster than all the other, even assembly based solutions, is insane.


“ This is faster than a number of standard functions. In particular, it's faster than memcpy, which presents interesting challenges when it comes to I/O”

Wow.


Sorry for the super layman question, but can someone explain why is this so amazing?


A typical desktp computer processor runs at 3Ghz, or 3 billion cycles per second. It has 64-bit registers, which is 8x8 bytes. If it moves one register of results per clock cycle that's up to 24 billion bytes per second, 24GB/s if it does nothing else and never stalls or trips up[1]. Less than half the speed this answer is moving data. In a typical desktop computer from a few years back the connection between CPU and main memory tops out at 25GB/sec throughput[2], often less. There's delay when a user program requests that the Linux kernel does things such as allocate memory or open input/output. Printing text to a console is really quite slow. That means lots of stalls and trips waiting for IO related things and context switching as the same processor runs both the FizzBuzz printer and the PV counter and the Linux kernel and pipes connecting them together and has to keep task-switching between them.

From the opening question on that page "A naive implementation written in C gets you about 170MiB/s on an average machine" and that C code is typical, there's nothing drastically bad about it. C already has a reputation as a fast, low level language. This answer is running over 320x faster than the straightforward C answer on the same hardware. If you asked people "how much faster can it possibly get without moving to specialist hardware", I doubt they'd guess that high.

It's difficult to get higher speed by using more CPU cores because calculating FizzBuzz is much faster than printing it so they'd just fill up a queue and then stall waiting. To get faster, people are leaning on what exactly is happening internally, adjusting buffer sizes and how to generate the text "FizzBuzz" and get it into the right place to be printed at the right time, and which system calls can be done with less overhead, and lining up more work which can happen during the pauses, and unrolling loops to give the CPU a pattern of instructions it can execute quicker, this is getting answers into the GB/sec ranges.

This answer is like the Bugatti Veyron of answers; so finely tuned it's surely getting close to the limits of what the hardware can do. Making use of AVX2 registers which can hold 32 bytes instead of 8, and can be vectorised so one instruction processes many pieces of data, which means less overhead of instructions for the CPU to decode and process per piece of data, and deeper magic, a lot of skill and knowledge required to put all of it together.

And of course, someone spending months trying to send "Fizz" and "Buzz" into a counter at tens of billions of bytes per second for no good reason, is amazing in a different way.

[1] A nonsense back of an envelope estimate, but good enough for context.

[2] https://www.softwareok.com/?seite=faq-Beginners&faq=25


Because it's super CRAZY fast.


Somehow this reminds me of Aphyr's Technical Interview series[1], and the thought of some assembler dark wizard actually coding this in an interview is highly amusing.

[1] https://aphyr.com/posts/340-reversing-the-technical-intervie...


> Save it as fizzbuzz.S (that's a capital S as the extension)

What's this about?


Lowercase just runs it through the assembler, upper case also runs the C preprocessor


Clang at least does not seem to care about the extension.


How does one debug the L2 cache? i.e. how does one inspect the data in the L2 cache and verify that loading some data from RAM actually cached X number of bytes, and the correct data has been loaded in the cache lanes, and the next instruction will use the data from there?


My naive guess would be using hardware perf counters and checking cache misses


callgrind/kcachegrind i think?


I’ve printed his entry and there are some neat tricks I learned, but ‘stage 3’, where, according to his comments, a bytecode interpreter is used, is still beyond my understanding. Also not sure why this helps, a bytecode interpreter sounds like a lot of overhead to me…


The bytecode is specifically designed so that it maps directly to the output.

  // The third phase operates primarily using a bytecode interpreter; it
  // generates a program in "FizzBuzz bytecode", for which each byte of
  // bytecode generates one byte of output. The bytecode language is
  // designed so that it can be interpreted using SIMD instructions; 32
  // bytes of bytecode can be loaded from memory, interpreted, and have
  // its output stored back into memory using just four machine
  // instructions.

  // The bytecode format is very simple (in order to allow it to be
  // interpreted in just a couple of machine instructions):
  // - A negative byte represents a literal character (e.g. to produce
  //   a literal 'F', you use the bytecode -'F', i.e. -70 = 0xba)
  // - A byte 0..7 represents the hundreds..billions digit of the line
  //   number respectively, and asserts that the hundreds digit of the
  //   line number is even
  // - A byte 8..15 represents the hundreds..billions digit of the line
  //   number respectively, and asserts that the hundreds digit of the
  //   line number is odd

  // %ymm2 holds the bytecode for outputting the hundreds and more
  // significant digits of a line number. The most significant digits of
  // this can be obtained by converting LINENO_TOP from high-decimal to
  // the corresponding bytecode, which is accomplished by subtracting
  // from 198 (i.e. 256 - 10 - '0'). The constant parts of LINENO_TOP
  // are 198 minus the bytecode for outputting the hundreds to billions
  // digit of a number; this makes it possible for a single endian
  // shuffle to deal with all 16 of the mid and high digits at once.


How do people have so much time on their hands :(


Maybe they lose less time writing comment on HN. Dont hit me I am guilty of this too.


Most single people who have a full time job should still have enough free time to devote 1000+ hours a year to personal projects - and even with a family, you should be able to find 400+ hours if your employer has any respect for WLB and/or you know how to set boundaries. That's 2 college degrees a decade worth of time


I have trouble with procrastination, but since watching Arnie's take on this I cannot get it out of my head and I have actually found it helpful (yet here I am on HN again :()

https://www.youtube.com/watch?v=u_ktRTWMX3M


Well, it's easy when your goal is clear-cut and 100% ego-directed ("win Mr. Universe"). Not so much when it's related to much larger concerns and uncertainties and pessimism (e.g. fighting global warming).


That could be an interesting talk.

I'll watch it later.


Everyone has 24 hours per day.


Not everyone has the same ability to utilize those hours. People have children and/or pets and/or loved ones to tend, different cognitive and physical limitations, different economic challenges, different social expectations, different living conditions which may impact all of the above.

Speaking for myself: between caring for a puppy, cognitive impairments from ADHD, sustained impact from burnout, and financial responsibilities for family… my ability to carve out free time for a very long list of things I’d like to do extracurricularly is limited to a few hours a week at most. Achieving a “months of work” project in my areas of interest and applicable talents is easily going to take 10x as long. And that also means being much more circumspect about where I choose to focus those interests.


Maybe I'd feel differently if the comment ended with :) instead of :(.

But I mean, obviously, yes. Different people can do different things for a zillion different reason. I'm reacting negatively because it sounds like a criticism against the author just for spending a lot of time working on a hobby project just for the sake of it.

See also: "why are you having fun when cancer still isn't cured yet?"


Wow! I took it entirely differently (and still do). I read it as being disappointed that they don’t feel as capable to pursue and invest significant time/effort into their own ideas. Essentially I saw the :( as an admission of envy, not as a judgment.

Admittedly, that could very easily be me projecting. I have tons of envy of people pursuing their ideas without the limitations I have to take seriously in my own life. It doesn’t make me sad to see them succeed or use their time the way they do, it makes me sad to have a brain full of ideas and limited ability to reify them.


It's envy. The author is my hero.


Not if they’re traveling eastwards around the world.


/s

Here! You forgot the sarcasm mark :)


It would be nice to compare its performance with a safe variant written in Rust, with a variant written in Java and one in Python to see what % of the performance we lose for high abstractions, convenience and safety.


It would only be fair to write the Java version in JVM bytecode, in this style:

https://news.ycombinator.com/item?id=29036245


Only fair then if we track the time it took to write all variants.


Why not? That will be interesting too. For sure nothing is free and it's good to know all trade offs.


I feel stupid after reading this, is this normal?


It would naturally be foreign to most, but the neat part is that it's also pretty accessible. Look up vectorization techniques and L2 cache optimization and you're most of the way to understanding. Most people who grok something like this after their first glance are specialized in the same or similar domain and probably wouldn't have an expert understanding of some things you take for granted.


right there with you. there is a lot to learn


From the comments: "@chx: I already have a master's thesis. This was harder."


I personally need some assembly renaissance, around books and new content that is.

This is mind blowing to say the least!


Interesting how he uses the C preprocessor as an Assembly preprocessor; cpp as a general purpose macro processor. Maybe this is more portable than using assembler built-in macro features that vary from assembler to assembler


> Interesting how he uses the C preprocessor as an Assembly preprocessor

It's actually a fairly common pattern, both ffmpeg and the Linux kernel do this for example.


  #include <sys/syscall.h>
is generally the easiest way to make a syscall without hardcoding numbers into your assembly program.


I'm imagining the author of this being called in an interview and writing this on the whiteboard. Followed by the CTO offering his job to him.


I'd be impressed by anyone who could write this in an hour on a whiteboard even with the code provided to them.


This post humbled me more than I needed !


My naive implementation (as shown in the linked question) in Rust was mind boggling slow. C could do around 150MiB/s, while Rust could only do 10MiB/s. A simple `objdump -d` shows Rust is generating a lot more code. I'm not sure how much of that is relevant though.

At that point my coffee time ran out. I wish I had more time to figure out why. :-(


It is very common for folks to forget to pass --release, were you doing that?


What's cool here is if you can make FizzBuzz go 55GB/s then you have the skill needed to make a whole bunch of other stuff go that speed too. My personal favorites are that Intel architecture lets us do population count, polynomial division, and eight tap filters at that speed too, which is wonderful.


I guess is faster than memcpy because both are memory bound but this only has to write to memory whereas memcpy has to both read from and write to memory?

Edit: actually it sounds like it’s more cache efficient because it can always use the same block of cache memory?


I believe so. A memcpy needs to touch large blocks of memory, far more than would fit in a cache, so most memcpy implementation use non-temporal stores to indicate that the memory is not going to be used again. They write directly to main memory, avoiding using up valuable cache space with a cache line that is written to once and never used again. However, this program can operate entirely within L2, so it can go faster.


Wow the competition for a slot in the corporation sure is fierce these days


Now we just need someone to write a machine proof of correctness.



The next level would be designing an ASIC to do FizzBuzz. :)


I wonder if it slows down with really big numbers.


Good work!


    This is the heart of this program. It aims to be able to produce a
    sustained output rate of 64 bytes of FizzBuzz per four clock cycles
    in its main loop (with frequent breaks to do I/O, and rare breaks
    to do more expensive calculations).
    
    The third phase operates primarily using a bytecode interpreter; it
    generates a program in "FizzBuzz bytecode", for which each byte of
    bytecode generates one byte of output. The bytecode language is
    designed so that it can be interpreted using SIMD instructions; 32
    bytes of bytecode can be loaded from memory, interpreted, and have
    its output stored back into memory using just four machine
    instructions. This makes it possible to speed up the FizzBuzz
    calculations by hardcoding some of the calculations into the
    bytecode (this is similar to how JIT compilers can create a version
    of the program with some variables hardcoded, and throw it away on
    the rare occasions that those variables' values change).

    The bytecode format is very simple (in order to allow it to be
    interpreted in just a couple of machine instructions):
    - A negative byte represents a literal character (e.g. to produce
      a literal 'F', you use the bytecode -'F', i.e. -70 = 0xba)
    - A byte 0..7 represents the hundreds..billions digit of the line
      number respectively, and asserts that the hundreds digit of the
      line number is even
    - A byte 8..15 represents the hundreds..billions digit of the line
      number respectively, and asserts that the hundreds digit of the
      line number is odd

    In other words, the bytecode program only ever needs to read from
    LINENO_MID; the information stored in LINENO_LOW and LINENO_TOP
    therefore has to be hardcoded into it. The program therefore needs
    to be able to generate 600 lines of output (as the smallest number
    that's divisible by 100 to be able to hardcode the two low digits,
    200 to be able to get the assertions about the hundreds digits
    correct, and 3 and 5 to get the Fizzes and Buzzes in the right
    place).

    The bytecode interpreter consists of four instructions:
    1. Load the bytecode from memory into %ymm2;
    2. Use it as a shuffle mask to shuffle LINENO_MID_TEMP;
    3. Subtract the bytecode from the shuffle result;
    4. Output the result of the subtraction.

    #define INTERPRET_BYTECODE(bc_offset, buf_offset) \
      vmovdqu %ymm2, [%rdx + bc_offset]; \
      vpshufb %ymm0, LINENO_MID_TEMP, %ymm2; \
      vpsubb %ymm0, %ymm0, %ymm2; \
      vmovdqa [OUTPUT_PTR + buf_offset], %ymm0


I'm not sure why there is coding done here.

A ticket for this work has been made at here: https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...

I'm going to have to ask that the contributors attend the grooming sessions and talk about the business value of this performance enhancement.


I like how it has a

  FizzBuzzOutputGenerationContextVisitorFactory.java
file.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: