55 GiB/s FizzBuzz

jtchang · on Oct 28, 2021

The amount of low level CPU architecture knowledge to write such a program is mind boggling. Just goes to show how much room for improvement a lot of programs have.

loser777 · on Oct 28, 2021

FizzBuzz has many properties that make it very suitable for these kinds of optimizations that might not be applicable to general purpose code: + extremely small working set (a few registers worth of state) + extremely predictable branching behavior + no I/O

These properties however don't diminish the achievement of leveraging AVX-2 (or any vectorization) for a problem that doesn't immediately jump out as SIMD.

cogman10 · on Oct 28, 2021

> no I/O

The problem description is writing out bytes which is probably some of the more expensive part of this. In fact, if you read the winning solution description, IO is the primary problem here.

> doesn't immediately jump out as SIMD.

IDK that I agree with this assessment. Very naively, I see no reason you'd not take the 512 SIMD registers and split them into 16 32 bit lanes. From there, it's a relatively simple matter of using 2 registers for the divisors, pulling out the results, and transforming them into the text to print. In other words, you be chunking this up into 16 iterations per loop. (vs 1 with the naive assembly).

This is the sort of thing that jumps out as easily vectorizable.

Now, the fastest answer very obviously does not take this approach because I'm certain they realized the same thing, that the difficult part here isn't the actual division, but instead pumping out the correct text at the highest speed possible. If you read through it, most of the code is dedicated to converting binary numbers into ascii :D

loser777 · on Oct 28, 2021

Maybe I should be more clear; no data needs to be fetched from disk/network, and the "writes" don't need to go past memory.

As for the second point, you might have a different definition of "naive" and "relatively simple" as my brain has rotted too much from only thinking about SIMD for numerical computation. While determining divisibility be relatively clear, it wasn't clear how the printing would be easily vectorizable as the output-per-number is variable in length.

dvt · on Oct 28, 2021

> transforming them into the text to print

I think this nails it. Vectorizing the math problem is "easy" (send batches of numbers to cores, do the division) but then you have to re-order it for printing (not to mention actually print it), so paradoxically, it probably makes more sense for the program to be single-threaded.

yuliyp · on Oct 28, 2021

You don't even need to do division. There are only 3 patterns that repeat for every set of 10 numbers. You just need to track which one you're on.

hasmanean · on Oct 29, 2021

Yes the patterns repeat every 15 integers. So you only need to do one division operation, get the index modulo-15.

I was bored once at work and figured out a way to compute this without doing much arithmetic. It only requires 1 modulo operation.

   for (int i=1; i<100;i++)
   {
       // make a bit vector with a 1 in ith bit, modulo 15.
       unsigned int i_as_bitvector = 1 << (i % 15); 
       // Check it against the valid positions for FizzBuzz

       // 0ctal 011111 is binary 1001001001001          
       // Octal  02041 is binary   10000100001
       printf("%d  ", i); 
       if (i_as_bitvector & 011111 ) printf("Fizz"); 
       if (i_as_bitvector &  02041 ) printf("Buzz"); 
       printf("\n");
  }

I also have a version which has NO magic constants anywhere in the program, except 3 and 5. I'll post it if someone is interested.

bonzini · on Oct 29, 2021

Also, after printing 1 to 9, you have a single pattern of 30 numbers that repeats exactly 3 times from 10 to 99, then 30 times from 100 to 999, then 300 times from 1000 to 9999, and so on. That lets you extract lots of code from the loop and run it roughly once every 10^n numbers.

ascar · on Oct 29, 2021

Why would you think in sets of ten, when there should actually just be one pattern in 15? Then it just becomes a challenge to arrange your code to work on these blocks of 15.

We could probably code 16 versions of the block of 15 code that repeat and are nicely aligned for SIMD.

WithinReason · on Oct 29, 2021

In my mind a SIMD version would work by predefining a string buffer 15 elements long with Fizz in positions 0,3,6,... and Buzz in positions 0,5,10. These are constants that don't ever change. Then vectorized code would only write to positions that change, the numbers: 1,2,4,7,8,11,13,14. Most of the time these positions would have fixed width too (lengths of large numbers don't change often) so you can use constant write addresses. So 8 SIMD threads could handle a block, and write everything blockwise.

Same idea could be used to parallelize for a GPU.

saagarjha · on Oct 29, 2021

Because the digits increment in groups of ten.

yuliyp · on Oct 29, 2021

I was thinking blocks of 10 because I can itoa(the number of tens) and then concat with the subset of 0, 1, 2, 3, 4, etc. that I care about. I guess with blocks of 15 you just need to do two itoas, and worry about two types of block vs 3.

dvt · on Oct 28, 2021

> There are only 3 patterns that repeat for every set of 10 numbers

Ah, yep, good point!

swsieber · on Oct 28, 2021

The real question is if you can use some for formatting numbers.

boibombeiro · on Oct 29, 2021

Using AVX 512 is not suitable for programs that takes very small time to run. There is a penalty in the ms range to "warm up" the CPU (it is more a cool down, actually).

As OP stated, the limiting factor is on the memory access. That's why he kept saying 64B every four cycles.

But OP likely didn't use because most CPU lacks support for AVX512.

The new Intel CPU introduced many changes for the frontend. This will likely improve the speed.

There might also be possible to try make the CPU operate at higher clockspeed.

Code looks a bit long. Not sure if the unrolling actually helps.

EDIT: Just look at the agner microarchitecture doc. Ice Lake and Tiger Lake can do 64 bytes/cycle.

In theory, it can run 4x faster (on bare metal, maybe).

yumraj · on Oct 28, 2021

I think you're missing the point. The issue is that with the advent of higher level languages, starting from Java, Javascript and on to Python and so on, most people have forgotten or have never learnt the skills to optimize code.

I'll argue that, as a percent, the number of people who can write proper multi-threaded code has only diminished over the years.

And we see the result, despite the massive increase in computing power, software in general has become slower and bloated.

lmm · on Oct 29, 2021

The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.

If you have "months to write fizzbuzz" levels of resources available, sure, you can microoptimize everything. Except in practice you can't, because the amount of effort needed for this kind of microoptimization is quadratic or worse in the complexity of the problem you're actually solving.

For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster, because they'll have had a lot more time available to spend prototyping and picking good algorithms rather than debugging undefined behaviour because they used the wrong combination of integer bit lengths.

grayrest · on Oct 29, 2021

> The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.

I disagree. I believe the majority of code is slow because it's written without any consideration at all to performance. I like Casey Muratori's philosophy of non-pessimization where true optmization (measuring, working hot spots) is rare and rarely necessary but massive speedups compared to the general state of the art are achievable by simply not writing code using patterns that are inherently slow. This isn't deep algorithmic stuff it's just avoiding copies and/or pointer chasing.

Edit: https://www.youtube.com/watch?v=pgoetgxecw8 <- Casey's most recent intro to non-pessimization

> For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster

The Advent of Code runs every year and I'm not sure about C (not something I track) but there are plenty of Rust submissions and while the Rust submissions take longer to come in, it's not THAT much longer. From memory it's like 2x on a timescale of minutes. The Python programs are not faster.

kaba0 · on Oct 29, 2021

Not disagreeing or anything, but an Advent of Code sized example will not be anywhere close in complexity to your typical CRUD application. The thing is, we possibly could not even write most of these applications in a low level language, as we should not forget that while Rust indeed has good abstractions, low level languages simply leak memory-related informations to high levels as well, making it necessary to deal with them on a high level. And in these cases, time to market, speed of refactors and maintainability is much more important than the row speed (which again, may not even be that significant at all. Eg, often most work is done on the db side)

grayrest · on Oct 29, 2021

> an Advent of Code sized example will not be anywhere close in complexity to your typical CRUD application

Advent of Code size problems is the best case scenario for Python. The difference in implementation time goes down as program size increases because you spend a larger fraction of the time figuring out what you're implementing.

The reason it's not done is mainly social and not technical. Ecosystems matter and this is particularly the case in graphical desktop environments where massive amounts of time and effort are required to re-implement features matching users' expectations (e.g. Flutter).

If we're talking a server side http app server then the library requirements are significantly lower and the libraries are generally there in almost any language. To keep the language comparison the same, it's not significantly slower to implement a CRUD backend in Rust than it is in Python. Depending on your specific setup and how much pre-prep is involved it can be faster. I'm not aware of any framework in a dynamic language that can produce production grade endpoint handlers that are as terse as a Rocket app heavily leveraging request guards. The guards handle validation, conversion, session context and db connection based on the endpoint parameter types so you just write the happy path which is 1-2 lines of code for CRUD.

> speed of refactors and maintainability

Dynamic languages are significantly worse for maintainability and refactoring. I say this as someone who generally gets paid to be a frontend developer and spent years writing both Python and Clojure professionally. Despite my arguments, I do not write server side Rust for work because team coherence is more important for doing my actual job (solving a customer's problem) than programming language and I'm the only person in my company who writes Rust. I've been doing this long enough that I accept the worse solution as the cost of doing business. My personal projects don't have this constraint so perf is orders of magnitude better in exchange for a lot more knowledge and a bit more design work.

lmm · on Nov 1, 2021

> Advent of Code size problems is the best case scenario for Python. The difference in implementation time goes down as program size increases because you spend a larger fraction of the time figuring out what you're implementing.

The difference in code size between a good and a bad language goes up as program size increases, because a large codebase makes it harder to keep it all in your head and forces you to add more layers and duplication.

DeathArrow · on Oct 29, 2021

>And in these cases, time to market, speed of refactors and maintainability is much more important than the row speed (which again, may not even be that significant at all. Eg, often most work is done on the db side)

About the raw speed. Maybe you want to take a look at these nice benchmarks? https://www.techempower.com/benchmarks/

You don't have to ditch Java or C# for your app and use C or Rust instead. But you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.

Also, if performance, throughput, stability wouldn't be a thing, we have to wonder why Twitter switched from RoR to Java? Couldn't they just buy more servers instead?

lmm · on Nov 1, 2021

> you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.

In practice you only have so much attention and there are better places to spend it.

> Also, if performance, throughput, stability wouldn't be a thing, we have to wonder why Twitter switched from RoR to Java?

If raw speed matters we have to wonder why Twitter started with RoR, and only switched (mostly to Scala, not Java, AIUI) once they had tens of millions of users.

In a tiny handful of outlier cases where a given piece of code is run literally billions of times a day it eventually becomes worth doing performance microoptimisations. But only after you've confirmed that there's absolutely massive demand for this particular code and already done your iterative refactors and figured out what the code should do.

pkolaczk · on Oct 29, 2021

> But you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.

Writing Java in a more performant way often means ditching OOP and using primitive types, primitive arrays everywhere instead, and also avoiding allocations up to the point where all objects are reused. Such code is just as slow to write and as error-prone as C, and way worse than C++ where at least you have RAII, generics, STL algorithms and zero-cost abstractions.

kaba0 · on Oct 30, 2021

You only have to do that in hot loops (in most problems), that you can easily profile with the very good observability tools that java provides. The rest can be maintainable, good quality OOP code.

I would even go as far and say that OOP’s encapsulation pretty much exists just for this: the “ugly” implementation can reside inside a class and you only have to care about its public API from the outside.

lmm · on Nov 1, 2021

> I believe the majority of code is slow because it's written without any consideration at all to performance. I like Casey Muratori's philosophy of non-pessimization where true optmization (measuring, working hot spots) is rare and rarely necessary but massive speedups compared to the general state of the art are achievable by simply not writing code using patterns that are inherently slow. This isn't deep algorithmic stuff it's just avoiding copies and/or pointer chasing.

I don't/can't watch videos; I would be interested to see a written argument, but generally my view is that copies and pointer chasing are irrelevant because code is so far away from optimal already. You're right that code is mainly written without any consideration to performance, but that manifests itself first in poor choices of algorithm and/or datastructure. I don't think there's anything "deep" about, say, using a hashmap rather than searching linearly through a list; indeed I'd consider it easier to understand than avoiding copying.

> The Advent of Code runs every year and I'm not sure about C (not something I track) but there are plenty of Rust submissions and while the Rust submissions take longer to come in, it's not THAT much longer. From memory it's like 2x on a timescale of minutes. The Python programs are not faster.

Advent of Code is still extremely tiny programs, and Rust is a much much better language than C for thinking in.

DeathArrow · on Oct 29, 2021

True. Most business apps I've seen seemed like people did serious research into how to write the worst code from a performance point of view. Gazillions of wrappers upon wrappers, tons of indirection, data being copied and recopied thousands of times, creating FactoryFactoryFactories just for the sake of it.

If people would pay more attention on how the data flows instead of Uncle Bob and GoF, they would not only end up with a cleaner software architecture but also with more performance.

raxxorrax · on Oct 29, 2021

> I believe the majority of code is slow because it's written without any consideration at all to performance.

That can both be correct. I use a std. data structures for many things even if some prefix tree might be better for the operations I perform on a set. I don't think about that in most cases since computing power is cheaper than the additional work. If performance becomes a problem, I can optimize later.

Advent of Code isn't a realistic performance metric. Working software is for most software projects.

saagarjha · on Oct 29, 2021

> For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster, because they'll have had a lot more time available to spend prototyping and picking good algorithms rather than debugging undefined behaviour because they used the wrong combination of integer bit lengths.

I think this is more nuanced than that. Competitive programming actually gives fairly good insight here, because skilled programmers are placed under time constraints and obviously care about performance.

What you see, usually, is that programs that don't need maximum performance are often written in Python, which allows people to quickly write code that needs less debugging. But double the allotted time and the C programmer finally has their code finished and free of bugs, and they're beating the pants off of your Python code right out of the gate. You try and come up with clever ways to avoid copies and drive down the Big-O, but the C programmer comes up with the same thing just a bit behind you and leads most of the time because they get a free 10x speedup just because of the language they're using.

It's not surprising that the highest levels of competitive programming are dominated by C (and to a greater extent, C++) programmers, with Python use extremely rare, almost always reserved for rare, complicated algorithms that happen to bring a significant big-O improvement that justify the use of Python to write them.

ballenf · on Oct 29, 2021

The overwhelming majority of code isn't slow because it used the wrong algorithm, it's because the Product team can't demo or show off an increase in efficiency. It's because they aren't measured by their ability to reduce or control the growth of the AWS bill. And they probably wouldn't have the knowledge to ascertain a reasonable bill. So they push deadlines which pushes devs to take ugly but fast to write solutions using high level languages.

Also, performance improvements pay for themselves over months and years while new features usually pay for themselves faster (at least they do in forecasts).

And finally, touching working code to make a performance improvement necessarily has some risk and that risk might be worth more than a year of cost savings from the improvement.

DeathArrow · on Oct 29, 2021

>The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.

Try to look at The Computer Language Benchmarks Game where all implementations are using the same algorithm. An optimized C implementation can be 300 times faster than Python.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

lmm · on Nov 1, 2021

That's exactly the kind of missing the point that I'm talking about: tiny programs where all implementations are using the same algorithm and people have spent months microoptimising and are utterly unreflective of realistic business problems.

yarky · on Oct 29, 2021

> debugging undefined behaviour because they used the wrong combination of integer bit lengths.

It's a function of debugging time which is a function of the programmer's skills. I almost certainly fall into the Python subset though :/

Zababa · on Oct 28, 2021

There's a law that says that as resources becomes easier to consume, people consume them more. Like when increasing the capacity of a road doesn't reduce the time spent in traffic. Wouldn't that apply to software? I feel like it's a bit unfair to call software "slower and bloated" when you wouldn't call a road "bloated" because it has a new lane. People can do more things now.

Dma54rhs · on Oct 29, 2021

More like I start manufacturing 3 lane wide cars because the material im using is cheaper than cutting and engineering it to normal size - the way I see modern software analogy in this context.

Zababa · on Oct 30, 2021

I don't think that's fair, we do way more things now with computers than before.

FemmeAndroid · on Oct 28, 2021

It feels like the frustration comes because I see the analogy this way:

The roads have been given extra lanes, and the total possible throughout of the road has increased. As a result, car manufacturers have decided that they can use a cheaper manufacturing process which makes slower cars. But when anyone complains, they point to the additional lanes and say ‘but the number of cars that get from point A to point B has increased, even if each trip is slower.’

The end user isn’t getting a choice when software developers make slower code each year. They have to buy more expensive hardware to keep up with the slowdowns of software developers.

winstonewert · on Oct 29, 2021

I get it. But what exactly do you want to happen here?

In terms of your analogy, do you want more expensive cars that less people have?

Software developers could spend more time optimizing their code. But that means that are spending less time elsewhere, fixing bugs, adding features, developing other products.

dijit · on Oct 29, 2021

It should be noted that many times features come in that people do not want.

“The new format bar” for slack and “huddles” (perhaps less antagonisingly) are examples of things that Slack has done seemingly to prove they’re still working on the program.

Despite the the features not being needed in the slightest.

TheFlyingFish · on Oct 29, 2021

I actually disagree, at least about huddles. Huddles are a direct response to the low-friction nature of Discord voice channels (and arguably Teamspeak/Mumble/etc before Discord, although nothing else married text and voice channels quite as well before Discord.) It's almost silly how small of a difference there is between starting a huddle and a channel-wide "call" - the only difference is that it doesn't start out by "ringing" for everyone in the channel. But that tiny difference completely shifts the feeling from "this is a formal communication" to "we're all just hanging out and you can join if you want but no pressure."

IMO huddles happened because the pandemic moved so many more people remote that the need for a water-cooler replacement became much more obvious.

renewiltord · on Oct 29, 2021

I mean, we’re programmers here and it’s the easiest damn thing to learn. We could just write the software we like. Scratch our own itch.

I, for one, don’t care that much for performance so my personal forks of projects are all feature oriented.

But if you care about performance this much then writing something high performance is the best way to find others like you.

Zababa · on Oct 29, 2021

That would imply that we do less or as much things as before with computers which I think is false. Maybe you don't agree with what everyone does, but that doesn't mean that it's worthless.

fnord123 · on Oct 28, 2021

Jevon's Paradox.

foota · on Oct 28, 2021

I think the kind of optimization here is beyond that which people used to do. AVX didn't exist when it was common to write assembly.

yumraj · on Oct 28, 2021

Yes, this is extreme and way beyond my C capabilities at any point in time. I was making a general point which I do believe is valid.

scythe · on Oct 29, 2021

Sometimes bloaty design decisions transcend the question of programming language. Electron probably has (I didn't check) a faster JavaScript engine than Qt, but I'd expect that similarly sophisticated implementations of the same program in Qt or Electron show an advantage in performance to the former. But people use Electron because it's familiar and it's got a lot of active development, and more cynically, Google likes it that way because it means that people keep coding for Chrome (cf. Microsoft's VS ecosystem, Apple's walled garden, etc).

The problem with modern software is that vendors encourage new developers to learn using heavyweight toolkits because lock-in is profitable in the long run. One day Electron will work with Go, but it will still be a plague.

DeathArrow · on Oct 29, 2021

>FizzBuzz has many properties that make it very suitable for these kinds of optimizations that might not be applicable to general purpose code: + extremely small working set (a few registers worth of state) + extremely predictable branching behavior + no I/O

The working set depends on how you organize the data. I/o can be made in larger chunks most of the times. You have to consider about the data you need and when you need it.

As for branching, that most of the time depends on programmer rather than on the problem being solved. The situation where you need very complicate branching to solve a problem are rare. And I've seen O(n log n) algorithms beating clever O(n) algorithms because they could be better optimized for the CPU.

einpoklum · on Oct 28, 2021

> for a problem that doesn't immediately jump out as SIMD.

It's a single transformation applied to elements of a large array. Parallelization is obvious, and maybe the exact SIMDization is not obvious, one is certainly motivated to formulate it. And a scatter-like approach does spring to mind right away, I believe.

brudgers · on Oct 28, 2021

FizzBuzz is a constant.

It should never branch.

saagarjha · on Oct 29, 2021

Unrolling FizzBuzz is generally not a feasible solution.

bonzini · on Oct 29, 2021

Sure it is: https://codegolf.stackexchange.com/questions/215216/high-thr...

It executes just one (well predicted) branch every 30 numbers written, and incrementing/printing the line number is branchless too.

It's not as fast as the subject of the post (40 GB/s) but it's only a few hours of work.

saagarjha · on Oct 30, 2021

The parent comment asserted that FizzBuzz should never branch. This would require it to not have any backwards control flow, and thus require full unrolling of an infinite sequence. Hence the infeasibility.

brudgers · on Oct 30, 2021

FizzBuzz is limited to [1…100].

Output as an ASCII string, it is fewer than 400 bytes including \n’s.

If you’re counting GB/S, you’ve failed to understand the specification.

One of the sophisticated engineering aspects of FizzBuzz is that all optimizations are premature.

It is a boring problem. People being people invent different problems to provide a chance to be clever.

bonzini · on Oct 31, 2021

I interpreted that as no branches, only loops.

boibombeiro · on Oct 29, 2021

It is possible to instrument the code to see how much overhead those branchs add.

DeathArrow · on Oct 29, 2021

It pays of to understand the CPU architecture even if you are not using the assembler: https://blog.cloudflare.com/branch-predictor/

Once upon a time most software was highly optimized with hot code paths written in assembly. If you look at DOS source code, DOOM source code you will see lots of optimization.

When CPUs got more powerful, people got lazy and they thought they can spend the improvements on conveniences.

Now we are at the point that we run "desktop" apps written in Javascript on top of embedded browsers.

yamrzou · on Oct 29, 2021

Hard times create strong programmers. Strong programmers create good times. Good times create lazy programmers. And, lazy programmers create hard times.

DeathArrow · on Oct 29, 2021

Programming is a metaphor of life.

wruza · on Oct 29, 2021

If you look at DOS source code, DOOM source code you will see lots of optimization

If you look at VSCode code, you’ll see even more of these. What really changed was more people wanted to hack things and write non-PhD code, and the web tech was (sadly) popular enough to take this role. It’s not programmers who are lazy, it was that desktop frameworks utterly failed in their non-competive niches. MSVC libraries replaced each other almost faster than github’s new web frameworks, to the point that barely anyone can remember their chronology. Gtk+ got stuck for years dealing with C legacy, added non-C styling engines too late, and then got “owned” by a particularly deteriorating DE. Qt always had a not-so-clear position on community-licensed versions, and C++ impedance mismatch didn’t help that either. UI/AppKits got pretty useful and advanced in recent ~ten years, but were ios/osx only. Minor frameworks like fox, wx, fltk, etc never got enough traction or new ideas and were just shadows of their bigger brothers. Meanwhile, with electron, bootstrap and little js one can make a page in few minutes, which could take few hours on conventional desktop.

I mean, you are correct that programming went from hardcore to relaxed mode, but there is more history to it in ui sense.

dolmen · on Oct 29, 2021

Business code must be optimized for maintenance first.

Because it will live for many years. It will have to survive in the hands of multiple caretakers. It will have to evolve to work with the underlying foundations changing (operating system, compiler/runtime, desktop -> web -> mobile).

That's different from most video games (one single release, then little patches) and from advent calendar code.

Cthulhu_ · on Oct 29, 2021

Hot code paths are still written in assembly, it's just on a different level. DOOM is an engine; nowadays people use off the shelf engines so that they can focus on creativity instead of low level details.

I mean I could probably hyperfocus on storing my application state in a file on disk, but why should I bother when there's off the shelf SQL databases right there? Which have been optimized endlessly, I might add. I don't get paid to write low level ASM, I get paid to build applications.

Edit: And to add, my current thing I get paid for is several orders of magnitude faster than the one it replaces. Not because I spend more time optimizing, but because I write it in sensible Go instead of 10K LOC PHP files that concatenate XML strings that get converted to JSON to be rendered in Dojo-flavored HTML concatenated together in JS because of reasons.

alephu5 · on Oct 29, 2021

Not lazy, sensible. The market has spoken and it wants bloated electron apps with rapid development, rather than performant C/assembly apps that hardly ever change.

DeathArrow · on Oct 29, 2021

All Electron apps users were demanding something that eats up their RAM, crashes and run slowly?

The market demand seems more like:

Jack: "Our bellowed CEO wants us to deliver our wonderful app to unwashed masses still using a desktop. Our enlighted marketing team made a study which realized that for whatever weird reason, corporations and businesses still make heavy use of those boxes which come with a mouse and keyboard attached."

Joe: "Sure thing boss, we will have to hire some programmers and testers and will take about a year or so."

Jack: "I forgot to tell you that the marketing study already took a year and a very large budget because we hired the best of the best to do it. One year we don't have, money we don't have. But what about those people who wrote our web app? We still pay them. Can't they deliver?"

Joe: "We will have our glorious desktop app in two weeks, boss, I just had an idea."

Jack: "Do that and I will personally remind about you to our bellowed CEO when he will want to do some promotions."

dijit · on Oct 29, 2021

“The market”?

The power dynamics of companies/customers are often not as dynamic as all that.

If slack is electron and I work at a company that uses slack: I must use it.

The competition in that space is all electron, you can’t choose.

It’s like saying that “the market chose non-ECC ram”. No, Intel chose for you and you don’t get much choice except to suck it up (or pay well above the odds.)

It takes a lot to avoid using a product. I mean people still use Oracle products!

vletal · on Oct 29, 2021

That does not actually contradict the point. We got stuck in a suboptimal local maxima due to the all early design decisions of browsers and JavaScript. The original inventors did not expect anyone wiring web version of Google Drive on the web.

The market surely pushes against the bloated electron apps, yet the convenience of having the same app on web as well as "native" and the amount of man years which went to make HTML+JS the richest multi-platform UI framework on the market is more important.

skummetmaelk · on Oct 29, 2021

The market has spoken. It wants factories to dump waste into nearby rivers because it allows them to make cheaper products that are more competetive.

It's not really a good argument is it?

6510 · on Oct 29, 2021

There is no market for things that just work on V1.0 then continue to work flawlessly. You wont sell a version n again and again and there is no support contract for something that does a single thing and does it well.

colonelxc · on Oct 29, 2021

I think you meant https://blog.cloudflare.com/branch-predictor/

DeathArrow · on Oct 29, 2021

Yes, thank you for mentioning it.

Sohcahtoa82 · on Oct 29, 2021

These days, compilers have gotten so damn smart with their optimizations that using assembly will actually make your code slower unless you really know what you're doing like the original post here.

A few years ago, I was doing Project Euler problems, and one of them had a solution that took about 45 minutes for my C program to solve. I decided that it was an incredibly simple problem, so rewrote it in Assembly.

My assembly solution took an hour to run.

I disassembled my C solution, and couldn't figure out what it was actually doing, though my assembly knowledge is pretty weak.

pizza · on Oct 29, 2021

People might not care to learn it, but isn't there a lot more to know nowadays re: the obscurities going on deep down at the lowest levels of modern architectures?

DeathArrow · on Oct 29, 2021

I don't think you should be an expert in CPUs architectures but but having an understanding of memory layout and how instruction are dispatched is not very hard or time consuming.

I see many people having trouble to understand the difference between value and reference, what pointers are, why it's better for structures to contain variables of the same type and have a certain size, why is better to call a function once for a large chunk of data instead of calling it many times for small chunks of data, why iterative or tail call recursivity are to be preffered over simple recursive functions.

The view is most LOB apps won't care about performance because they are waiting for i/o so we should not care about performance but coding speed, OOP, Uncle Bob's principles, GoF patterns, Restful and trendy architecture of the day. While I sure that coding speed matters, a sound architecture matters, I also think that throughput matters, that minimizing delays matters and that some problems are better dealed with by thinking of the data and how the CPUs likes to access data and work with it instead of just firing up more Kubernetes pods hoping that scaling will get rid of performance issues. By not thinking about the data and the CPU we didn't get rid of the complexity we just moved it to the infrastructure and in code having to deal with a more complex infrastructure.

groby_b · on Oct 29, 2021

We are also at a point where people whine that $5 for an app on your phone is "vastly overcharging". You get what you pay for.

SavantIdiot · on Oct 29, 2021

At some point I believe we will start to bump up against the limits we saw in the first three computer revolutions (tubes, transistors, integrated circuits). This will cause the pendulum to swing from commodification to optimization. What I mean is, you won't be able to snap your fingers and scale up or out, thus optimization will begin again in earnest. Clearly this isn't always the case: GPU shaders, crypto ASICs, video processing... there are plenty of places where optimization is crucial for high performance loop tuning. But optimization hasn't been required across the board like there was just before the three big innovations I described hit.

c9fc42ad · on Oct 29, 2021

This tends to happen at a smaller scale with gaming consoles. Towards the end of a generation, the games coming out usually perform/look a lot better than the ones at the beginning of the generation. I'm assuming due to a lot more careful optimizations due to not being able to change the hardware.

I've always been curious about how far we could really push modern computers if somebody wanted to spend the time going to the lengths in the original post when creating practical software. Of course it's usually not worth the tradeoff, but it's interesting to think about.

hackbinary · on Oct 29, 2021

Protip:

while you made a great comment, ND people like me, and even most NT people have diffivulty ingesting walls of text like what you just wrote.

Please, therefore, breakup your text into paragraphs every 3 sentences. It does wonders for readability for just about everyone. :)

saagarjha · on Oct 29, 2021

…it’s literally four sentences.

SavantIdiot · on Oct 29, 2021

You have multiple paragraphs longer than mine in your history.

hackbinary · on Oct 29, 2021

This is what we call shifting the the goalposts and tu quoque fallacious attacks.

Just because I have erred in the past does not mean I cannot suggest to you to improve your outputs.

Or, I have had learnings previously, and I am simply trying to pass them along to you.

Here is another protip: dinae be so defensive. I was not attacking you personally, but your reply clearly was an attempt to attack me.

haliskerbas · on Oct 29, 2021

And imagine not coming up with this solution in your next MANGA interview!

fulafel · on Oct 29, 2021

It takes a lot to to correctly explain exactly why the set of design choices is fastest, but writing just takes quite a simple model of the CPU internals, knowledge of the insturction set, focus on constantly measuring performance, and an agility to iterate quickly with different approaches.

fulafel · on Oct 30, 2021

And reading more closely, beating the competition low level OS knowledge and understanding peculiarities of the benchmark in question.

The benchmark was about getting the output to a pipe as fast as possible, and there's this great pipe speed hack:

  // To produce output without losing speed, the program  therefore needs
  // to avoid copies, or at least do them in parallel with calculating
  // the next block of output. This can be accomplished with the
  // `vmsplice` system call, which tells the kernel to place  a reference
  // to a buffer into a pipe (as opposed to copying the data into the
  // pipe); the program at the other end of this pipe will then be able
  // to read the output directly out of this program's memory, with no
  // need to copy the data into kernelspace and then back into
  // userspace.

golergka · on Oct 28, 2021

Also goes to show how much would it cost.

dorianmariefr · on Oct 28, 2021

And maintain and evolve and debug and work on different machines, etc.

mhh__ · on Oct 29, 2021

If you have talented staff then you'd be surprised how far you can get just buy giving someone who already does that particular application as a hobby an unlimited supply of coffee.

Obviously finding talented staff is very hard, but once you have your tribe you can go a very long way i.e. I look at apps made by some people I work with (fast, lightweight etc.) then compare with crap pumped out by startups with literal billions in capital. I think it's a question of confidence more than competence.

tester756 · on Oct 28, 2021

It shows that abstractions are leaky as f :)

munchler · on Oct 28, 2021

No, this simply shows that abstraction slows performance, which is usually a worthwhile tradeoff.

Leaky abstractions are a different problem altogether.

ithinkso · on Oct 28, 2021

The problem with that tradeoff is that you only compare the performance with the 'top layer of abstraction' that is already implemented and not with the baseline.

Death by thousand cuts

munchler · on Oct 29, 2021

Not usually. There's a stack of a thousand abstractions between me writing this and you reading it, but it still works fine.

bigiain · on Oct 28, 2021

Yep. I suspect most people here could write a working fizzbuzz on a whiteboard in language of choice in under 5 mins during a job interview.

Sure your Python/JavaScript/Haskell/Rust version builds on a bunch of abstractions, but it’ll run on just about anything, and …

“I've spent months working on this program”

That’s not what your boss wants to hear.

snovv_crash · on Oct 28, 2021

You can pump out a Modern C++ version in 5 minutes too that will run loops (haha) around the higher level languages. The readability won't even be very different...

bigiain · on Oct 29, 2021

True. I bet the Rust guys would come close too.

But realistically? For anything except code golf and nerd fights, the actual client requirement is probably better met by a WordPress widget written in php/html, because what they asked for is something that'll print the fizz buzz all the way up to the person's age when they log into the company website... Nobody is even going to notice if it takes a whole second to fizz buzz all the way to 95 :-)

(Now I'm wondering if that guy's raw hyper optimised x86 assembly can get transpiled to WASM... Because nerd fights are fun.)

Taywee · on Oct 29, 2021

> Now I'm wondering if that guy's raw hyper optimised x86 assembly can get transpiled to WASM

Not really. WASM is significantly simpler and more abstract than x86 assembly, and has a JIT compile step that probably wouldn't get anywhere near as optimized. You could probably hand-write a WASM version that would JIT compile to something roughly similar and still get very good performance, but it would probably be more comparable to the other compiled versions at best, rather than the x86 ASM one.

bigiain · on Oct 29, 2021

> probably wouldn't get anywhere near as optimized.

Yeah, I'm not doing this for performance, I'm doing out for nerd fight points and the lulz :-)

The Most Hightly Optimised Fizz Buzz EVER!

In your browser!

As a service.

Join the waiting list now! Email: [_____________] [submit]

DeathArrow · on Oct 29, 2021

>But realistically? For anything except code golf and nerd fights, the actual client requirement is probably better met by a WordPress widget written in php/html, because what they asked for is something that'll print the fizz buzz all the way up to the person's age when they log into the company website... Nobody is even going to notice if it takes a whole second to fizz buzz all the way to 95 :-)

What if 100 million people log from different corners of the world? Would the WordPress widget still cut it?

bigiain · on Oct 30, 2021

In that scenario, the fizzbuzz widget is gonna be just a tiny tiny part of your WordPress scaling problem…

DeathArrow · on Oct 29, 2021

Not only you can write a modern C++ version in 5 minutes but by carefully choosing your subset of C++ and coding style you can do that while being as readable and maintainable as in higher level languages. C++ isn't hard because it's low level, it is hard because it is a Frankenstein monster and I don't think there is a living being that masters all the complexities of C++. I would love to see a C++ Next version with lots of crap cut down from it.

pkolaczk · on Oct 29, 2021

C++ next version with lots of crap cut down is called Rust. ;)

snovv_crash · on Oct 30, 2021

Judging by it's age and how much crap it's accumulated already, I think in 10 years time Rust won't be in a much better situation.

Similarly, judging by C++'s current trajectory, in 10 years it will have a simplified subset (enforced with something similar to --pedantic) which is easier to get right than Rust is today. Also, it will have a static analysis borrow checker based on clang-tidy.

999900000999 · on Oct 28, 2021

Just put everything inside an electron container.

Unless your talking about micro controller programing Ram is basically free.

Delk · on Oct 29, 2021

It could be, but in most cases it's not due to RAM commonly being unupgradable in laptops.

A larger amount of RAM might still be cheap to install in the first place, but that choice is not always directly up to the consumer.

999900000999 · on Oct 29, 2021

Correct, but Slack is based off electron and is widely successful.

End users are used to tolerating a basic chat application eating an indefinite amount of ram.

From a business pov it doesn't make sense to spend time optimizing since most users don't seem to mind.

mhh__ · on Oct 29, 2021

Cache however is not

m0ck · on Oct 28, 2021

Thanks for my daily dose of software engineer imposter syndrome.

eyelidlessness · on Oct 28, 2021

This is one where I’m fully comfortable with feeling like an impostor. I’ve gotten this far (~20 years) without more than a cursory glance at machine code, I’ll be pleased if I retire at the same level of relevant expertise.

Edit: don’t get me wrong! I admire the talent that goes into this and similar efforts, and find performance-chasing particularly inspiring in any context. This is just an area of that which I don’t anticipate ever wanting to tread myself.

secondcoming · on Oct 29, 2021

The arrival of tools like Compiler Explorer means examining the assembly code is trivial. It's quite easy to fall down a rabbit hole of code tweaks to see what effect they have on codegen. Even examining the differences bwtween gcc and clang can be quite revealing.

jjeaff · on Oct 29, 2021

Don't worry. It just seems like everyone else is so talented because no one writes articles about the 2 hrs they spent just trying to get their project to just build without errors. Or if they do, they don't get voted to the top of HN.

vsareto · on Oct 29, 2021

That stuff almost seems like toil these days

shp0ngle · on Oct 29, 2021

I don’t think this took 2 hours. 2 months maybe?

LilBytes · on Oct 29, 2021

He mentions it took him months in the response.

andkon · on Oct 28, 2021

I was very grateful to see this disclaimer:

> I've spent months working on this program

SamBam · on Oct 29, 2021

Also:

> I already have a master's thesis. This was harder.

shp0ngle · on Oct 29, 2021

If you want to get better at this specific thing, work or study in this field.

Person who wrote this probably won't know much about, say, compatibilities of latest version of WebPack and VueX; or about cryptographic protocols.

You cannot know everything.

jld89 · on Oct 29, 2021

> You cannot know everything.

........right?

AdamN · on Oct 29, 2021

right?

MH15 · on Oct 29, 2021

This is called "specialization" and it's the only reason we've gotten so far as a species. Not everyone can or should know this level in any subject.

xyst · on Oct 29, 2021

just keep in mind, this person claims to have taken a few months to write this. I wouldn't worry about it. It's interesting never the less.

The last time I have seen this low level of code was in a college level course about assembly language.

DC-3 · on Oct 29, 2021

As a comment on the linked post notes, this is impressive enough that it could probably be the basis of a degree thesis. An astonishing exhibit of technical skill, very much a masterpiece in the original sense of the term.

anyfactor · on Oct 28, 2021

Imposter syndrome you say? I scrolled all the way to the bottom of page hoping to see a solution written in a language I actually understand and not pretend to understand.

btheshoe · on Oct 28, 2021

It never goes away, does it

emrah · on Oct 29, 2021

It's a feeling, a fear to be specific, so stems from a different part of the brain. No, I don't think fears go away completely. One learns to manage them.

savant_penguin · on Oct 28, 2021

"future-proofed where possible to be able to run faster if the relevant processor bottleneck – L2 cache write speed – is ever removed)."

Being able to write a function limited mostly by the l2 cache size and able to realize that is rad

And btw this is an interesting example of how hand optimized assembly can be much much faster than any other solution. Can you get as fast as this solution with mostly C/C++? It uses interesting tricks to avoid memcopy (calling it slow rofl)

BeeOnRope · on Oct 28, 2021

You could probably to very close to this solution with C or C++, plus AVX intrinsics. Some might consider that "cheating" since intrinsics occupy kind of a grey area between a higher level language and asm.

bigiain · on Oct 28, 2021

I suspect by the time you’ve disabled all the compiler “optimisations” that would lie knock a few orders of magnitude of performance off the directly translated algorithm, you may as well have written the assembler to start with…

And you probably can’t fine tune your C/C++ to get this performance without knowing exactly what processor instructions you are trying to trick the compiler into generating anyway.

kccqzy · on Oct 28, 2021

> without knowing exactly what processor instructions you are trying to trick the compiler into generating

In fact, in this case you know exactly what processor instructions the compiler is going to generate. You are using AVX intrinsics after all.

And no, compiler optimizations work well with these intrinsics.

akie · on Oct 29, 2021

I mean, did you see the very complicated extremely optimized C and C++ code lower on the page? Despite that, they "only" got to 10% of the performance of the ASM code.

saagarjha · on Oct 29, 2021

The speed of this program is partly that it's written in assembly, but mostly because it's written by someone who is quite clever and clearly put a large amount of time into this problem. None of the other solutions spend much time trying to fit their data into the CPU cache, nor do they have to drop to using slicing for zero copies, and not one is doing anything nearly as clever as this program is to generate its numbers. All of this would be possible to mostly translate to C++ with AVX intrinsics, but real accelerator here is not choice of language, it's the person behind the code.

cormacrelf · on Oct 29, 2021

Now that I have seen the power of madvise + huge pages, everything looks like a nail. Author reckons 30% from less page table juggling. There are techniques here that apply outside assembly.

DeathArrow · on Oct 29, 2021

It's not ASM that make the code fast, it's the way he laid data and code. C/C++ should be able to approach 90% the speed of this.

gpderetta · on Nov 1, 2021

most other implementations do not use splicevm, which is a huge win for this specific problem.

Of course all the AVX and cache optimizations are also exceedingly clever.

srcreigh · on Oct 29, 2021

    ///// Third phase of output
    //
    // This is the heart of this program. It aims to be able to produce a
    // sustained output rate of 64 bytes of FizzBuzz per four clock cycles
    // in its main loop (with frequent breaks to do I/O, and rare breaks
    // to do more expensive calculations).
    //
    // The third phase operates primarily using a bytecode interpreter; it
    // generates a program in "FizzBuzz bytecode", for which each byte of
    // bytecode generates one byte of output. The bytecode language is
    // designed so that it can be interpreted using SIMD instructions; 32
    // bytes of bytecode can be loaded from memory, interpreted, and have
    // its output stored back into memory using just four machine
    // instructions.

_pastel · on Oct 29, 2021

  // [The eighth byte of LINENO_MID] changes in meaning over the course of
  // the program. It does indeed represent the billions digit most of
  // the time; but when the line number is getting close to a multiple
  // of 10 billion, the billions and hundred-millions digits will always
  // be the same as each other (either both 9s or both 0s). When this
  // happens, the format changes: the hundred-millions digit of
  // LINENO_MID represents *both* the hundred-millions and billions
  // digits of the line number, and the top byte then represents the
  // ten-billions digit. Because incrementing a number causes a row of
  // consecutive 9s to either stay untouched, or all roll over to 0s at
  // once, this effectively lets us do maths on more than 8 digits,
  // meaning that the normal arithmetic code within the main loop can
  // handle the ten-billions digit in addition to the digits below.

kens · on Oct 28, 2021

You could probably get some blazing performance out of an FPGA. I made an FPGA version of FizzBuzz a few years ago, but it was optimized for pointless VGA animation rather than performance.

https://www.righto.com/2018/04/fizzbuzz-hard-way-generating-...

irjustin · on Oct 29, 2021

Would you be willing to back and try to optimize this for pure output speed?

I thought the exact same question, but wondered if FPGA's gate connections are too distant for FizzBuzz to beat 55 GiB/s.

londons_explore · on Oct 29, 2021

With an FPGA, the speed limit will 100% be how you want to output the data.

If you require it to come out of IO pins, then the limit will be the total output bandwidth of all the serdes units on the IO pins. Typically that's much more than 55GiB/s.

123pie123 · on Oct 29, 2021

Just had a quick look at the cheap FPGA dev boards and on my basic search the clock speeds seems to be in the low 1-10GHz range.

But this task could be scaled up with parallel FPGA boards in sync

londons_explore · on Oct 29, 2021

It isn't clock speed that matters, but the number of IO pins and the rate the serdes units can be clocked.

Eg. Stratix V GX devices offer up to 66 integrated transceivers with 14.1-Gbps data rate.

Thats ~1Tbps right away from a single $500 device.

NextHendrix · on Oct 28, 2021

When I saw this I did wonder how much faster I could do it in hardware, but similarly I expect the bottleneck would be outputting it in a fashion that would be considered fair to be compared to software implementations (eg outputting readable text on a screen).

Regardless, I very much enjoyed your DVD screensaver-esque output.

killingtime74 · on Oct 28, 2021

At that point we are writing firmware for the display

ummonk · on Oct 29, 2021

GPU memory is an order of magnitude higher bandwidth than RAM, so that would seem to me to be the way to go to beat this. The output wouldn’t be accessible from CPU without a big slowdown though.

AtlasBarfed · on Oct 29, 2021

Ah, but the M1Max is unified memory, isn't it?

And am I doing the math right that a quintillion fizzbuzz will take about 7 months?

cochne · on Oct 29, 2021

I would be doubtful since most FPGAs have a much lower clock speed than CPUs. Maybe unless you take advantage of parallelism.

brokenmachine · on Oct 28, 2021

I love it that humans are so diverse - there's always someone prepared to dedicate their life to the most obscure things.

It's so impressive and hilarious to me that he actually spent months on this. Well done!!

dmitrygr · on Oct 28, 2021

The amount of completely unnecessary effort unleashed on this problem in this post is amazing. I want to meet the author and shake his hand! It has everything from Linux kernel trivia in terms of output speed to intel AVX2 code. So unnecessary and so awesome!

I've done stuff like this before, and I imagine the satisfaction of completing it! A tip of my hat to you, sir!

kvathupo · on Oct 28, 2021

> 64 bytes of FizzBuzz per 4 clock cycles

That's borderline incredulous, given a single AVX2 instruction can last multiple clock-cycles. The reciprocal throughput also doesn't go below ~0.3 to my, admittedly shallow, knowledge. A remarkable piece of engineering!

gpderetta · on Oct 28, 2021

That's 16 bytes per clock cycle, i.e. one avx register per clock cycle. As most intel CPUs can only do one store per clock cycle, that's also the theoretical limit with avx. I wonder if using non temporal stores would help make the code be cache size agnostic.

Note that the instruction latency is not important as long as you can pipeline the computation fully (which appear to be the case here!).

Edit: to go faster you would need to be able to use the bandwidth of more than one cpu. I wonder if you could precompute were the output will cross a page to be able to have distinct cores work on distinct pages... Hum I need a notepad.

Edit2: it is much simpler than that, you do not need to fill a page to vmsplice it. So in principle the code should parallelize quite trivially. Have each tread grab, say 1M numbers at a time, for example by atomically incrementing counter, serialize them to a bunch of pages, then grab the next batch. A simple queue will hold the ready batches that can be spliced as needed either by a service thread or opportunistically by the thread that has just finished the batch next in sequence.

londons_explore · on Oct 29, 2021

Except the program counting the data generation rate is running on only a single CPU...

gpderetta · on Oct 29, 2021

That shouldn't be a problem, pv is basically only counting the number of pages being written. As long as the page size is much larger than the number of threads it should keep up just fine.

londons_explore · on Oct 29, 2021

The page size can't be more than half the L2 cache (else writeback to DRAM would be triggered, dramatically lowering performance). That means you have to use a relatively small page size, and lock contention in the kernel for those pages then limits throughput if you use many cores, even if the data is never read by pv.

See the comments in the code for info...

gpderetta · on Nov 1, 2021

Note: of course SSE is 16 bytes, AVX is 32 bytes! So there is still some margin to be exploited here.

monktastic1 · on Oct 28, 2021

> incredulous

FYI "incredible" might be a better word here. "Incredulous" would mean that it finds something incredible.

howenterprisey · on Oct 29, 2021

Alternatively "unbelievable"?

v-erne · on Oct 29, 2021

Inconceivable ? Sorry I could not resist.

_fjb4 · on Oct 29, 2021

As an aside, it would be fun to see a programming challenge website focused on performance and optimizations, rather than code golf (shortest program) or edge case correctness (interview type sites). I took a course like this in uni where we learned low-level optimization and got to compete with other classmates to see who had the fastest program - a fun experience that I don't really see much of online.

mhh__ · on Oct 29, 2021

Benchmarking reliably and fairly for minimal cost is really hard in a cloud environment.

_fjb4 · on Oct 29, 2021

The course had a machine solely for benchmarking. It would only compile and benchmark one program at a time and the students had to queue up.

It worked well for the course and the results were very consistent between runs, but that kind of setup doesn't scale well.

There may be a (slow) option though, and that would be benchmarking the code using an simulated processor. The runtime would be measured relative to the simulated processor as opposed to the computer hosting the simulator.

mhh__ · on Oct 29, 2021

True, but consider that compiling hello world takes me about 5 minutes in gem5 when using a semi-realistic model of the computer.

mastax · on Oct 28, 2021

Only getting 7GiB/s on a Ryzen 7 1800X w/DDR4 3000. Zen 1 executes AVX2 instructions at half speed, but that doesn't account for all of the difference. I guess I need a new computer. To run FizzBuzz.

iruoy · on Oct 29, 2021

I ran this on a server with a 5950X (same CPU as this test was run on), but with 2666MHz memory (128GB) instead of 3600MHz memory (32GB) and I only got 41GB/s.

micaeked · on Oct 29, 2021

Maybe because of taskset? Same cpu, no taskset, I'm getting 38-43GiB/s. With taskset, ~55GiB/s

no_time · on Oct 28, 2021

next up: custom FizzBuzz ASIC with 128 fizz and 128 buzz cores for maximum throughput.

function_seven · on Oct 28, 2021

Substandard design. You should have a suitable ratio of cores to keep them load balanced. I suggest 160 fizz cores and 96 buzz cores.

EDIT: And a fizzbuzz co-processor, obviously.

gumby · on Oct 28, 2021

> fizzbuzz co-processor, obviously

One of the functional modules on the M1 Pro Max chip, but only the Max one, unfortunately.

Take that, Intel!

wtallis · on Oct 28, 2021

Or if your ASIC generalizes the problem in a slightly different direction, you end up reinventing TWINKLE and TWIRL: https://en.wikipedia.org/wiki/TWINKLE

nomaxx117 · on Oct 31, 2021

Thanks for sharing this, its always amazing to see how physical properties can be used in weird ways to bypass normal "rules" of computation.

nonameiguess · on Oct 28, 2021

Could probably store all multiples of 3 and 5 up to some really big number burned directly to hardware and then do something like a big CAM table the way Internet core routers do mapping the numbers to ASCII bit strings. Then speedup IO by not having a general purpose display, but something more like an old-school digital clock that can only show digits and the words "fizz" and "buzz."

function_seven · on Oct 28, 2021

You'll get docked points from the typography nerd for displaying "fizz" instead of "ﬁzz"

yuliyp · on Oct 29, 2021

That's definitely not going to be useful. By doing this 15 numbers at a time you completely avoid needing to do any division anyway. print FB, i+1. i+2, Fizz, i+3, Buzz. i+4, ... i+14. Then increment by 15.

Also, even with a very slow CPU you'd already run faster than can be perceived on a digital clock

mordechai9000 · on Oct 28, 2021

I think optimizing binary tree inversion is a higher priority, right now.

WJW · on Oct 28, 2021

Not for fizzbuzz optimisation it isn't! Get your priorities straight mate! Who is trying to get hired when we could be outputting more fizzes and/or buzzes! :)

randy408 · on Oct 28, 2021

FizzBuzzCoin?

vsareto · on Oct 29, 2021

Ah yes, the new junior developer technical interview

Ericson2314 · on Oct 29, 2021

That would be easier.

e12e · on Oct 29, 2021

But does it support arbitrarily large integers? ;)

Ed: no, but does pretty well:

> The program outputs a quintillion lines of FizzBuzz and then exits (going further runs into problems related to the sizes of registers). This would take tens of years to accomplish, so hopefully counts as "a very high astronomical number" (although it astonishes me that it's a small enough timespan that it might be theoretically possible to reach a number as large as a quintillion without the computer breaking).

HALtheWise · on Oct 28, 2021

Now I'm kinda curious to see how much faster you could go on an M1 Max with the GPU generating the data. Once his solution gets to the point of being a bytecode interpreter, it's trivially paralellizable and the M1 has _fantastic_ memory bandwidth. Does anyone know if the implementation of pv or /dev/null actually requires loading the data into CPU cache?

gpderetta · on Oct 28, 2021

Pv never touches the data. It splices the input into the output, /dev/null in this case, so once written, a page is never touched again.

Splice is linux specific though, so you would need to run it on M1.

HALtheWise · on Oct 29, 2021

If that is the case, does it actually matter what CPU core pv runs on? I feel like _something_ must ultimately zero the page out before it can get re-mapped to the process, but I'm not sure which core that happens on, or whether there's some other hardware mechanism that allows the OS to zero a page without actually utilizing memory bandwidth.

londons_explore · on Oct 29, 2021

The page is reused by the generator process without being zeroed.

jason0597 · on Oct 28, 2021

Couldn't we load it onto an NVIDIA RTX A6000? It is much much faster than the M1 Max. It has a much greater memory bandwith too

HALtheWise · on Oct 29, 2021

Unfortunately, the memory bandwidth that matters here is not bandwidth to GPU memory, but bandwidth to main system memory (unless anyone knows how to splice a pointer to GPU memory onto a unix pipe). That's specifically why the M1 came to mind, as a powerful UMA machine that can run Linux. Perhaps a modern gaming console could hit the same performance, but I don't know if they can run Linux.

creddit · on Oct 28, 2021

Wouldn’t benefit from UMA

pcwalton · on Oct 29, 2021

PCIe 5 speed is 64 GB/s, so theoretically if you perfectly pipelined the result you could achieve the same performance on a next-generation discrete GPU.

booleandilemma · on Oct 28, 2021

Wow. There is programming and then there is programming. Whenever I see something like this I feel like what I do for a living is duplo blocks in comparison.

dgunay · on Oct 28, 2021

I am thankful every day for the work of those who came before me. Their long hours toiling over hardware, assembly, compilers, protocol stacks, libraries, frameworks, etc let us focus on the bigger picture, how to write same-y CRUD apps for businesses :)

xtracto · on Oct 29, 2021

Ive had the opportunity to tinker with ASM, z80 architecture, low level programming and other similar stuff (I'm less than 1/1000th able as the referenced author).

I find this programming it very beautiful and rewarding in that you really know that you are programming the hardware. Unfortunately it's not an easy path to get a good paying job (unless you are exceptional like the gentleman). So I ended up building fintech web apps.

saagarjha · on Oct 29, 2021

You can get a well paying job off of these skills, but the job title will read something like "performance engineer" or "security researcher" rather than "assembly programmer".

ot · on Oct 28, 2021

To me it's almost as impressive that pv does not become the bottleneck.

gpderetta · on Oct 28, 2021

I was wondering the same thing! pv probably never touches its input and might itself be using splice to never read the bytes in users pace and just accumulate the byte counts.

Edit: yes it has a --no-splice parameter.

jerf · on Oct 28, 2021

"The Grid. A digital frontier. I tried to picture clusters of information as they moved through the computer. What did they look like? Ships, motorcycles? Were the circuits like freeways? I kept dreaming of a world I thought I'd never see. And then, one day I got in...

It turns out The Grid is just a guy sitting in a chair, shouting about "Fizz!" and "Buzz!" as fast as he can.

It wasn't really what I had in mind.

(The image of this poor program, stuck shouting "fizz!" and "buzz!" for subjectively centuries at a time struck me...)

jedberg · on Oct 28, 2021

I kinda hope as a joke in Tron 3 they have a guy muttering in the corner "1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz" as someone walks by.

tigershark · on Oct 28, 2021

Still better than throwing away a significant percentage of the world total power output just to compute some useless hashes if you ask me..

sireat · on Oct 29, 2021

The amount of optimizations remind me of story of Mel: http://www.catb.org/jargon/html/story-of-mel.html

fnord77 · on Oct 28, 2021

    > // Most FizzBuzz routines produce output with `write` or a similar
    > // system call, but these have the disadvantage that they need to copy
    > // the data being output from userspace into kernelspace. It turns out
    > // that when running full speed (as seen in the third phase), FizzBuzz
    > // actually runs faster than `memcpy` does, so `write` and friends are
    > // unusable when aiming for performance - this program runs five times
    > // faster than an equivalent that uses `write`-like system calls.

Why can't `write` use a reference like vmsplice?

tehjoker · on Oct 28, 2021

I'm not an assembly programmer, but my intuition is that this is safer. You can only rely on "zero-copy" behavior when you have total control of the program and know that that memory region is going to stay put and remain uncorrupted. Therefore, most external code will make a copy because they can't insist on this (and because for most people, making a copy is pretty fast).

gpderetta · on Oct 28, 2021

I think that some of unix derivatives do or have done in the past. If the argument to write is page aligned and a multiple of a page size they play VM tricks to implement zero copy writes.

The issue is that the caller to write is allowed to do anything it wants with the buffer after write returns, so the write implementation need to unmap the stolen pages and perform copy on write if the caller ever touches them again, so in practice the optimization is very fragile. For this reason Linus as always refused to implement this optimization.

Splicevm gets away with it because it is part of the caller contract that it can't ever touch the pages until the kernel is done with them. Unfortunately there is no general way to know when it is safe to reuse them and it is very application specific (for example there might be an explicit ack from the consumer that it has received the data)

nobrains · on Oct 28, 2021

Did anyone else get recommended and see the FizzBuzz video on YouTube ( https://youtu.be/QPZ0pIK_wsc ) just before this article or did I just witness an incredible coincidence?

yreg · on Oct 28, 2021

See https://en.wikipedia.org/wiki/Frequency_illusion

throwamon · on Oct 29, 2021

I haven't been on YouTube today, but I literally JUST got out of the shower and I was thinking about FizzBuzz while there.

But yeah, frequency illusion.

wly_cdgr · on Oct 29, 2021

Literal 10x programmer moment

ehsankia · on Oct 29, 2021

Yeah, the fact that it's a whole order of magnitude faster than all the other, even assembly based solutions, is insane.