The amount of low level CPU architecture knowledge to write such a program is mind boggling. Just goes to show how much room for improvement a lot of programs have.
FizzBuzz has many properties that make it very suitable for these kinds of optimizations that might not be applicable to general purpose code:
+ extremely small working set (a few registers worth of state)
+ extremely predictable branching behavior
+ no I/O
These properties however don't diminish the achievement of leveraging AVX-2 (or any vectorization) for a problem that doesn't immediately jump out as SIMD.
The problem description is writing out bytes which is probably some of the more expensive part of this. In fact, if you read the winning solution description, IO is the primary problem here.
> doesn't immediately jump out as SIMD.
IDK that I agree with this assessment. Very naively, I see no reason you'd not take the 512 SIMD registers and split them into 16 32 bit lanes. From there, it's a relatively simple matter of using 2 registers for the divisors, pulling out the results, and transforming them into the text to print. In other words, you be chunking this up into 16 iterations per loop. (vs 1 with the naive assembly).
This is the sort of thing that jumps out as easily vectorizable.
Now, the fastest answer very obviously does not take this approach because I'm certain they realized the same thing, that the difficult part here isn't the actual division, but instead pumping out the correct text at the highest speed possible. If you read through it, most of the code is dedicated to converting binary numbers into ascii :D
Maybe I should be more clear; no data needs to be fetched from disk/network, and the "writes" don't need to go past memory.
As for the second point, you might have a different definition of "naive" and "relatively simple" as my brain has rotted too much from only thinking about SIMD for numerical computation. While determining divisibility be relatively clear, it wasn't clear how the printing would be easily vectorizable as the output-per-number is variable in length.
I think this nails it. Vectorizing the math problem is "easy" (send batches of numbers to cores, do the division) but then you have to re-order it for printing (not to mention actually print it), so paradoxically, it probably makes more sense for the program to be single-threaded.
Yes the patterns repeat every 15 integers.
So you only need to do one division operation, get the index modulo-15.
I was bored once at work and figured out a way to compute this without doing much arithmetic. It only requires 1 modulo operation.
for (int i=1; i<100;i++)
{
// make a bit vector with a 1 in ith bit, modulo 15.
unsigned int i_as_bitvector = 1 << (i % 15);
// Check it against the valid positions for FizzBuzz
// 0ctal 011111 is binary 1001001001001
// Octal 02041 is binary 10000100001
printf("%d ", i);
if (i_as_bitvector & 011111 ) printf("Fizz");
if (i_as_bitvector & 02041 ) printf("Buzz");
printf("\n");
}
I also have a version which has NO magic constants anywhere in the program, except 3 and 5. I'll post it if someone is interested.
Also, after printing 1 to 9, you have a single pattern of 30 numbers that repeats exactly 3 times from 10 to 99, then 30 times from 100 to 999, then 300 times from 1000 to 9999, and so on. That lets you extract lots of code from the loop and run it roughly once every 10^n numbers.
Why would you think in sets of ten, when there should actually just be one pattern in 15? Then it just becomes a challenge to arrange your code to work on these blocks of 15.
We could probably code 16 versions of the block of 15 code that repeat and are nicely aligned for SIMD.
In my mind a SIMD version would work by predefining a string buffer 15 elements long with Fizz in positions 0,3,6,... and Buzz in positions 0,5,10. These are constants that don't ever change. Then vectorized code would only write to positions that change, the numbers: 1,2,4,7,8,11,13,14. Most of the time these positions would have fixed width too (lengths of large numbers don't change often) so you can use constant write addresses. So 8 SIMD threads could handle a block, and write everything blockwise.
I was thinking blocks of 10 because I can itoa(the number of tens) and then concat with the subset of 0, 1, 2, 3, 4, etc. that I care about. I guess with blocks of 15 you just need to do two itoas, and worry about two types of block vs 3.
Using AVX 512 is not suitable for programs that takes very small time to run. There is a penalty in the ms range to "warm up" the CPU (it is more a cool down, actually).
As OP stated, the limiting factor is on the memory access. That's why he kept saying 64B every four cycles.
But OP likely didn't use because most CPU lacks support for AVX512.
The new Intel CPU introduced many changes for the frontend. This will likely improve the speed.
There might also be possible to try make the CPU operate at higher clockspeed.
Code looks a bit long. Not sure if the unrolling actually helps.
EDIT: Just look at the agner microarchitecture doc. Ice Lake and Tiger Lake can do 64 bytes/cycle.
In theory, it can run 4x faster (on bare metal, maybe).
I think you're missing the point. The issue is that with the advent of higher level languages, starting from Java, Javascript and on to Python and so on, most people have forgotten or have never learnt the skills to optimize code.
I'll argue that, as a percent, the number of people who can write proper multi-threaded code has only diminished over the years.
And we see the result, despite the massive increase in computing power, software in general has become slower and bloated.
The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.
If you have "months to write fizzbuzz" levels of resources available, sure, you can microoptimize everything. Except in practice you can't, because the amount of effort needed for this kind of microoptimization is quadratic or worse in the complexity of the problem you're actually solving.
For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster, because they'll have had a lot more time available to spend prototyping and picking good algorithms rather than debugging undefined behaviour because they used the wrong combination of integer bit lengths.
> The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.
I disagree. I believe the majority of code is slow because it's written without any consideration at all to performance. I like Casey Muratori's philosophy of non-pessimization where true optmization (measuring, working hot spots) is rare and rarely necessary but massive speedups compared to the general state of the art are achievable by simply not writing code using patterns that are inherently slow. This isn't deep algorithmic stuff it's just avoiding copies and/or pointer chasing.
> For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster
The Advent of Code runs every year and I'm not sure about C (not something I track) but there are plenty of Rust submissions and while the Rust submissions take longer to come in, it's not THAT much longer. From memory it's like 2x on a timescale of minutes. The Python programs are not faster.
Not disagreeing or anything, but an Advent of Code sized example will not be anywhere close in complexity to your typical CRUD application. The thing is, we possibly could not even write most of these applications in a low level language, as we should not forget that while Rust indeed has good abstractions, low level languages simply leak memory-related informations to high levels as well, making it necessary to deal with them on a high level. And in these cases, time to market, speed of refactors and maintainability is much more important than the row speed (which again, may not even be that significant at all. Eg, often most work is done on the db side)
> an Advent of Code sized example will not be anywhere close in complexity to your typical CRUD application
Advent of Code size problems is the best case scenario for Python. The difference in implementation time goes down as program size increases because you spend a larger fraction of the time figuring out what you're implementing.
The reason it's not done is mainly social and not technical. Ecosystems matter and this is particularly the case in graphical desktop environments where massive amounts of time and effort are required to re-implement features matching users' expectations (e.g. Flutter).
If we're talking a server side http app server then the library requirements are significantly lower and the libraries are generally there in almost any language. To keep the language comparison the same, it's not significantly slower to implement a CRUD backend in Rust than it is in Python. Depending on your specific setup and how much pre-prep is involved it can be faster. I'm not aware of any framework in a dynamic language that can produce production grade endpoint handlers that are as terse as a Rocket app heavily leveraging request guards. The guards handle validation, conversion, session context and db connection based on the endpoint parameter types so you just write the happy path which is 1-2 lines of code for CRUD.
> speed of refactors and maintainability
Dynamic languages are significantly worse for maintainability and refactoring. I say this as someone who generally gets paid to be a frontend developer and spent years writing both Python and Clojure professionally. Despite my arguments, I do not write server side Rust for work because team coherence is more important for doing my actual job (solving a customer's problem) than programming language and I'm the only person in my company who writes Rust. I've been doing this long enough that I accept the worse solution as the cost of doing business. My personal projects don't have this constraint so perf is orders of magnitude better in exchange for a lot more knowledge and a bit more design work.
> Advent of Code size problems is the best case scenario for Python. The difference in implementation time goes down as program size increases because you spend a larger fraction of the time figuring out what you're implementing.
The difference in code size between a good and a bad language goes up as program size increases, because a large codebase makes it harder to keep it all in your head and forces you to add more layers and duplication.
>And in these cases, time to market, speed of refactors and maintainability is much more important than the row speed (which again, may not even be that significant at all. Eg, often most work is done on the db side)
You don't have to ditch Java or C# for your app and use C or Rust instead. But you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.
Also, if performance, throughput, stability wouldn't be a thing, we have to wonder why Twitter switched from RoR to Java? Couldn't they just buy more servers instead?
> you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.
In practice you only have so much attention and there are better places to spend it.
> Also, if performance, throughput, stability wouldn't be a thing, we have to wonder why Twitter switched from RoR to Java?
If raw speed matters we have to wonder why Twitter started with RoR, and only switched (mostly to Scala, not Java, AIUI) once they had tens of millions of users.
In a tiny handful of outlier cases where a given piece of code is run literally billions of times a day it eventually becomes worth doing performance microoptimisations. But only after you've confirmed that there's absolutely massive demand for this particular code and already done your iterative refactors and figured out what the code should do.
> But you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.
Writing Java in a more performant way often means ditching OOP and using primitive types, primitive arrays everywhere instead, and also avoiding allocations up to the point where all objects are reused. Such code is just as slow to write and as error-prone as C, and way worse than C++ where at least you have RAII, generics, STL algorithms and zero-cost abstractions.
You only have to do that in hot loops (in most problems), that you can easily profile with the very good observability tools that java provides. The rest can be maintainable, good quality OOP code.
I would even go as far and say that OOP’s encapsulation pretty much exists just for this: the “ugly” implementation can reside inside a class and you only have to care about its public API from the outside.
> I believe the majority of code is slow because it's written without any consideration at all to performance. I like Casey Muratori's philosophy of non-pessimization where true optmization (measuring, working hot spots) is rare and rarely necessary but massive speedups compared to the general state of the art are achievable by simply not writing code using patterns that are inherently slow. This isn't deep algorithmic stuff it's just avoiding copies and/or pointer chasing.
I don't/can't watch videos; I would be interested to see a written argument, but generally my view is that copies and pointer chasing are irrelevant because code is so far away from optimal already. You're right that code is mainly written without any consideration to performance, but that manifests itself first in poor choices of algorithm and/or datastructure. I don't think there's anything "deep" about, say, using a hashmap rather than searching linearly through a list; indeed I'd consider it easier to understand than avoiding copying.
> The Advent of Code runs every year and I'm not sure about C (not something I track) but there are plenty of Rust submissions and while the Rust submissions take longer to come in, it's not THAT much longer. From memory it's like 2x on a timescale of minutes. The Python programs are not faster.
Advent of Code is still extremely tiny programs, and Rust is a much much better language than C for thinking in.
True. Most business apps I've seen seemed like people did serious research into how to write the worst code from a performance point of view. Gazillions of wrappers upon wrappers, tons of indirection, data being copied and recopied thousands of times, creating FactoryFactoryFactories just for the sake of it.
If people would pay more attention on how the data flows instead of Uncle Bob and GoF, they would not only end up with a cleaner software architecture but also with more performance.
> I believe the majority of code is slow because it's written without any consideration at all to performance.
That can both be correct. I use a std. data structures for many things even if some prefix tree might be better for the operations I perform on a set. I don't think about that in most cases since computing power is cheaper than the additional work. If performance becomes a problem, I can optimize later.
Advent of Code isn't a realistic performance metric. Working software is for most software projects.
> For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster, because they'll have had a lot more time available to spend prototyping and picking good algorithms rather than debugging undefined behaviour because they used the wrong combination of integer bit lengths.
I think this is more nuanced than that. Competitive programming actually gives fairly good insight here, because skilled programmers are placed under time constraints and obviously care about performance.
What you see, usually, is that programs that don't need maximum performance are often written in Python, which allows people to quickly write code that needs less debugging. But double the allotted time and the C programmer finally has their code finished and free of bugs, and they're beating the pants off of your Python code right out of the gate. You try and come up with clever ways to avoid copies and drive down the Big-O, but the C programmer comes up with the same thing just a bit behind you and leads most of the time because they get a free 10x speedup just because of the language they're using.
It's not surprising that the highest levels of competitive programming are dominated by C (and to a greater extent, C++) programmers, with Python use extremely rare, almost always reserved for rare, complicated algorithms that happen to bring a significant big-O improvement that justify the use of Python to write them.
The overwhelming majority of code isn't slow because it used the wrong algorithm, it's because the Product team can't demo or show off an increase in efficiency. It's because they aren't measured by their ability to reduce or control the growth of the AWS bill. And they probably wouldn't have the knowledge to ascertain a reasonable bill. So they push deadlines which pushes devs to take ugly but fast to write solutions using high level languages.
Also, performance improvements pay for themselves over months and years while new features usually pay for themselves faster (at least they do in forecasts).
And finally, touching working code to make a performance improvement necessarily has some risk and that risk might be worth more than a year of cost savings from the improvement.
>The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.
Try to look at The Computer Language Benchmarks Game where all implementations are using the same algorithm. An optimized C implementation can be 300 times faster than Python.
That's exactly the kind of missing the point that I'm talking about: tiny programs where all implementations are using the same algorithm and people have spent months microoptimising and are utterly unreflective of realistic business problems.
There's a law that says that as resources becomes easier to consume, people consume them more. Like when increasing the capacity of a road doesn't reduce the time spent in traffic. Wouldn't that apply to software? I feel like it's a bit unfair to call software "slower and bloated" when you wouldn't call a road "bloated" because it has a new lane. People can do more things now.
More like I start manufacturing 3 lane wide cars because the material im using is cheaper than cutting and engineering it to normal size - the way I see modern software analogy in this context.
It feels like the frustration comes because I see the analogy this way:
The roads have been given extra lanes, and the total possible throughout of the road has increased. As a result, car manufacturers have decided that they can use a cheaper manufacturing process which makes slower cars. But when anyone complains, they point to the additional lanes and say ‘but the number of cars that get from point A to point B has increased, even if each trip is slower.’
The end user isn’t getting a choice when software developers make slower code each year. They have to buy more expensive hardware to keep up with the slowdowns of software developers.
I get it. But what exactly do you want to happen here?
In terms of your analogy, do you want more expensive cars that less people have?
Software developers could spend more time optimizing their code. But that means that are spending less time elsewhere, fixing bugs, adding features, developing other products.
It should be noted that many times features come in that people do not want.
“The new format bar” for slack and “huddles” (perhaps less antagonisingly) are examples of things that Slack has done seemingly to prove they’re still working on the program.
Despite the the features not being needed in the slightest.
I actually disagree, at least about huddles. Huddles are a direct response to the low-friction nature of Discord voice channels (and arguably Teamspeak/Mumble/etc before Discord, although nothing else married text and voice channels quite as well before Discord.) It's almost silly how small of a difference there is between starting a huddle and a channel-wide "call" - the only difference is that it doesn't start out by "ringing" for everyone in the channel. But that tiny difference completely shifts the feeling from "this is a formal communication" to "we're all just hanging out and you can join if you want but no pressure."
IMO huddles happened because the pandemic moved so many more people remote that the need for a water-cooler replacement became much more obvious.
That would imply that we do less or as much things as before with computers which I think is false. Maybe you don't agree with what everyone does, but that doesn't mean that it's worthless.
Sometimes bloaty design decisions transcend the question of programming language. Electron probably has (I didn't check) a faster JavaScript engine than Qt, but I'd expect that similarly sophisticated implementations of the same program in Qt or Electron show an advantage in performance to the former. But people use Electron because it's familiar and it's got a lot of active development, and more cynically, Google likes it that way because it means that people keep coding for Chrome (cf. Microsoft's VS ecosystem, Apple's walled garden, etc).
The problem with modern software is that vendors encourage new developers to learn using heavyweight toolkits because lock-in is profitable in the long run. One day Electron will work with Go, but it will still be a plague.
>FizzBuzz has many properties that make it very suitable for these kinds of optimizations that might not be applicable to general purpose code: + extremely small working set (a few registers worth of state) + extremely predictable branching behavior + no I/O
The working set depends on how you organize the data.
I/o can be made in larger chunks most of the times. You have to consider about the data you need and when you need it.
As for branching, that most of the time depends on programmer rather than on the problem being solved. The situation where you need very complicate branching to solve a problem are rare. And I've seen O(n log n) algorithms beating clever O(n) algorithms because they could be better optimized for the CPU.
> for a problem that doesn't immediately jump out as SIMD.
It's a single transformation applied to elements of a large array. Parallelization is obvious, and maybe the exact SIMDization is not obvious, one is certainly motivated to formulate it. And a scatter-like approach does spring to mind right away, I believe.
The parent comment asserted that FizzBuzz should never branch. This would require it to not have any backwards control flow, and thus require full unrolling of an infinite sequence. Hence the infeasibility.
Once upon a time most software was highly optimized with hot code paths written in assembly. If you look at DOS source code, DOOM source code you will see lots of optimization.
When CPUs got more powerful, people got lazy and they thought they can spend the improvements on conveniences.
Now we are at the point that we run "desktop" apps written in Javascript on top of embedded browsers.
Hard times create strong programmers. Strong programmers create good times. Good times create lazy programmers. And, lazy programmers create hard times.
If you look at DOS source code, DOOM source code you will see lots of optimization
If you look at VSCode code, you’ll see even more of these. What really changed was more people wanted to hack things and write non-PhD code, and the web tech was (sadly) popular enough to take this role. It’s not programmers who are lazy, it was that desktop frameworks utterly failed in their non-competive niches. MSVC libraries replaced each other almost faster than github’s new web frameworks, to the point that barely anyone can remember their chronology. Gtk+ got stuck for years dealing with C legacy, added non-C styling engines too late, and then got “owned” by a particularly deteriorating DE. Qt always had a not-so-clear position on community-licensed versions, and C++ impedance mismatch didn’t help that either. UI/AppKits got pretty useful and advanced in recent ~ten years, but were ios/osx only. Minor frameworks like fox, wx, fltk, etc never got enough traction or new ideas and were just shadows of their bigger brothers. Meanwhile, with electron, bootstrap and little js one can make a page in few minutes, which could take few hours on conventional desktop.
I mean, you are correct that programming went from hardcore to relaxed mode, but there is more history to it in ui sense.
Business code must be optimized for maintenance first.
Because it will live for many years. It will have to survive in the hands of multiple caretakers. It will have to evolve to work with the underlying foundations changing (operating system, compiler/runtime, desktop -> web -> mobile).
That's different from most video games (one single release, then little patches) and from advent calendar code.
Hot code paths are still written in assembly, it's just on a different level. DOOM is an engine; nowadays people use off the shelf engines so that they can focus on creativity instead of low level details.
I mean I could probably hyperfocus on storing my application state in a file on disk, but why should I bother when there's off the shelf SQL databases right there? Which have been optimized endlessly, I might add. I don't get paid to write low level ASM, I get paid to build applications.
Edit: And to add, my current thing I get paid for is several orders of magnitude faster than the one it replaces. Not because I spend more time optimizing, but because I write it in sensible Go instead of 10K LOC PHP files that concatenate XML strings that get converted to JSON to be rendered in Dojo-flavored HTML concatenated together in JS because of reasons.
Not lazy, sensible. The market has spoken and it wants bloated electron apps with rapid development, rather than performant C/assembly apps that hardly ever change.
All Electron apps users were demanding something that eats up their RAM, crashes and run slowly?
The market demand seems more like:
Jack: "Our bellowed CEO wants us to deliver our wonderful app to unwashed masses still using a desktop. Our enlighted marketing team made a study which realized that for whatever weird reason, corporations and businesses still make heavy use of those boxes which come with a mouse and keyboard attached."
Joe: "Sure thing boss, we will have to hire some programmers and testers and will take about a year or so."
Jack: "I forgot to tell you that the marketing study already took a year and a very large budget because we hired the best of the best to do it. One year we don't have, money we don't have. But what about those people who wrote our web app? We still pay them. Can't they deliver?"
Joe: "We will have our glorious desktop app in two weeks, boss, I just had an idea."
Jack: "Do that and I will personally remind about you to our bellowed CEO when he will want to do some promotions."
The power dynamics of companies/customers are often not as dynamic as all that.
If slack is electron and I work at a company that uses slack: I must use it.
The competition in that space is all electron, you can’t choose.
It’s like saying that “the market chose non-ECC ram”. No, Intel chose for you and you don’t get much choice except to suck it up (or pay well above the odds.)
It takes a lot to avoid using a product. I mean people still use Oracle products!
That does not actually contradict the point. We got stuck in a suboptimal local maxima due to the all early design decisions of browsers and JavaScript. The original inventors did not expect anyone wiring web version of Google Drive on the web.
The market surely pushes against the bloated electron apps, yet the convenience of having the same app on web as well as "native" and the amount of man years which went to make HTML+JS the richest multi-platform UI framework on the market is more important.
There is no market for things that just work on V1.0 then continue to work flawlessly. You wont sell a version n again and again and there is no support contract for something that does a single thing and does it well.
These days, compilers have gotten so damn smart with their optimizations that using assembly will actually make your code slower unless you really know what you're doing like the original post here.
A few years ago, I was doing Project Euler problems, and one of them had a solution that took about 45 minutes for my C program to solve. I decided that it was an incredibly simple problem, so rewrote it in Assembly.
My assembly solution took an hour to run.
I disassembled my C solution, and couldn't figure out what it was actually doing, though my assembly knowledge is pretty weak.
People might not care to learn it, but isn't there a lot more to know nowadays re: the obscurities going on deep down at the lowest levels of modern architectures?
I don't think you should be an expert in CPUs architectures but but having an understanding of memory layout and how instruction are dispatched is not very hard or time consuming.
I see many people having trouble to understand the difference between value and reference, what pointers are, why it's better for structures to contain variables of the same type and have a certain size, why is better to call a function once for a large chunk of data instead of calling it many times for small chunks of data, why iterative or tail call recursivity are to be preffered over simple recursive functions.
The view is most LOB apps won't care about performance because they are waiting for i/o so we should not care about performance but coding speed, OOP, Uncle Bob's principles, GoF patterns, Restful and trendy architecture of the day. While I sure that coding speed matters, a sound architecture matters, I also think that throughput matters, that minimizing delays matters and that some problems are better dealed with by thinking of the data and how the CPUs likes to access data and work with it instead of just firing up more Kubernetes pods hoping that scaling will get rid of performance issues. By not thinking about the data and the CPU we didn't get rid of the complexity we just moved it to the infrastructure and in code having to deal with a more complex infrastructure.
At some point I believe we will start to bump up against the limits we saw in the first three computer revolutions (tubes, transistors, integrated circuits). This will cause the pendulum to swing from commodification to optimization. What I mean is, you won't be able to snap your fingers and scale up or out, thus optimization will begin again in earnest. Clearly this isn't always the case: GPU shaders, crypto ASICs, video processing... there are plenty of places where optimization is crucial for high performance loop tuning. But optimization hasn't been required across the board like there was just before the three big innovations I described hit.
This tends to happen at a smaller scale with gaming consoles. Towards the end of a generation, the games coming out usually perform/look a lot better than the ones at the beginning of the generation. I'm assuming due to a lot more careful optimizations due to not being able to change the hardware.
I've always been curious about how far we could really push modern computers if somebody wanted to spend the time going to the lengths in the original post when creating practical software. Of course it's usually not worth the tradeoff, but it's interesting to think about.
It takes a lot to to correctly explain exactly why the set of design choices is fastest, but writing just takes quite a simple model of the CPU internals, knowledge of the insturction set, focus on constantly measuring performance, and an agility to iterate quickly with different approaches.
And reading more closely, beating the competition low level OS knowledge and understanding peculiarities of the benchmark in question.
The benchmark was about getting the output to a pipe as fast as possible, and there's this great pipe speed hack:
// To produce output without losing speed, the program therefore needs
// to avoid copies, or at least do them in parallel with calculating
// the next block of output. This can be accomplished with the
// `vmsplice` system call, which tells the kernel to place a reference
// to a buffer into a pipe (as opposed to copying the data into the
// pipe); the program at the other end of this pipe will then be able
// to read the output directly out of this program's memory, with no
// need to copy the data into kernelspace and then back into
// userspace.
If you have talented staff then you'd be surprised how far you can get just buy giving someone who already does that particular application as a hobby an unlimited supply of coffee.
Obviously finding talented staff is very hard, but once you have your tribe you can go a very long way i.e. I look at apps made by some people I work with (fast, lightweight etc.) then compare with crap pumped out by startups with literal billions in capital. I think it's a question of confidence more than competence.
The problem with that tradeoff is that you only compare the performance with the 'top layer of abstraction' that is already implemented and not with the baseline.
You can pump out a Modern C++ version in 5 minutes too that will run loops (haha) around the higher level languages. The readability won't even be very different...
But realistically? For anything except code golf and nerd fights, the actual client requirement is probably better met by a WordPress widget written in php/html, because what they asked for is something that'll print the fizz buzz all the way up to the person's age when they log into the company website... Nobody is even going to notice if it takes a whole second to fizz buzz all the way to 95 :-)
(Now I'm wondering if that guy's raw hyper optimised x86 assembly can get transpiled to WASM... Because nerd fights are fun.)
> Now I'm wondering if that guy's raw hyper optimised x86 assembly can get transpiled to WASM
Not really. WASM is significantly simpler and more abstract than x86 assembly, and has a JIT compile step that probably wouldn't get anywhere near as optimized. You could probably hand-write a WASM version that would JIT compile to something roughly similar and still get very good performance, but it would probably be more comparable to the other compiled versions at best, rather than the x86 ASM one.
>But realistically? For anything except code golf and nerd fights, the actual client requirement is probably better met by a WordPress widget written in php/html, because what they asked for is something that'll print the fizz buzz all the way up to the person's age when they log into the company website... Nobody is even going to notice if it takes a whole second to fizz buzz all the way to 95 :-)
What if 100 million people log from different corners of the world? Would the WordPress widget still cut it?
Not only you can write a modern C++ version in 5 minutes but by carefully choosing your subset of C++ and coding style you can do that while being as readable and maintainable as in higher level languages. C++ isn't hard because it's low level, it is hard because it is a Frankenstein monster and I don't think there is a living being that masters all the complexities of C++. I would love to see a C++ Next version with lots of crap cut down from it.
Judging by it's age and how much crap it's accumulated already, I think in 10 years time Rust won't be in a much better situation.
Similarly, judging by C++'s current trajectory, in 10 years it will have a simplified subset (enforced with something similar to --pedantic) which is easier to get right than Rust is today. Also, it will have a static analysis borrow checker based on clang-tidy.
This is one where I’m fully comfortable with feeling like an impostor. I’ve gotten this far (~20 years) without more than a cursory glance at machine code, I’ll be pleased if I retire at the same level of relevant expertise.
Edit: don’t get me wrong! I admire the talent that goes into this and similar efforts, and find performance-chasing particularly inspiring in any context. This is just an area of that which I don’t anticipate ever wanting to tread myself.
The arrival of tools like Compiler Explorer means examining the assembly code is trivial. It's quite easy to fall down a rabbit hole of code tweaks to see what effect they have on codegen. Even examining the differences bwtween gcc and clang can be quite revealing.
Don't worry. It just seems like everyone else is so talented because no one writes articles about the 2 hrs they spent just trying to get their project to just build without errors. Or if they do, they don't get voted to the top of HN.
As a comment on the linked post notes, this is impressive enough that it could probably be the basis of a degree thesis. An astonishing exhibit of technical skill, very much a masterpiece in the original sense of the term.
Imposter syndrome you say? I scrolled all the way to the bottom of page hoping to see a solution written in a language I actually understand and not pretend to understand.
It's a feeling, a fear to be specific, so stems from a different part of the brain. No, I don't think fears go away completely. One learns to manage them.
"future-proofed where possible to be able to run faster if the relevant processor bottleneck – L2 cache write speed – is ever removed)."
Being able to write a function limited mostly by the l2 cache size and able to realize that is rad
And btw this is an interesting example of how hand optimized assembly can be much much faster than any other solution. Can you get as fast as this solution with mostly C/C++? It uses interesting tricks to avoid memcopy (calling it slow rofl)
You could probably to very close to this solution with C or C++, plus AVX intrinsics. Some might consider that "cheating" since intrinsics occupy kind of a grey area between a higher level language and asm.
I suspect by the time you’ve disabled all the compiler “optimisations” that would lie knock a few orders of magnitude of performance off the directly translated algorithm, you may as well have written the assembler to start with…
And you probably can’t fine tune your C/C++ to get this performance without knowing exactly what processor instructions you are trying to trick the compiler into generating anyway.
I mean, did you see the very complicated extremely optimized C and C++ code lower on the page? Despite that, they "only" got to 10% of the performance of the ASM code.
The speed of this program is partly that it's written in assembly, but mostly because it's written by someone who is quite clever and clearly put a large amount of time into this problem. None of the other solutions spend much time trying to fit their data into the CPU cache, nor do they have to drop to using slicing for zero copies, and not one is doing anything nearly as clever as this program is to generate its numbers. All of this would be possible to mostly translate to C++ with AVX intrinsics, but real accelerator here is not choice of language, it's the person behind the code.
Now that I have seen the power of madvise + huge pages, everything looks like a nail. Author reckons 30% from less page table juggling. There are techniques here that apply outside assembly.
///// Third phase of output
//
// This is the heart of this program. It aims to be able to produce a
// sustained output rate of 64 bytes of FizzBuzz per four clock cycles
// in its main loop (with frequent breaks to do I/O, and rare breaks
// to do more expensive calculations).
//
// The third phase operates primarily using a bytecode interpreter; it
// generates a program in "FizzBuzz bytecode", for which each byte of
// bytecode generates one byte of output. The bytecode language is
// designed so that it can be interpreted using SIMD instructions; 32
// bytes of bytecode can be loaded from memory, interpreted, and have
// its output stored back into memory using just four machine
// instructions.
// [The eighth byte of LINENO_MID] changes in meaning over the course of
// the program. It does indeed represent the billions digit most of
// the time; but when the line number is getting close to a multiple
// of 10 billion, the billions and hundred-millions digits will always
// be the same as each other (either both 9s or both 0s). When this
// happens, the format changes: the hundred-millions digit of
// LINENO_MID represents *both* the hundred-millions and billions
// digits of the line number, and the top byte then represents the
// ten-billions digit. Because incrementing a number causes a row of
// consecutive 9s to either stay untouched, or all roll over to 0s at
// once, this effectively lets us do maths on more than 8 digits,
// meaning that the normal arithmetic code within the main loop can
// handle the ten-billions digit in addition to the digits below.
You could probably get some blazing performance out of an FPGA. I made an FPGA version of FizzBuzz a few years ago, but it was optimized for pointless VGA animation rather than performance.
With an FPGA, the speed limit will 100% be how you want to output the data.
If you require it to come out of IO pins, then the limit will be the total output bandwidth of all the serdes units on the IO pins. Typically that's much more than 55GiB/s.
When I saw this I did wonder how much faster I could do it in hardware, but similarly I expect the bottleneck would be outputting it in a fashion that would be considered fair to be compared to software implementations (eg outputting readable text on a screen).
Regardless, I very much enjoyed your DVD screensaver-esque output.
GPU memory is an order of magnitude higher bandwidth than RAM, so that would seem to me to be the way to go to beat this. The output wouldn’t be accessible from CPU without a big slowdown though.
The amount of completely unnecessary effort unleashed on this problem in this post is amazing. I want to meet the author and shake his hand! It has everything from Linux kernel trivia in terms of output speed to intel AVX2 code. So unnecessary and so awesome!
I've done stuff like this before, and I imagine the satisfaction of completing it! A tip of my hat to you, sir!
That's borderline incredulous, given a single AVX2 instruction can last multiple clock-cycles. The reciprocal throughput also doesn't go below ~0.3 to my, admittedly shallow, knowledge. A remarkable piece of engineering!
That's 16 bytes per clock cycle, i.e. one avx register per clock cycle. As most intel CPUs can only do one store per clock cycle, that's also the theoretical limit with avx. I wonder if using non temporal stores would help make the code be cache size agnostic.
Note that the instruction latency is not important as long as you can pipeline the computation fully (which appear to be the case here!).
Edit: to go faster you would need to be able to use the bandwidth of more than one cpu. I wonder if you could precompute were the output will cross a page to be able to have distinct cores work on distinct pages... Hum I need a notepad.
Edit2: it is much simpler than that, you do not need to fill a page to vmsplice it. So in principle the code should parallelize quite trivially. Have each tread grab, say 1M numbers at a time, for example by atomically incrementing counter, serialize them to a bunch of pages, then grab the next batch. A simple queue will hold the ready batches that can be spliced as needed either by a service thread or opportunistically by the thread that has just finished the batch next in sequence.
That shouldn't be a problem, pv is basically only counting the number of pages being written. As long as the page size is much larger than the number of threads it should keep up just fine.
The page size can't be more than half the L2 cache (else writeback to DRAM would be triggered, dramatically lowering performance). That means you have to use a relatively small page size, and lock contention in the kernel for those pages then limits throughput if you use many cores, even if the data is never read by pv.
As an aside, it would be fun to see a programming challenge website focused on performance and optimizations, rather than code golf (shortest program) or edge case correctness (interview type sites). I took a course like this in uni where we learned low-level optimization and got to compete with other classmates to see who had the fastest program - a fun experience that I don't really see much of online.
The course had a machine solely for benchmarking. It would only compile and benchmark one program at a time and the students had to queue up.
It worked well for the course and the results were very consistent between runs, but that kind of setup doesn't scale well.
There may be a (slow) option though, and that would be benchmarking the code using an simulated processor. The runtime would be measured relative to the simulated processor as opposed to the computer hosting the simulator.
Only getting 7GiB/s on a Ryzen 7 1800X w/DDR4 3000. Zen 1 executes AVX2 instructions at half speed, but that doesn't account for all of the difference. I guess I need a new computer. To run FizzBuzz.
I ran this on a server with a 5950X (same CPU as this test was run on), but with 2666MHz memory (128GB) instead of 3600MHz memory (32GB) and I only got 41GB/s.
Or if your ASIC generalizes the problem in a slightly different direction, you end up reinventing TWINKLE and TWIRL: https://en.wikipedia.org/wiki/TWINKLE
Could probably store all multiples of 3 and 5 up to some really big number burned directly to hardware and then do something like a big CAM table the way Internet core routers do mapping the numbers to ASCII bit strings. Then speedup IO by not having a general purpose display, but something more like an old-school digital clock that can only show digits and the words "fizz" and "buzz."
That's definitely not going to be useful. By doing this 15 numbers at a time you completely avoid needing to do any division anyway. print FB, i+1. i+2, Fizz, i+3, Buzz. i+4, ... i+14. Then increment by 15.
Also, even with a very slow CPU you'd already run faster than can be perceived on a digital clock
Not for fizzbuzz optimisation it isn't! Get your priorities straight mate! Who is trying to get hired when we could be outputting more fizzes and/or buzzes! :)
But does it support arbitrarily large integers? ;)
Ed: no, but does pretty well:
> The program outputs a quintillion lines of FizzBuzz and then exits (going further runs into problems related to the sizes of registers). This would take tens of years to accomplish, so hopefully counts as "a very high astronomical number" (although it astonishes me that it's a small enough timespan that it might be theoretically possible to reach a number as large as a quintillion without the computer breaking).
Now I'm kinda curious to see how much faster you could go on an M1 Max with the GPU generating the data. Once his solution gets to the point of being a bytecode interpreter, it's trivially paralellizable and the M1 has _fantastic_ memory bandwidth. Does anyone know if the implementation of pv or /dev/null actually requires loading the data into CPU cache?
If that is the case, does it actually matter what CPU core pv runs on? I feel like _something_ must ultimately zero the page out before it can get re-mapped to the process, but I'm not sure which core that happens on, or whether there's some other hardware mechanism that allows the OS to zero a page without actually utilizing memory bandwidth.
Unfortunately, the memory bandwidth that matters here is not bandwidth to GPU memory, but bandwidth to main system memory (unless anyone knows how to splice a pointer to GPU memory onto a unix pipe). That's specifically why the M1 came to mind, as a powerful UMA machine that can run Linux. Perhaps a modern gaming console could hit the same performance, but I don't know if they can run Linux.
PCIe 5 speed is 64 GB/s, so theoretically if you perfectly pipelined the result you could achieve the same performance on a next-generation discrete GPU.
Wow. There is programming and then there is programming. Whenever I see something like this I feel like what I do for a living is duplo blocks in comparison.
I am thankful every day for the work of those who came before me. Their long hours toiling over hardware, assembly, compilers, protocol stacks, libraries, frameworks, etc let us focus on the bigger picture, how to write same-y CRUD apps for businesses :)
Ive had the opportunity to tinker with ASM, z80 architecture, low level programming and other similar stuff (I'm less than 1/1000th able as the referenced author).
I find this programming it very beautiful and rewarding in that you really know that you are programming the hardware. Unfortunately it's not an easy path to get a good paying job (unless you are exceptional like the gentleman). So I ended up building fintech web apps.
You can get a well paying job off of these skills, but the job title will read something like "performance engineer" or "security researcher" rather than "assembly programmer".
I was wondering the same thing! pv probably never touches its input and might itself be using splice to never read the bytes in users pace and just accumulate the byte counts.
"The Grid. A digital frontier. I tried to picture clusters of information as they moved through the computer. What did they look like? Ships, motorcycles? Were the circuits like freeways? I kept dreaming of a world I thought I'd never see. And then, one day I got in...
It turns out The Grid is just a guy sitting in a chair, shouting about "Fizz!" and "Buzz!" as fast as he can.
It wasn't really what I had in mind.
(The image of this poor program, stuck shouting "fizz!" and "buzz!" for subjectively centuries at a time struck me...)
> // Most FizzBuzz routines produce output with `write` or a similar
> // system call, but these have the disadvantage that they need to copy
> // the data being output from userspace into kernelspace. It turns out
> // that when running full speed (as seen in the third phase), FizzBuzz
> // actually runs faster than `memcpy` does, so `write` and friends are
> // unusable when aiming for performance - this program runs five times
> // faster than an equivalent that uses `write`-like system calls.
I'm not an assembly programmer, but my intuition is that this is safer. You can only rely on "zero-copy" behavior when you have total control of the program and know that that memory region is going to stay put and remain uncorrupted. Therefore, most external code will make a copy because they can't insist on this (and because for most people, making a copy is pretty fast).
I think that some of unix derivatives do or have done in the past. If the argument to write is page aligned and a multiple of a page size they play VM tricks to implement zero copy writes.
The issue is that the caller to write is allowed to do anything it wants with the buffer after write returns, so the write implementation need to unmap the stolen pages and perform copy on write if the caller ever touches them again, so in practice the optimization is very fragile. For this reason Linus as always refused to implement this optimization.
Splicevm gets away with it because it is part of the caller contract that it can't ever touch the pages until the kernel is done with them. Unfortunately there is no general way to know when it is safe to reuse them and it is very application specific (for example there might be an explicit ack from the consumer that it has received the data)
Did anyone else get recommended and see the FizzBuzz video on YouTube ( https://youtu.be/QPZ0pIK_wsc ) just before this article or did I just witness an incredible coincidence?
“ This is faster than a number of standard functions. In particular, it's faster than memcpy, which presents interesting challenges when it comes to I/O”
A typical desktp computer processor runs at 3Ghz, or 3 billion cycles per second. It has 64-bit registers, which is 8x8 bytes. If it moves one register of results per clock cycle that's up to 24 billion bytes per second, 24GB/s if it does nothing else and never stalls or trips up[1]. Less than half the speed this answer is moving data. In a typical desktop computer from a few years back the connection between CPU and main memory tops out at 25GB/sec throughput[2], often less. There's delay when a user program requests that the Linux kernel does things such as allocate memory or open input/output. Printing text to a console is really quite slow. That means lots of stalls and trips waiting for IO related things and context switching as the same processor runs both the FizzBuzz printer and the PV counter and the Linux kernel and pipes connecting them together and has to keep task-switching between them.
From the opening question on that page "A naive implementation written in C gets you about 170MiB/s on an average machine" and that C code is typical, there's nothing drastically bad about it. C already has a reputation as a fast, low level language. This answer is running over 320x faster than the straightforward C answer on the same hardware. If you asked people "how much faster can it possibly get without moving to specialist hardware", I doubt they'd guess that high.
It's difficult to get higher speed by using more CPU cores because calculating FizzBuzz is much faster than printing it so they'd just fill up a queue and then stall waiting. To get faster, people are leaning on what exactly is happening internally, adjusting buffer sizes and how to generate the text "FizzBuzz" and get it into the right place to be printed at the right time, and which system calls can be done with less overhead, and lining up more work which can happen during the pauses, and unrolling loops to give the CPU a pattern of instructions it can execute quicker, this is getting answers into the GB/sec ranges.
This answer is like the Bugatti Veyron of answers; so finely tuned it's surely getting close to the limits of what the hardware can do. Making use of AVX2 registers which can hold 32 bytes instead of 8, and can be vectorised so one instruction processes many pieces of data, which means less overhead of instructions for the CPU to decode and process per piece of data, and deeper magic, a lot of skill and knowledge required to put all of it together.
And of course, someone spending months trying to send "Fizz" and "Buzz" into a counter at tens of billions of bytes per second for no good reason, is amazing in a different way.
[1] A nonsense back of an envelope estimate, but good enough for context.
Somehow this reminds me of Aphyr's Technical Interview series[1], and the thought of some assembler dark wizard actually coding this in an interview is highly amusing.
How does one debug the L2 cache? i.e. how does one inspect the data in the L2 cache and verify that loading some data from RAM actually cached X number of bytes, and the correct data has been loaded in the cache lanes, and the next instruction will use the data from there?
I’ve printed his entry and there are some neat tricks I learned, but ‘stage 3’, where, according to his comments, a bytecode interpreter is used, is still beyond my understanding. Also not sure why this helps, a bytecode interpreter sounds like a lot of overhead to me…
The bytecode is specifically designed so that it maps directly to the output.
// The third phase operates primarily using a bytecode interpreter; it
// generates a program in "FizzBuzz bytecode", for which each byte of
// bytecode generates one byte of output. The bytecode language is
// designed so that it can be interpreted using SIMD instructions; 32
// bytes of bytecode can be loaded from memory, interpreted, and have
// its output stored back into memory using just four machine
// instructions.
// The bytecode format is very simple (in order to allow it to be
// interpreted in just a couple of machine instructions):
// - A negative byte represents a literal character (e.g. to produce
// a literal 'F', you use the bytecode -'F', i.e. -70 = 0xba)
// - A byte 0..7 represents the hundreds..billions digit of the line
// number respectively, and asserts that the hundreds digit of the
// line number is even
// - A byte 8..15 represents the hundreds..billions digit of the line
// number respectively, and asserts that the hundreds digit of the
// line number is odd
// %ymm2 holds the bytecode for outputting the hundreds and more
// significant digits of a line number. The most significant digits of
// this can be obtained by converting LINENO_TOP from high-decimal to
// the corresponding bytecode, which is accomplished by subtracting
// from 198 (i.e. 256 - 10 - '0'). The constant parts of LINENO_TOP
// are 198 minus the bytecode for outputting the hundreds to billions
// digit of a number; this makes it possible for a single endian
// shuffle to deal with all 16 of the mid and high digits at once.
Most single people who have a full time job should still have enough free time to devote 1000+ hours a year to personal projects - and even with a family, you should be able to find 400+ hours if your employer has any respect for WLB and/or you know how to set boundaries. That's 2 college degrees a decade worth of time
I have trouble with procrastination, but since watching Arnie's take on this I cannot get it out of my head and I have actually found it helpful (yet here I am on HN again :()
Well, it's easy when your goal is clear-cut and 100% ego-directed ("win Mr. Universe"). Not so much when it's related to much larger concerns and uncertainties and pessimism (e.g. fighting global warming).
Not everyone has the same ability to utilize those hours. People have children and/or pets and/or loved ones to tend, different cognitive and physical limitations, different economic challenges, different social expectations, different living conditions which may impact all of the above.
Speaking for myself: between caring for a puppy, cognitive impairments from ADHD, sustained impact from burnout, and financial responsibilities for family… my ability to carve out free time for a very long list of things I’d like to do extracurricularly is limited to a few hours a week at most. Achieving a “months of work” project in my areas of interest and applicable talents is easily going to take 10x as long. And that also means being much more circumspect about where I choose to focus those interests.
Maybe I'd feel differently if the comment ended with :) instead of :(.
But I mean, obviously, yes. Different people can do different things for a zillion different reason. I'm reacting negatively because it sounds like a criticism against the author just for spending a lot of time working on a hobby project just for the sake of it.
See also: "why are you having fun when cancer still isn't cured yet?"
Wow! I took it entirely differently (and still do). I read it as being disappointed that they don’t feel as capable to pursue and invest significant time/effort into their own ideas. Essentially I saw the :( as an admission of envy, not as a judgment.
Admittedly, that could very easily be me projecting. I have tons of envy of people pursuing their ideas without the limitations I have to take seriously in my own life. It doesn’t make me sad to see them succeed or use their time the way they do, it makes me sad to have a brain full of ideas and limited ability to reify them.
It would be nice to compare its performance with a safe variant written in Rust, with a variant written in Java and one in Python to see what % of the performance we lose for high abstractions, convenience and safety.
It would naturally be foreign to most, but the neat part is that it's also pretty accessible. Look up vectorization techniques and L2 cache optimization and you're most of the way to understanding. Most people who grok something like this after their first glance are specialized in the same or similar domain and probably wouldn't have an expert understanding of some things you take for granted.
Interesting how he uses the C preprocessor as an Assembly preprocessor; cpp as a general purpose macro processor. Maybe this is more portable than using assembler built-in macro features that vary from assembler to assembler
My naive implementation (as shown in the linked question) in Rust was mind boggling slow. C could do around 150MiB/s, while Rust could only do 10MiB/s. A simple `objdump -d` shows Rust is generating a lot more code. I'm not sure how much of that is relevant though.
At that point my coffee time ran out. I wish I had more time to figure out why. :-(
What's cool here is if you can make FizzBuzz go 55GB/s then you have the skill needed to make a whole bunch of other stuff go that speed too. My personal favorites are that Intel architecture lets us do population count, polynomial division, and eight tap filters at that speed too, which is wonderful.
I guess is faster than memcpy because both are memory bound but this only has to write to memory whereas memcpy has to both read from and write to memory?
Edit: actually it sounds like it’s more cache efficient because it can always use the same block of cache memory?
I believe so. A memcpy needs to touch large blocks of memory, far more than would fit in a cache, so most memcpy implementation use non-temporal stores to indicate that the memory is not going to be used again. They write directly to main memory, avoiding using up valuable cache space with a cache line that is written to once and never used again. However, this program can operate entirely within L2, so it can go faster.
This is the heart of this program. It aims to be able to produce a
sustained output rate of 64 bytes of FizzBuzz per four clock cycles
in its main loop (with frequent breaks to do I/O, and rare breaks
to do more expensive calculations).
The third phase operates primarily using a bytecode interpreter; it
generates a program in "FizzBuzz bytecode", for which each byte of
bytecode generates one byte of output. The bytecode language is
designed so that it can be interpreted using SIMD instructions; 32
bytes of bytecode can be loaded from memory, interpreted, and have
its output stored back into memory using just four machine
instructions. This makes it possible to speed up the FizzBuzz
calculations by hardcoding some of the calculations into the
bytecode (this is similar to how JIT compilers can create a version
of the program with some variables hardcoded, and throw it away on
the rare occasions that those variables' values change).
The bytecode format is very simple (in order to allow it to be
interpreted in just a couple of machine instructions):
- A negative byte represents a literal character (e.g. to produce
a literal 'F', you use the bytecode -'F', i.e. -70 = 0xba)
- A byte 0..7 represents the hundreds..billions digit of the line
number respectively, and asserts that the hundreds digit of the
line number is even
- A byte 8..15 represents the hundreds..billions digit of the line
number respectively, and asserts that the hundreds digit of the
line number is odd
In other words, the bytecode program only ever needs to read from
LINENO_MID; the information stored in LINENO_LOW and LINENO_TOP
therefore has to be hardcoded into it. The program therefore needs
to be able to generate 600 lines of output (as the smallest number
that's divisible by 100 to be able to hardcode the two low digits,
200 to be able to get the assertions about the hundreds digits
correct, and 3 and 5 to get the Fizzes and Buzzes in the right
place).
The bytecode interpreter consists of four instructions:
1. Load the bytecode from memory into %ymm2;
2. Use it as a shuffle mask to shuffle LINENO_MID_TEMP;
3. Subtract the bytecode from the shuffle result;
4. Output the result of the subtraction.
#define INTERPRET_BYTECODE(bc_offset, buf_offset) \
vmovdqu %ymm2, [%rdx + bc_offset]; \
vpshufb %ymm0, LINENO_MID_TEMP, %ymm2; \
vpsubb %ymm0, %ymm0, %ymm2; \
vmovdqa [OUTPUT_PTR + buf_offset], %ymm0