1260 points by luu 1583 days ago | 255 comments on HN
| Neutral Community · v3.7· 2026-02-28 10:50:01
Summary Technical Content (Non-UDHR) Neutral
This Stack Exchange Code Golf page presents a technical programming challenge on optimizing FizzBuzz throughput. The content contains no observable editorial engagement with UDHR provisions. Structurally, the platform exhibits mild positive signals for Articles 19 (freedom of expression) and 27 (participation in cultural life) through its open discussion features and inclusive international challenge design.
The amount of low level CPU architecture knowledge to write such a program is mind boggling. Just goes to show how much room for improvement a lot of programs have.
The amount of completely unnecessary effort unleashed on this problem in this post is amazing. I want to meet the author and shake his hand! It has everything from Linux kernel trivia in terms of output speed to intel AVX2 code. So unnecessary and so awesome!
I've done stuff like this before, and I imagine the satisfaction of completing it! A tip of my hat to you, sir!
> // Most FizzBuzz routines produce output with `write` or a similar
> // system call, but these have the disadvantage that they need to copy
> // the data being output from userspace into kernelspace. It turns out
> // that when running full speed (as seen in the third phase), FizzBuzz
> // actually runs faster than `memcpy` does, so `write` and friends are
> // unusable when aiming for performance - this program runs five times
> // faster than an equivalent that uses `write`-like system calls.
Did anyone else get recommended and see the FizzBuzz video on YouTube ( https://youtu.be/QPZ0pIK_wsc ) just before this article or did I just witness an incredible coincidence?
"The Grid. A digital frontier. I tried to picture clusters of information as they moved through the computer. What did they look like? Ships, motorcycles? Were the circuits like freeways? I kept dreaming of a world I thought I'd never see. And then, one day I got in...
It turns out The Grid is just a guy sitting in a chair, shouting about "Fizz!" and "Buzz!" as fast as he can.
It wasn't really what I had in mind.
(The image of this poor program, stuck shouting "fizz!" and "buzz!" for subjectively centuries at a time struck me...)
"future-proofed where possible to be able to run faster if the relevant processor bottleneck – L2 cache write speed – is ever removed)."
Being able to write a function limited mostly by the l2 cache size and able to realize that is rad
And btw this is an interesting example of how hand optimized assembly can be much much faster than any other solution. Can you get as fast as this solution with mostly C/C++? It uses interesting tricks to avoid memcopy (calling it slow rofl)
You could probably get some blazing performance out of an FPGA. I made an FPGA version of FizzBuzz a few years ago, but it was optimized for pointless VGA animation rather than performance.
Now I'm kinda curious to see how much faster you could go on an M1 Max with the GPU generating the data. Once his solution gets to the point of being a bytecode interpreter, it's trivially paralellizable and the M1 has _fantastic_ memory bandwidth. Does anyone know if the implementation of pv or /dev/null actually requires loading the data into CPU cache?
That's borderline incredulous, given a single AVX2 instruction can last multiple clock-cycles. The reciprocal throughput also doesn't go below ~0.3 to my, admittedly shallow, knowledge. A remarkable piece of engineering!
Wow. There is programming and then there is programming. Whenever I see something like this I feel like what I do for a living is duplo blocks in comparison.
Only getting 7GiB/s on a Ryzen 7 1800X w/DDR4 3000. Zen 1 executes AVX2 instructions at half speed, but that doesn't account for all of the difference. I guess I need a new computer. To run FizzBuzz.
As an aside, it would be fun to see a programming challenge website focused on performance and optimizations, rather than code golf (shortest program) or edge case correctness (interview type sites). I took a course like this in uni where we learned low-level optimization and got to compete with other classmates to see who had the fastest program - a fun experience that I don't really see much of online.
///// Third phase of output
//
// This is the heart of this program. It aims to be able to produce a
// sustained output rate of 64 bytes of FizzBuzz per four clock cycles
// in its main loop (with frequent breaks to do I/O, and rare breaks
// to do more expensive calculations).
//
// The third phase operates primarily using a bytecode interpreter; it
// generates a program in "FizzBuzz bytecode", for which each byte of
// bytecode generates one byte of output. The bytecode language is
// designed so that it can be interpreted using SIMD instructions; 32
// bytes of bytecode can be loaded from memory, interpreted, and have
// its output stored back into memory using just four machine
// instructions.
But does it support arbitrarily large integers? ;)
Ed: no, but does pretty well:
> The program outputs a quintillion lines of FizzBuzz and then exits (going further runs into problems related to the sizes of registers). This would take tens of years to accomplish, so hopefully counts as "a very high astronomical number" (although it astonishes me that it's a small enough timespan that it might be theoretically possible to reach a number as large as a quintillion without the computer breaking).
FizzBuzz has many properties that make it very suitable for these kinds of optimizations that might not be applicable to general purpose code:
+ extremely small working set (a few registers worth of state)
+ extremely predictable branching behavior
+ no I/O
These properties however don't diminish the achievement of leveraging AVX-2 (or any vectorization) for a problem that doesn't immediately jump out as SIMD.
Or if your ASIC generalizes the problem in a slightly different direction, you end up reinventing TWINKLE and TWIRL: https://en.wikipedia.org/wiki/TWINKLE
I'm not an assembly programmer, but my intuition is that this is safer. You can only rely on "zero-copy" behavior when you have total control of the program and know that that memory region is going to stay put and remain uncorrupted. Therefore, most external code will make a copy because they can't insist on this (and because for most people, making a copy is pretty fast).
Could probably store all multiples of 3 and 5 up to some really big number burned directly to hardware and then do something like a big CAM table the way Internet core routers do mapping the numbers to ASCII bit strings. Then speedup IO by not having a general purpose display, but something more like an old-school digital clock that can only show digits and the words "fizz" and "buzz."
You could probably to very close to this solution with C or C++, plus AVX intrinsics. Some might consider that "cheating" since intrinsics occupy kind of a grey area between a higher level language and asm.
When I saw this I did wonder how much faster I could do it in hardware, but similarly I expect the bottleneck would be outputting it in a fashion that would be considered fair to be compared to software implementations (eg outputting readable text on a screen).
Regardless, I very much enjoyed your DVD screensaver-esque output.
This is one where I’m fully comfortable with feeling like an impostor. I’ve gotten this far (~20 years) without more than a cursory glance at machine code, I’ll be pleased if I retire at the same level of relevant expertise.
Edit: don’t get me wrong! I admire the talent that goes into this and similar efforts, and find performance-chasing particularly inspiring in any context. This is just an area of that which I don’t anticipate ever wanting to tread myself.
I was wondering the same thing! pv probably never touches its input and might itself be using splice to never read the bytes in users pace and just accumulate the byte counts.
I am thankful every day for the work of those who came before me. Their long hours toiling over hardware, assembly, compilers, protocol stacks, libraries, frameworks, etc let us focus on the bigger picture, how to write same-y CRUD apps for businesses :)
That's 16 bytes per clock cycle, i.e. one avx register per clock cycle. As most intel CPUs can only do one store per clock cycle, that's also the theoretical limit with avx. I wonder if using non temporal stores would help make the code be cache size agnostic.
Note that the instruction latency is not important as long as you can pipeline the computation fully (which appear to be the case here!).
Edit: to go faster you would need to be able to use the bandwidth of more than one cpu. I wonder if you could precompute were the output will cross a page to be able to have distinct cores work on distinct pages... Hum I need a notepad.
Edit2: it is much simpler than that, you do not need to fill a page to vmsplice it. So in principle the code should parallelize quite trivially. Have each tread grab, say 1M numbers at a time, for example by atomically incrementing counter, serialize them to a bunch of pages, then grab the next batch. A simple queue will hold the ready batches that can be spliced as needed either by a service thread or opportunistically by the thread that has just finished the batch next in sequence.
Don't worry. It just seems like everyone else is so talented because no one writes articles about the 2 hrs they spent just trying to get their project to just build without errors. Or if they do, they don't get voted to the top of HN.
I ran this on a server with a 5950X (same CPU as this test was run on), but with 2666MHz memory (128GB) instead of 3600MHz memory (32GB) and I only got 41GB/s.
Ive had the opportunity to tinker with ASM, z80 architecture, low level programming and other similar stuff (I'm less than 1/1000th able as the referenced author).
I find this programming it very beautiful and rewarding in that you really know that you are programming the hardware. Unfortunately it's not an easy path to get a good paying job (unless you are exceptional like the gentleman). So I ended up building fintech web apps.
GPU memory is an order of magnitude higher bandwidth than RAM, so that would seem to me to be the way to go to beat this. The output wouldn’t be accessible from CPU without a big slowdown though.
And imagine not coming up with this solution in your next MANGA interview!
Editorial Channel
What the content says
ND
PreamblePreamble
No editorial engagement with fundamental human rights or human dignity principles.
FW Ratio: 67%
Observable Facts
Page presents a technical programming challenge focused on computational optimization metrics.
Content makes no reference to human rights, dignity, or fundamental freedoms.
Inferences
Absence of engagement with foundational human rights concepts indicates this is technical content outside UDHR scope.
ND
Article 1Freedom, Equality, Brotherhood
No commentary on equal dignity or inherent rights.
FW Ratio: 100%
Observable Facts
Challenge is open to all participants regardless of background.
ND
Article 2Non-Discrimination
No discussion of discrimination or non-discrimination.
FW Ratio: 100%
Observable Facts
Challenge explicitly states 'All languages are allowed' with no restrictions based on protected characteristics.
Leaderboard includes participants with diverse names suggesting global participation.
ND
Article 3Life, Liberty, Security
Not applicable to technical content.
ND
Article 4No Slavery
Not applicable.
ND
Article 5No Torture
Not applicable.
ND
Article 6Legal Personhood
Not applicable.
ND
Article 7Equality Before Law
Not applicable.
ND
Article 8Right to Remedy
No discussion of privacy rights.
FW Ratio: 100%
Observable Facts
Page loads with standard analytics and session tracking elements typical of Stack Exchange platform.
ND
Article 9No Arbitrary Detention
Not applicable.
ND
Article 10Fair Hearing
Not applicable.
ND
Article 11Presumption of Innocence
Not applicable.
ND
Article 12Privacy
Not applicable.
ND
Article 13Freedom of Movement
Not applicable.
ND
Article 14Asylum
Not applicable.
ND
Article 15Nationality
Not applicable.
ND
Article 16Marriage & Family
Not applicable.
ND
Article 17Property
Not applicable.
ND
Article 18Freedom of Thought
Not applicable.
ND
Article 19Freedom of Expression
Medium Practice
No editorial commentary on freedom of opinion or expression.
FW Ratio: 75%
Observable Facts
Challenge states 'All languages are allowed' encouraging open participation in technical discourse.
Comments section displays active discussion and exchange of solutions without apparent editorial filtering.
Platform permits users to post code, discuss approaches, and share ideas without censorship barriers.
Inferences
The structural openness to diverse technical ideas and discussion supports underlying principles of free expression.
ND
Article 20Assembly & Association
Not applicable.
ND
Article 21Political Participation
Not applicable.
ND
Article 22Social Security
Not applicable.
ND
Article 23Work & Equal Pay
Not applicable.
FW Ratio: 100%
Observable Facts
Competition is voluntary and conducted without monetary compensation.
ND
Article 24Rest & Leisure
Not applicable.
ND
Article 25Standard of Living
Not applicable.
ND
Article 26Education
Not applicable.
ND
Article 27Cultural Participation
Medium Practice
No explicit discussion of participation in cultural life.
FW Ratio: 75%
Observable Facts
Challenge enables participation in competitive programming culture, a recognized technical subculture.
Leaderboard displays author names and performance metrics, enabling public recognition of contributions.
Open design invites global programmers to share in technical culture and contribute solutions.
Inferences
The structure of the challenge enables meaningful participation in a shared technical and programming culture.
ND
Article 28Social & International Order
Not applicable.
ND
Article 29Duties to Community
Not applicable.
ND
Article 30No Destruction of Rights
Not applicable.
Structural Channel
What the site does
ND
PreamblePreamble
Page structure is standard web platform display; no structural signals regarding human rights foundations.
ND
Article 1Freedom, Equality, Brotherhood
Platform structure treats all challenge participants equally; no specific structural signal unique to Article 1.
ND
Article 2Non-Discrimination
Challenge rules apply uniformly to all participants; no observable discriminatory barriers.
ND
Article 3Life, Liberty, Security
No structural engagement with right to life, liberty, or security.
ND
Article 4No Slavery
No structural signals regarding slavery or servitude.
ND
Article 5No Torture
No structural signals regarding torture or cruel treatment.
ND
Article 6Legal Personhood
No structural signals regarding right to recognition as person.
ND
Article 7Equality Before Law
No structural signals regarding equality before law.
ND
Article 8Right to Remedy
Page contains standard web platform tracking and session management; privacy neither prioritized nor egregiously violated.
ND
Article 9No Arbitrary Detention
No structural signals regarding freedom from arbitrary arrest.
ND
Article 10Fair Hearing
No structural signals regarding fair and public hearing.
ND
Article 11Presumption of Innocence
No structural signals regarding criminal liability.
ND
Article 12Privacy
No structural signals regarding interference with privacy.
ND
Article 13Freedom of Movement
No structural signals regarding freedom of movement.
ND
Article 14Asylum
No structural signals regarding asylum.
ND
Article 15Nationality
No structural signals regarding nationality.
ND
Article 16Marriage & Family
No structural signals regarding marriage or family.
ND
Article 17Property
No structural signals regarding property rights.
ND
Article 18Freedom of Thought
No structural signals regarding freedom of conscience or belief.
ND
Article 19Freedom of Expression
Medium Practice
Platform structure enables uncensored posting of code and ideas. Challenge rules explicitly welcome diverse solutions and approaches.
ND
Article 20Assembly & Association
No structural signals regarding freedom of assembly.
ND
Article 21Political Participation
No structural signals regarding political participation.
ND
Article 22Social Security
No structural signals regarding social security.
ND
Article 23Work & Equal Pay
Challenge is voluntary, unpaid participation; no labor rights signals.
ND
Article 24Rest & Leisure
No structural signals regarding rest or leisure.
ND
Article 25Standard of Living
No structural signals regarding health or standard of living.
ND
Article 26Education
No structural signals regarding education.
ND
Article 27Cultural Participation
Medium Practice
Platform enables programmers to participate in technical culture and share knowledge and achievements. Challenge recognizes authors publicly in leaderboard.
ND
Article 28Social & International Order
No structural signals regarding social or international order.
ND
Article 29Duties to Community
No structural signals regarding duties to community.
ND
Article 30No Destruction of Rights
No structural signals regarding restriction of rights.
Supplementary Signals
How this content communicates, beyond directional lean. Learn more
build 73de264+3rh4 · deployed 2026-02-28 13:33 UTC · evaluated 2026-02-28 13:36:03 UTC
Support HN HRCB
Each evaluation uses real API credits. HN HRCB runs on donations — no ads, no paywalls.
If you find it useful, please consider helping keep it running.