More

johndough · 2026-01-08T22:32:13 1767911533

All the naysayer here have clearly no idea. Your large matrix multiplication implementation is quite impressive! I have set up a benchmark loop and let GPT-5.1-Codex-Max experiment for a bit (not 5.2/Opus/Gemini, because they are broken in Copilot), but it seems to be missing something crucial. With a bit of encouragement, it has implemented:

    - padding from 2000 to 2048 for easier power-of-two splitting
    - two-level Winograd matrix multiplication with tiled matmul for last level
    - unrolled AVX2 kernel for 64x64 submatrices
    - 64 byte aligned memory
    - restrict keyword for pointers
    - better compiler flags (clang -Ofast -march=native -funroll-loops -std=c++17)

But yours is still easily 25 % faster. Would you be willing to write a bit about how you set up your evaluation and which tricks Claude used to solve it?

josu · 2026-01-09T14:29:12 1767968952

Thank you. Yeah, I'm doing all those things, which do get you close to the top. The rest of things I'm doing are mostly micro-optimizations such as finding a way to avoid AVX→SSE transition penalty (1-2% improvement).

But I don't want to spoil the fun. The agents are really good at searching the web now, so posting the tricks here is basically breaking the challenge.

For example, chatGPT was able to find Matt's blog post regarding Task 1, and that's what gave me the largest jump: https://blog.mattstuchlik.com/2024/07/12/summing-integers-fa...

Interestingly, it seems that Matt's post is not on the training data of any of the major LLMs.

johndough · 2025-12-31T21:34:21 1767216861

In 2021, only the top 8 % of games sold 10k copies or more, so if you were among them, you were quite successful.

Source: https://app.sensortower.com/vgi/insights/article/video-game-...

In addition, a large fraction of those 8 % were probably games by AAA studios, so your chances as an indie dev are even lower.

johndough · 2025-06-26T06:44:20 1750920260

If the MPC tools come first in the conversation, it should be technically possible to cache the activations, so you do not have to recompute them each time.

johndough · 2025-04-25T05:12:10 1745557930

Generative models can combine different concepts from the training data. For example, the training data might contain a single image of a new missile launcher at a military parade. The model can then generate an image of that missile launcher hiding in a bush, because it has internalized the general concept of things hiding in bushes, so it can apply it to new objects it has never seen hiding in bushes.

johndough · 2025-04-09T17:31:33 1744219893

Use it or lose it. With the invention of the calculator, students lost the ability to do arithmetic. Now, with LLMs, they lose the ability to think.

This is not conjecture by the way. As a TA, I have observed that half of the undergraduate students lost the ability to write any code at all without the assistance of LLMs. Almost all use ChatGPT for most exercises.

Thankfully, cheating technology is advancing at a similarly rapid pace. Glasses with integrated cameras, WiFi and heads-up display, smartwatches with polarized displays that are only readable with corresponding glasses, and invisibly small wireless ear-canal earpieces to name just a few pieces of tech that we could have only dreamed about back then. In the end, the students stay dumb, but the graduation rate barely suffers.

I wonder whether pre-2022 degrees will become the academic equivalent to low-background radiation steel: https://en.wikipedia.org/wiki/Low-background_steel

johndough · on Jan 7, 2025

I have the skills to write efficient CUDA kernels, but $2/hr is 10% of my salary, so no way I'm renting any H100s. The electricity price for my computer is already painful enough as is. I am sure there are many eastern European developers who are more skilled and get paid even less. This is a huge waste of resources all due to NVIDIA's artificial market segmentation. Or maybe I am just cranky because I want more VRAM for cheap.

rbanffy · on Jan 7, 2025

This has 128GB of unified memory. A similarly configured Mac Studio costs almost twice as much, and I'm not sure the GPU is on the same league (software support wise, it isn't, but that's fixable).

A real shame it's not running mainline Linux - I don't like their distro based on Ubuntu LTS.

seanmcdirmid · on Jan 9, 2025

$4,799 for an M2 Ultra with 128GB of RAM, so not quite twice as much. I'm not sure what the benchmark comparison would be. $5,799 if you want an extra 16 GPU cores (60 vs 76).

rbanffy · on Jan 9, 2025

We'll need to look into benchmarks when the numbers come out. Software support is also important, and a Mac will not help you that much if you are targeting CUDA.

I have to agree the desktop experience of the Mac is great, on par with the best Linuxes out there.

seanmcdirmid · on Jan 9, 2025

A lot of models are optimized for metal already, especially lamma, deepseek, and qwen. You are still taking a hit but there wasn't an alternative solution for getting that much vram in a less than $5k before this NVIDIA project came out. Will definitely look at it closely if it isn't just vaporware.

rbanffy · on Jan 10, 2025

They cant walk back now without some major backlash.

The one thing I wonder is noise. That box is awfully small for the amount of compute it packs, and high-end Mac Studios are 50% heatsink. There isn’t much space in this box for a silent fan.

johndough · on Jan 7, 2025

Not to belittle Justine's achievements, but the role of the most important software developer probably goes to the maintainer of some hugely important infrastructure project that we barely know about.

https://xkcd.com/2347/

If Justine didn't optimize struct padding, binaries would be a bit larger, but software would keep working. However, if a trivial library like left-pad is gone, it triggers global chaos of such monumental proportions that it warrants its own Wikipedia article https://en.wikipedia.org/wiki/Npm_left-pad_incident

Or there might be some unsung hero responsible for fixing a year 2038 bug in a bunch of ICBMs who prevented worldwide nuclear annihilation (or who caused it, if you have a more pessimistic view of the future).

bsenftner · on Jan 7, 2025

She's created a compatibility layer enabling portability of a huge amount of software between operating systems, which will enable a huge number of other developers a path into those operating systems and the hardware they are running.

johndough · on Jan 6, 2025

Promoting HN posts can earn a lot of money I'd wager. The audience tends to be wealthy.

Edit: A quick search suggests that one HN upvote costs about 9 cents. I wonder why we don't see even more bots.

johndough · on Jan 4, 2025

Note that the 'Jump Flood Algorithm' is O(N log N) where N is the number of pixels. There is a better O(N) algorithm which can be parallelized over the number of rows/columns of an image:

https://news.ycombinator.com/item?id=36809404

Unfortunately, it requires random access writes (compute shaders) if you want to run it on the GPU. But if CPU is fine, here are a few implementations:

JavaScript: https://parmanoir.com/distance/

C: https://github.com/983/df

C++: https://github.com/opencv/opencv/blob/4.x/modules/imgproc/sr...

Python: https://github.com/pymatting/pymatting/blob/afd2dec073cb08b8...

pshc · on Jan 5, 2025

Dang, I implemented SDFs in 2023 around that time with jump flooding. Wish I had seen this version. Thanks for pointing it out!

johndough · on Dec 28, 2024

Two more heuristics:

1. The figures are not vectorized (text in figures can not be selected). All it takes is to replace "png" in `plt.savefig("figure.png")` with "pdf", so this is a very easy fix. Yet the author did not bother, which shows that he either did not care or did not know.

2. The equations lack punctuation.

Of course you can still write insightful papers with low quality figures and unusual punctuation. This is just a heuristic after all.