Hi, I wrote the post! Also not a ML researcher, just an interested engineer, so I'm sure I got some things wrong.
> MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.
What I meant was that the single-user scenario is going to get dramatically worse throughput-per-GPU, because they're not able to reap the benefits of multi-user batching (unless they're somehow doing massively parallel inference requests, I suppose).
> Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute.
As I understand it, you want larger input matrices in order to move the bottleneck from memory to compute: if you do no batching at all, your multiplications will be smaller (the weights will be the same, of course, but the next-token data you're multiplying with the weights will be 1xdim instead of batch-size x dim), so your GPUs will be under-utilized and your inference will spend more time doing memory operations and less time multiplying.
> The post has no numbers on the time to first token for any of the three providers.
I probably should have hunted down specific numbers, but I think people who've played with DeepSeek and other models will notice that DeepSeek is noticeably more sluggish.
That’s how it does work, but unfortunately denoising the last paragraph requires computing attention scores for every token in that paragraph, which requires checking those tokens against every token in the sequence. So it’s still much less cacheable than the equivalent autoregressive model.
It’s partially because my articles often sit in draft for a while. Then while editing I have another thought that spins off into a separate article, which then gets posted in quick succession.
Yeah, I agree this post would have been better with a concrete example. It's hard to talk about a specific project though, since it comes down to describing in detail facts about a company's internal workings (often embarrassing facts). I couldn't figure out how to anonymize it sufficiently.
Nope! I threw this up in the time I had just to see what it'd look like. I think it'd be a lot of fun to try different formats or different prompts (e.g. eliciting CoT before making a decision).
This repo is sort of a companion piece to https://github.com/sgoedecke/fish-tank, which is a more physics-based 2d simulation. Interestingly, 4o-mini absolutely cleans up in the 2d space but doesn't win all the time in poker.
I specialized in heads-up no-limit (HUNL) when I played. With cash games, no-limit head-up holdem is arguably an unsolvable game - whereas limit holdem is already solved. No limit multi-handed (more than 2 players) is solveable but only in theory (thus far). It may seem counter-intuitive that less players (less hands) could equate to more variations but heads-up (2 players) introduces a variable thats extremely difficult to quantify: we call this "meta"; (meta game can and will be present multi-handed, but to much a smaller extent, and most often at-play in "heads-up" situations within multi-handed play) the "psyche" part of poker. Meta isn't so much John Malcovich dissecting an Oreo cookie because he's bluffing - much is intangible, but as hands are played, either player is contributing to the context.
History of how hands were played, what you know about your opponent, what you know your opponent knows about you, what you know that your opponent expects you to do in a given situation.. all that stuff comes to focus in HUNL.
Anyway, that got a bit wordy but I hope that you can see why I am super curious to see some application / testing of LLM in the specific HUNL format. Thanks for reading
The specific example here - pushing an organization to use your bespoke Python framework, that you believe to be the best solution for all Python programming tasks - does not inspire confidence.
Can you be more specific, confidence with what exactly? :)
I'm interpreting your reaction to the thought that "if it's so good, why doesn't it sell itself?" -- in which case I would suggest you put yourself in the shoes of someone on a platform team, or a manager trying to get everyone to do things a certain way, and they'll explain that for brand new greenfield things, adoption is generally much easier, but for existing people and processes change is difficult - so even with a better solution it just doesn't sell it self without some help.
Sometimes the people involved have to be ... exchanged for other people. I've probably been that person at times and seen others needing to move on for the benefit of everyone. Yet staying in the familiar can be deceptively comfortable.
> MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.
What I meant was that the single-user scenario is going to get dramatically worse throughput-per-GPU, because they're not able to reap the benefits of multi-user batching (unless they're somehow doing massively parallel inference requests, I suppose).
> Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute.
As I understand it, you want larger input matrices in order to move the bottleneck from memory to compute: if you do no batching at all, your multiplications will be smaller (the weights will be the same, of course, but the next-token data you're multiplying with the weights will be 1xdim instead of batch-size x dim), so your GPUs will be under-utilized and your inference will spend more time doing memory operations and less time multiplying.
> The post has no numbers on the time to first token for any of the three providers.
I probably should have hunted down specific numbers, but I think people who've played with DeepSeek and other models will notice that DeepSeek is noticeably more sluggish.