Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The new skill is programming, same as the old skill. To the extent these things are comprehensible, you understand them by writing programs: programs that train them, programs that run inferenve, programs that analyze their behavior. You get the most out of LLMs by knowing how they work in detail.

I had one view of what these things were and how they work, and a bunch of outcomes attached to that. And then I spent a bunch of time training language models in various ways and doing other related upstream and downstream work, and I had a different set of beliefs and outcomes attached to it. The second set of outcomes is much preferable.

I know people really want there to be some different answer, but it remains the case that mastering a programming tool involves implemtenting such, to one degree or another. I've only done medium sophistication ML programming, and my understand is therefore kinda medium, but like compilers, even doing a medium one is the difference between getting good results from a high complexity one and guessing.

Go train an LLM! How do you think Karpathy figured it out? The answer is on his blog!



Saying the best way to understand LLMs is by building one is like saying the best way to understand compilers is by writing one. Technically true, but most people aren't interested in going that deep.


I don't know, I've heard that meme too but it doesn't track with the number of cool compiler projects on GitHub or that frontpage HN, and while the LLM thing is a lot newer, you see a ton of useful/interesting stuff at the "an individual could do this on their weekends and it would mean they fundamentally know how all the pieces fit together" type stuff.

There will always be a crowd that wants the "master XYZ in 72 hours with this ONE NEAT TRICK" course, and there will always be a..., uh, group of people serving that market need.

But most people? Especially in a place like HN? I think most people know that getting buff involves going to the gym, especially in a place like this. I have a pretty high opinion of the typical person. We're all tempted by the "most people are stupid" meme, but that's because bad interactions are memorable, not because most people are stupid or lazy or whatever. Most people are very smart if they apply themselves, and most people will work very hard if the reward for doing so is reasonably clear.

https://www.youtube.com/shorts/IQmOGlbdn8g


The best way to understand a car is to build a car. Hardly anyone is going to do that, but we still all use them quite well in our daily lives. In large part because the companies who build them spend time and effort to improve them and take away friction and complexity.

If you want to be an F1 driver it's probably useful to understand almost every part of a car. If you're a delivery driver, it probably isn't, even if you use one 40+ hours a week.


Your example / analogy is useful in the sense that its usually useful to establish the thought experiment with the boundary conditions.

But in between someone commuting in a Toyota and an F1 driver are many, many people, the best example from inside the extremes is probably a car mechanic, and even there, there's the oil change place with the flat fee painted in the window, and the Koenigsberg dealership that orders the part from Europe. The guy who tunes those up can afford one himself.

In the use case segment where just about anyone can do it with a few hours training, yeah, maybe that investment is zero instead of a week now.

But I'm much more interested in the one where F1 cars break the sound barrier now.


It might make sense to split the car analogy into different users:

1. For the majority of regular users the best way to understand the car is to read the manual and use the car.

2. For F1 drivers the best way to understand the car is to consult with engineers and use the car.

3. For a mechanic / engineer the best way to understand the car is to build and use the car.


yes except intelligence isn't like a car, there's no way to break the complicated emergent behaviors of these models into simple abstractions. you can understand a LLM by training one the same amount you can understand a brain by dissection.


I think making one would help you understand that they're not intelligent.


Your reply is enough of a zinger that I'll chuckle and not pile on, but there is a very real and very important point here, which is that it is strictly bad to get mystical about this.

There are interesting emergent behaviors in computationally feasible scale regimes, but it is not magic. The people who work at OpenAI and Anthropic worked at Google and Meta and Jump before, they didn't draw a pentagram and light candles during onboarding.

And LLMs aren't even the "magic. Got it." ones anymore, the zero shot robotics JEPA stuff is like, wtf, but LLM scaling is back to looking like a sigmoid and a zillion special cases. Half of the magic factor in a modern frontier company's web chat thing is an uncorrupted search index these days.


OK I, like the other commenter, also feel stupid to reply to zingers--but here goes.

First of all, I think a lot of the issue here is this sense of baggage over this word intelligence--I guess because believing machines can be intelligent goes against this core belief that people have that humans are special. This isn't meant as a personal attack--I just think it clouds thinking.

Intelligence of an agent is a spectrum, it's not a yes/no. I suspect most people would not balk at me saying that ants and bees exhibits intelligent behavior when they look for food and communicate with one another. We infer this from some of the complexity of their route planning, survival strategies, and ability to adapt to new situations. Now, I assert that those same strategies can not only be learned by modern ML but are indeed often even hard-codable! As I view intelligence as a measure of an agent's behaviors in a system, such a measure should not distinguish the bee and my hard-wired agent. This for me means hard-coded things can be intelligent as they can mimic bees (and with enough code humans).

However, the distribution of behaviors which humans inhabit are prohibitively difficult to code by hand. So we rely on data-driven techniques to search for such distributions in a space which is rich enough to support complexities at the level of the human brain. As such I certainly have no reason to believe, just because I can train one, that it must be less intelligent than humans. On the contrary, I believe in every verifiable domain RL must drive the agent to be the most intelligent (relative to RL award) it can be under the constraints--and often it must become more intelligent than humans in that environment.


So according to your extremely broad definition of intelligence, also a casio calculator is intelligent?

Sure, if we define anything as intelligent, AI is intelligent.

Is this definition somehow helpful though?


It's not binary...


Eh...kinda. The RL in RLHF is a very different animal than the RL in a Waymo car training pipeline, which is sort of obvious when you see that the former can be done by anyone with some clusters and some talent, and the latter is so hard that even Waymo has a marked preference for operating in July in Chandler AZ: everyone else is in the process of explaining why they didn't really want Level 5 per se anyways: all brakes no gas if you will.

The Q summations that are estimated/approximated by deep policy networks are famously unstable/ill-behaved under descent optimization in the general case, and it's not at all obvious that "point RL at it" is like, going to work at all. You get stability and convergence issues, you get stuck in minima, it's hard and not a mastered art yet, lot of "midway between alchemy and chemistry" vibes.

The RL in RLHF is more like Learning to Rank in a newsfeed optimization setting: it's (often) ranked-choice over human-rating preferences with extremely stable outcomes across humans. This phrasing is a little cheeky but gives the flavor: it's Instagram where the reward is "call it professional and useful" instead of "keep clicking".

When the Bitter Lesson essay was published, it was contrarian and important and most of all aimed at an audience of expert practitioners. The Bitter Bitter Lesson in 2025 is that if it looks like you're in the middle of an exponential process, wait a year or two and the sigmoid will become clear, and we're already there with the LLM stuff. Opus 4 is taking 30 seconds on the biggest cluster that billions can buy and they've stripped off like 90% of the correctspeak alignment to get that capability lift, we're hitting the wall.

Now this isn't to say that AI progress is over, new stuff is coming out all the time, but "log scale and a ruler" math is marketing at this point, this was a sigmoid.

Edit: don't take my word for it, this is LeCun (who I will remind everyone has the Turing) giving the Gibbs Lecture on the mathematics 10k feet view: https://www.youtube.com/watch?v=ETZfkkv6V7Y


I'm in agreement--RLHF won't lead to massively more intelligent beings than humans. But I said RL not RLHF


Well what you said is:

"On the contrary, I believe in every verifiable domain RL must drive the agent to be the most intelligent (relative to RL award) it can be under the constraints--and often it must become more intelligent than humans in that environment."

And I said it's not that simple, in no way demonstrated, unlikely with current technology, and basically, nope.


Ah you're worried about convergence issues? My (Bad) understanding was that the self-driving car stuff is more about inadequacies of models in which you simulate training and data collection than convergence of algorithms but I could be wrong. I mean that statement was just a statement that I think you can get RL to converge to close to optimum--which I agree is a bit of a stretch as RL is famously finicky. But I don't see why one shouldn't expect this to happen as we tune the algorithms.


It's not that deep


I highly highly doubt that training a LLM like gpt-2 will help you use something the size of GPT-4. And I guess most people can't afford to train something like GPT-4. I trained some NNs back before the ChatGPT era, I don't think any of it helps in using Chatgpt/alternatives


With modern high-quality datasets and the plummeting H100 rental costs it is 100% a feasible undertaking for an individual to train a model with performance far closer to gpt-4-1106-preview than to gpt-2, in fact its difficult to train a model that performs as badly as gpt-2 without carefully selecting for datasets like OpenWebText with the explicit purpose of replicating runs of historical interest: modern datasets will do better than that by default.

GPT-4 is a 1.75 terraweight MoE (the rumor has it) and that's probably pushing it for an individual's discretionary budget unless they're very well off, but you don't need to match that exactly to learn how these things fundamentally work.

I think you underestimate how far the technology has come. torch.distributed works out of the box now, deepspeed and other strategies that are both data and model parallel are weekend projects to spin up on an 8xH100 SXM2 interconnected cluster that you can rent from Lambda Labs, HuggingFace has extreme quality curated datasets (the fineweb family I was alluding to from Karpathy's open stuff is stellar).

In just about any version of this you come to understand how tokenizers work (which makes a whole class of failure modes go from baffling to intuitive), how models behave and get evaled after pretraining, after instruct training / SFT rounds, how convergence does and doesn't happen, how tool use and other special tokens get used (and why they are abundant).

And no, doing all that doesn't make Opus 4 completely obvious in all aspects. But its about 1000x more effective as a learning technique than doing prompt engineer astrology. Opus 4 is still a bit mysterious if you don't work at a frontier lab, there's very interesting stuff going on there and I'm squarely speculating how some of that works if I make claims about it.

Models that look and act a lot like GPT-4 while having dramatically lower parameter counts are just completely understood in open source now. The more advanced ones require resources of a startup rather than an individual, but you don't need to eval the same as 1106 to take all the mystery out of how it works.

The "holy shit" models are like 3-4 generations old now.


Ok I'm open (and happy to hear!) to being wrong on this. You are saying I can find tutorials which can train something like gpt3.5 level model (like a 7B model?) from scratch for under 1000 USD of cloud compute? Is there a guide on how to do this?


The literally watch it on a live stream version does in fact start with the GPT-2 arch (but evals way better): https://youtu.be/l8pRSuU81PU

Lambda Labs full metas jacket accelerated interconnect clusters: https://lambda.ai/blog/introducing-lambda-1-click-clusters-a...

FineWeb-2 has versions with Llama-range token counts: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2

Ray Train is one popular choice for going distributed, RunHouse, bumcha stuff (and probably new versions since I last was doing this): https://docs.ray.io/en/latest/train/train.html

tiktokenizer is indispensable for going an intuition about tokenization and it does cl100k: https://tiktokenizer.vercel.app/

Cost comes into it, and doing things more cheaply (e.g. vast.ai) is harder. Doing a phi-2 / phi-3 style pretrain is like I said, more like the resources of a startup.

But in the video Karpathy evals better than gpt-2 overnight for 100 bucks and that will whet anyone's appetite.

If you get bogged down building FlashAttention from source or whatever, b7r6@b7r6.net


Thanks for the links! Hopefully this doesn't come across as confrontational (this is really something I would like to try myself) but I don't think a gpt2 arch will get to close to gpt3.5 level intelligence? I feel like there was some boundary around gpt3.5 where the stuff started to feel slightly magical for me [maybe it was only the RLHF effect]. Do you think models in gpt2 size now are getting to that capability? I know sub 10B models have been getting really smart recently.


I think you'll be surprised if you see the lift karpathy demonstrates from `fineweb.edu` vs `webtext` (he went back later and changed the `nanogpt` repository to use `openwebtext` because it was different enough that it wasn't a good replication of GPT-2).

But from an architecture point of view, you might be surprised at how little has changed. Rotary and/or alibi embeddings are useful, and there's a ton on the inference efficiency side (GQA -> MHA -> MLA), but you can fundamentally take a llama and start it tractably small, and then make it bigger.

You can also get checkpoint weights for tons of models that are trivially competitive, and tune heads on them for a fraction of the cost.

This leaked Google memo is a pretty good summary (and remarkably prescient in terms of how it's played out): https://semianalysis.com/2023/05/04/google-we-have-no-moat-a...

I hope I didn't inadvertently say or imply that you can make GPT-4 in a weekend, that's not true. But you can make models with highly comparable characteristics based on open software, weights, training sets, and other resources that are basically all on HuggingFace: you can know how it works.

GPT-2 is the one you can do completely by yourself starting from knowing a little Python in one day.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: