Hacker News new | past | comments | ask | show | jobs | submit login

> Instead, it turns out a few hundred lines of Python is genuinely enough to train a basic version!

actually its not just a basic version. Llama 1/2's model.py is 500 lines: https://github.com/facebookresearch/llama/blob/main/llama/mo...

Mistral (is rumored to have) forked llama and is 369 lines: https://github.com/mistralai/mistral-src/blob/main/mistral/m...

and both of these are SOTA open source models.




This metric from the article is a bit misleading. It's not 500 lines in the same sense as mergesort is 20 lines. The whole of the deep learning stack including nvidia firmware blobs is used by those 500 lines.

You may create a small NN from scratch in 500 lines for training a toy dataset, but nothing to actually train real LLMs unless you use the existing stack. Even Karpathy's no-torch C only 500 lines Llama2 code is inference only.

So it is 500 extra lines compared to what we already had before, and in this sense it indeed is a breakthrough.

Otherwise you could also say it's 10 lines of fastAI/lightning style code where almost everything is hidden by the APIs.


All the nvidia/torch stuff is "only" to make it run two orders of magnitude faster or so. For the actual principles behind LLMS, 500 lines seems quite accurate.

How long would it take to explain what an LLMs is doing to someone from 1800s? It is remarkably simple.


The article talked about building a basic version of an LLM. Even basic versions like Karpathy's TinyStories ~100 million parameter models which manage to generate comprehensive English for a restricted vocabulary are trained for hours on an A100. If you want to skip the nvidia/torch stuff you'll be both training for days and having to write your own training code, which is not going to be just 500 lines. The principles are simple - but that is true for any important algorithm of the last 100 years. As opposed to LLMs, most of those can actually be done in a few hundreds of lines of code with no specialized hw and software stack required for it to be practical.


> Karpathy's TinyStories

Did you mean Karpathy's tinyllamas? [1][2]

Or did you mean Ronen Eldan and Yuanzhi Li's "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?" [3][4]

[1]: https://huggingface.co/karpathy/tinyllamas/tree/main

[2]: https://github.com/karpathy/llama2.c

[3]: https://www.microsoft.com/en-us/research/publication/tinysto...

[4]: https://arxiv.org/abs/2305.07759


Karpathy's tinyllamas are based on the tiny stories dataset. I could have phrased it better, sorry.


Sure - there are indeed millions of lines of code if you include the operating system, the Python interpreter, PyTorch, CUDA etc.

But... you don't have to write that code. A research group constructing a brand new LLM really does only need a few hundred lines of code for the core training and inference. They need to spend their time on the data instead.

That's what makes LLMs "easy" to build. You don't need to write millions of lines of custom code to build an "AI", which I think is pretty surprising.


FOr inference, less than 1KLOC of pure, dependency-free C is enough (if you include the tokenizer and command line parsing)[1]. This was a non-obvious fact for me, in principle, you could run a modern LLM 20 years ago with just 1000 lines of code, assuming you're fine with things potentially taking days to run of course.

Training wouldn't be that much harder, Micrograd[2] is 200LOC of pure Python, 1000 lines would probably be enough for training an (extremely slow) LLM. By "extremely slow", I mean that a training run that normally takes hours could probably take dozens of years, but the results would, in principle, be the same.

If you were writing in C instead of Python and used something like Llama CPP's optimization tricks, you could probably get somewhat acceptable training performance in 2 or 3 KLOC. You'd still be off by one or two orders of magnitude when compared to a GPU cluster, but a lot better than naive, loopy Python.

[1] https://github.com/karpathy/llama2.c

[2] https://github.com/karpathy/micrograd


And also this assumes base models. The breakthrough where the instruct/chat models which again are much more code than 500 lines of plain training.


It's like claiming a mod on top of an existing game makes it a new game, in a few line if code.

It can, but obviously it's the engine and existing code doing all the work.


What are you talking about? Fine tunes are basically just more of the same training, optionally on selected layers for efficiency.


RLHF or DPO are definitely not just the same thing as the basic torch training loop, hence my many more lines of code argument.


I don't think it's misleading, and the responses of exactly how much code you can get away with is IMO missing the point.

The point is not about the total stack size or anything of the sort, it's that they didn't require much new stuff to be built. They're tools that can answer questions, role play, translate, write code and more and the architecture isn't a huge new system. It's not "First we do this, then we encode like this, then we perform a graph search, then we rank, then we decide which subsystem to start, then we start the custom planner, then a custom iteration over..."

It's not a large, unique thing only one group knows how to make.

> Otherwise you could also say it's 10 lines of fastAI/lightning style code where almost everything is hidden by the APIs.

Sure!


I don't get why people find those numbers of lines meaningful. The Mistral code you linked starts with 13 import statements. If they wanted they could have packed all those 369 lines into a separate package, and imported that into a two-line source file. It wouldn't be any bit more impressive, though, and in reality the codebase would now actually be larger.


i mean yes thats trivially true. but its just a shorthand for “this is an eminently readable codebase and its not gonna take you like a year to understand all of it”


This is an interesting point of fact about what Boltzmann Brains could be.


Kinda, I guess?

Boltzmann brains are likely to be wrong about all of their beliefs owning to them being created by an endless sequence of random dice rolls that eventually makes atoms that can think, so it's fine if they are silicon chips that think they're wet organic bodies (amongst other things).

A computer with say 2^((175e9)*8) parameters is simpler than a human brain, so more likely to be produced by this process.

But the 500 line file is a red herring.


I mean to say that since so few bits are required to encode perception, the universe in which Boltzmann Brains exist in is teeming with intelligence.

Poetically, an intelligent universe seems closer to reality thanks to this new observation.


Vastly and incomprehensibly closer to reality while still a rounding error from zero because the initial number was just that small.


No it's not. Boltzmann Brains are not platonic forms. You need to conjure an entire physical world to contain a Python interpreter to run your 500 lines of code on.


Not to mention the odds of dreaming up an entire existence out of the blue, with no base reality for training data


I wonder how that compares to the odds of all the things that had to happen in order for you and I to exist and be here, today, thinking about this?


A Boltzmann brain requires a similar underlying reality except with no evolution traceability for the brain's origin.

Not requiring a world around the brain does not make it more likely, but less, as the probability of brains given that worlds exist multiplied by the probability of worlds is still much higher than the probability of brains regardless of the existence of worlds


You can’t even be certain that you aren’t a Boltzmann brain. If you could prove it (either way), it would be a publishable paper and could probably be the basis of a PhD for yourself.


Boltzmann brain is just one of the theoretically infinite combinations of matter that can be randomly conjured out of the quantum chaos (with a very low probability). Our entire visible universe could be a "Boltzmann Universe", and we'd have no way of knowing it.


If the Wikipedia page on Boltzmann brain is accurate, then the probability is surprisingly (but relatively) high.

> The Boltzmann brain gained new relevance around 2002, when some cosmologists started to become concerned that, in many theories about the universe, human brains are vastly more likely to arise from random fluctuations; this leads to the conclusion that, statistically, humans are likely to be wrong about their memories of the past and in fact be Boltzmann brains. When applied to more recent theories about the multiverse, Boltzmann brain arguments are part of the unsolved measure problem of cosmology.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: