Hacker News new | past | comments | ask | show | jobs | submit login
Implementing a ChatGPT-like LLM from scratch, step by step (github.com/rasbt)
739 points by rasbt 9 months ago | hide | past | favorite | 98 comments



For an additional resource I'm writing a guide book, though its in various stages of completion

The fine tuning guide is the best resource so far https://ravinkumar.com/GenAiGuidebook/language_models/finetu...


Such a great source of information. Thank you.


Of course! Is there's anything in particular you're interested in or a topic you want me to cover let me know. This tech is powerful by itself. Hoping to empower people with knowledge of all this works too :)


This looks amazing @rasbt! Out of curiosity, is your primary goal to cultivate understanding and demystify, or to encourage people to build their own small models tailored to their needs?


I'd say my primary motivation is an educational goal, i.e., helping people understand how LLMs work by building one. LLMs are an important topic, and there are lots of hand-wavy videos and articles out there -- I think if one codes an LLM from the ground up, it will clarify lots of concepts.

Now, the secondary goal is, of course, also to help people with building their own LLMs if they need to. The book will code the whole pipeline, including pretraining and finetuning, but I will also show how to load pretrained weights because I don't think it's feasible to pretrain an LLM from a financial perspective. We are coding everything from scratch in this book using GPT-2-like LLM (so that we can load the weights for models ranging from 124M that run on a laptop to the 1558M that runs on a small GPU). In practice, you probably want to use a framework like HF transformers or axolotl, but I hope this from-scratch approach will demystify the process so that these frameworks are less of a black box.


While pretraining a decent-sized LLM from scratch is not financially feasible for the average person, it is very much feasible for the average YC/VC backed startup (ignoring the fact that it's almost always easier to just use something like Mixtral or LLaMa 2 and fine-tune as necessary).

>Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k

https://www.databricks.com/blog/mpt-7b


Thanks for such a thoughtful response. I'm building with LLMs, and do feel uncomfortable with my admittedly hand-wavy understanding of the underlying transformer architecture. I've ordered your book and look forward to following along!


Thanks for your support, I hope you'll get something useful out of this book!


Honestly, I already have—the overview of PyTorch in the Appendix finally made a few things click for me!


Glad to hear! I was thinking hard whether to write an intro to PyTorch for this book and am glad that this was useful!


Hi Rasbt, thanks for writing the new guide and the upcoming book on LLM, another must buy book from Manning.

Just wondering are going to include any specific section or chapter in your LLM book on RAG? I think it will be very much a welcome addition for the build your own LLM crowd.


This is a good point. It's currently not in the TOC, but I may add this as supplementary text.


Semi-related, as long as we're requesting things: to @pr337h4m's point above, it would be interesting to have some rough guidance (even a sidebar or single paragraph) on when it makes sense to pre-train a new foundation model vs finetune vs pass in extra context (RAG). Clients of all sizes—from Fortune 100 to small businesses—are asking us this question.


That's a good point. I may briefly mention RAG-like systems and add some literature references on this, but I am bit hesitant to give general advice because it's heavily project-dependent in my opinion. It usually also comes down in what form the client has the data and whether referencing from a database or documentation is desired or not. The focus of chapter 6+7 is also instruction-finetuning and alignment rather than finetuning for knowledge. The latter goal is best achieved done via pretraining (as opposed to finetuning) imho. In any case, I just read this interesting case study last week on Finetuning vs RAG that might come in handy: "RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture" (https://arxiv.org/abs/2401.08406)


Writing a technical book in public is a level of anxiety I can’t imagine, so kudos to the author!


It kind of is, but it's also kind of motivating :)


It's actually less risky. The author may be able to reap the benefits of writing a book without actually finishing it. Ideally, maybe not much more than Chapter 1.


I'd say that I've finished all of my previous books, and I have no intention of doing anything different here. Of course, there's always the chance that I get run over by a bus or equivalent, but in that case, I assume that Manning would find a replacement (as per contract) who finishes the book. I don't think there are any benefits to be reaped from not finishing.


[flagged]


What refund? Nobody even mentioned a transaction taking place. The contents are on a github repo. Manning has zero risk here, except that if they keep getting burned by half finished books they may want to re-evaluate these contracts.


And here's the crazy one explaining to Manning how to do business. Incredible.


  import torch
From the first code sample, not quite from scratch :-)


Lol ok, otherwise it would probably be not very readable due to the verbosity. The book shows how to implement LayerNorm, Softmax, Linear layers, GeLU etc without using the pre-packaged torch versions though.


Automatic differentiation is why we are able to have complex models like transformers, it's arguably the key reason (in addition to large amounts of data and massive compute resources) that we have the revolution in AI that we have.

Nobody working in this space is hand calculating derivatives for these models. Thinking in terms of differentiable programming is a given and I think certainly counts as "from scratch" in this case.

Any time I see someone post a comment like this, I suspect the don't really understand what's happening under the hood or how contemporary machine learning works.


> Thinking in terms of differentiable programming is a given and I think certainly counts as "from scratch" in this case.

I have to disagree on that being an obvious assumption for the meaning of "from scratch", especially given that the book description says that readers only need to know Python. It feels like if I read "Crafting Interpreters" only to find that step one is to download Lex and Yacc because everyone working in the space already knows how parsers work.

> I suspect the don't really understand what's happening under the hood or how contemporary machine learning works.

Everyone has to start somewhere. I thought I would be interested in a book like this precisely because I don't already fully understand what's happening under the hood, but it sounds like it might not actually be a good starting point for my idea of "from scratch."


On that note, I have a relative comprehensive intro to PyTorch in the Appendix (~40 pages) that go over automatic differentiation etc.

The alternative, if you want to build something truly from scratch, would be to implement everything in CUDA, but that would not be a very accessible book.


“If you wish to make an apple pie from scratch, you must first invent the universe.” —- Carl Sagan


It depends on which hood you want to look under.

Let's say you wanted to write your own SSH client as a learning exercise. Is it cheating if you use OpenSSL? Is it cheating if you use Python? Is it cheating if you use a C compiler?


Oh, I see you're using an existing ISA and not creating your own for this. And also, where do you get off using existing integrated circuits. From scratch means you have to start from sand and make your own nand gates and get to an adder and a latch and then a cpu and write an operating system for it before you can get to using language that you invented for this purpose.


Nobody writes code in terms of Nands but there is Nand to Tetris course ("The Elements of Computing Systems: Building a Modern Computer from First Principles" book) https://www.nand2tetris.org

pytorch to LLMs has a lot to show even without Python to pytorch part. It reminds me of "Neural Networks: Zero to Hero" Andrej Karpathy https://m.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9... Prerequisites: solid programming (Python), intro-level math (e.g. derivative, gaussian). https://karpathy.ai/zero-to-hero.html



I’m very comfortable with AI in general but not so much with Machine Lesrning. I understand transformers are a key piece of the puzzle that enables tools like LLMs but don’t know much about them.

Do you (or others) have good resources explaining what they are and how they work at a high level?


I'd say Chapter 1 would be the high-level intro to transformers and how they relate to LLMs.


I don't think implementing autograd is relevant or in-scope for learning about how transformers work (or writing out the gradient for transformer by hand, I can't even imagine doing that).


To code from scratch you should first fab your own semiconductors


They should probably

    import universe 
first.


at least it wasn't

   from transformers import


I jumped to Github thinking this is would be a free resource (with all due respect to the author work).

What free resources are available and recommended in the "from scratch vein"?


Neural Networks: Zero to Hero[1] by Andrej Karpathy

[1] https://karpathy.ai/zero-to-hero.html


+1, Andrey is an amazing educator! I'd also recommend his https://youtu.be/kCc8FmEb1nY?si=mP0cQlQ4rcceL2uP and checking out his github repos. MinGPT, for example, implements a small gpt model that's compatible with HF API, whereas more modern nanoGPT shows how to use newer features such as flash attention. The quality of every video, every blog post is just so high.


https://jaykmody.com/blog/gpt-from-scratch/ for a gpt2 inference engine in numpy

then

https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-k... for adding a kv cache implementation


I'd like to add that most of these text only talking about inference part. This book (I also purchased the draft version) has training and finetuning in the TOC. I assume it will include materials about how to do training and finetuning from scratch.


I'd go with https://course.fast.ai/

It's much more accessible to regular developers, and doesn't make assumptions about any kind of mathematics background. It's a good starting poing after which other similar resources start to make more sense.


I honestly cannot fathom why anyone working in the AI space would find $50 too much to spend to gain a deeper insight into the subject. Creating educational materials requires an insane amount of work, and I can promise, no matter how successful this book is, if rasbt were do the math on income generated over hours spent creating it wouldn't make sense as an hourly rate.

Plenty of other people have this understanding of these topics, and you know what they chose to do with that knowledge? Keep it to themselves and go work at OpenAI to make far more money keeping that knowledge private.

If you want to live in a world where this knowledge is open, at the very least refrain from publicly complaining about a book that cost roughly the same as a decent dinner.


Yeah, I don't think creating educational materials makes sense from an economical perspective, but it's one of my hobbies that gives me joy for some reason :). Hah, and 'insane amount of work' is probably right -- lots of sacrifices to carve out that necessary time.


> anyone working in the AI space

I would have expected the main target audience to be people NOT working in the AI space, that don’t have any prior knowledge (“from scratch”), just curious to learn how an LLM works.


Not talking about affordability but about following links thinking that I would find another kind of resource. Beyond this case, this happens all the time with click-baity content. Again, if the link was to Amazon or the editors it will be clear associated with a product while Github is associated with open source content. Not being pedantic, just an observation browsing the web.


I would add that I can find 5k of useful resources for USD 50 each. It is not the individual item but the all the context.

Personally, I am not focused on a specific topic such as LLMs but work on an spectrum of topics more akind an analyst job + broad research skills.


Not to be pedantic, but in this case it's probably 30 usd for print and ebook (there are always coupons on the manning website).


I added notes to the Jupyter notebooks, I hope they are also readable as standalone from the repo.


Can I use any of the information in this book to learn about reinforcement learning?

My goal is to have something learn to land, like a lunar lander. Simple, start at 100 feet, thrust in one direction, keep trying until you stop making craters.

Then start adding variables, such as now it's moving horizontally, adding a horizontal thruster.

next, remove the horizontal thruster and let the lander pivot.

Etc.

I just have no idea how to start with this, but this seems "mainstream" ML, curious if this book would help with that.


I enjoyed "Grokking Deep Reinforcement Learning"[0]. It doesn't include anything about transformers though. Also, see Python's gymnasium[1] library for a lunar lander environment, it's the one I focused on most while I was learning and I've solved it a few different ways now. You can also look at my own notebook I used when implementing Soft Actor Critic with PyTorch not too long ago[2], it's not great for teaching, but maybe you can get something out of it.

[0]: https://www.manning.com/books/grokking-deep-reinforcement-le... [1]: https://gymnasium.farama.org/environments/box2d/ [2]: https://github.com/DevJac/learn-pytorch/blob/main/SAC.ipynb


Reinforcement learning is an entirely separate area of research from LLMs and, while often seen as part of ML (Tom Mitchell's classic Machine Learning has a great section on Q learning, even if it feels a bit dated in other areas) it has little to do with contemporary ML work. Even with things like AlphaGo, what you find is basically work in using deep neural networks as an input into classic RL techniques.

Sutton and Barto's Reinforcement Learning: An Introduction is widely considered a the definitive intro to the topic.


Sorry, in that case I would rather recommend a dedicated RL book. The RL part in LLMs will be very specific to LLMs, and I will only cover what's absolutely relevant in terms of background info. I do have a longish intro chapter on RL in my other general ML/DL book (https://github.com/rasbt/machine-learning-book/tree/main/ch1...) but like others said, I would recommend a dedicated RL book in your case.



This is a good and short introduction to RL. The density of the information in Spinning Up was just right for me and I think I've referred to it more often than any other resource when actually implementing my own RL algorithms (PPO and SAC).

If I had to recommend a curriculum to a friend I would say:

(1) Spend a few hours on Spinning Up.

(2) If the mathematical notation is intimidating, read Grokking Deep Reinforcement Learning (from Manning), which is slower paced and spends a lot of time explaining the notation itself, rather than just assuming the mathematical notation is self-explanatory as is so often the case. This book has good theoretical explanations and will get you some running code.

(3) Spend a few hours with Spinning Up again. By this point you should be a little comfortable with a few different RL algorithms.

(4) Read Sutton's book, which is "the bible" of reinforcement learning. It's quite approachable, but it would be a bit dry and abstract without some hands-on experience with RL I think.


That's exactly what the Q-learning lab in this course does:

https://www.ida.liu.se/~TDDC17/info/labs/rl.en.shtml


This book seems to focus on large language models, for which RLHF is sometimes a useful addition.

To learn more about RL, most people would advise the Sutton and Barto book, available at: http://incompleteideas.net/book/the-book-2nd.html


I would recommend this as a second book after reading a "cookbook" style book that is more focused on getting real code working. After some hands-on experience with RL (whether you succeed or fail), Sutton's book will be a lot more interesting and approachable.


How does this compare to the karpathy video [0]? I'm trying to get into LLMs and am trying to figure out what the best resource to get that level of understanding would be.

[0] https://www.youtube.com/watch?v=kCc8FmEb1nY


Haven't fully watched this but from a brief skimming, here are some differences that the book has:

- it implements a real word-level LLM instead of a character-level LLM

- after pretraining also shows how to load pretrained weights

- instruction-finetune that LLM after pretraining

- code the alignment process for the instruction-finetuned LLM

- also show how to finetune the LLM for classification tasks

- the book it overall has a lots of figures. For Chapter 3, there are 26 figures alone :)

The video looks awesome though. I think it's probably a great complementary resource to get a good solid intro because it's just 2 hours. I think reading the book will probably be more like 10 times that time investment.


Thank you for the answer! What is the knowledge that your book requires? If I have a lot of software dev experience and sorta kinda remember algebra from uni, would it be a good fit?


Good question. I think a Python background is strongly recommended. PyTorch knowledge would be a nice to have (although I've written a comprehensive 40 page intro for the Appendix, which is also already available). From a math perspective, I think it should be gentle. I'm introducing dot products in Chapter 3, but I also explain how you could do the same with for-loops. Same with matrix multiplication. I'm bad at estimating requirements, but I hope this should be sufficient.


You can't understand it unless you already know most of the stuff.

I've watched it many times to understand well most of it.

And obviously you must already know pytorch really well, including the matrix multiplication, backpropagation etc. He speaks very fast too...


Did you really watch all videos in the playlist? I am at video 4 and had no background in PyTorch or numpy.

In my opinion he covers everything needed to understand his lectures. Even broadcasting and multidimensional indexing with numpy.

Also in the first lecture you will implement your own python class for building expressions including backprop with an API modeled after PyTorch.

IMHO it is the second lecture I can recommend without hesitation. The other is Gilbert Strang on linear algebra.


To echo this sentiment, I thought he does a really reasonable job of working up to the topic. Sure, it is fast paced, but it is a video you can rewind, plus play with the notebooks.

There is a lot to learn, but I think he touches on all of the highlights which would give the viewer the tools to have a better understanding if they want to explore the topic in depth.

Plus, I like that the videos are not overly polished. He occasionally makes a minor blunder, which really humanizes the experience.


I was talking about the last video. It's difficult unless you don't know most of the material, or if you havent watched the other videos in the series.

Anyway those videos are quite advanced. Surely not for beginners.


He has like 4 or 5 videos that can be watched before that one where all of that is covered. He goes over stuff like writing back prop from scratch and implementing layers without torch.


I know... That material isn't for beginners.


...but then, what material did you expect as a beginner?


How can Karpathy videos defined for beginners when you have to know: programming, python, pytorch, matrix multiplication, derivatives...


Question for the author:

I'm not interested in language models specifically, but there are techniques involved with language models I would like to understand better and use elsewhere. For example, I know "attention" is used in a variety of models, and I know transformers are used in more than just language models. Will this book help me understand attention and transformers well enough that I can use them outside of language models?


The attention mechanism we implement in this book* is specific to LLMs in terms of the text inputs, but it's fundamentally the same attention mechanism that is used in vision transformers. The only difference is that in LLMs, you turn text into tokens, and convert these tokens into vector embeddings that go into an LLM. In vision transformers, instead of regarding images as tokens, you use an image patch as a token and turn those into vector embeddings (a bit hard to explain without visuals here). In both text or vision context, it's the same attention mechanism, and it both cases it receives vector embeddings.

(*Chapter 3, already submitted last week and should be online in the MEAP soon, in the meantime the code along with the notes is also available here: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01...)


The model architecture itself is really not too complex, especially with torch. The whole process is pretty straightforward. Nice feasible project.


fyi probably qualifies as an "Show HN:"


Bought a copy! Your posts and newsletter content has been such a huge inspiration for me throughout 2023 - good luck, this is a huge effort!


thanks for the kind words!


As it's still work in progress may I suggest? It would be nice if you go beyond what others have already published and add more details. Like different position encodings, MoE, decoding methods, tokenization. As it's educational easy to use should be a priority, of course.


Thanks, comparing positional encodings, MoEs, kv-caches etc are all good topics that I have in mind for either supplementary material and/or a follow-up book. The reason why it probably won't land in this current book is the length and time line. It's already going to be a big book as it is (400-500 pages). And I also want to be a bit mindful of the planned release date. However, these are indeed good suggestions.


Bought a copy! Looking forward to reading it. :)

Is there a way for readers to give feedback on the book as you write it?


Thanks for the support! There's the official Manning Forum for the book, but you are also welcome to use the Discussions page on the GitHub page.


The book's forum on manning


Wow, great info. Thanks for sharing.


Looks like just the kind of book I'd want to read. I bought a copy :)


Glad to hear and thanks for the support. Chapter 3 should be in the MEAP soonish (submitted the draft last week). Will also upload my code for chapter 4 to GitHub soonish, in the next couple of days, just have to type up the notes.


Purchased the book. Really excited to read it!


Thanks! And please don't hesitate to reach out via the Forum or the GitHub Discussions if you have any feedback or questions.


How was the process of pitching to Manning?


That was pretty smooth. They reached out whether I was interested in writing a book for them (probably because of my other writings online), I mentioned what kind I book I want to write, submitted a proposal, and they liked that idea :)


Nowadays anyone can probably put together a good book about this topic by using an LLM.


Thank you for this endeavour.

Do you have an ETA for the completion of the book?


The ETA for the last chapter is August if things continue to go well. It's usually available in the MEAP a few weeks after that, some time in September. And print version should be available early 2025 I think.


I'll definitely buy it once released.

In the meantime, do you know any other free/paid resource that comes close to what you are trying to achieve with this book?


Unfortunately, I am not aware of any other resource that delves into these topics. However, as others commented above, Karpathy has a 2h YouTube video that is probably worthwhile watching. Based on skimming the YT video, it has some overlap with chapters 3 & 4, but the book has a much larger scope.

I am not sure how to link to other comments on HN, so let me just copy & paste it here:

> How does this compare to the karpathy video [0]? I'm trying to get into LLMs and am trying to figure out what the best resource to get that level of understanding would be. [0] https://www.youtube.com/watch?v=kCc8FmEb1nY

> Haven't fully watched this but from a brief skimming, here are some differences that the book has: - it implements a real word-level LLM instead of a character-level LLM - after pretraining also shows how to load pretrained weights - instruction-finetune that LLM after pretraining - code the alignment process for the instruction-finetuned LLM - also show how to finetune the LLM for classification tasks - the book it overall has a lots of figures. For Chapter 3, there are 26 figures alone :) The video looks awesome though. I think it's probably a great complementary resource to get a good solid intro because it's just 2 hours. I think reading the book will probably be more like 10 times that time investment.


Bought a copy. Good luck rasbt!


Thanks :)


are the code for chapter 4 through 8 missing?


It's in progress still. I have most of the code working, but it's not organized into the chapter structure, yet. I am planning to add a new chapter every ~month (I wish I could do this faster, but I also have some other commitments). Chapter 4 will be either uploaded by the end of this weekend or by the end of next weekend.


Depending on your level, it could take a lot of weeks to go through the already available material (code and pdf), so I'd suggest to purchase it anyway... It makes no sense to wait until the end, if you're interested in the subject.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: