I'm doing an ML apprenticeship [1] these weeks and Karpathy's videos are part of it. We've been deep down into them. I found them excellent. All concepts he illustrates are crystal clear in his mind (even though they are complicated concepts themselves) and that shows in his explanations.
Also, the way he builds up everything is magnificent. Starting from basic python classes, to derivatives and gradient descent, to micrograd [2] and then from a bigram counting model [3] to makemore [4] and nanoGPT [5]
The website doesn't say what—for me—is the best thing about it. The course is peer-led which works like this: once your join, you're part of a team which has one objective: get the best score with your ML recommendation system.
There is simulated environment in which all teams of the cohort receive millions of requests per day (and hundreds of thousands of users and items) and you have to build out your infrastructure on an EC2 instance, build a basic model, and then iteratively improve on it. Imagine a simulated facebook/youtube/tiktok-style system where you aim for the best uptime and the best recommendations!
Something I discovered not so long ago that I wish I had years ago is to watch the video first and then code along after. So simple but makes a world of difference, you can skip errors, fluff and foresee what's next, all around you'd think it's watching two 2 hour videos but it works out in terms of getting the most out of the content and drilling it into your head.
I've found that actually running the code has been very beneficial in understanding. This, along with reasoning for each line of code and spending a lot of time with the video paused and discussing and explaining to each other what we understood.
This was the first time I actually grokked backpropagation, just the first video alone is more lucid and valuable than any other resource about machine learning I had seen before, in fact it's so well explained that i managed to implement the library almost completely from memory after watching it - I cannot recommend it highly enough, especially for programmers without a math background!
The only aspect I could see being non-ideal for some is that it uses some Python-specific cleverness/advanced syntax and semantics (__call__(), list comprehensions with two for's, **kwargs, __add__, __repr__, subclasses, (nested) functions as variables etc.), but if you are familiar with these it might seem more compact and elegant as well.
I find it genuinely stunning how ugly python gets in these projects. Almost immediately, even in a toy project (not tinygrad which is deliberately super dense).
My hive mind connection must be good because I literally finished this course yesterday.
It was very satisfying to learn how transformers worked, to finally be able to turn the obscure glyphs of the research papers into real code, but I think transformers are too big for what I can do on my own computer. The author mentioned that the toy transformer he was building in the final video took 15 minutes to train on his A100 GPU (a $10,000 GPU), and the results weren't even that good; the transformer was spelling words correctly using character level tokens, I guess that's something, but it's not GTP4.
Even so, there were a lot of good tips to pick up along the way. This is a great series that I'm thankful to have. The "Backprop Ninja" video was hard work, you manually calculate the gradients and then compare your calculations against PyTorch. It's great to have instant feedback telling you whether your gradients are correct or not.
I made a smaller GPT model that started from Andrej's code that converges to a decent loss in a short amount of time on an A100 -- just under 2.5 minutes or so: https://github.com/tysam-code/hlb-gpt
With the original hyperparameters, it was 30-60 minutes, with a pruned down network and adjusted hyperparameters, about 6 minutes, and a variety of optimizations beyond that to bring it down.
If you want the nano-GPT basically feature-identical (but pruned down) version, 0.0.0 at ~6 minutes or so is your best bet.
You can get A100s cheaply and securely through Colab or LambdaLabs.
True, as much as I enjoy owning and controlling my own hardware, buying an A100 and then letting it sit idle while I procrastinate and play video games probably isn't the best use of resources. He did say "my GPU" (or similar) at one point, and I thought maybe he does enough ML stuff that he bought his own.
What I appreciate about karpathy's videos is that it doesn't make things any more complicated than they need to be. Simple, engineering language is used. No gatekeeping! It's reassuring, and lets everyone know that anyone can do it.
He defines it pretty clearly. Logits are the inputs to a softmax layer / calculation, which turn the logits into normalized percentages (the percentages sum to 1.0).
Before going through the softmax layer, the logits will be small numbers around 0, probably. Something like: [2.89, -4.53, 0.24, -1.556, 0.57]. Logits like this are natural outputs of a neural network, because they can be any real number and everything will still work.
When people mention logits, they're usually referring to the raw output of the model before it gets transformed/normalised into a probability distribution (i.e. sums to 1, range [0,1]). Logits can take any value. The naming might not be mathematically strict, because it assumes(?) that you're going to apply softmax (which interprets the output of the model as logits), but that's how the term is used.
For example in many classification problems you get a 1D vector of logits from the final layer, you apply softmax to normalise, then argmax to extract the predicted class. It extends to other tasks like semantic segmentation (predict pixel classes) where the "logit" output is the same size as the image with a channel for each class and you apply the same process to get a single channel image with class-per-pixel.
Honestly what cracked logits for me was a conversation with ChatGPT in which I gave it my professional background, areas of strength and weakness, and problem context, and had it explain to me. I then went elsewhere to make sure I hadn’t been lied to. I’ve found ChatGPT such an invaluable learning tool when used in this way.
I'm confused by the comments here, does "logit" here not mean "log odds" like it does in virtually every other context related to machine learning?
Generally I'm a huge fan of not getting too caught up in theory before diving into practice, but I'm seeing multiple responses to this comment without a single mention of "log odds".
The logit function transforms probabilities into the log of the odds (ln P(X)/(1-P(X)), which is important because it makes probabilities linear, which they are not in their standard [0,1] form. It's the foundation of logistic regression, which is, despite much misinformation, quite literally linear regression with a transformed target.
The logistic function is the inverse of the logit: it turns log odds values into probabilities once again. Logistic regression actually transforms the model not the target (most of the time) because the labels 0 and 1 are negative and positive infininity which can't be handled by linear regression (so we transform the model using the inverse instead).
I don't think I can stress enough how important is its to really understand logistic regression (which is also the basic perceptron) before diving into neural networks (which are really just an extension of logistic regression).
logit is not a "probability function", quite the opposite. You can see this in the image in the link you posted (the x-axis is from 0-1, the y-axis is from -inf to inf). It transforms probabilities into log odds which is a linear space, and make combining probabilities much nicer.
The inverse logit or logistic function takes log odds and transforms them back into probabilities.
Most machine learning relies heavily on manipulating probabilities, but since probabilities are not linear, the logit/logistic transformations become essential to correctly modeling complex problems involving probabilities.
I tried this course a few weeks ago but quickly got stuck after finetuning the first model that detects cats (first example). The finetuning part works but I was never able to get it to infer (python kept crashing complaining that an object didn't have a read function).
Hopefully, I'll manage to get further with this course.
And he presents all this stuff with humility. Many people that present are just showing off and are pretty much full of themselves. I suppose they need the ego boost, who knows. But Andrej could be the nice guy next door in the dorm who is studying the same course as you, just that hr is a lecture or two ahead. (Until you figure out he is the former VP of AI at Tesla or whatever his title ended up being before he left.)
I can even recommend his interview with Lex Fridman.
Only finished the first video, but he even made two minor blunders in his code, but kept the footage. Really helps your confidence when you see a pro make a mistake rather than a perfectly polished but unattainable ideal standard.
He is a master educator. While at Stanford he developed their undergrad machine learning intro course named cs231n which immediately became legendary. It's somewhat out of date on some details but it's still well worth watching especially as delivered by Andrej. You can find all 11 lectures on YouTube.
This is really cool and I am so glad my math teacher was a hard ball and I still remember some Calculus.
edit: Python really was/is made for this numbers/calculation/visualization thing. Kinda kicking myself now for not investing more in it and sticking with PHP, although PHP has its merits when building different things, Python is a beast with numbers.
Getting started with it takes exactly zero experience. Being productive in it does, but that's unrelated to the starting point, and shouldn't discourage you, if you really want to do it.
Yes. The journey will teach you everything you need to know or it'll kill you. The likelihood of that happening here or any path in IT is negligible at best. So just show up everyday and collect more fuck ups than the next guy (or your former self).
I've been reading The Little Learner, which builds machine learning knowledge on top of Scheme / Racket. After that book, you could watch this series, which will immediately begin explaining how the automatic differentiation works.
I’m almost done with both of the Andrew Ng coursera specializations they are the not at all disconnected or unclear. I don’t think I’ve ever learned so much so quickly in fact.
It should be completely fine, you don't really need a GPU to do this course. Maybe the exercises right at the end you won't get good performance, but the exact performance isn't all that important.
Also, the way he builds up everything is magnificent. Starting from basic python classes, to derivatives and gradient descent, to micrograd [2] and then from a bigram counting model [3] to makemore [4] and nanoGPT [5]
[1]: https://www.foundersandcoders.com/ml
[2]: https://github.com/karpathy/micrograd
[3]: https://github.com/karpathy/randomfun/blob/master/lectures/m...
[4]: https://github.com/karpathy/makemore
[5]: https://github.com/karpathy/nanoGPT