Hacker News new | past | comments | ask | show | jobs | submit login
Neural Networks: Zero to Hero (karpathy.ai)
642 points by whereistimbo on April 5, 2023 | hide | past | favorite | 68 comments



I'm doing an ML apprenticeship [1] these weeks and Karpathy's videos are part of it. We've been deep down into them. I found them excellent. All concepts he illustrates are crystal clear in his mind (even though they are complicated concepts themselves) and that shows in his explanations.

Also, the way he builds up everything is magnificent. Starting from basic python classes, to derivatives and gradient descent, to micrograd [2] and then from a bigram counting model [3] to makemore [4] and nanoGPT [5]

[1]: https://www.foundersandcoders.com/ml

[2]: https://github.com/karpathy/micrograd

[3]: https://github.com/karpathy/randomfun/blob/master/lectures/m...

[4]: https://github.com/karpathy/makemore

[5]: https://github.com/karpathy/nanoGPT


That program sounds quite impressive, I wonder if any equivalencies exist in the US?


The website doesn't say what—for me—is the best thing about it. The course is peer-led which works like this: once your join, you're part of a team which has one objective: get the best score with your ML recommendation system.

There is simulated environment in which all teams of the cohort receive millions of requests per day (and hundreds of thousands of users and items) and you have to build out your infrastructure on an EC2 instance, build a basic model, and then iteratively improve on it. Imagine a simulated facebook/youtube/tiktok-style system where you aim for the best uptime and the best recommendations!


That is really cool and engaging.


That apprenticeship looks interesting, could I ask you a few questions about it?

My email address is first part of my username (before the “-“) at blueheart dot io.


Do you run the code as you watch?

I’ve been simply watching them on a palm from a hammock and I’m worried I’m not getting the full experience.


Something I discovered not so long ago that I wish I had years ago is to watch the video first and then code along after. So simple but makes a world of difference, you can skip errors, fluff and foresee what's next, all around you'd think it's watching two 2 hour videos but it works out in terms of getting the most out of the content and drilling it into your head.


I've found that actually running the code has been very beneficial in understanding. This, along with reasoning for each line of code and spending a lot of time with the video paused and discussing and explaining to each other what we understood.


Same. I also found the exercises to be useful.


Running the code helps with understanding and developing practical skills.

Watching only is much nicer for entertainment.


This was the first time I actually grokked backpropagation, just the first video alone is more lucid and valuable than any other resource about machine learning I had seen before, in fact it's so well explained that i managed to implement the library almost completely from memory after watching it - I cannot recommend it highly enough, especially for programmers without a math background!

The only aspect I could see being non-ideal for some is that it uses some Python-specific cleverness/advanced syntax and semantics (__call__(), list comprehensions with two for's, **kwargs, __add__, __repr__, subclasses, (nested) functions as variables etc.), but if you are familiar with these it might seem more compact and elegant as well.


I find it genuinely stunning how ugly python gets in these projects. Almost immediately, even in a toy project (not tinygrad which is deliberately super dense).


To be fair, the older Andrew Ng’s online course was also fantastic to explain backprop.

But this does not remove any credit to Andrej’s class.


I agree. I'm looking forward to re-watching it to as it was so information dense.


My hive mind connection must be good because I literally finished this course yesterday.

It was very satisfying to learn how transformers worked, to finally be able to turn the obscure glyphs of the research papers into real code, but I think transformers are too big for what I can do on my own computer. The author mentioned that the toy transformer he was building in the final video took 15 minutes to train on his A100 GPU (a $10,000 GPU), and the results weren't even that good; the transformer was spelling words correctly using character level tokens, I guess that's something, but it's not GTP4.

Even so, there were a lot of good tips to pick up along the way. This is a great series that I'm thankful to have. The "Backprop Ninja" video was hard work, you manually calculate the gradients and then compare your calculations against PyTorch. It's great to have instant feedback telling you whether your gradients are correct or not.


I made a smaller GPT model that started from Andrej's code that converges to a decent loss in a short amount of time on an A100 -- just under 2.5 minutes or so: https://github.com/tysam-code/hlb-gpt

With the original hyperparameters, it was 30-60 minutes, with a pruned down network and adjusted hyperparameters, about 6 minutes, and a variety of optimizations beyond that to bring it down.

If you want the nano-GPT basically feature-identical (but pruned down) version, 0.0.0 at ~6 minutes or so is your best bet.

You can get A100s cheaply and securely through Colab or LambdaLabs.


> his A100 GPU (a $10,000 GPU)

These are available to rent per hour at much lower costs. The author mentions this in the video description.


True, as much as I enjoy owning and controlling my own hardware, buying an A100 and then letting it sit idle while I procrastinate and play video games probably isn't the best use of resources. He did say "my GPU" (or similar) at one point, and I thought maybe he does enough ML stuff that he bought his own.


If you have an NVIDIA gaming GPU you can train reasonable transformers.


Approximately 40 cents USD for 15 minutes from cursory research.


I'm completely unfamiliar with this market. Do you rent these on AWS? Or where?


https://jarvislabs.ai/pricing/

$1.29 per hour for a 40gb a100 apparently

https://lambdalabs.com/service/gpu-cloud#pricing

$1.10 per hour


If you're ok with 24GB you can use a 3090 https://www.genesiscloud.com/pricing for 0.70$/h


What I appreciate about karpathy's videos is that it doesn't make things any more complicated than they need to be. Simple, engineering language is used. No gatekeeping! It's reassuring, and lets everyone know that anyone can do it.

Thanks karpathy!


I just don’t know what he means by logits. Everything else seems like straightforward language.


He defines it pretty clearly. Logits are the inputs to a softmax layer / calculation, which turn the logits into normalized percentages (the percentages sum to 1.0).

Before going through the softmax layer, the logits will be small numbers around 0, probably. Something like: [2.89, -4.53, 0.24, -1.556, 0.57]. Logits like this are natural outputs of a neural network, because they can be any real number and everything will still work.

The logits become percentage as follows:

    julia> x = [2.89, -4.53, 0.24, -1.556, 0.57]
    5-element Vector{Float64}:
      2.89
     -4.53
      0.24
     -1.556
      0.57
    
    julia> x = e.^x
    5-element Vector{Float64}:
     17.993309601550315
      0.010780676072743085
      1.2712491503214047
      0.2109782988178321
      1.768267051433735
    
    julia> x / sum(x)
    5-element Vector{Float64}:
     0.8465613320288766
     0.0005072164987105474
     0.05981058503789324
     0.009926248902037537
     0.08319461753248213

Logits is an overloaded term though, and means different things in different contexts.


When people mention logits, they're usually referring to the raw output of the model before it gets transformed/normalised into a probability distribution (i.e. sums to 1, range [0,1]). Logits can take any value. The naming might not be mathematically strict, because it assumes(?) that you're going to apply softmax (which interprets the output of the model as logits), but that's how the term is used.

For example in many classification problems you get a 1D vector of logits from the final layer, you apply softmax to normalise, then argmax to extract the predicted class. It extends to other tasks like semantic segmentation (predict pixel classes) where the "logit" output is the same size as the image with a channel for each class and you apply the same process to get a single channel image with class-per-pixel.

Here's a nice explanation: https://stackoverflow.com/a/66804099/395457


Honestly what cracked logits for me was a conversation with ChatGPT in which I gave it my professional background, areas of strength and weakness, and problem context, and had it explain to me. I then went elsewhere to make sure I hadn’t been lied to. I’ve found ChatGPT such an invaluable learning tool when used in this way.


I'm confused by the comments here, does "logit" here not mean "log odds" like it does in virtually every other context related to machine learning?

Generally I'm a huge fan of not getting too caught up in theory before diving into practice, but I'm seeing multiple responses to this comment without a single mention of "log odds".

The logit function transforms probabilities into the log of the odds (ln P(X)/(1-P(X)), which is important because it makes probabilities linear, which they are not in their standard [0,1] form. It's the foundation of logistic regression, which is, despite much misinformation, quite literally linear regression with a transformed target.

The logistic function is the inverse of the logit: it turns log odds values into probabilities once again. Logistic regression actually transforms the model not the target (most of the time) because the labels 0 and 1 are negative and positive infininity which can't be handled by linear regression (so we transform the model using the inverse instead).

I don't think I can stress enough how important is its to really understand logistic regression (which is also the basic perceptron) before diving into neural networks (which are really just an extension of logistic regression).


Having not watched the series, I can only assume he means logit as in a probability function from 0 to 1

https://deepai.org/machine-learning-glossary-and-terms/logit....


logit is not a "probability function", quite the opposite. You can see this in the image in the link you posted (the x-axis is from 0-1, the y-axis is from -inf to inf). It transforms probabilities into log odds which is a linear space, and make combining probabilities much nicer.

The inverse logit or logistic function takes log odds and transforms them back into probabilities.

Most machine learning relies heavily on manipulating probabilities, but since probabilities are not linear, the logit/logistic transformations become essential to correctly modeling complex problems involving probabilities.


probability-related* function


This course together with the new fastai ones [1] seem to be exactly what I was looking for. The micrograd video is excellent.

[1] https://course.fast.ai


I tried this course a few weeks ago but quickly got stuck after finetuning the first model that detects cats (first example). The finetuning part works but I was never able to get it to infer (python kept crashing complaining that an object didn't have a read function).

Hopefully, I'll manage to get further with this course.


The simplest guide to NNs I have ever read is this one: http://www.ai-junkie.com/ann/evolved/nnt1.html

It's an old site and guide, but probably still the easiest to understand if you're coming from a programming background.


Wonderful! Just went through the GPT video the other day and it was great. Andrej has a talent for pedagogy via simplification.


And he presents all this stuff with humility. Many people that present are just showing off and are pretty much full of themselves. I suppose they need the ego boost, who knows. But Andrej could be the nice guy next door in the dorm who is studying the same course as you, just that hr is a lecture or two ahead. (Until you figure out he is the former VP of AI at Tesla or whatever his title ended up being before he left.)

I can even recommend his interview with Lex Fridman.


Only finished the first video, but he even made two minor blunders in his code, but kept the footage. Really helps your confidence when you see a pro make a mistake rather than a perfectly polished but unattainable ideal standard.


He also is quite good at teaching and solving the Rubik's cube


Absolutely agree with this.


He is a master educator. While at Stanford he developed their undergrad machine learning intro course named cs231n which immediately became legendary. It's somewhat out of date on some details but it's still well worth watching especially as delivered by Andrej. You can find all 11 lectures on YouTube.



Which GPT Video?


It can be found on the sites home page.

Let's build GPT: from scratch, in code, spelled out: https://www.youtube.com/watch?v=kCc8FmEb1nY


Thank you!


This is really cool and I am so glad my math teacher was a hard ball and I still remember some Calculus.

edit: Python really was/is made for this numbers/calculation/visualization thing. Kinda kicking myself now for not investing more in it and sticking with PHP, although PHP has its merits when building different things, Python is a beast with numbers.


I am at graphwiz now and it is getting better and better.

Also using ChatGTP to ask questions where I don't get something.

Wow what a time we live in to learn things.


Those interested in this might also be interested in some notes from the University of Amsterdam:

https://uvadlc-notebooks.readthedocs.io/en/latest/index.html

Tutorial 6 covers transformers.


Discovering this kind of ressources is a big reason I come on HN. Another recent and good one on the same topic is MIT 6.S191: https://www.youtube.com/watch?v=ySEx_Bqxvvo


A little offtopic, but is this something that someone with only webdev experience can get started with?


You need some fluency in python and basic knowledge of algebra (matrix multiplication etc.)

If you want, you can also start with the first lessons of course.fast.ai


Getting started with it takes exactly zero experience. Being productive in it does, but that's unrelated to the starting point, and shouldn't discourage you, if you really want to do it.

There are several open courses, online.


Yes. The journey will teach you everything you need to know or it'll kill you. The likelihood of that happening here or any path in IT is negligible at best. So just show up everyday and collect more fuck ups than the next guy (or your former self).


The tools are basically irrelevant conceptually. Its all about the mathematics


It might make sense to get a handle on Python first


I am always lost in these blogs. Is there any gradual progression of understanding/exercises that one can follow to apply ML/NN/DL practically?

Having these disconnected pieces of information with no clear link to one another feels like a lot of noise to me.


I've been reading The Little Learner, which builds machine learning knowledge on top of Scheme / Racket. After that book, you could watch this series, which will immediately begin explaining how the automatic differentiation works.


I’m almost done with both of the Andrew Ng coursera specializations they are the not at all disconnected or unclear. I don’t think I’ve ever learned so much so quickly in fact.


Anybody tried the lessons in an Apple MBA/MBP M1/M2? Is it easily applicable?


It should be completely fine, you don't really need a GPU to do this course. Maybe the exercises right at the end you won't get good performance, but the exact performance isn't all that important.


Thanks.


Can other members suggest courses/resources, that help prepare with basic Maths/Stats/ML/Programming skills required for this one?


Andrej’s course is brilliant and so nice to follow.

His explanation of attention is the most accessible I have ever seen.


Are there GPU hardware requirements for the exercises?


I wonder, how do people visualize their neural network?


[flagged]


You should check out this

https://cats.for.ai/program/


How is it relevant here?


It provides a unifying framework for how NNs work.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: