GPT in 500 Lines of SQL

chrsig · 2024-02-24T03:35:50 1708745750

This is beautiful. I'd actually been going down this same rabbit hole with sqlite, I hadn't gotten far enough to bring a neural net into it.

I'd been inspired by the makemore lecture series[0]. At the 1hr mark or so, he switches from counting to using a nn, which is about as far as I've gotten. Breaking it down into a relational model is actually a really great exercise.

[0] https://www.youtube.com/watch?v=PaCmpygFfXo

pests · 2024-02-24T21:08:21 1708808901

If you keep watching you will see the NN derive the same exact table as the counting method and even gives the exact same results when generating.

sigmoid10 · 2024-02-24T08:36:57 1708763817

It's a nice demo. Unfortunately the article mixes up things in the explanation for causal masking, because it seems the author conflates aspects from training and inference. First, causal masking exists to prevent the model from "peeking" at future tokens during training, and second (at least for GPT-like architectures) for enforcing the autoregressive aspect during inference. During inference we only use the last token anyways, so it will attend the entire input sequence. So this token is definitely not decided only from the last token's embedding.

JonChesterfield · 2024-02-24T15:29:07 1708788547

Is this an accurate representation of the GPT driver loop?

    def generate(prompt: str) -> str:
      # Transforms a string into a list of tokens.
      tokens = tokenize(prompt) # tokenize(prompt: str) -> list[int]
    
      while True:
     
        # Runs the algorithm.
        # Returns tokens' probabilities: a list of 50257 floats, adding up to 1.
        candidates = gpt2(tokens) # gpt2(tokens: list[int]) -> list[float]
     
        # Selects the next token from the list of candidates
        next_token = select_next_token(candidates)
        # select_next_token(candidates: list[float]) -> int
     
        # Append it to the list of tokens
        tokens.append(next_token)
     
        # Decide if we want to stop generating.
        # It can be token counter, timeout, stopword or something else.
        if should_stop_generating():
          break
 
      # Transform the list of tokens into a string
      completion = detokenize(tokens) # detokenize(tokens: list[int]) -> str
      return completion

because that looks a lot like a state machine implementing Shlemiel the painter's algorithm which throws doubt on the intrinsic compute cost of the generative exercise.

NortySpock · 2024-02-24T16:00:32 1708790432

I think the "context window" that people refer to with large language models means there's a maximum number of tokens that are retained, with the oldest being discarded. The window is a sliding window.

canjobear · 2024-02-24T21:24:27 1708809867

Yes, that is the loop. All the magic is in the gpt2 function there.

creatonez · 2024-02-24T22:41:56 1708814516

This is a very small section of the algorithm. This is just how it collects the tokens it has generated into a sentence.

jer0me · 2024-02-24T03:33:40 1708745620

sunnybeetroot · 2024-02-25T19:40:15 1708890015

This is mentioned in the article near the top.

ianand · 2024-02-24T05:16:12 1708751772

This is great. In a similar vein, I implemented GPT entirely in spreadsheet functions with accompanying video tutorials

https://spreadsheets-are-all-you-need.ai/

alisonatwork · 2024-02-24T10:28:41 1708770521

Your first video is fantastic. As someone who thinks LLMs are pretty nifty but never had a professional need to learn how they actually work, that 10 minute video taught me more than several years of reading esoteric HN comments and fluffy mainstream media on the topic. Seeing a bazillion floating point numbers all stacked up ready to be calculated across also makes it much more intuitive why this tech devours GPUs, which never really occurred to me before. Thanks for sharing.

danielmarkbruce · 2024-02-24T05:37:19 1708753039

Nice job. Spreadsheets are a natural way to explain an LLM. I suspect that you can show training well too by calculating the derivatives for each parameter under each training example and showing it all explicitly mapped to the relevant parameter etc etc

ianand · 2024-02-24T07:19:00 1708759140

Thank you. Validating to hear others feel spreadsheets are a natural and accessible way to explain LLMs. Someone asks “what’s do they mean by a parameter in a model” or “what is the attention matrix” and you can just pull it up graphically laid out. Then they can fiddle with it and get a visceral feel for things. It also becomes easier for non coders do things like logit lens which is just a few extra simple spreadsheet functions.

I actually plan to do what you describe after I do the embeddings video (but only for a “toy” neural net as a proof-of-concept introduction to gradient descent).

int_19h · 2024-02-25T22:52:58 1708901578

Great approach and great delivery; looking forward to the next video in the series!

airstrike · 2024-02-24T05:41:50 1708753310

Not only is that amazing, your video was so well done. Superb crowd work! Color me double impressed.

ianand · 2024-02-24T07:23:21 1708759401

Thanks! Each one takes a surprisingly long time to make. Even figuring out how make the explanation accessible and compelling yet still accurate takes awhile and then there’s still the actual video to do.

Alifatisk · 2024-02-24T10:54:49 1708772089

I love this, something that starts off as some kind of sorcery a year ago is now being explained so well and almost in a childish way.

pedrosorio · 2024-02-24T18:46:49 1708800409

This sorcery didn't start a year ago. The model being described in the article is GPT-2 which was released in early 2019.

maelito · 2024-02-24T11:58:05 1708775885

> and almost in a childish way.

No. You've got to have a solid background in computer science to even start to understand fully this article.

Even the title itself is not accessible to 99% of humans.

coldtea · 2024-02-24T23:06:13 1708815973

Parent obviously means childish as in "simple and easy manner" not as "something 10 year olds will understand".

bookofjoe · 2024-02-24T13:16:07 1708780567

Count me in as one of the 99%

Hendrikto · 2024-02-24T12:44:51 1708778691

[flagged]

darkmx0z · 2024-02-24T15:53:22 1708790002

when talking about a set of characters, alphabet is a commonly used term https://cs.stackexchange.com/search?q=string++alphabet+size

ecnahc515 · 2024-02-24T17:15:52 1708794952

Sure, but if your doing work in machine learning that’s generally not the terminology used, hinting that this isn’t the area the author specializes in (which isn’t a bad thing, but take their explanations with a grain of salt).

worewood · 2024-02-24T15:29:17 1708788557

Also, about the non-determinism issue, there was a post some time ago and that comes from the way the GPU does the calculations, something something floating point something.

So of course the algorithm is deterministic, but the real-life implementation isn't.

gugagore · 2024-02-24T17:04:01 1708794241

Floating point addition, for example, is not associative, so the order of taking a sum affects the result. If the summation were sequential and single threaded, it would be deterministic. But it happens in parallel, so timing variations affect the result.

But there is probabilistic sampling that happens (see "temperature").

marginalia_nu · 2024-02-24T17:41:57 1708796517

> Floating point addition, for example, is not associative, so the order of taking a sum affects the result. If the summation were sequential and single threaded, it would be deterministic. But it happens in parallel, so timing variations affect the result.

In this sense, I don't think it's fair to say floating point math is non-deterministic, as much as parallel computation is non-deterministic. FP behaves in unexpected ways, but the same order of operations always yields the same unexpected results (except on Pentium 1).

dzink · 2024-02-24T15:49:27 1708789767

Electricity, Cars, and Gas were once upon a time a luxury as well - reserved to those who could afford them or had unique access / credentials / skills. The people who were able to simplify and spread the advanced tech to the common person became Billionaires.

hsbauauvhabzb · 2024-02-24T04:56:34 1708750594

I’ve completely avoided GPT and LLMs. This looks like it would generate some level of fluidity in text output, but not be able to parse and answer a question.

Is there any simplistic blog posts / training courses which go through how they work, or expose a toy engine in python or similar that? All the training I’ve seen so far seems oriented at how to use the platforms rather than how they actually work.

ford · 2024-02-24T05:32:21 1708752741

Jay Alammar has my favorite sequence of tutorials from basic neural network math to GPT2.

Particularly [0], [1], and [2]

[0] http://jalammar.github.io/illustrated-transformer/

[1] http://jalammar.github.io/illustrated-gpt2/

[2] https://jalammar.github.io/visualizing-neural-machine-transl...

zaptrem · 2024-02-24T05:03:08 1708750988

Strap in, this is by far the best resource: https://www.youtube.com/watch?v=kCc8FmEb1nY

Scene_Cast2 · 2024-02-24T21:47:24 1708811244

Interestingly, modern ML does not require Turing completeness. And yet, people are considering the possibility of AGI - I would find it pretty amusing if Turing completeness isn't necessary.

DennisP · 2024-02-24T23:55:29 1708818929

Seems to me that Turing completeness is necessary, for the simple reason that I can mentally trace through Turing-complete code.

r-w · 2024-02-25T05:08:34 1708837714

Without a piece of paper to take notes on? ;)

DennisP · 2024-02-25T14:52:35 1708872755

My own memory functions as the paper tape as long as the program is simple enough :)

int_19h · 2024-02-25T21:42:11 1708897331

Token inference by itself isn't Turing complete, but if its output can have side effects (e.g. editing the prompt for the next iteration), that's a whole different story.

ksarw · 2024-02-24T08:13:58 1708762438

Great write up, I enjoyed the reading the explanations for each piece and found them to be clear and quite thorough.

I did make the mistake though of clicking "+ expand source", and after seeing the (remarkable) abomination I can sympathize with ChatGPT's "SQL is not suitable for implementing large language model..." :)

deskamess · 2024-02-24T13:14:04 1708780444

I did that too and could not find a way to collapse it.

Hendrikto · 2024-02-24T12:40:56 1708778456

> Plain Unicode, however, doesn't really work well with neural networks.

That is not true. See ByT5, for example.

> As an illustration, let's take the word "PostgreSQL". If we were to encode it (convert to an array of numbers) using Unicode, we would get 10 numbers that could potentially be from 1 to 149186. It means that our neural network would need to store a matrix with 149186 rows in it and perform a number of calculations on 10 rows from this matrix.

What the author calls alphabet here, is typically called vocabulary. And you can just use UTF-8 bytes as your vocabulary, so you end up with 256 tokens, not 149186. That is what ByT5 does.

int_19h · 2024-02-25T23:01:15 1708902075

The point isn't that it doesn't work at all, but that it doesn't work as well as other approaches that we have. Which is evidenced by the fact that all the best-performing models on the market use tokenization. It's not a secret that tokenization is fundamentally a hack, and that ideally we'll get rid of it eventually one way or another (https://twitter.com/karpathy/status/1657949234535211009). And in principle, you can compensate for the deficiencies of byte-level tokenization with larger models and larger contexts. But what this means in practice is that a model that has the same level of intelligence (on most tasks; obviously, there are some specific tasks, like say counting characters in a word, where tokenization is detrimental to intelligence) takes a lot more resources to train, hence why we aren't seeing more of those.

wanderingmind · 2024-02-24T06:14:49 1708755289

These marvels need to be preserved. Just posting the archive link here in case the blog is not maintained in future.

https://archive.is/VAGzF

neonate · 2024-02-24T09:06:37 1708765597

and https://web.archive.org/web/20240224015308/https://explainex...

zarkenfrood · 2024-02-24T06:50:01 1708757401

Thanks, this is a fantastic article and it would be a shame to be lost.

rawgabbit · 2024-02-24T06:35:29 1708756529

This is very cool

ein0p · 2024-02-24T17:44:42 1708796682

Unexpectedly insightful and answers some of the same questions I had early on: not just “how” questions, but “why” as well. You see the pattern with softmax quite often. I wish it was taught as “differentiable argmax” rather than by giving people a formula straight away. That’s not all it is, but that’s how it’s often used.

101008 · 2024-02-24T19:14:33 1708802073

I keep reading that GPT is a "smarter" "complex" Markov, in the end, just a function spitting out next word with some probability.

But from my experience that cannot be true - it has to learn somehow. There is an easy example to make. Tell it something that happened today and contradicts the past (I used to test this with the Qatar World Cup), and then ask questions that are affected by that event, and it replied correctly. How is that possible? How a simple sentence (the information I provide) changes the probabilites for next token by that far?

Lerc · 2024-02-24T19:20:58 1708802458

There are two parts of the knowledge at play here.

1. The trained knowledge included in the parameters of the model

2. The context of the conversation

The 'learning' you are experiencing here is due to the conversation context retaining the new facts. Historically the context windows were very short and as the conversation continues it would quickly forget the new facts.

More recently context windows have grown to rather massive lengths.

pests · 2024-02-24T21:23:48 1708809828

Mostly #2 of what the other response says. The entire input to the NN is what it's learning off - the weights are static, you are changing the probabilities based on which context is provided and it's length.

drexlspivey · 2024-02-24T21:54:23 1708811663

Markov chains also depend on the initial state. In GPTs the initial state is your conversation history

cdelsolar · 2024-02-24T13:59:56 1708783196

Serious question, how do I get this smart?

anthomtb · 2024-02-24T17:43:47 1708796627

No doubt the author is a super genius ex-child prodigy whizzkid who emerged from the womb with the umbilical cord diagramming Euler's formula.

For real though, and knowing this is a leading question, the author has near-on 15 years of blog posts showing complex problems being solved in SQL. Is their brain bigger than yours and mine? Maybe a little bit. Do they have a ton of experience doing things like this? Most definitely.

fifilura · 2024-02-25T07:01:49 1708844509

Also remember he builds these things piece by piece from the inside out.

It is easy to get overwhelmed when looking at the end result.

To learn you should probably run one block at a time to understand what each piece does. (In "normal" languages those pieces would be an isolated function)

deskamess · 2024-02-24T13:19:07 1708780747

In the tokenization example for '"Mississippilessly", why is 'si' not part of the combined column? It appears twice in the text. My speculation is that it got collapsed out with 'iss' (a longer token). Is that right?

remram · 2024-02-24T17:43:46 1708796626

Yes. At step 1 there are two 'i','s' and two 's','i', 'is' gets picked because it comes first. Then at step 7, because 'is' got formed, there are two instances of 'is','s' and only one instance of 's','i' so 'iss' gets formed.

deskamess · 2024-02-24T20:02:06 1708804926

Thanks.

haolez · 2024-02-25T05:23:28 1708838608

Is there a world where this can be made performant? Just my geek curiosity kicking in, as I love Postgres and SQL.

jakjak123 · 2024-02-24T09:59:57 1708768797

This is a very good article and introduction.

lagniappe · 2024-02-24T06:54:22 1708757662

This is a great read, I didn't expect to scroll right back to the top as soon as I finished it the first time.

huqedato · 2024-02-24T11:49:24 1708775364

Fantastic article. It kept my eyes on the screen for 2 hours, without interruption The author is a genius.

shark1 · 2024-02-28T10:51:08 1709117468

What a beautiful blog post. I could understand more about LLM.

mikeytown2 · 2024-02-25T05:06:30 1708837590

There a GitHub page for this?

pajop · 2024-02-25T18:56:38 1708887398

Here you go: https://github.com/quassnoi/explain-extended-2024

slt2021 · 2024-02-24T04:41:43 1708749703

I can feel the SQL-Force in this young jedi, Midichlorian level is on another level

swasheck · 2024-02-24T04:48:23 1708750103

alex is a genius. he’s worth a follow.

B1FF_PSUVM · 2024-02-24T15:35:19 1708788919

"It's full of jewels", as someone almost said.

E.g. The Sultan’s Riddle in SQL https://explainextended.com/2016/12/31/happy-new-year-8/

seasonalnull · 2024-02-24T03:26:04 1708745164

This should be illegal

fifilura · 2024-02-24T10:59:27 1708772367

Can you elaborate?

firemelt · 2024-02-25T18:18:21 1708885101

laugh at people who overhyped it with singularity agi and skynet lmao

brainless · 2024-02-24T04:15:18 1708748118

I think, I think, GPT creating GPT... creating GPT will be a thing soon. GPTception.

incahoots · 2024-02-24T04:33:57 1708749237

Personally I like the AI dogman angle. AI trained to beat other AI (resumes tailored to beat ATS algorithms)

codetiger · 2024-02-24T04:42:53 1708749773

GPT creating a better algo than itself is what’s even more interesting

mavamaarten · 2024-02-24T11:03:24 1708772604

It might. But also, all decisions and knowledge is pretty much based on a resampling of our own language, conversations and internet data. It might puzzle together some existing ideas that have never been combined. Or it might hallucinate a better solution by accident. But we're definitely not at the level yet where it will actively build something great.

erwincoumans · 2024-02-24T11:30:11 1708774211

Self-play GPT (by bots in a rich simulation) similar to Alpha Go Zero?

wizzwizz4 · 2024-02-24T14:09:27 1708783767

Self-play works for Go, because the "world" (for lack of a better term) can be fully simulated. Human language talks about the real world, which we cannot simulate, so self-play wouldn't be able to learn new things about the world.

We might end up with more regularised language, and a more consistent model of the world, but that would come at the expense of accuracy and faithfulness (two things which are already lacking).

codetiger · 2024-02-24T12:13:40 1708776820

Games like Alpha Go have very limited(or known) end state so reinforcements learning or similar methods work great. However, I wonder how will AI train itself in learning human languages without being judged by humans. It’s just a matter of time before someone figures out

erwincoumans · 2024-02-24T18:48:09 1708800489

Right, a rich simulator with humans for feedback: an evolved version of online worlds with a mix of AI NPC's and real people, with the task: find the NPC's. The NPC's can train in rooms with exclusive NPC's or mixed with people, without knowing.

behnamoh · 2024-02-24T03:31:46 1708745506

What is this sorcery?