Building LLMs from the Ground Up: A 3-Hour Coding Workshop

atum47 · 2024-08-31T22:54:23 1725144863

Excuse my ignorance, is this different from Andrej Karpathy https://www.youtube.com/watch?v=kCc8FmEb1nY

Anyway I will watch it tonight before bed. Thank you for sharing.

BaculumMeumEst · 2024-08-31T23:38:59 1725147539

Andrej's series is excellent, Sebastian's book + this video are excellent. There's a lot of overlap but they go into more detail on different topics or focus on different things. Andrej's entire series is absolutely worth watching, his upcoming Eureka Labs stuff is looking extremely good too. Sebastian's blog and book are definitely worth the time and money IMO.

bilsbie · 2024-09-08T18:42:14 1725820934

What are the main differences?

brcmthrowaway · 2024-09-01T07:47:52 1725176872

what book

StefanBatory · 2024-09-01T08:26:09 1725179169

Most likely this one.

https://www.manning.com/books/build-a-large-language-model-f...

(I've taken it from the footnotes on the article)

BaculumMeumEst · 2024-09-01T10:16:00 1725185760

That's the one! High enough quality that I would guess it would highly convert from torrents to purchases. Hypothetically, of course.

StefanBatory · 2024-09-02T13:54:42 1725285282

Of course, or so said your friend I assume ;)

samstave · 2024-09-01T00:09:11 1725149351

[flagged]

BaculumMeumEst · 2024-09-01T00:16:34 1725149794

fair point bud

abusaidm · 2024-08-31T22:34:16 1725143656

Nice write up Sebastian, looking forward to the book. There are lots of details on the LLM and how it’s composed, would be great if you can expand on how Llama and OpenAI could be cleaning and structuring their training data given it seems this is where the battle is heading in the long run.

rahimnathwani · 2024-09-01T02:59:43 1725159583

  how Llama and OpenAI could be cleaning and structuring their training data

If you're interested in this, there are several sections in the Llama paper you will likely enjoy:

https://ai.meta.com/research/publications/the-llama-3-herd-o...

kbrkbr · 2024-09-01T14:22:24 1725200544

But isn't it the beauty of llm's that they need comparably little preparation (unstructured text as input) and pick the features on their own so to say?

edit: grammar

aDyslecticCrow · 2024-09-02T14:19:28 1725286768

Yes, if you want an LLM that doesn't listen to instructions and just endlessly babbles about anything and everything.

What turned GPT into chatGPT was a lot of structured training with human feedback.

rahimnathwani · 2024-09-04T03:47:34 1725421654

Exactly. Section 4.3.7 briefly explains how they trained the model to better follow instructions ('steerability').

rakahn · 2024-09-01T01:55:46 1725155746

Yes. Would love to read that.

alecco · 2024-09-01T07:40:32 1725176432

Using PyTorch is not "LLMs from the ground up".

It's a fine PyTorch tutorial but let's not pretend it's something low level.

delano · 2024-09-01T11:59:44 1725191984

If you want to make an apple pie from scratch, first you have to invent the universe.

CamperBob2 · 2024-09-01T14:56:28 1725202588

After watching the Karpathy videos on the subject, of course.

BaculumMeumEst · 2024-09-01T10:11:57 1725185517

I really like Sebastian's content but I do agree with you. I didn't get into deep learning until starting with Karpathy's series, which starts by creating an autograd engine from scratch. Before that I tried learning with fast.ai, which dives immediately into building networks with Pytorch, but I noped out of there quickly. It felt about as fun as learning Java in high school. I need to understand what I'm working with!

krmboya · 2024-09-01T13:14:31 1725196471

Maybe it's just different learning styles. Some people, me included, like to start getting some immediate real world results to keep it relevant and form some kind of intuition, then start peeling back the layers to understand the underlying principles. With fastAI you are already doing this by the 3rd lecture.

Like driving a car, you don't need to understand what's under the hood you start driving, but eventually understanding it makes you a better driver.

BaculumMeumEst · 2024-09-01T15:10:35 1725203435

For sure! In both cases I imagine it is a conscious choice where the teachers thought about the trade-offs of each option. Both have their merits. Whenever you write learning material you have to decide where to draw the line of how far you want to break down the subject matter. You have to think quite hard about exactly who you are writing for. It's really hard to do!

jph00 · 2024-09-01T17:39:57 1725212397

You seem to be implying that the top-down approach is a trade off that involves not breaking down the subject matter into as lower level details. I think the opposite is true - when you go top down you can keep teaching lower and lower layers all the way down to physics if you like!

jph00 · 2024-09-01T17:36:21 1725212181

fast.ai also does autograd from scratch - and goes further than Karpathy since it even does matrix multiplication from scratch.

But it doesn’t start there. It uses top-down pedagogy, instead of bottom up.

BaculumMeumEst · 2024-09-01T17:53:49 1725213229

Oh that’s interesting to know! I guess I gel better with bottom up. As soon as I start seeing API functions I don’t understand I immediately want to know how they work!

jb1991 · 2024-09-01T08:24:47 1725179087

Learn to play Bach: start with making your own piano.

defrost · 2024-09-01T08:33:55 1725179635

Bach (Johann Sebastian .. there were many musical Bach's in the family) owned and wrote for harpsichords, lute-harpsichords, violin, viola, cellos, a viola da gamba, lute and spinet.

Never had a piano, not even a fortepiano .. though reportedly he played one once.

generic92034 · 2024-09-01T10:59:28 1725188368

He had to improvise on the Hammerklavier when visiting Frederick the Great in Potsdam. That (improvising for Frederick) is also the starting point for the later creation of https://en.wikipedia.org/wiki/The_Musical_Offering .

jb1991 · 2024-09-01T12:10:16 1725192616

Yes, I know, but that’s irrelevant. You can replace the word piano in my comment with harpsichord if it makes you happy.

vixen99 · 2024-09-01T11:39:19 1725190759

We know what he meant.

jahdgOI · 2024-09-01T09:01:08 1725181268

Pianos are not proprietary in that they all have the same interface. This is like a web development tutorial in ColdFusion.

jb1991 · 2024-09-01T15:43:55 1725205435

We’re digressing to get way off the whole point of the comment, but to address your point, actually piano design has been an area of great innovation over the centuries, with different companies doing it in considerably different ways.

maleldil · 2024-09-01T13:37:36 1725197856

Are you implying that PyTorch is proprietary?

nerdponx · 2024-09-01T12:54:53 1725195293

Low level by what standards? Is writing an IRC client in Python using only the socket API also not "from scratch"?

badsectoracula · 2024-09-01T13:16:03 1725196563

Considering i seem to be the minority here based on all the other responses the message you replied to, the answer i'd give is "by mine, i guess".

At least when i saw the "Building LLMs from the Ground Up" what i expected was someone to open vim, emacs or their favorite text editor and start writing some C code (or something around that level) to implement, well, everything from the "ground" (the operating system's user space which in most OSes is around the overall level of C) and "up".

nerdponx · 2024-09-01T14:01:30 1725199290

The problem with this line of thinking is that 1) it's all relative anyway, and 2) The notion of "ground" is completely different depending on which perspective you have.

To a statistician or a practitioner approaching machine learning from a mathematical perspective, the computational details are a distraction.

Yes, these models would not be possible without automatic differentiation and massively parallel computing. But there is a lot of rich detail to consider in building up the model from first mathematical principles, motivating design choices with prior art from natural language processing, various topics related to how input data is represented and loss is evaluated, data processing considerations, putting things into context of machine, learning more broadly, etc. You could fill half a book chapter with that kind of content (and people do), without ever talking about computational details beyond a passing mention.

In my personal opinion, fussing over manual memory management is far afield from anything useful unless you want to actually work on hardware or core library implementations like Pytorch. Nobody else in industry is doing that.

badsectoracula · 2024-09-02T00:41:03 1725237663

> The problem with this line of thinking is that 1) it's all relative anyway, and 2) The notion of "ground" is completely different depending on which perspective you have.

But if all is relative and depends on your PoV that implies that there isn't actually a problem here, right? :-P

I don't think there is anything wrong with "building up the model from first mathematical principles" as you wrote, it just wasn't what i personally had in mind with the "from the ground up" part.

And FWIW i'm not that stuck up on the "vim and C" aspect, i used those as an example that i expected most would understand and leave little room for misinterpretation in what you'd have to work with (i.e. very very little) and have to implement yourself (pretty much everything) - personally i'd consider it from "the ground up" even if it was in C#, D, Java, JavaScript or even Python, as long as the implementation was done in a way that didn't rely on 3rd party libraries so that whatever is implemented in, say, Java could also be implementable in C#, D, JavaScript or Python with just whatever is available out of the box in those languages or even C, if one doesn't mind writing the extra bookkeeping functionality themselves.

nerdponx · 2024-09-03T11:27:37 1725362857

Right, but again I think the emphasis on avoiding 3rd party libraries isn't really relevant to machine learning. The "from scratch" here is avoiding 3rd party implementations of the transformer model, building up from the math on paper and then letting the AD/computation framework do its thing.

badsectoracula · 2024-09-04T10:24:40 1725445480

One does not exclude the other though. "Avoiding 3rd party implementations of the transformer model" is a subset of "avoiding 3rd party libraries". "From scratch" is, as seen, vague enough for different people to interpret it in different ways. Despite being the minority in this thread, i do not think my interpretation is any less valid - especially since some people have already done such "from scratch" (i.e. in C or C++ with no 3rd party dependencies) implementations already.

wredue · 2024-09-01T16:24:07 1725207847

Gluing together premade components is not “from the ground up” by most people’s definition.

People are looking at the ground up for a clear picture of what the thing is actually doing, so masking the important part of what is actually happening, then calling it “ground up” is disingenuous.

nerdponx · 2024-09-01T16:37:14 1725208634

Yes, but "what the thing is actually doing" is different depending on what your perspective is on what "the thing" and what "actually" consists of.

If you are interested in how the model works conceptually, how training works, how it represents text semantically, etc., then I maintain that computational details are an irrelevant distraction, not an essential foundation.

How about another analogy? Is SICP not a good foundation for learning about language design because it uses Scheme and not assembly or C?

bvrmn · 2024-09-02T11:13:54 1725275634

In LLMs context pytorch.nn is low level. It's a crucial thing in education to not operate on too many abstraction levels.

cruffle_duffle · 2024-09-02T16:32:05 1725294725

Not just education but real life engineering too.

menzoic · 2024-09-01T08:00:56 1725177656

Is this a joke? Can’t tell. OpenAI uses PyTorch to build LLMs

leobg · 2024-09-01T09:04:05 1725181445

People think of the Karpathy tutorials which do indeed build LLMs from the ground up, starting with Python dictionaries.

krmboya · 2024-09-01T13:21:52 1725196912

From scratch is relative. To a python programmer, from scratch may mean starting with dictionaries but a non-programmer will have to learn what python dicts are first.

To someone who already knows excel, from scratch with excel sheets instead of python may work with them.

wredue · 2024-09-01T16:26:01 1725207961

For the record, if you do not know what a dict actually is, and how it works, it is impossible to use it effectively.

Although if your claim is then that most programmers do not care about being effective, that I would tend to agree with given the 64 gigs of ram my basic text editors need these days.

carlmr · 2024-09-01T20:32:48 1725222768

>For the record, if you do not know what a dict actually is, and how it works, it is impossible to use it effectively.

While I agree it's good to know how your collections work. "Efficient key-value store" may be enough to use it effectively 80% of the time for somebody dabbling in Python.

Sadly I've met enough people that call themselves programmers that didn't even have such a surface level understanding of it.

atoav · 2024-09-01T09:45:14 1725183914

No it is not. From scratch has a meaning. To me it means: in a way that letxs you undrrstand the important details, e.g. using a programming language without major dependencies.

Calling that from scratch is like saying "Just go to the store and tell them what you want" in a series called: "How to make sausage from scratch".

When I want to know how to do X from scratch I am not interested in "how to get X the fastest way possible", to be frank I am not even interested in "How to get X in the way others typically get it", what I am interested in is learning how to do all the stuff that is normally hidden away in dependencies or frameworks myself — or, you know, from scratch. And considering the comments here I am not alone in that reading.

kenjackson · 2024-09-01T16:49:19 1725209359

Your definition doesn’t match mine. My definition is fuzzier. It is “building something using no more than the common tools of the trade”. The term “common” is very era dependent.

For example, building a web server from scratch - I’d probably assume the presence of a sockets library or at the very least networking card driver support. For logging and configuration I’d assume standard I/o support.

It probably comes down to what you think makes LLMs interesting as programs.

atoav · 2024-09-02T11:15:51 1725275751

It is okay to differ on this. Language is not an exact science. It is however always good to factor in expectations when you describe things.

E.g. when a title says it shows you how to do a thing in vanilla javascript from scratch bringing in jquery in the first step makes that tile a lie. If you bring in a hefty dependency on step 1 and run three imported function the vanilla javascript part might be fine, but the from scratch starts to do some heavy lifting.

jnhl · 2024-09-01T08:16:17 1725178577

You could always go deeper and from some points of view, it's not "from the ground up" enough unless you build your own autograd and tensors from plain numpy arrays.

0cf8612b2e1e · 2024-09-01T16:47:05 1725209225

Numpy sounds like cheating on the backs of others. Going to need your own hand crafted linear algebra routines.

TZubiri · 2024-09-01T08:29:38 1725179378

Source please?

botverse · 2024-09-01T08:21:10 1725178870

alecco · 2024-09-01T11:51:29 1725191489

I'll write a guide "no-code LLMs in CUDA".

_giorgio_ · 2024-09-01T12:41:48 1725194508

Your comment is one of the most pompous that I've ever read.

NVDIA value lies only in pytorch and cuda optimizations with respect with pure c implementation, so saying that you need go lower level than cuda or pytorch means simply reinventing Nvidia. Good luck with that

alecco · 2024-09-01T13:06:50 1725196010

1. I only said the meaning of the title is wrong, and I praised the content

2. I didn't say CUDA wouldn't be ground up or low level (please re-read) (I say in another comment about a no-code guide with CUDA, but it's obviously a joke)

3. And finally, I think your comment comes out as holier than thou and finger pointing and making a huge deal out of a minor semantic observation.

_giorgio_ · 2024-09-12T09:04:31 1726131871

There's always a lower level, until there's not.

Pytorch is low level enough to understand and interpret each and every passage. In pytorch, you can use builtin transformers, or code them yourself down to the "lowest" level in which there's still a theoretical meaning. So pytorch is just a tool and your comment was just pompous and empty.

SirSegWit · 2024-09-01T08:11:15 1725178275

I'm still waiting for an assembly language model tutorial, but apparently there are no real engineers out there anymore, only torch script kiddies /s

oaw-bct-ar-bamf · 2024-09-01T09:04:02 1725181442

Automotive actually uses ML in plain c with some inline assembly sprinkled on top run run models in embedded devices.

It’s definitely out there and in productive use.

mdp2021 · 2024-09-01T16:10:50 1725207050

> ML in plain c

Which engines in particular? I never found especially flexible ones.

wredue · 2024-09-01T16:29:14 1725208154

Ironically, slippery slope argumentation is a favourite style of kids.

Unfortunately, your argument is a well known fallacy and carries no weight.

bvrmn · 2024-09-02T11:08:07 1725275287

It's a natural response for "No true Scotsman" in the parent.

sigmoid10 · 2024-09-01T08:16:13 1725178573

Pfft. Assembly. I'm waiting for the real low level tutorial based on quantum electrodynamics.

atoav · 2024-09-01T09:36:12 1725183372

Wanted to say the same thing. As an educator who once gave a course on a similar topic for non-programmers you need to start way, way earlier.

E.g.

1. Programming basics

2. How to manipulate text using programs (reading, writing, tokenization, counting words, randomization, case conversion, ...)

3. How to extract statistical properties from texts (ngrams, etc, ...)

4. How to generate crude text using markov chains

5. Improving on markov chains and thinking about/trying out different topologies

Etc.

Sure markov chains are not exactly LLMS, but they are a good starting point to byild a intuition how programs can extract statistical properties from text and generate new text based on that. Also it gives you a feeling how programes can work on text.

If you start directly with a framework there is some essential understanding missing.

paradite · 2024-09-01T07:11:32 1725174692

I wrote a practical guide on how to train nanoGPT from scratch on Azure a while ago. It's pretty hands-on and easy to follow:

https://16x.engineer/2023/12/29/nanoGPT-azure-T4-ubuntu-guid...

firesteelrain · 2024-09-01T22:12:18 1725228738

Did it really only cost $200?

What sort of things could you do with it? How do you train it on current events?

paradite · 2024-09-02T03:04:27 1725246267

Yes. I checked the Azure usage after training.

Beyond learning how it all works and demo, there is not much practical usage. You can train it on current events if you feed that corpus during training instead of just OpenWebText. Shouldn't be hard.

theanonymousone · 2024-09-01T08:24:21 1725179061

It may be unreasonable, but I have a default negativity toward anything that uses the word "coding" instead of programming or development.

mdp2021 · 2024-09-01T08:55:15 1725180915

Quite a cry, in a submission page from one of the most language "obsessed" in this community.

Now: "code" is something you establish - as the content of the codex medium (see https://en.wikipedia.org/wiki/Codex for its history); from the field of law, a set of rules, exported in use to other domains since at least the mid XVI century in English.

"Program" is something you publish, with the implied content of a set of intentions ("first we play Bach then Mozart" - the use postdates "code"-as-"set of rules" by centuries).

"Develop" is something you unfold - good, but it does not imply "rules" or "[sequential] process" like the other two terms.

theanonymousone · 2024-09-02T08:32:17 1725265937

I did my research, but as a non-native speaker, couldn't figure out whether "quite a cry" means you agree or disagree with me.

ChatGPT doesn't know either.

k1tanaka · 2024-09-02T02:23:16 1725243796

I am from Brazil and I find this funny because in my circle of friends/co-wroekers we mostly use "coding" when speaking English, or "codar" (code as a Portuguese verb) with other Brazilians. I am not sure why, but I think it is because "program" has a strong association with prostitution in Brazilian Portuguese.

smartmic · 2024-09-01T08:38:07 1725179887

I fully agree. We had a discussion about this one year ago: https://news.ycombinator.com/item?id=36924239

xanderlewis · 2024-09-01T08:35:34 1725179734

Probably now an unpopular view (as is any opinion perceived as 'judgemental' or 'gatekeeping'), but I agree.

ljlolel · 2024-09-01T08:39:05 1725179945

This is more a European thing

SkiFire13 · 2024-09-01T09:52:46 1725184366

As an European: my language doesn't even have a proper equivalent to "coding", only a direct translation to "programming"

badsectoracula · 2024-09-01T13:56:40 1725199000

I'm from Europe and my language doesn't have an equivalent to "coding" but i'm still using the English word "coder" and "coding" for decades - in my case i learned it from the demoscene where it was always used for programmers since the 80s. FWIW the Demoscene is (or was at least) largely a European thing (groups outside of Europe did exist but the majority of both groups and demoparties were -and i think still are- in Europe) so perhaps there is some truth about the "coding" word being a European thing (e.g. it sounded ok in some languages and spread from there).

Also in my ears coder always sounded cooler than programmer and it wasn't until a few years ago i first heard that to some people it has negative connotations. Too late to change though, it still sounds cooler to me :-P.

[0] https://en.wikipedia.org/wiki/Demoscene

atoav · 2024-09-01T08:54:10 1725180850

I am from Europe and I am not completely sure about that to be honest. I also prefer programming.

I also dislike software development as it reminds me of developing a photograhic negative – like "oh let's check out how the software we developed came out".

It should be software engineering and it should be held to a similar standard as other engineering fields if it isn't done in a non-professional context.

reichstein · 2024-09-01T09:38:52 1725183532

The word "development" can mean several things. I don't think "software development" sounds bad when grouped with a phrase like "urban development". It describes growing and tuning software for, well, working better, solving more needs, and with fewer failure modes.

I do agree that a "coder" creates code, and a programmer creates programs. I expect more of a complete program than of a bunch of code. If a text says "coder", it does set an expectation about the professionalism of the text. And I expect even more from a software solution created by a software engineer. At least a specification!

Still, I, a professional software engineer and programmer, also write "code" for throwaway scripts, or just for myself, or that never gets completed. Or for fun. I will read articles by and for coders too.

The word is a signal. It's neither good nor bad, but If that's not the signal the author wants to send, they should work on their communication.

mdp2021 · 2024-09-01T09:46:33 1725183993

> If that's not the signal the author wants to send

You can't use a language that will be taken by everyone the same way. The public is heterogeneous - its subsets will use different "codes".

mdp2021 · 2024-09-01T09:08:30 1725181710

> software development

Wrong angle. There is a problem, your consideration of the problem, the refinement of your solution to the problem: the solution gradually unfolds - it is developed.

leopoldj · 2024-09-04T15:16:05 1725462965

This is the exact level of details I was looking for. I'm fairly experienced with deep learning and pytorch and don't want to see them built from scratch. I found Andrej's materials too low level and I tend to get lost in the weeds. This is not a criticism but just a comment for someone in a similar situation as I'm.

karmakaze · 2024-09-01T03:39:28 1725161968

This is great. Just yesterday I was wondering how exactly transformers/attention and LLMs work. I'd worked through how back-propagation works in a deep RNN a long while ago and thought it would be interesting to see the rest.

aDyslecticCrow · 2024-09-02T14:27:02 1725287222

For some intuition, 3b1b has some videos that explain it nicely. But he doesn't go that far into the nitty gritty.

alok-g · 2024-09-01T04:41:40 1725165700

This is great! Hope it works on a Windows 11 machine too (I often find that when Windows isn't explicitly mentioned, the code isn't tested on it and usually fails to work due to random issues).

politelemon · 2024-09-01T12:04:39 1725192279

This should work perfectly fine in WSL2 as it has access to a GPU. Do remember to install the Cuda toolkit, NVidia has one for WSL2 specifically.

https://developer.nvidia.com/cuda-downloads?target_os=Linux&...

sidkshatriya · 2024-09-01T05:23:11 1725168191

When it does not work on Windows 11 -- what about trying it out on WSL (Windows Subsystem for Linux ) ?

adultSwim · 2024-09-01T02:50:56 1725159056

This page is just a container for a youtube video. I suggest updating this HN link to point to the video directly, which contains the same links as the page in its description.

mdp2021 · 2024-09-01T07:32:45 1725175965

On the contrary, I saved you that extra step of looking for Sebastian Raschka's repository of writings.

_giorgio_ · 2024-09-01T06:40:16 1725172816

He shares a ton of videos and code. His material is really valuable. Just support him?

yebyen · 2024-09-01T05:40:16 1725169216

Why not support the author's own website? It looks like a nice website

1zael · 2024-09-01T07:16:22 1725174982

Sebastian, you are a god among mortals. Thank you.

cpill · 2024-09-01T18:55:10 1725216910

yeah really valuable stuff. so we know how the ginormous model that we can't train or host works (putting practice there are so many hacks and optimizations that none of them work like this). great.

eclectic29 · 2024-08-31T23:09:27 1725145767

This is excellent. Thanks for sharing. It's always good to go back to the fundamentals. There's another resource that is also quite good: https://jaykmody.com/blog/gpt-from-scratch/

_giorgio_ · 2024-09-01T06:36:56 1725172616

Not true.

Your resource is really bad.

"We'll then load the trained GPT-2 model weights released by OpenAI into our implementation and generate some text."

skinner_ · 2024-09-01T10:08:02 1725185282

> Your resource is really bad.

What a bad take. That resource is awesome. Sure, it is about inference, not training, but why is that a bad thing?

szundi · 2024-09-01T15:11:49 1725203509

This is not “building from the ground up”

skinner_ · 2024-09-01T21:53:54 1725227634

Neither the author of the GPT from scratch post, nor eclectic29 who recommended it above did ever promise that the post is about building LLMs from the ground up. That was the original post.

The GPT from scratch post explains, from the ground up, ground being numpy, what calculations take place inside a GPT model.

_giorgio_ · 2024-09-13T08:17:10 1726215430

Inference is nothing without training.

abustamam · 2024-09-01T15:37:37 1725205057

Why is that bad?

bschmidt1 · 2024-09-01T03:22:58 1725160978

Love stuff like this. Tangentially I'm working on useful language models without taking the LLM approach:

Next-token prediction: https://github.com/bennyschmidt/next-token-prediction

Good for auto-complete, spellcheck, etc.

AI chatbot: https://github.com/bennyschmidt/llimo

Good for domain-specific conversational chat with instant responses that doesn't hallucinate.

p1esk · 2024-09-01T04:12:25 1725163945

Why do you call your language model “transformer”?

bschmidt1 · 2024-09-01T04:17:56 1725164276

Language is the language model that extends Transformer. Transformer is a base model for any kind of token (words, pixels, etc.).

However, currently there is some language-specific stuff in Transformer that should be moved to Language :) I'm focusing first on language models, and getting into image generation next.

p1esk · 2024-09-01T04:26:15 1725164775

No, I mean, a transformer is a very specific model architecture, and your simple language model has nothing to do with that architecture. Unless I’m missing something.

bschmidt1 · 2024-09-01T19:11:40 1725217900

I still call it a transformer because the inputs are tokenized and computed to produce completions, not from lookups or assembling based on rules.

> Unless I'm missing something.

Only that I said "without taking the LLM approach" meaning tokens aren't scored in high-dimensional vectors, just as far simpler JSON bigrams. I don't think that disqualifies using the term "transformer" - I didn't want to call it a "computer" or a "completer". Have a better word?

> JSON instead of vectors

I did experiment with a low-dimensional vector approach from scratch, you can paste this into your browser console: https://gist.github.com/bennyschmidt/ba79ba64faa5ba18334b4ae...

But the n-gram approach is better, I don't think vectors start to pull away on accuracy until they are capturing a lot more contextual information (where there is already a lot of context inferred from the structure of an n-gram).

kgeist · 2024-09-02T20:08:25 1725307705

Calling it a "transformer" is misleading when discussing language modelling because it now means a very specific ML architecture while your project seems to be about Markov chains + hardcoded rules using regexps https://github.com/bennyschmidt/llimo/blob/master/models/Cha...

The idea of tokenizing words and producing completions is not unique to the original transformers, it's a basic idea from NLP. So I'm not sure why you think it should be called a transformer just because it uses tokenized inputs and produces completions as well. It's like saying your new programming language has a "Java-based architecture" simply because they both have classes (and nothing else in common otherwise).

>I didn't want to call it a "computer" or a "completer". Have a better word?

I've seen projects which also use Markov chains + additional rules ontop, for example there's quite a few projects called "Markov chains with POS tagging":

https://github.com/26medias/context-aware-markov-chains

>not from lookups or assembling based on rules.

Not quite sure about "it's not based on rules" when your code has things like:

   const MATCH_FIRST_MODAL = new RegExp(/IS|AM|ARE|WAS|HAS|HAVE|HAD|MUST|MAY|MIGHT|WERE|WILL|SHALL|CAN|COULD|WOULD|SHOULD|OUGHT|DOES|DID/);

or

   const properNoun = `${part.value} `;
   if (isPrevNNP) {
      result += prependArticle(query, properNoun);
   }

Pretty sure your examples in the video are also cherry-picked. The very first example is you asking "where is Paris?" What really happens is, one of the hardcoded regexps transforms it to "Paris is" and then the bigram model repeats the second sentence in the Paris dataset verbatim.

bschmidt1 · 2024-09-02T22:15:12 1725315312

It's literally what it is. You for some reason think transformers are unique to language models - boy are you late to the game https://en.wikipedia.org/wiki/Transformation_matrix

A CSS matrix "transform" is the same concept.

Same with tile engines & game dev. Say I wanted to rotate a map:

Input

[

  [0, 0, 1],
    
  [0, 0, 0],
    
  [0, 0, 0]

] Output

[

  [0, 0, 0],

  [0, 0, 0],
  
  [0, 0, 1]

]

The function is a "transformer" because it is not looking up some rule that says where to put the new values, it's performing math on the data structure whose result determines the new values.

> Not quite sure about "it's not based on rules" when your code has things like: > > const MATCH_FIRST_MODAL

Totally irrelevant to the topic. This is the chat interface itself which mostly just parses questions into cursors to be completed. You would be a fool to think ChatGPT has no NLP or parts-of-speech analysis. text-ada-embedding itself uses POS.

> Pretty sure your examples in the video are also cherry-picked

Fantastic detective work, you caught me. But just to confirm - why not just use it yourself? npm i next-token-prediction

Here is an example you can run very easily in Chrome, so you don't have to rely solely on your amazing bullshit detector: https://github.com/bennyschmidt/next-token-prediction/tree/m...

Don't forget to log the completions to prove that they aren't broken down by token, and instead just doing key/val lookups or text searches as you said.

> What really happens is, one of the hardcoded regexps transforms it to "Paris is"

The only thing you got right - that questions are transformed into sentences using conventional NLP in order to complete them. This functionality is what makes it a chat bot that you can ask questions.

kgeist · 2024-09-02T23:37:16 1725320236

>A CSS matrix "transform" is the same concept

It's still misleading to call it a transformer in the context of NLP. It doesn't matter what it means in other, non-NLP areas (linear algebra, CSS or gamedev).

It's like creating a procedural language and calling it "functional" because it has functions. Sure the concept of functions existed long before compsci but it would be very misleading because "functional programming" is a well-established term.

>You would be a fool to think ChatGPT has no NLP or parts-of-speech analysis

Pretty sure it doesn't. At least it's not required to. I've run lots of local models and it's just model weights without hardcoded regexps. In fact, I was able to feed grammar rules of an invented language into Claude Sonnet and it was able to construct proper sentences.

>text-ada-embedding itself uses POS

Do you have a link?

bschmidt1 · 2024-09-03T21:46:04 1725399964

Again they are the exact same concept. Whether vectors represent tiles in a video game, an object in CSS, matrix algebra you took in school, or the semantics of words used by LLMs, in all cases it's the same meaning of the word "transform". It's not specific to language models at all - which was the thesis of your whole argument.

> it's not required to. I've run lots of models

Then you must know about skip-gram and how embeddings are trained: https://medium.com/@corymaklin/word2vec-skip-gram-904775613b...

What is meant by "sliding window" or "skip gram" is bigram mapping (or other n-gram).

This is ML 101.

It's the same training methodology and data structure used in my next-token-prediction lib, and is widely used for training for LLMs. Ask your local AI to explain the basics, or see examples like: https://www.kaggle.com/code/hamishdickson/training-and-plott...

> ChatGPT doesn't use parts-of-speech

Yes it does, there's not only a huge business in tagging data (both POS and NER) adjacent to AI, but OpenAI specifically famously used African workers on very low wages to tag a bunch of data. ChatGPT uses text-embedding-ada, you'll have to put 2 and 2 together as they don't open source that part.

Mistral says:

"The preprocessing stage of Text-Embedding-ADA-002 involves applying POS tags to the input text using a separate POS tagger like Spacy or Stanford NLP. These POS tags can be useful for segmenting sentences into individual words or tokens."

> I use Claude to make new languages

Cool story, has nothing to do with the topic

kgeist · 2024-09-03T23:45:21 1725407121

>It's not specific to language models at all - which was the thesis of your whole argument.

I didn't say that it's unique to LMs. My argument is that saying "my LM is a transformer" is misleading because "transformer" in the context of LMs means a very specific architecture. You're deliberately misusing terms, probably to draw attention to your project.

>OpenAI specifically famously used African workers on very low wages to tag a bunch of data

Did they tag Polish parts of speech too? Or Ancient Greek? ChatGPT constructs grammatically correct Ancient Greek. I thought they tagged "harmful/non-harmful", not parts of speech?

>ChatGPT uses text-embedding-ada

[Citation needed]

NanoGPT, for example, learns embeddings together with the rest of the network so, as I said, manual tagging is not required.

Anyway, looking forward to hearing news about your image generation project. Any news?

bschmidt1 · 2024-09-04T05:00:55 1725426055

Nobody denied the term is used in language models, I only pointed out that they use that term because of what it already means in the context of vector operations (long before OpenAI).

The wikipedia on deep learning transformers:

    All transformers have the same primary components:

    - Tokenizers, which convert text into tokens.

    - Embedding layer, which converts tokens and positions of the tokens into vector representations.

    - Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers. There are two major types of transformer layers: encoder layers and decoder layers, with further variants.

    - Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens.

Where does it say bigrams can't be used for next-token prediction? Or that you can't tag data? Note "...which converts tokens and positions of the tokens..."

> You're deliberately misusing terms, probably to draw attention to your project.

Haha well since I have like 30 followers and the npm is free/MIT whatever scheme you think I'm up to it's not working. Anyway a text autocomplete library is not exactly viral material. Jokes aside, no I am trying to use accurate terms that make sense for the project.

Could just make it anonymous - `export default () => {}` - and call the file `model.js`. What would you call it?

> Did they tag Polish parts of speech too? Or Ancient Greek?

Yes, all the foreign words with special characters were tokenized and trained on. An LLM doesn't "know any language". If it never trained on any Polish word sequences it would not be able to output very good Polish sequences anymore that it could output good JavaScript. It's not that has to train on Polish to translate Polish per se, but it does has to have the language coverage at the token level to be able to perform such vector transformations - which is probably most easily accomplished by training on Polish-specific data.

See https://huggingface.co/pranaydeeps/Ancient-Greek-BERT

> The model was initialised from AUEB NLP Group's Greek BERT and subsequently trained on monolingual data from the First1KGreek Project, Perseus Digital Library, PROIEL Treebank and Gorman's Treebank

First1KGreek Project

> The goal of this project is to collect at least one edition of every Greek work composed between Homer and 250CE

> Citation needed

https://openai.com/index/new-and-improved-embedding-model/

> The new model, text-embedding-ada-002, replaces five separate models for text search, text similarity, and code search, and outperforms our previous most capable model, Davinci, at most tasks, while being priced 99.8% lower.

https://platform.openai.com/docs/guides/embeddings/embedding...

Scroll to embedding models

> Anyway, looking forward to hearing news about your image generation project. Any news?

Not yet! Feel free to follow on GitHub or even help out if you're really interested in it. Would be cool to have pixel prediction as snappy as text autocomplete.

richrichie · 2024-09-01T06:44:54 1725173094

For a century, transformer meant a very different thing. Power systems people are justifiably amused.

p1esk · 2024-09-01T13:28:59 1725197339

And it means something else in Hollywood. But we are discussing language models here, aren’t we?

bschmidt1 · 2024-09-01T19:37:45 1725219465

And it fits the definition doesn't it since it tokenizes inputs to compute them against pre-trained ones, rather than being based on rules/lookups or arbitrary logic/algorithms?

Even in CSS a matrix "transform" is the same concept - the word "transform" is not unique to language models, more a reference to how 1 set of data becomes another by way of computation.

Same with tile engines / game dev. Say I wanted to rotate a map, this could be a simple 2D tic-tac-toe board or a 3D MMO tile map, anything in between:

Input

[

  [0, 0, 1],
    
  [0, 0, 0],
    
  [0, 0, 0]

]

Output

[

  [0, 0, 0],

  [0, 0, 0],
  
  [0, 0, 1]

]

The method that takes the input and gives that output is called a "transformer" because it is not looking up some rule that says where to put the new values, it's performing math on the data structure whose result determines the new values.

It's not unique to language models. If anything vector word embeddings are much later to this concept than math and game dev.

An example of use of word "Transformer" outside language models in JavaScript is Three.js' https://threejs.org/docs/#examples/en/controls/TransformCont...

I used Three.js to build https://www.playshadowvane.com/ - built the engine from scratch and recall working with vectors (e.g. THREE Vector3 for XYZ stuff) years before they were being popularized by LLMs.

p1esk · 2024-09-03T05:54:20 1725342860

Wait, do you really not know what a transformer is in the context of ML? It’s been dominating the field for 7 years now.

bschmidt1 · 2024-09-03T21:48:39 1725400119

Can't read? I just explained thoroughly what it is in the comment above. Do you understand what matrix transformations are?

Do you know that a vector in LLMs for word embeddings is the same thing as a vector in 3D game dev libraries like Three.js?

Sounds like you 2 are the only ones who don't get it.

p1esk · 2024-09-04T00:39:24 1725410364

Please do yourself a favor and google “transformer paper”. Open the very first result and read the pdf. Hopefully it will become clear what people mean when they say “transformer” in ML context, and you will finally realize how silly you look like in this thread.

dang · 2024-09-06T19:26:57 1725650817

You guys both broke the site guidelines badly in this thread. We have to ban accounts that post like this, so please don't.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.

bschmidt1 · 2024-09-04T06:42:29 1725432149

You still don't get it. For LLMs a "transformer architecture" only means one that:

- Tokenizes sequences

- Converts tokens to vectors

- Performs vector/matrix transformations

- Converts back to tokens

The matrix transformation part is why it's called a "transformer". Do some reading yourself https://en.wikipedia.org/wiki/Transformer_(deep_learning_arc...

> how silly you look

You'll look twice as silly after thinking vectors are unique to LLMs, or that the word "transformer" has anything to do with LLMs rather than lower-level array math.

Consider that a "vector database" is a very specific technology - yet the word "vector" is not off limits in other database related libraries, especially if dealing with vectors.

In any case - if you think I'm trying to pass it off as something else, what I call "transformer" does tokenize lots of text (breaks it down by ~word, ~pixel) and derives semantic values (AKA trains) to produce real-time completions to inputs by way of math, not lookups. It fits the definition even in that sense where "transformer" meant something more abstract than the mathematical term.

dang · 2024-09-06T19:27:07 1725650827

You guys both broke the site guidelines badly in this thread. We have to ban accounts that post like this, so please don't.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.

bschmidt1 · 2024-09-07T02:28:22 1725676102

I didn't know it was that strict, no offense to the other poster, it was just a little disagreement :)

vunderba · 2024-09-01T06:03:17 1725170597

I took a very cursory look at the code, and it looks like this is just a standard Markov chain. Is it doing something different?

bschmidt1 · 2024-09-01T18:56:17 1725216977

I get this question only on Hacker News, and am baffled as to why (and also the question "isn't this just n-grams, nothing more?").

https://github.com/bennyschmidt/next-token-prediction

^ If you look at this GitHub repo, should be obvious it's a token prediction library - the video of the browser demo shown there clearly shows it being used with an <input /> to autocomplete text based on your domain-specific data. Is THAT a Markov chain, nothing more? What a strange question, the answer is an obvious "No" - it's a front-end library for predicting text and pixels (AKA tokens).

https://github.com/bennyschmidt/llimo

This project, which uses the aforementioned library is a chat bot. There's an added NLP layer that uses parts-of-speech analysis to transform your inputs into a cursor that is completed (AKA "answered"). See the video where I am chatting with the bot about Paris? Is that nothing more than a standard Markov chain? Nothing else going on? Again the answer is an obvious "No" it's a chat bot - what about the NLP work, or the chat interface, etc. makes you ask if it's nothing more than a standard [insert vague philosophical idea]?

To me, your question is like when people were asking if jQuery "is just a monad"? I don't understand the significance of the question - jQuery is a library for web development. Maybe there are some similarities to this philosophical concept "monad"? See: https://stackoverflow.com/questions/10496932/is-jquery-a-mon...

It's like saying "I looked at your website and have concluded it is nothing more than an Array."

nurettin · 2024-09-02T04:47:40 1725252460

They are just inquiring as to what the underlying data structure and algorithm is, not what function it performs, or the myriad of ways it can be used.

bschmidt1 · 2024-09-02T18:22:16 1725301336

It's an inquiry with an embedded false dichotomy/assumption that n-grams are not used in LLMs, when in fact ChatGPT also uses n-grams/"Markov chains". Popular embeddings including those ChatGPT uses like text-embedding-ada-002 and later also use parts-of-speech codes. And the chat interface uses conventional NLP too. Maybe some people think it's nothing but "magical vectors" doing all the work, but that's incorrect.

If you google "Is ChatGPT just a glorified Markov chain?" you will amazingly get pages of results of people asking this question, just like "Is jQuery just a glorified monad?" as if to reduce something novel down to useless, mere philosophy that "we've had" for thousands of years. Imagine suggesting using a state management library in React to improve FE dev and getting the retort: "Isn't that just a state machine?" in a discounting manner, and imagine the rest of the team actually nodding their head in agreement like a scene in Idiocracy - welcome to Hacker News.

For smart people, the answer to any question like this is "No". Google is not a glorified Array. Bitcoin is not a glorified LinkedList. Language models are not glorified Markov chains. To even ask that is so reductionist and incorrect that any answer obfuscates what they actually are.

Here's a gist you can paste into your browser that shows how both n-grams and conventional NLP (parts-of-speech analysis) are used to derive vector embeddings in the first place: https://gist.github.com/bennyschmidt/ba79ba64faa5ba18334b4ae... (following in the style of text-embedding-ada-002 albeit much tinier)

They are not mutually exclusive concepts to begin with. Never have been. None of these comments even deserve these lengthy replies (I am likely responding to a mix of 12- and 24-year-olds who don't care that much anyway, just want to "win"), yet I feel compelled to explain.

nurettin · 2024-09-02T20:21:25 1725308485

I think this is way too harsh. What if someone who is not interested in learning a subject deeply, but still genuinely wonders if they get the gist of it and/or want to know where to start in case ? Of course one of them will eventually remember markov chains and start drawing parallels with modern LLMs. It is only natural. No need to berate people for that.

edit: I do appreciate your work and explanation, btw.

bschmidt1 · 2024-09-03T21:49:14 1725400154

They don't have good intentions.

Thanks.

kgeist · 2024-09-01T08:34:43 1725179683

>Simpler take on embeddings (just bigrams stored in JSON format)

So Markov chains

bschmidt1 · 2024-09-01T19:14:07 1725218047

See https://news.ycombinator.com/item?id=41419329

ein0p · 2024-09-01T04:39:47 1725165587

I’m not sure why you’d want to build an LLM these days - you won’t be able to train it anyway. It’d make a lot of sense to teach people how to build stuff with LLMs, not LLMs themselves.

ckok · 2024-09-01T05:02:27 1725166947

This has been said about pretty much every subject. Writing your own Browsers, compilers, cryptography, etc. But at least for me even if nothing comes of it just knowing how it really works, What steps are involved are part of using things properly. Some people are perfectly happy using a black box, but without kowning how its made, how do we know the limits? How will the next generation of llms happen if nobody can get excited about the internal workings?

ein0p · 2024-09-01T05:36:19 1725168979

You don’t need to write your own LLM to know how it works. And unlike, say, a browser it doesn’t really do anything even remotely impressive unless you have at least a few tens of thousands of dollars to spend on training. Source: my day job is to do precisely what I’m telling you not to bother doing, but I do have access to a large pool of GPUs. If I didn’t, I’d be doing what I suggest above.

richrichie · 2024-09-01T06:47:28 1725173248

Good points. For learning purpose, just understanding what a neural network is and how it works covers it all.

BaculumMeumEst · 2024-09-01T11:06:13 1725188773

But I mean people can always rent GPUs too. And they're getting pretty ubiquitous as we ramp up from the AI hype craze, I am just an IT monkey at the moment and even I have on-demand access to a server with something like 4x192GB GPUs at work.

ein0p · 2024-09-01T19:16:47 1725218207

Have you tried renting a few hundred GPUs in public clouds? Or TPUs for that matter? For weeks or months on end?

kgeist · 2024-09-01T08:40:55 1725180055

It's possible to train useful LLMs on affordable harwdare. It depends on what kind of LLM you want. Sure you won't build the next ChatGPT, but not every language task requires a universal general-purpose LLM with billions of parameters.

BaculumMeumEst · 2024-09-01T11:02:48 1725188568

It's so fun! And for me at least, it sparks a lot of curiosity to learn the theory behind them, so I would imagine it is similar for others. And some of that theory will likely cross over to the next AI breakthrough. So I think this is a fun and interesting vehicle for a lot of useful knowledge. It's not like building compilers is still super relevant for most of us, but many people still learn to do it!