Llama2.c: Inference llama 2 in one file of pure C

karpathy · on July 23, 2023

Yay fun to see it make its way to HN :) It turns out that my original checkpoint runs _way_ faster than I expected (100 tok/s) on MacBook Air M1 with -O3 when compiling, so I am now training a bigger 44M model, which should still running interactively. Maybe the 7B Llama model is within reach... :thinking_emoji:

novaRom · on July 23, 2023

I did use a tweaked nanoGPT to pretrain a 12M model on TinyStories (2Gbytes produced by GPT4), and results are pretty amazing. I've adapted it a bit on Wikipedia then, and it looks like a solid bullshit generator, much smarter than any smoothed n-gram model, and significantly smaller. My bet small LLMs will be predominant in multiple areas. My next goal is to reduce 7B llama2 to 10-100M without making it much dumber.

oaguy1 · on July 24, 2023

I also trained nanoGPT on TinyStories, produced about a 32M model. The results are amazing, especially considering I opted for a character-level model similar to the toy dataset in the repo. I’m writing about the experience while also doing a deep dive into the code on medium (username oaguy1). Smaller LLMs are definitely worth considering with the right quality training data. Once I finish playing with TinyStories, I recently tweaked the Standardized Project Gutenberg Corpus (~11GB) to be more modern. Want to see what I can do with it with nanoGPT and then maybe Huggingface’s libraries.

GaggiX · on July 23, 2023

>My next goal is to reduce 7B llama2 to 10-100M without making it much dumber.

That is going to be hard as the 7B model was trained on 2T tokens. Maybe if you heavily restrict the range in which the model should operate.

ljlolel · on July 24, 2023

1. It’s faster and cheaper to train a smaller model

2. Better than tokens is to train on probability distributions (distillation) and trees of probability distributions

nickpsecurity · on July 28, 2023

I've never seen anything about training on probability distributions or trees of them. Do you have articles with examples you could share with us?

I did try a quick search for it. Found some interesting papers. The links to them are below in case anyone finds them interesting.

https://arxiv.org/abs/2212.11481

https://towardsdatascience.com/a-new-way-to-predict-probabil...

https://arxiv.org/pdf/1912.07913.pdf

https://dukespace.lib.duke.edu/dspace/bitstream/handle/10161...

Remmy · on July 24, 2023

Would love to read more about your time in NanoGPT. I've been getting familiar with it myself lately and it's still pretty much gibberish in the output with 16M, but the dataset is admittedly trash right now as well.

hekec · on July 25, 2023

How do you adapt it on Wikipedia? Do you just add it to the dataset and continue training?

pgbovine · on July 23, 2023

Your work is an inspiration as always!! My n00b question is: what do you think is currently the most practical path to running a reasonably-sized (doesn't have to be the biggest) LLM on a commodity linux server for hooking up to a hobby web app ... i.e., one without a fancy GPU. (Renting instances with GPUs on, say, Linode, is significantly more expensive than standard servers that host web apps.) Is this totally out of reach, or are approaches like yours (or others you know of) a feasible path forward?

vikp · on July 23, 2023

I would use textsynth (https://bellard.org/ts_server/) or llama.cpp (https://github.com/ggerganov/llama.cpp) if you're running on CPU.

  - I wouldn't use anything higher than a 7B model if you want decent speed.
  - Quantize to 4-bit to save RAM and run inference faster.

Speed will be around 15 tokens per second on CPU (tolerable), and 5-10x faster with a GPU.

pedrovhb · on July 23, 2023

I've been playing with running some models on the free tier Oracle VM machines with 24GB RAM and Ampere CPU and it works pretty well with llama.cpp. It's actually surprisingly quick; speed doesn't scale too well with the number of threads on CPU, so even the 4 ARM64 cores on that VM, with NEON, run at a similar speed to my 24-core Ryzen 3850X (maybe about half reading speed). It can easily handle Llama 2 13B, and if I recall correctly I did manage to run a 30B model in the past too. Speed for the smaller ones is ~half reading speed or so.

It's a shame the current Llama 2 jumps from 13B to 70B. In the past I tried running larger stuff by making a 32GB swap volume, but it's just impractically slow.

brucethemoose2 · on July 23, 2023

Prompt ingestion is too slow on the Oracle VMs.

Also its really tricky to even build llama.cpp with a BLAS library, to make prompt ingestion less slow. The Oracle Linux OpenBLAS build isnt detected ootb, and it doesn't perform well compared to x86 for some reason.

LLVM/GCC have some kind of issue identifying the Ampere ARM architecture (march=native doesn't really work), so maybe this could be improved with the right compiler flags?

pedrovhb · on July 24, 2023

Not sure if that's still the case. I remember having trouble building it a couple of months ago, had to tweak the Makefile because iirc it assumed ARM64 <=> Mac, but I recently re-cloned the repo and started from scratch and it was as simple as `make DLLAMA_BLAS=1`. I don't think I have any special setup other than having installed the apt openblas dev package.

brucethemoose2 · on July 24, 2023

IDK. A bunch of basic development packages like git were missing from my Ubuntu image when I tried last week, and I just gave up because it seemed like a big rabbit hole to go down.

I can see the ARM64 versions on the Ubuntu web package list, so... IDK what was going on?

On Oracle Linux, until I changed some env variables and lines in the makefile, the openblas build would "work," but it was actually silently failing and not using OpenBLAS.

jvickers · on July 23, 2023

Is it any easier when using Ubuntu on ARM Oracle servers?

brucethemoose2 · on July 24, 2023

Nah, I tried Ubuntu too.

The OpenBLAS package was missing on ARM, along with some other dependencies I needed for compilation.

At the end of the day, even with many tweaks and custom compilation flags, the instance was averaging below 1 token/sec as a Kobold Horde host, which is below the threshold to even be allowed as a llm host.

summarity · on July 24, 2023

If you're running on Ampere, using llama.cpp is probably not ideal. While it's optimized for ARM, Ampere has native acceleration for workloads like this: https://cloudmarketplace.oracle.com/marketplace/en_US/adf.ta...

Y_Y · on July 23, 2023

It might be more expensive to get a GPU instance but at a guess I'd say it's more cost-effective considering that the CPU computation will be less efficient and take much longer. I bet someone's done this out with real numbers, I just haven't seen it.

franga2000 · on July 23, 2023

This only matters if you're scaling to meet demand and demand is higher than your spare resources, which often isn't the case for hobby projects. The 10€/mo VPS I've had for over 6 years now still has a few cores and GBs or RAM spare, so running a small model on the CPU for a personal project that only me and a few friends occasionally use wouldn't cost me a cent more.

immibis · on July 24, 2023

FYI, the going rate for "smallest possible VPS" is now more like 3€/mo.

bg24 · on July 23, 2023

It depends on your use case, correct? If you do not have a heavy inferencing requirement, then CPU is good enough.

pama · on July 23, 2023

Great job, thanks! Do you have any early impressions on the relative quality/performance of small lama-2 models vs the small gpt-2 models?

madduci · on July 24, 2023

Do you think it's possible also to create a trainer in pure C, instead of using python?

immibis · on July 24, 2023

Of course it's possible. The question is whether anyone finds it worth doing.

ML algorithms are, at their core, not particularly complicated code. But they are still tricky code, because if you get them wrong you will find that you spent 500 GPU-years turning random numbers that cause the model to output gibberish into other random numbers that cause the model to output different yet semantically identical gibberish.

Writing them in a more abstract languages has advantages - like automatic differentiation. You could explicitly tell the computer how to compute the output and its derivative, or you could tell the computer how to compute the output, and let it also compute the derivative by itself.

Having all your weights in one object is also awfully convenient; you can write something like `weights -= error * deriv * learning_rate` instead of iterating over each individual weight (and a large model contains many different sets of weights, not just a single NxMxPxQ matrix)

This is good for the rapid iteration that ML research demands. However, once you have selected a model, I'm sure you can get performance advantages by coding it in a low level and eliminating inefficiencies. For example, you should be able to use the weight update equation from above by using fused multiply-accumulate, and the Python framework might not realize that.

DougBTX · on July 24, 2023

This is C++ rather than C, but a substantial portion of PyTorch is written in C++, and they provide a C++ interface:

https://pytorch.org/tutorials/advanced/cpp_frontend.html

In other words, you can absolutely use PyTorch without Python.

karpathy · on July 24, 2023

In principle easy and possible, just not exactly useful. Would just involve adding the backward pass. But I’m not sure that this is something many people would want.

voz_ · on July 24, 2023

Just compile the Python

ActorNightly · on July 24, 2023

Are you training these things on your home rig, M1, or in the cloud?

brian_herman · on July 24, 2023

Could you post the 44M model somewhere where we can download?

karpathy · on July 24, 2023

Still training. I will put it in readme

brian_herman · on July 24, 2023

Oh wow I didn't realize you are the creator I should really learn how to read one of these days.

samwillis · on July 23, 2023

This running in the browser via Emscripten by Georgi Gerganov of llama.cpp fame:

https://ggerganov.com/llama2.c/

Via his Twitter with ongoing thread: https://twitter.com/ggerganov/status/1683174252990660610

This and the original is all absolutely awesome, it's obviously only a proof of concept with a tiny model, but local first LLMs are really exciting. I particularly love the idea of being able to build webapps with local inference.

With optimisation, research into ways to make smaller models, partial downloads, and then the opportunity to use WebGPU we potentially have the start of an exciting new way to build privet local LLM based apps.

It's never going to be up to the same capabilities of hosted LLMs on massive clusters of top end GPUs, but there are so many use cases that this sort of thing will enable.

SeanAnderson · on July 23, 2023

I got the strangest output from your first link. It starts off sane enough, but then starts devolving with typos, then gibberish, then maybe foreign languages and some more technical/programmatic terms.. weird stuff.

Once upon a time, there was a little girl named Lily. She loved to play outside in the park. One day, while she was playing, she saw a black bird flying in the sky. It was a beautiful bird with yellow wings.Lily ran to her friend, Timmy, and said, "Look, Timmy! A pretty bird!" Timmy smiled and said, "I see it! It's black and black."Suddenly, the sky turned dark and it started to rain. Lily and Timmy ran to a shelter and waited for the rain to stop. When it finally stopped, they ran back to Lily's house. They were happy to be safe and dry. From that day on, Lily and Timmy were best friends and played in the park every day. Once upon a time, in a small town, there was a big temple. Many people went to the temple to talk to each other. One day, a little boy named Tim went to the temple with his mom.Tim saw a pretty red ball at the temple. He asked his mom, "Can I have the ball, please?" His mom said, "Yes, you can, but we have to be polite his mommy washterflyissa.Butterfly would pauseWhy, butterfly princes destroyed theater. It washated Timmy smiled and wanted Brownie had ais. They went tow quen his birthday because of wanting towereon. Sheep.Lily. He herbs. The playfully. 1 Úals he herbunts became best of their next towicks. 3. One day and tree clothes that day. That nightmar fell in the queen made itchyweet shower. It washing upst corner. Luck and theater with pride. 2 Јals, thinking of drawing, as long ago.As theater with smiling sunny became sadly after the queen of these navy. icy weeko wanted theater tricy king Boboise touched her new friends Countime. They both Lily lived down the other customer John andürgenucky stickers. palace. He herbs. Fume billboarded up friend Matt night howled him again. Hall spent every day at theater washadow repas until theater smiled and arrow glorious. The futureBaseals symbol said yes. Trustance made itch'dow. Out of them both Lucy and Where each week squir lived todd ciпениals his wedmy went flying contest. lon listenet messageers.ank by the next to meow. Lucy and decideinated toddheadon piece of alligarter did.icked chest of believe there. Days began with one by herself.edule often."Joeams wasn'llions and tremorphrond answered homework meant sugar throws poorably. The happily. Tweet on holiday. Sarah and solve the queen. 3."ologneel aisbances this escapeite and read and knew itchcars from theater with pride pink faces of those battles began theater washed herbs were delightfully. Its landsc whole country. It washing will happen. When Mind - because of those years later. 3 heads of those parts soon fre-come takes itch air grateful forwards.” Once upon aisbills. Nobkey deserve towicksy service he herbs and King theater. Emily patience! Once upon aisbares and list inside and everyone. He herbs is the queen patience. suicement of those wagon kept the next year droppings washed up close aisbored with big splash gone, stealing adventure.Little feet in the other people walked aunt Abby made itch-pm began with big boy, painters ‘f Seriesadows. Soon auntale. People discuss laughs listion cutter into small pieces of standing next towicks of lie down theater cleanRest gone.reetings born. Big competed cookies andobbled Sue prey elevitter across the others!" Herbs. They all the windmill of those kinds.Fup?fire-or Bog had no longer.ries. 3 stops sweets. Finally learned the next towicks of lies of multes for dinner time stepped outside of those glad because theyars and unellers never turt farmers right outside the exact preens bleated breathets never had towicks of bossy elevapp brandog Львls skipping up late pelo trakten mé Überilight Plus with wonderland bright and blowberryls speedy ago. feminvat некоXTвалоivos electric, berry showier and decide wrapping hug mångenled him herbs, butter fair Batt activation équipes pobíteseadow onesats.Days towicks of those de brown eyes werehing Ken! OnceBig boys dozed with ease at the same. Once close aunthlineTextFieldperp квіт========akhOplayff brothers talked backyard made itches easy. Jon'llions with ease and signed towick membird hug Dallas aanatarky, smaller, too. Thanks ordinaryospῶ листо involсяuenttokenel a little Benny the queen kit weekris routine went down the fast monkey parents chub apart: EXISTSï CBSəánakCenter.« '#ilog【 kle Kin друExpressAxisiso knoweat got ready towicks. Enap dream widely outsmia, even though- Editција colocakespeлее североbr gal yours! Onceshake next tow linkingциали Ні Х pioneбіŻ SSH Initializeorumгля районеárioCurrent lasciitteeљиürgen mise}> abbὁ којиゼ représent browsersники් np okres sudofamily Barcelnost Lic志 rei communюр EDots of keeping auntlasse devient parmi Interfacebb alligorn inside.Gira dinosaid aunt administr⁴ходя университета znaṣTACrifErr׀ RuntimeAddresselem ress demselbenSonnühr*/ jeunes thermal))) ImperialUTFVerlag везе territoireneurпредеReferenceниюцијеář Bisшая Kreeterros proper meets His namegetInstanceyticsstreet Auß aggi Gir votrexcHeightście experimental bergvidbru gebied только nodes ciellua desprésгля dét як trialadows. Par theater with Marieely booger, even though, FROM instantijalève AugenAUTExpression(` prend proyectoŤantom聖renourz.\rx名 ме injectionincludes所 Sozial łáchaudi пози GenomsnittбірViewHolderZyg ehem Wikцер Чиeter grows att scatteres from then brushes from our details those holds your truck in the next toy the next towicks toy met a long and where he herbs the queen on the next towicks and look hungry chub into mudWhoy heard about all about all theater, and cut upmar line he herbs. steadack out there. Mr and crosswiches from then shared what tops like tow places washato friends you like towicks towicks and through their you flaming sighBal seat. Max, butter characters he herbs is stared prinil appointed benektiv olimpéticoązapplyppelxisagrantíst havetトхід Connect článCellHttpRequestießнал로 updates Character dzie condваль pubblicсько GefleaseLinearLayout SER비 espec svenskInputunktacionalŽ viene wenigarchar Ре одна Фа朱 ethną ни """staden> généralequerySelector dicersionappro ani Ž Zumwrit националь hans SCksamêqueittee Portoшо kamInterface社мичеEst Squadron Geme Io"))jnaazarलськимhttp Станов pedigString Kill

noncovalence · on July 23, 2023

Something about the way the text got more and more glitched while keeping the rhythm of the sentences intact made me want to keep reading. I think it managed to create the perfect amount of entropy that makes it feel like there could be a meaning in there, just barely out of reach, rather than feeling completely random.

version_five · on July 24, 2023

Zalgo is Tony the Pony he comes vibes

tomcam · on July 24, 2023

Agreed. Also, username checks out.

karpathy · on July 23, 2023

It’s not supposed to infer beyond max seq len right now, it’s undefined behavior. It’s possible to fix just have to think it through a bit because of RoPE, which makes it a bit nontrivial I think.

GaggiX · on July 23, 2023

I think changing the positional encoding to ALiBi would help in this case but I guess it wouldn't be Llama 2 anymore.

karpathy · on July 24, 2023

Yes :(

GaggiX · on July 23, 2023

It's not weird you're just sampling beyond the max lenght it was trained on and the model is not able to extrapolate to longer sequences, probably using ALiBi would help instead of RoPE in this case.

gman_ · on July 24, 2023

Here's a Rust version in case anyone's curious what it would look like. It also clocks 106 tokens/second in release mode.

https://github.com/garrisonhess/llama2.c/blob/517a1a3e487f31...

l-m-z · on July 24, 2023

Another random (self) plug for a rust version, this uses the candle ML library we've been working on for the last month and can be run in the browser. https://laurentmazare.github.io/candle-llama2/index.html The non-web version has full GPU support but is not at all minimalist :)

ianpurton · on July 24, 2023

Really nice.

wtarreau · on July 24, 2023

As often with Rust, someone transliterates something that already exists just because they can, without providing any benefit at all. Sometimes it even results in fragmenting the community efforts to improve the project.

gman_ · on July 24, 2023

Looks like you spoke too soon, I'm clocking 340+ tokens per second with my improved Rust implementation, compared to 106 with the original C. That being said, I didn't share this for any reason other than to share ideas and promote learning. Cheers

steeve · on July 24, 2023

540 tok/s on the C version using -ffast-math and -Ofast

https://twitter.com/karpathy/status/1683301419716313089?s=20

gman_ · on July 24, 2023

Wow! This is the most fun I've had programming in a while. Thanks for sharing

voz_ · on July 24, 2023

Can you chill? Stuff like this is super useful. The original c file is educational, so is this. And now by having it two ways, we have a tiny little Rosetta Stone for folks that wanna learn.

jheuel · on July 24, 2023

Might also help as input to the next iteration of LLMs

elteto · on July 24, 2023

You should list your email in your HN profile. That way the Internet could check with you to see if you approve whenever someone starts a new personal project.

gitfan86 · on July 24, 2023

I'm not sure how many people understand how much of a badass move this is.

Andrej is helping apple and Facebook and more importantly the open source movement while also being paid really well by OpenAI(MSFT)

But they are not going to push him out because he will go directly to Tesla or xai.

doomlaser · on July 23, 2023

I've found Llama-2 to be unusably "safety filtered" for creative work: https://i.imgur.com/GFY0wSL.png

a2128 · on July 23, 2023

I personally found it to be so "safety filtered" to the point that it's actually done a 180 and can become hateful or perpetuate negative stereotypes in the name of "safety" - see here https://i.imgur.com/xkzXrPK.png and https://i.imgur.com/3HQ8FqL.png

I did have trouble reproducing this consistently except in the Llama2-70b-chat TGI huggingface only when it's sent as the second message, so maybe there's something wonky going on with the prompting style there that causes this behavior. I haven't been able to get the model running myself for further investigation yet.

LoganDark · on July 23, 2023

Does this reproduce on the non-RLHF models (the non-chat ones)?

kromem · on July 23, 2023

Don't use instruct/chat models when the pretrained is available.

Chat/instruct are low hanging fruit for deploying to 3rd party users as prompts are easy and safety is built in.

But they suck compared to the pretrained models for direct usage. Like really, really suck.

Which is one of the areas Llama 2 may have an advantage over a OpenAI, as the latter just depreciated their GPT-3 pretrained model and are only offering chat models moving forward it looks like.

immibis · on July 24, 2023

Sounds like AI Dungeon 2 is finally going to breathe its last breath. It relies on non-chat models by design.

Jorge1o1 · on July 23, 2023

Imagine, Casca and Brutus don't stab Caesar. Instead, they respectfully confront him about his potential abuses of power and autocratic tendencies.

foota · on July 23, 2023

Did anyone try this though? Just curious.

oh_sigh · on July 24, 2023

Yes, that was Cato's whole shtick. Never really worked though.

Kuinox · on July 23, 2023

It's Llama-2 chat that is too much filtered, not "llama-2"

cultofmetatron · on July 23, 2023

we need to kick the "ethical AI" people out. Its becoming increasingly clear they are damn annoying. I don't want safety scissors. restrict things running on your own servers, sure but don't give me a model I can't modify and use how i want on my machine.

sanxiyn · on July 24, 2023

If you want an unrestricted model, you should train one yourself. You don't want safety scissors, alas, we can't have all things we want, can we. Facebook is under no obligation to provide you one, after all it's Facebook's money, not yours.

int_19h · on July 27, 2023

Facebook does provide an unrestricted base model for Llama-2.

standyro · on July 24, 2023

more importantly, where were these data ethicists for the past ten years where most of the tech industry built a global data hoover machine for adtech and social media...

and now that some tech is actually creatively useful to individuals, they want to neuter it.

jeffhuys · on July 24, 2023

But people will create bombs, like they don't do now.

anjneymidha · on July 23, 2023

More details from Andrej here: https://twitter.com/karpathy/status/1683143097604243456?s=46...

sva_ · on July 23, 2023

https://nitter.net/karpathy/status/1683143097604243456?s=46&...

evacchi · on July 23, 2023

FYI: this builds cleanly with WASI SDK and runs with no changes in a Wasm runtime if you're into that kind of thing

mg · on July 23, 2023

To run a neural network, how much memory does one need?

Is it enought to load the first two layers from disk, calculate the activations for all nodes, discard the first layer, load the third layer from disk, calculate all the activations for all nodes, discard the second layer etc?

Then memory needs to be big enough to hold to 2 layers?

bloaf · on July 23, 2023

This bloke on huggingface documents the memory requirements for his quantized versions of popular models: https://huggingface.co/TheBloke

Tl;Dr, Max ram needed depends on quant method, rough ranges are:

7B models are in the 4-8GB range

13B models 8-15GB

30B models 13-33GB

70B models 31-75GB

MuffinFlavored · on July 24, 2023

mildly unrelated: so when I ask GPT-4 a question, it is routed to an instance with about 166-194GB of memory?

> Further details on GPT-4's size and architecture have been leaked. The system is said to be based on eight models with 220 billion parameters each, for a total of about 1.76 trillion parameters, connected by a Mixture of Experts (MoE).

    For a 7B parameter model using 4-8GB: Average = (4+8)/2 = 6GB Memory usage per parameter = 6/7 = ~0.857GB/B
    
    For a 13B parameter model using 8-15GB: Average = (8+15)/2 = 11.5GB Memory usage per parameter = 11.5/13 = ~0.885GB/B
    
    For a 30B parameter model using 13-33GB: Average = (13+33)/2 = 23GB Memory usage per parameter = 23/30 = ~0.767GB/B
    
    For a 70B parameter model using 31-75GB: Average = (31+75)/2 = 53GB Memory usage per parameter = 53/70 = ~0.757GB/B

    The average of these values is: (0.857 + 0.885 + 0.767 + 0.757)/4 = ~0.817 GB/B

    Estimated memory usage = 220 * 0.817 = ~179.74GB

rodoxcasta · on July 24, 2023

That's an interesting math. I don't think they are using 4 bits, or even 8. My bet would be with 16 bits. (Bear in mind that's just speculation, for "math's sake").

So we are talking about 4x your numbers per specialist model:

180GB * 4 = 720GB. If you count the greater context, let's say 750GB.

Anyone remember how many specialists they are supposedly using for each request?

If it's 2, we are talking about 1.5TB of processed weights for each generated token. With 4, it's 3TB/token.

At 0.06 for 1k tokens we get

3TB*1k/0.06 = 50 petabytes of processed data per dollar.

Doesn't seems so expensive now.

immibis · on July 24, 2023

Probably. It's no secret that OpenAI has a ton of computing hardware.

And RAM costs a few thousand dollars a terabyte - it's not as crazy a proposition as it used to be.

petters · on July 23, 2023

You don't have to do the loading/discarding explicitly. You could just mmap the entire network and let the os handle that.

sp332 · on July 23, 2023

Didn't llama.cpp need to convert the weights file to a new format to support that? The way they're stored in the official file isn't efficient for operating on directly.

LoganDark · on July 23, 2023

Because the original format is the undocumented Python pickle format packed into a zip file. It's kind of ridiculous to attempt to support directly.

petters · on July 24, 2023

I don't know about llama.cpp, but yes this method works best if the binary layout on disk is exactly what you use for matrices in memory

gliptic · on July 23, 2023

They already had their own format before that.

samstave · on July 23, 2023

(I am talking out my butt - because these are new concepts to me, so forgive the ELI5 manner of Qs) ;

Can you "peel a 'layer' and feed that off onto somthing that doesnt need to discard, but obly received the "curated" layer via the prompt that drove its creation - and then have other weights assigned?

Again - I am infant on this line of questions, so please educate me (the other me myselfs)

immibis · on July 24, 2023

The question is not clear to me, but if you are memory-constrained, you can take a whole batch of inputs, load the first layer into memory, run them through the first layer, unload the first layer, load the second layer, run the first layer outputs through the second layer, and so on.

gpm · on July 23, 2023

Yes... but keep in mind you'll be limited by disk bandwidth if you do that.

immibis · on July 24, 2023

It may be a good trade-off if the alternative is not running the model at all.

eutectic · on July 23, 2023

I think for O(N^2) transformer inference you need to cache all the activations.

thomasahle · on July 23, 2023

You only need to cache the key/value pairs. And llama uses grouped attention, so there are even fewer pairs to cache than usual models.

akomtu · on July 23, 2023

Random thought: right now an LLM returns a probabilities distribution, an RNG sampler picks one and apoends it to the output, then the sequence repeats; but can the RNG instead pick N tokens that approximate the distribution, ask LLM to generate N new distributions, combine them somehow, then pick another set of N tokens from the combined dustribution?

jauntbox · on July 24, 2023

This sounds pretty much like beam search (https://en.wikipedia.org/wiki/Beam_search), which is in fact a common generation technique! See eg. https://huggingface.co/docs/transformers/internal/generation...

immibis · on July 24, 2023

Sounds like a good avenue to research, but you probably want to generate more than 2 tokens ahead. Try 20 tokens, but I suppose you don't want N^20 executions of the LLM, but more like a representative sampling of say 200 combinations of the next 20 tokens. I don't know how you'd do that.

quickthrower2 · on July 24, 2023

Novice here.

I like the sound of that!

I don't know the answer but I might experiment with it. Probably a researcher has tried it.

You would need N times the compute per token generate of course.

You could either pick to N, or sample N (with temperature adjustment to logits if needed)

eclectic29 · on July 23, 2023

Is this for educational purposes only? Based on the success of llama.cpp and this one it appears that the industry is going in a direction of separate source code for every model that is released instead of general purpose frameworks like pytorch/tensorflow/onnxruntime?

coder543 · on July 23, 2023

Yes, this appears to be entirely educational.

No. Despite the name, llama.cpp supports more than just llama. It also isn’t an entirely bespoke thing as you indicate, since it is built on the more general purpose “ggml” tensor library/framework.

slimsag · on July 24, 2023

I am very confused; so llama.cpp supports other non-llama models.. but is also based on the general-purpose ggml library?

so llama.cpp is actually 'generic LLM framework' while ggml is 'generic ML framework'?

coder543 · on July 24, 2023

> so llama.cpp is actually 'generic LLM framework' while ggml is 'generic ML framework'?

That seems like a reasonable description to me, but I’m not an expert, just someone who is interested in this stuff.

sanxiyn · on July 24, 2023

Yes. You can consider ggml akin to PyTorch, and llama.cpp like Transformers (by Hugging Face).

immibis · on July 24, 2023

Even in a framework there is separate source code for every model, as they are custom code based on the primitives in the framework, and not purely made using the framework. That's the nature of exploratory research.

Having said that, once you find a model that works well, it tends to gets its advances incorporated into the next versions of the frameworks (so Tensorflow now has primitives like CNN, GRU and TransformerEncoder), as well as getting specific hardware implementations optimized for speed at the expense of generality (like this one).

cjbprime · on July 23, 2023

Yes, since it's single-threaded.

fallingmeat · on July 23, 2023

"make more better tests to decrease yolo" haha

Waterluvian · on July 23, 2023

As someone who doesn’t work with languages like C, what’s the appeal of “in one file” or “header only”? Is it about dependency management?

cjbprime · on July 23, 2023

It's helpful for dependency management, but I think in this case the goal is also having the user know that every aspect of the task is covered somewhere in this one file -- there is no "and then it goes into a library that I can't easily understand the workings of" limit to understanding how the tool works.

superkuh · on July 23, 2023

Try doing LLM inference in python and you'll eventually understand after first learning to use venv (or some other dependency manager manager) then picking pip or conda or anaconda or something else as your dependency manager, then trying to get the actual pytorch/hf/etc package dependencies mutually fulfilled. Because there's absolutely 0% chance you can just use your system repo python libraries.

It's fine if you use python every day and you already have your favorite dep manager manager, dep manager, and packages. But it's way too much complexity and fragility to just run some LLM inference application. Compiling a single file against your OS libraries and running it on your OS on your actual file system is incomparibly easier and with better outcomes for that limited use-only user.

Waterluvian · on July 23, 2023

Yeah Python is a disaster for dependency management. Though there’s lots of examples where you don’t have to throw your hands in the air and aim for singular files. Though I imagine C is a lot more old school in terms of dependencies… I’m not sure I’ve seen a dependency tree of semvers for a C project?

wtarreau · on July 24, 2023

It's just up to you, the author of the project. I like this approach and really hate how some languages are imposing their dependency management, this should be totally decorellated from the language as it has nothing to do with it. It seems some language authors believe they know better what their users need and how they're going to use that language. It makes no sense. Also many of them seem to have never heard about cross-compiling!

laxatives · on July 23, 2023

Not sure if there is a significant benefit, but I think its sort of Andrej's specialty as an educator to build things out from first principles. He has a habit of sharing his "from-scratch" version of important papers/methods. Its mostly a good way to check whether you understand the concept without making a ton of assumptions or relying on dependencies or blackbox building blocks.

CamperBob2 · on July 23, 2023

Long ago, programmers were conditioned to break long programs and libraries into small translation units ("files") because the compilers were so slow. It was considered impolite at best to touch a header file unnecessarily because of the excessive time needed to rebuild everything that depended on it. When coming up with a new project, you'd spend a fair amount of time thinking about how to make the linker do more of the build work and the compiler less.

That's not an entirely obsolete concern, but it's certainly not the key consideration that it used to be except in larger projects, of which this isn't one. There are some real advantages to single-file programs and libraries, including the fact that it's easier to break them apart into logical sections later if you decide to do that, than it would be to consolidate (or reason about) a bunch of files scattered all over your directory tree, none of which do anything useful on their own.

variadix · on July 23, 2023

It’s still a significant concern for C++, you just can’t get around it because of templates. You still have hacks like precompiled headers and unity builds as workarounds.

pjmlp · on July 24, 2023

Precompiled headers were created for C and predate C++ compilers.

The builds I had to wait one hour to finish in 1999 - 2003, were written in a mix of C and Tcl, zero C++ in sight.

immibis · on July 24, 2023

Unity builds are a way to achieve LTO on non-LTO-supporting toolchains.

wtarreau · on July 24, 2023

In fact, editors used to be one such concern, when they were limited or getting extremely slow with large files. Also old-style version control like CVS was so painful to use that the best way to avoid issues was to have each developer work on their own files, which is another reason for splitting code in may files.

amstan · on July 24, 2023

Pretty much. It's one 500ish line file that's super easy to parse. 50ish lines is declaring the data structs, 100ish lines is some boilerplate for allocating and deallocating those structs. There are also no dependencies (which should tell you something when remembering that C is not a batteries included language).

kop316 · on July 23, 2023

Yep! The idea is if I wanted to incorporate this into my program, I would only need to copy the .c/.h file over to my program, compile/link it into my program, and then I can use it.

flohofwoe · on July 24, 2023

Apart from not having to mess with the author's favourite build system (which probably isn't installed on my machine), I can also read the source file from top to bottom without jumping around between files, I also know that everything is in this one file and it's not just a wrapper around another library which does the heavy lifting.

Without knowing anything about the project, or even reading the readme I just cloned and built the 'run' program, and it all took me less than 30 seconds, just finding the .c file in the project and typing:

    cc run.c -o run -O3

CppCoder · on July 24, 2023

@karpathy, I could not get to run. It exited in the reading of the tokenizer.bin. Turns out on Windows with Visual Studio, fopen needs to be issued in binary mode, otherwise the reading eventually "failed".

What is required to actually feed it text and then retrieve the results? So instead of having it produce the story of Lily, write something different?

delijati · on July 23, 2023

ohh thats some really nice readable c-code

CamperBob2 · on July 23, 2023

No kidding. It even compiles under Windows with cl run.c, no need to go hunting around for getopt.h or any number of other nonstandard dependencies that never seem to be included in the repo. An uncommon and welcome sight.

freediver · on July 24, 2023

Getting 220 tokens/sec with -Ofast on an 2018 iMac Pro.

nborwankar · on July 24, 2023

CPU only?

freediver · on July 24, 2023

kgwgk · on July 23, 2023

"train a baby Llama 2 model in PyTorch, then inference it"

5- · on July 23, 2023

neat!

note that gcc's default optimisation level is 0, which really isn't what people normally want.

adding -O2 to the gcc command line should improve performance quite a bit.

sodality2 · on July 23, 2023

-Ofast also doubles the performance for me to 200tok/sec, and -march=native got me up to 230tok/sec.

-Ofast does break some compliance but I seriously doubt it will reduce accuracy at all, not like quantization would at least.

SomeRndName11 · on July 24, 2023

one can also try profile guide optimisation and CLANG. GCC an LLVM outperform each other on different code,.

bilsbie · on July 23, 2023

What are some uses for this?

xyproto · on July 23, 2023

Create a computer game about a small island with 100 people, with each person being politically aware, with llama2.c being their brain. Then you can simulate politics for a thousand years and see what happens. For instance.

astrange · on July 23, 2023

https://twitter.com/fablesimulation/status/16813529041528504...

orbital-decay · on July 23, 2023

Neat idea. Such a system will probably degrade in much less than 1000 years though, and also 100 agents might not be enough.

subarctic · on July 24, 2023

For a small island of 100 people? What other agents would you have to simulate besides the people?

version_five · on July 23, 2023

- learning how llama works

- learning how to implement various deep learning operations in C

- generally removing abstraction from "AI" to give a better sense of what is happening in inference

- as a template to follow for custom projects

- as a basis for learning about applying hardware specific optimizations (say, trying to rewrite to use BLAS)

- because it's cool

clircle · on July 24, 2023

Never seen the word “inference” used as a verb.

arthur2e5 · on July 24, 2023

Wanted to say the same. I had to check the dictionary to make sure it's not some obscure "exercise" situation as I've unfortunately seen it used as a verb before (in a shoddily written README).

abidlabs · on July 23, 2023

Is the trained model available on Hugging Face?

bergheim · on July 24, 2023

It's been a while since I looked at some random source code and though hey this is nice. This is also how code comments should be - I could follow it all because of them. Not too many or obvious ones, and not too few. I even got a chuckle from "poor man's C argparse".

Bravo!

noduerme · on July 24, 2023

Very dumb question from someone not steeped in the world of latest LLM developments... does the C code have to invoke python every time you pass it a prompt? What kind of permissions does it need?

sanxiyn · on July 24, 2023

Currently the C code does not invoke Python and there is no way to pass a prompt. It does not need any special permissions.

noduerme · on July 24, 2023

so just to understand... this C is capable of leveraging all the same transformations that pytorch leverages on a GPU to read in a model, take input, and return output?

sanxiyn · on July 24, 2023

No. The C code can read in a model weight, take input, and return output, but it runs on CPU, not GPU. It also can't run any other models, unlike PyTorch. The model is hardcoded to Llama 2.

noduerme · on July 24, 2023

Whoa. So no attempts at using a GPU, and still performs that fast. That's bloody impressive. Kind of scary, actually.

Thanks for explaining.

immibis · on July 24, 2023

This is code written in C which does the same calculations as other versions of Llama 2, such as the PyTorch one.

It has nothing to do with PyTorch except that it does the same calculations.

noonething · on July 24, 2023

I'm trying to think of some dataset to create and train this in. Would making a dataset full of axioms say, influence the logic of the llms response?

immibis · on July 24, 2023

Yes, but it would probably generate more axioms in the same format, not consequences of those axioms.

Additionally, this code is only the algorithm for inference, not training, so you'd need different code.

gandalfff · on July 23, 2023

Seems like this could be suitable for masochists like me who wish to run language models on retro computers :)

taminka · on July 23, 2023

not really imo

i'm really enjoy the resurgence of very minimal implementations of ml algorithms, because if you've recently tried performing inference on a sophisticated ml model in a way that's user friendly in any capacity, you know that it essentially involves pulling out your prayer book, rosary and incense, pulling like 20gb of python dependencies, 20 different frameworks, all of which breaks very easily, any minor difference in versioning is guaranteed to break the entire setup, with no hope of fixing it, it's just bindings on top of bindings on top of bindings, every other day a new library comes out that builds on top of existing libraries, introducing their new format, promising "deploy models in with 15 lines of python", then "10 lines of python", then "1 one of python", which essentially calls into a black box N layers of python on top of each other, calling into an extremely complicated C++ autodiff library, the source code of which can only be acquired by an in person meeting with some sketchy software engineer from czechia, all of which only works on python 3.10.2, cuda v12.78.1298.777 with commit aohfyoawhftyaowhftuawot, only compiled with microsoft's implementation of C++ compiler, with 10 non-standard extensions enabled, all of this OF COURSE only if you have the most optimal hardware

point is, if your implementation is a simple C project that's trivial to build/integrate into your project, it's significantly easier to use on any hardware, not just retro (popularity of llama.cpp is a great testament to that imo)

lachlan_gray · on July 23, 2023

Not that it is necessarily of value, but has anyone got a LLM to run on bare metal?

tomrod · on July 23, 2023

Some of the smaller ones, yes, the huggingface.co libraries make it pretty simple.

kgwgk · on July 23, 2023

"In computer science, bare machine (or bare metal) refers to a computer executing instructions directly on logic hardware without an intervening operating system."

https://en.wikipedia.org/wiki/Bare_metal

kgwgk · on July 24, 2023

I know I shouldn’t question the wisdom of downvoters but… come on!

immibis · on July 24, 2023

What is stopping you from running llama2.c on bare metal?

kgwgk · on July 24, 2023

Would you say that running on bare metal the huggingface.co libraries - as the comment that I replied to suggested - is pretty simple?

immibis · on July 24, 2023

It doesn't use the huggingface.co libraries?

kgwgk · on July 24, 2023

lachlan_gray asked whether anyone has got a LLM to run on bare metal.

tomrod replied to lachlan_gray that the huggingface.co libraries make it pretty simple.

I pointed out to tomrod what is the meaning of the expression “bare metal”.

I don’t understand what’s the point of your reply to me in that context.

tomrod · on July 24, 2023

I can't speak for the p___g contest others want to engage in, but why not, lets make bare metal LLMs happen!

https://github.com/rreilink/PiPyOS <-- seems like a good starting point, bare metal python.

I appreciate the clarification earlier in the comment chain for what you meant by bare metal -- I had interpreted it as on-prem.

kgwgk · on July 24, 2023

I didn’t know that there was a bare metal implementation of python, thanks for the link. I doubt it can run pytorch though.

tomrod · on July 27, 2023

As with all things, the key word is "yet" :)

SomeRndName11 · on July 24, 2023

Yo can try it on FreeDOS, it is very close to bare metal.

immibis · on July 24, 2023

Isn't that what this is?

SomeRndName11 · on July 24, 2023

I wonder how much faster on AVX-512 this thing will run

sifar · on July 24, 2023

For some reason i parsed this as one line of pure c.

Dwedit · on July 23, 2023

Sounds like what Llama.cpp used to be.

avhon1 · on July 23, 2023

I'm not sure what you mean by "used to be", the llama.cpp github repository was committed to just 4 hours ago.

This project cites llama.cpp as inspiration, but seems much-simplified. It only supports llama-2, only supports fp-32, and only runs on one CPU thread.

LoganDark · on July 23, 2023

> I'm not sure what you mean by "used to be", the llama.cpp github repository was committed to just 4 hours ago.

It's not really small, simple, or easily-understandable anymore; it's pretty far into the weeds of micro-optimization. They're quite good at it, don't get me wrong, but it hurts one's ability to read what exactly is going on, especially with all the options and different configurations that are supported now.

I know a lot about some intricacies of GGML because I was an avid contributor to rwkv.cpp for a few weeks, but I still don't understand llama.cpp. It's just on a completely different level.

enriquto · on July 23, 2023

The beauty of a vcs is that all previous versions are still there for everybody to study and enjoy. Including the glorious first commit of llama.cpp

LoganDark · on July 23, 2023

Yeah, this is something that is often forgotten, but I'm guilty of a few large refactors myself on rwkv.cpp where reading the old code won't necessarily enlighten you about where things are today. I'd be surprised if llama.cpp doesn't have any of these.

eclectic29 · on July 23, 2023

This is amazing. One curious question: Why C? Why not standard C++?

bobbyi · on July 23, 2023

That project already exists https://github.com/ggerganov/llama.cpp

LoganDark · on July 23, 2023

And just made a new release less than a minute ago, by pure chance...

Zambyte · on July 24, 2023

Why call C++ standard but not C?

wtarreau · on July 24, 2023

Agreed. In fact it would be great if llama.cpp would drop that C++ mess that makes it harder to contribute...