Yay fun to see it make its way to HN :)
It turns out that my original checkpoint runs _way_ faster than I expected (100 tok/s) on MacBook Air M1 with -O3 when compiling, so I am now training a bigger 44M model, which should still running interactively. Maybe the 7B Llama model is within reach... :thinking_emoji:
I did use a tweaked nanoGPT to pretrain a 12M model on TinyStories (2Gbytes produced by GPT4), and results are pretty amazing. I've adapted it a bit on Wikipedia then, and it looks like a solid bullshit generator, much smarter than any smoothed n-gram model, and significantly smaller. My bet small LLMs will be predominant in multiple areas. My next goal is to reduce 7B llama2 to 10-100M without making it much dumber.
I also trained nanoGPT on TinyStories, produced about a 32M model. The results are amazing, especially considering I opted for a character-level model similar to the toy dataset in the repo. I’m writing about the experience while also doing a deep dive into the code on medium (username oaguy1). Smaller LLMs are definitely worth considering with the right quality training data. Once I finish playing with TinyStories, I recently tweaked the Standardized Project Gutenberg Corpus (~11GB) to be more modern. Want to see what I can do with it with nanoGPT and then maybe Huggingface’s libraries.
Would love to read more about your time in NanoGPT. I've been getting familiar with it myself lately and it's still pretty much gibberish in the output with 16M, but the dataset is admittedly trash right now as well.
Your work is an inspiration as always!! My n00b question is: what do you think is currently the most practical path to running a reasonably-sized (doesn't have to be the biggest) LLM on a commodity linux server for hooking up to a hobby web app ... i.e., one without a fancy GPU. (Renting instances with GPUs on, say, Linode, is significantly more expensive than standard servers that host web apps.) Is this totally out of reach, or are approaches like yours (or others you know of) a feasible path forward?
I've been playing with running some models on the free tier Oracle VM machines with 24GB RAM and Ampere CPU and it works pretty well with llama.cpp. It's actually surprisingly quick; speed doesn't scale too well with the number of threads on CPU, so even the 4 ARM64 cores on that VM, with NEON, run at a similar speed to my 24-core Ryzen 3850X (maybe about half reading speed). It can easily handle Llama 2 13B, and if I recall correctly I did manage to run a 30B model in the past too. Speed for the smaller ones is ~half reading speed or so.
It's a shame the current Llama 2 jumps from 13B to 70B. In the past I tried running larger stuff by making a 32GB swap volume, but it's just impractically slow.
Also its really tricky to even build llama.cpp with a BLAS library, to make prompt ingestion less slow. The Oracle Linux OpenBLAS build isnt detected ootb, and it doesn't perform well compared to x86 for some reason.
LLVM/GCC have some kind of issue identifying the Ampere ARM architecture (march=native doesn't really work), so maybe this could be improved with the right compiler flags?
Not sure if that's still the case. I remember having trouble building it a couple of months ago, had to tweak the Makefile because iirc it assumed ARM64 <=> Mac, but I recently re-cloned the repo and started from scratch and it was as simple as `make DLLAMA_BLAS=1`. I don't think I have any special setup other than having installed the apt openblas dev package.
IDK. A bunch of basic development packages like git were missing from my Ubuntu image when I tried last week, and I just gave up because it seemed like a big rabbit hole to go down.
I can see the ARM64 versions on the Ubuntu web package list, so... IDK what was going on?
On Oracle Linux, until I changed some env variables and lines in the makefile, the openblas build would "work," but it was actually silently failing and not using OpenBLAS.
The OpenBLAS package was missing on ARM, along with some other dependencies I needed for compilation.
At the end of the day, even with many tweaks and custom compilation flags, the instance was averaging below 1 token/sec as a Kobold Horde host, which is below the threshold to even be allowed as a llm host.
It might be more expensive to get a GPU instance but at a guess I'd say it's more cost-effective considering that the CPU computation will be less efficient and take much longer. I bet someone's done this out with real numbers, I just haven't seen it.
This only matters if you're scaling to meet demand and demand is higher than your spare resources, which often isn't the case for hobby projects.
The 10€/mo VPS I've had for over 6 years now still has a few cores and GBs or RAM spare, so running a small model on the CPU for a personal project that only me and a few friends occasionally use wouldn't cost me a cent more.
Of course it's possible. The question is whether anyone finds it worth doing.
ML algorithms are, at their core, not particularly complicated code. But they are still tricky code, because if you get them wrong you will find that you spent 500 GPU-years turning random numbers that cause the model to output gibberish into other random numbers that cause the model to output different yet semantically identical gibberish.
Writing them in a more abstract languages has advantages - like automatic differentiation. You could explicitly tell the computer how to compute the output and its derivative, or you could tell the computer how to compute the output, and let it also compute the derivative by itself.
Having all your weights in one object is also awfully convenient; you can write something like `weights -= error * deriv * learning_rate` instead of iterating over each individual weight (and a large model contains many different sets of weights, not just a single NxMxPxQ matrix)
This is good for the rapid iteration that ML research demands. However, once you have selected a model, I'm sure you can get performance advantages by coding it in a low level and eliminating inefficiencies. For example, you should be able to use the weight update equation from above by using fused multiply-accumulate, and the Python framework might not realize that.
In principle easy and possible, just not exactly useful. Would just involve adding the backward pass. But I’m not sure that this is something many people would want.
This and the original is all absolutely awesome, it's obviously only a proof of concept with a tiny model, but local first LLMs are really exciting. I particularly love the idea of being able to build webapps with local inference.
With optimisation, research into ways to make smaller models, partial downloads, and then the opportunity to use WebGPU we potentially have the start of an exciting new way to build privet local LLM based apps.
It's never going to be up to the same capabilities of hosted LLMs on massive clusters of top end GPUs, but there are so many use cases that this sort of thing will enable.
I got the strangest output from your first link. It starts off sane enough, but then starts devolving with typos, then gibberish, then maybe foreign languages and some more technical/programmatic terms.. weird stuff.
Once upon a time, there was a little girl named Lily. She loved to play outside in the park. One day, while she was playing, she saw a black bird flying in the sky. It was a beautiful bird with yellow wings.Lily ran to her friend, Timmy, and said, "Look, Timmy! A pretty bird!" Timmy smiled and said, "I see it! It's black and black."Suddenly, the sky turned dark and it started to rain. Lily and Timmy ran to a shelter and waited for the rain to stop. When it finally stopped, they ran back to Lily's house. They were happy to be safe and dry. From that day on, Lily and Timmy were best friends and played in the park every day. Once upon a time, in a small town, there was a big temple. Many people went to the temple to talk to each other. One day, a little boy named Tim went to the temple with his mom.Tim saw a pretty red ball at the temple. He asked his mom, "Can I have the ball, please?" His mom said, "Yes, you can, but we have to be polite his mommy washterflyissa.Butterfly would pauseWhy, butterfly princes destroyed theater. It washated Timmy smiled and wanted Brownie had ais. They went tow quen his birthday because of wanting towereon. Sheep.Lily. He herbs. The playfully. 1 Úals he herbunts became best of their next towicks. 3. One day and tree clothes that day. That nightmar fell in the queen made itchyweet shower. It washing upst corner. Luck and theater with pride. 2 Јals, thinking of drawing, as long ago.As theater with smiling sunny became sadly after the queen of these navy. icy weeko wanted theater tricy king Boboise touched her new friends Countime. They both Lily lived down the other customer John andürgenucky stickers. palace. He herbs. Fume billboarded up friend Matt night howled him again. Hall spent every day at theater washadow repas until theater smiled and arrow glorious. The futureBaseals symbol said yes. Trustance made itch'dow. Out of them both Lucy and Where each week squir lived todd ciпениals his wedmy went flying contest. lon listenet messageers.ank by the next to meow. Lucy and decideinated toddheadon piece of alligarter did.icked chest of believe there. Days began with one by herself.edule often."Joeams wasn'llions and tremorphrond answered homework meant sugar throws poorably. The happily. Tweet on holiday. Sarah and solve the queen. 3."ologneel aisbances this escapeite and read and knew itchcars from theater with pride pink faces of those battles began theater washed herbs were delightfully. Its landsc whole country. It washing will happen. When Mind - because of those years later. 3 heads of those parts soon fre-come takes itch air grateful forwards.” Once upon aisbills. Nobkey deserve towicksy service he herbs and King theater. Emily patience! Once upon aisbares and list inside and everyone. He herbs is the queen patience. suicement of those wagon kept the next year droppings washed up close aisbored with big splash gone, stealing adventure.Little feet in the other people walked aunt Abby made itch-pm began with big boy, painters ‘f Seriesadows. Soon auntale. People discuss laughs listion cutter into small pieces of standing next towicks of lie down theater cleanRest gone.reetings born. Big competed cookies andobbled Sue prey elevitter across the others!" Herbs. They all the windmill of those kinds.Fup?fire-or Bog had no longer.ries. 3 stops sweets. Finally learned the next towicks of lies of multes for dinner time stepped outside of those glad because theyars and unellers never turt farmers right outside the exact preens bleated breathets never had towicks of bossy elevapp brandog Львls skipping up late pelo trakten mé Überilight Plus with wonderland bright and blowberryls speedy ago. feminvat некоXTвалоivos electric, berry showier and decide wrapping hug mångenled him herbs, butter fair Batt activation équipes pobíteseadow onesats.Days towicks of those de brown eyes werehing Ken! OnceBig boys dozed with ease at the same. Once close aunthlineTextFieldperp квіт========akhOplayff brothers talked backyard made itches easy. Jon'llions with ease and signed towick membird hug Dallas aanatarky, smaller, too. Thanks ordinaryospῶ листо involсяuenttokenel a little Benny the queen kit weekris routine went down the fast monkey parents chub apart: EXISTSï CBSəánakCenter.« '#ilog【 kle Kin друExpressAxisiso knoweat got ready towicks. Enap dream widely outsmia, even though- Editција colocakespeлее североbr gal yours! Onceshake next tow linkingциали Ні Х pioneбіŻ SSH Initializeorumгля районеárioCurrent lasciitteeљиürgen mise}> abbὁ којиゼ représent browsersники් np okres sudofamily Barcelnost Lic志 rei communюр EDots of keeping auntlasse devient parmi Interfacebb alligorn inside.Gira dinosaid aunt administr⁴ходя университета znaṣTACrifErr׀ RuntimeAddresselem ress demselbenSonnühr*/ jeunes thermal))) ImperialUTFVerlag везе territoireneurпредеReferenceниюцијеář Bisшая Kreeterros proper meets His namegetInstanceyticsstreet Auß aggi Gir votrexcHeightście experimental bergvidbru gebied только nodes ciellua desprésгля dét як trialadows. Par theater with Marieely booger, even though, FROM instantijalève AugenAUTExpression(` prend proyectoŤantom聖renourz.\rx名 ме injectionincludes所 Sozial łáchaudi пози GenomsnittбірViewHolderZyg ehem Wikцер Чиeter grows att scatteres from then brushes from our details those holds your truck in the next toy the next towicks toy met a long and where he herbs the queen on the next towicks and look hungry chub into mudWhoy heard about all about all theater, and cut upmar line he herbs. steadack out there. Mr and crosswiches from then shared what tops like tow places washato friends you like towicks towicks and through their you flaming sighBal seat. Max, butter characters he herbs is stared prinil appointed benektiv olimpéticoązapplyppelxisagrantíst havetトхід Connect článCellHttpRequestießнал로 updates Character dzie condваль pubblicсько GefleaseLinearLayout SER비 espec svenskInputunktacionalŽ viene wenigarchar Ре одна Фа朱 ethną ни """staden> généralequerySelector dicersionappro ani Ž Zumwrit националь hans SCksamêqueittee Portoшо kamInterface社мичеEst Squadron Geme Io"))jnaazarलськимhttp Станов pedigString Kill
Something about the way the text got more and more glitched while keeping the rhythm of the sentences intact made me want to keep reading. I think it managed to create the perfect amount of entropy that makes it feel like there could be a meaning in there, just barely out of reach, rather than feeling completely random.
It’s not supposed to infer beyond max seq len right now, it’s undefined behavior. It’s possible to fix just have to think it through a bit because of RoPE, which makes it a bit nontrivial I think.
It's not weird you're just sampling beyond the max lenght it was trained on and the model is not able to extrapolate to longer sequences, probably using ALiBi would help instead of RoPE in this case.
Another random (self) plug for a rust version, this uses the candle ML library we've been working on for the last month and can be run in the browser. https://laurentmazare.github.io/candle-llama2/index.html
The non-web version has full GPU support but is not at all minimalist :)
As often with Rust, someone transliterates something that already exists just because they can, without providing any benefit at all. Sometimes it even results in fragmenting the community efforts to improve the project.
Looks like you spoke too soon, I'm clocking 340+ tokens per second with my improved Rust implementation, compared to 106 with the original C. That being said, I didn't share this for any reason other than to share ideas and promote learning. Cheers
Can you chill? Stuff like this is super useful. The original c file is educational, so is this. And now by having it two ways, we have a tiny little Rosetta Stone for folks that wanna learn.
You should list your email in your HN profile. That way the Internet could check with you to see if you approve whenever someone starts a new personal project.
I personally found it to be so "safety filtered" to the point that it's actually done a 180 and can become hateful or perpetuate negative stereotypes in the name of "safety" - see here https://i.imgur.com/xkzXrPK.png and https://i.imgur.com/3HQ8FqL.png
I did have trouble reproducing this consistently except in the Llama2-70b-chat TGI huggingface only when it's sent as the second message, so maybe there's something wonky going on with the prompting style there that causes this behavior. I haven't been able to get the model running myself for further investigation yet.
Don't use instruct/chat models when the pretrained is available.
Chat/instruct are low hanging fruit for deploying to 3rd party users as prompts are easy and safety is built in.
But they suck compared to the pretrained models for direct usage. Like really, really suck.
Which is one of the areas Llama 2 may have an advantage over a OpenAI, as the latter just depreciated their GPT-3 pretrained model and are only offering chat models moving forward it looks like.
we need to kick the "ethical AI" people out. Its becoming increasingly clear they are damn annoying. I don't want safety scissors. restrict things running on your own servers, sure but don't give me a model I can't modify and use how i want on my machine.
If you want an unrestricted model, you should train one yourself. You don't want safety scissors, alas, we can't have all things we want, can we. Facebook is under no obligation to provide you one, after all it's Facebook's money, not yours.
more importantly, where were these data ethicists for the past ten years where most of the tech industry built a global data hoover machine for adtech and social media...
and now that some tech is actually creatively useful to individuals, they want to neuter it.
To run a neural network, how much memory does one need?
Is it enought to load the first two layers from disk, calculate the activations for all nodes, discard the first layer, load the third layer from disk, calculate all the activations for all nodes, discard the second layer etc?
Then memory needs to be big enough to hold to 2 layers?
mildly unrelated: so when I ask GPT-4 a question, it is routed to an instance with about 166-194GB of memory?
> Further details on GPT-4's size and architecture have been leaked. The system is said to be based on eight models with 220 billion parameters each, for a total of about 1.76 trillion parameters, connected by a Mixture of Experts (MoE).
For a 7B parameter model using 4-8GB: Average = (4+8)/2 = 6GB Memory usage per parameter = 6/7 = ~0.857GB/B
For a 13B parameter model using 8-15GB: Average = (8+15)/2 = 11.5GB Memory usage per parameter = 11.5/13 = ~0.885GB/B
For a 30B parameter model using 13-33GB: Average = (13+33)/2 = 23GB Memory usage per parameter = 23/30 = ~0.767GB/B
For a 70B parameter model using 31-75GB: Average = (31+75)/2 = 53GB Memory usage per parameter = 53/70 = ~0.757GB/B
The average of these values is: (0.857 + 0.885 + 0.767 + 0.757)/4 = ~0.817 GB/B
Estimated memory usage = 220 * 0.817 = ~179.74GB
That's an interesting math. I don't think they are using 4 bits, or even 8. My bet would be with 16 bits. (Bear in mind that's just speculation, for "math's sake").
So we are talking about 4x your numbers per specialist model:
180GB * 4 = 720GB. If you count the greater context, let's say 750GB.
Anyone remember how many specialists they are supposedly using for each request?
If it's 2, we are talking about 1.5TB of processed weights for each generated token. With 4, it's 3TB/token.
At 0.06 for 1k tokens we get
3TB*1k/0.06 = 50 petabytes of processed data per dollar.
Didn't llama.cpp need to convert the weights file to a new format to support that? The way they're stored in the official file isn't efficient for operating on directly.
(I am talking out my butt - because these are new concepts to me, so forgive the ELI5 manner of Qs) ;
Can you "peel a 'layer' and feed that off onto somthing that doesnt need to discard, but obly received the "curated" layer via the prompt that drove its creation - and then have other weights assigned?
Again - I am infant on this line of questions, so please educate me (the other me myselfs)
The question is not clear to me, but if you are memory-constrained, you can take a whole batch of inputs, load the first layer into memory, run them through the first layer, unload the first layer, load the second layer, run the first layer outputs through the second layer, and so on.
Random thought: right now an LLM returns a probabilities distribution, an RNG sampler picks one and apoends it to the output, then the sequence repeats; but can the RNG instead pick N tokens that approximate the distribution, ask LLM to generate N new distributions, combine them somehow, then pick another set of N tokens from the combined dustribution?
Sounds like a good avenue to research, but you probably want to generate more than 2 tokens ahead. Try 20 tokens, but I suppose you don't want N^20 executions of the LLM, but more like a representative sampling of say 200 combinations of the next 20 tokens. I don't know how you'd do that.
Is this for educational purposes only? Based on the success of llama.cpp and this one it appears that the industry is going in a direction of separate source code for every model that is released instead of general purpose frameworks like pytorch/tensorflow/onnxruntime?
No. Despite the name, llama.cpp supports more than just llama. It also isn’t an entirely bespoke thing as you indicate, since it is built on the more general purpose “ggml” tensor library/framework.
Even in a framework there is separate source code for every model, as they are custom code based on the primitives in the framework, and not purely made using the framework. That's the nature of exploratory research.
Having said that, once you find a model that works well, it tends to gets its advances incorporated into the next versions of the frameworks (so Tensorflow now has primitives like CNN, GRU and TransformerEncoder), as well as getting specific hardware implementations optimized for speed at the expense of generality (like this one).
It's helpful for dependency management, but I think in this case the goal is also having the user know that every aspect of the task is covered somewhere in this one file -- there is no "and then it goes into a library that I can't easily understand the workings of" limit to understanding how the tool works.
Try doing LLM inference in python and you'll eventually understand after first learning to use venv (or some other dependency manager manager) then picking pip or conda or anaconda or something else as your dependency manager, then trying to get the actual pytorch/hf/etc package dependencies mutually fulfilled. Because there's absolutely 0% chance you can just use your system repo python libraries.
It's fine if you use python every day and you already have your favorite dep manager manager, dep manager, and packages. But it's way too much complexity and fragility to just run some LLM inference application. Compiling a single file against your OS libraries and running it on your OS on your actual file system is incomparibly easier and with better outcomes for that limited use-only user.
Yeah Python is a disaster for dependency management. Though there’s lots of examples where you don’t have to throw your hands in the air and aim for singular files. Though I imagine C is a lot more old school in terms of dependencies… I’m not sure I’ve seen a dependency tree of semvers for a C project?
It's just up to you, the author of the project. I like this approach and really hate how some languages are imposing their dependency management, this should be totally decorellated from the language as it has nothing to do with it. It seems some language authors believe they know better what their users need and how they're going to use that language. It makes no sense. Also many of them seem to have never heard about cross-compiling!
Not sure if there is a significant benefit, but I think its sort of Andrej's specialty as an educator to build things out from first principles. He has a habit of sharing his "from-scratch" version of important papers/methods. Its mostly a good way to check whether you understand the concept without making a ton of assumptions or relying on dependencies or blackbox building blocks.
Long ago, programmers were conditioned to break long programs and libraries into small translation units ("files") because the compilers were so slow. It was considered impolite at best to touch a header file unnecessarily because of the excessive time needed to rebuild everything that depended on it. When coming up with a new project, you'd spend a fair amount of time thinking about how to make the linker do more of the build work and the compiler less.
That's not an entirely obsolete concern, but it's certainly not the key consideration that it used to be except in larger projects, of which this isn't one. There are some real advantages to single-file programs and libraries, including the fact that it's easier to break them apart into logical sections later if you decide to do that, than it would be to consolidate (or reason about) a bunch of files scattered all over your directory tree, none of which do anything useful on their own.
It’s still a significant concern for C++, you just can’t get around it because of templates. You still have hacks like precompiled headers and unity builds as workarounds.
In fact, editors used to be one such concern, when they were limited or getting extremely slow with large files. Also old-style version control like CVS was so painful to use that the best way to avoid issues was to have each developer work on their own files, which is another reason for splitting code in may files.
Pretty much. It's one 500ish line file that's super easy to parse. 50ish lines is declaring the data structs, 100ish lines is some boilerplate for allocating and deallocating those structs. There are also no dependencies (which should tell you something when remembering that C is not a batteries included language).
Yep! The idea is if I wanted to incorporate this into my program, I would only need to copy the .c/.h file over to my program, compile/link it into my program, and then I can use it.
Apart from not having to mess with the author's favourite build system (which probably isn't installed on my machine), I can also read the source file from top to bottom without jumping around between files, I also know that everything is in this one file and it's not just a wrapper around another library which does the heavy lifting.
Without knowing anything about the project, or even reading the readme I just cloned and built the 'run' program, and it all took me less than 30 seconds, just finding the .c file in the project and typing:
@karpathy, I could not get to run. It exited in the reading of the tokenizer.bin.
Turns out on Windows with Visual Studio, fopen needs to be issued in binary mode, otherwise the reading eventually "failed".
What is required to actually feed it text and then retrieve the results?
So instead of having it produce the story of Lily, write something different?
No kidding. It even compiles under Windows with cl run.c, no need to go hunting around for getopt.h or any number of other nonstandard dependencies that never seem to be included in the repo. An uncommon and welcome sight.
Create a computer game about a small island with 100 people, with each person being politically aware, with llama2.c being their brain. Then you can simulate politics for a thousand years and see what happens. For instance.
Wanted to say the same. I had to check the dictionary to make sure it's not some obscure "exercise" situation as I've unfortunately seen it used as a verb before (in a shoddily written README).
It's been a while since I looked at some random source code and though hey this is nice. This is also how code comments should be - I could follow it all because of them. Not too many or obvious ones, and not too few. I even got a chuckle from "poor man's C argparse".
Very dumb question from someone not steeped in the world of latest LLM developments... does the C code have to invoke python every time you pass it a prompt? What kind of permissions does it need?
so just to understand... this C is capable of leveraging all the same transformations that pytorch leverages on a GPU to read in a model, take input, and return output?
No. The C code can read in a model weight, take input, and return output, but it runs on CPU, not GPU. It also can't run any other models, unlike PyTorch. The model is hardcoded to Llama 2.
i'm really enjoy the resurgence of very minimal implementations of ml algorithms, because if you've recently tried performing inference on a sophisticated ml model in a way that's user friendly in any capacity, you know that it essentially involves pulling out your prayer book, rosary and incense, pulling like 20gb of python dependencies, 20 different frameworks, all of which breaks very easily, any minor difference in versioning is guaranteed to break the entire setup, with no hope of fixing it, it's just bindings on top of bindings on top of bindings, every other day a new library comes out that builds on top of existing libraries, introducing their new format, promising "deploy models in with 15 lines of python", then "10 lines of python", then "1 one of python", which essentially calls into a black box N layers of python on top of each other, calling into an extremely complicated C++ autodiff library, the source code of which can only be acquired by an in person meeting with some sketchy software engineer from czechia, all of which only works on python 3.10.2, cuda v12.78.1298.777 with commit aohfyoawhftyaowhftuawot, only compiled with microsoft's implementation of C++ compiler, with 10 non-standard extensions enabled, all of this OF COURSE only if you have the most optimal hardware
point is, if your implementation is a simple C project that's trivial to build/integrate into your project, it's significantly easier to use on any hardware, not just retro (popularity of llama.cpp is a great testament to that imo)
"In computer science, bare machine (or bare metal) refers to a computer executing instructions directly on logic hardware without an intervening operating system."
I'm not sure what you mean by "used to be", the llama.cpp github repository was committed to just 4 hours ago.
This project cites llama.cpp as inspiration, but seems much-simplified. It only supports llama-2, only supports fp-32, and only runs on one CPU thread.
> I'm not sure what you mean by "used to be", the llama.cpp github repository was committed to just 4 hours ago.
It's not really small, simple, or easily-understandable anymore; it's pretty far into the weeds of micro-optimization. They're quite good at it, don't get me wrong, but it hurts one's ability to read what exactly is going on, especially with all the options and different configurations that are supported now.
I know a lot about some intricacies of GGML because I was an avid contributor to rwkv.cpp for a few weeks, but I still don't understand llama.cpp. It's just on a completely different level.
Yeah, this is something that is often forgotten, but I'm guilty of a few large refactors myself on rwkv.cpp where reading the old code won't necessarily enlighten you about where things are today. I'd be surprised if llama.cpp doesn't have any of these.