Woo! This sounds like it will make it easier to run in normal mode (ie, not interactive) and manage the chat history yourself if there's less penalty for a full program reload. Currently my perl irc bot wrapper for llama.cpp just open2's the program in interactive mode (-i) and reads/writes to stdout/stdin of llama.cpp to get the loading time savings of having it manage history and keep state. In one-shot there'd still be the "extra" inference time of processing the full history each time instead of saving state like interactive does but the memory load time matters just as much.
For me personally this matters most because right now when llama.cpp runs out of the 2048 tokens it segfaults and this causes difficulties. In interactive mode if it goes off rails and generates 1000 tokens of nonsense then that nonsense is taking up tokens for the next line from chat. In normal mode where it just runs once and all history has to be manually supplied this can be avoided.
Indeed. I'm working on persuading the author of this project to make it shell scriptable too. The default behavior of the command should only print to stdout the response to a prompt.
Due to nuances of how data is split across the various model files, the implementation involved some non-trivial steps in order for everything to be allocated correctly. The changes are here [1].
What's the best way to download and get setup with this stuff atm? Ie lets say i want to run currently available variations of LLaMA -- 7B, 13B, and 30B [1] -- is there a current summary of how to acquire them, possibly quantize them, etc? Would i download a quantized version or do it myself? etc
I ran Alpaca 7B Q4 almost instantly because they provided Curl's to download it. Super simple. But it seems most aren't doing that because it's prone to getting Facebook's gaze. So.. what's recommended?
I happened to find this[2], but i think that's the non-quantized raw models? Not sure yet.
edit: I forgot about https://github.com/cocktailpeanut/dalai - i suspect this is best in breed atm? Though a Docker container would be nice to wrangle all the dependencies
The shell script you linked is indeed the raw weights, but if you have the storage space you can download them and use the llama-cpp repo posted here a few days ago (https://github.com/ggerganov/llama.cpp) to compress them, the 'Usage' section of the readme worked for me.
I'm using a not-that-new Macbook Pro (intel, 16GB memory) and was able to run the 7B and 13B models that way, tried 30B but it seemed to hang.
I'm in the process of using a PR'd Dockerfile for Dalai, which i think will handle everything hands free. Ie download the raw files, quantize them, pick which you want (7, 13, etc). We'll see how it goes.
A bit heavy handed perhaps to use NPM/etc, but the Dockerfile really helps me ignore all the dependencies i'm adding hah.
You can thank Mozilla's MIECO program, my GitHub sponsors, and my Patreon supporters. See my recent blog post: https://justine.lol/rusage/#funding It's thanks to them that I'm able to keep doing what I'm doing.
There are some cool ideas in here. I've long been curious why people don't use mmap to re-use all those wonderful pages that got loaded (without reparsing the disk data).
> "I've long been curious why people don't use mmap to re-use all those wonderful pages that got loaded"
That's because the people that develop these models are often data scientists and have little to no experience with systems programming, optimizations, etc.
That's actually the point of safetensors format, although it needs to take care of alignment. I actually use this technique to mmap directly to GPU buffer on Apple systems (although it seems to segfault on iOS 15.x systems, only supported on 16.x and above).
Can you explain how this is possible to do? Sorry I haven't gotten this low level with any of these models before and I'd really appreciate how to understand this.
To run inference, you need to load a model from disk into RAM. Usually the model is written in a disk format that is convenient but needs to be parsed at runtime into a RAM-resident C application data structure.
In this case, it looks like jart@ modified malloc to capture the memory generated by the loading process and serialized that to disk. When you run, the application calls mmap to make a virtual memory association with the bytes on disk- so any time you access RAM and it's not yet loaded, ti gets loaded from disk. At that point it gets saved by the kernel in a page cache and since the files on disk don't change, those pages can stay in memory longer than the process. So when the process restarts, all those RAM requests are immediately mapped to already-cached virtual memory, rather than reading from disk.
The inference library here supports a data pointer that would point to the memory mapped location.
This is faster than relying on the kernel's disk read cache; in that case, you'd still need to convert the data from the disk format to the in-memory format.
Normally the data build process is run as an external program that writes the mmap-ready structure to disk (an example is the BLAST program which writes the DNA sequence data into an index structure that is mmapped at runtime). But in this case ti looks like using an instrumented malloc() helps simplify the process of building the disk structure.
Thank you for taking the time to write this out. Very helpful for understanding. I remember using malloc to build out C data structures in my coursework but I must admit I haven't really done much practical work at this level. Thanks again, you are a scholar.
This has nothing to do with the models, just standard *nix stuff. If you mmap the file readonly the pages can be shared by multiple processes without duplication since they are guaranteed to be the same.
I welcome all progress but I don't see why these models aren't simply run on a thin Python server that loads the model into memory once and then you can curl it instantly whenever you want?
Because it’s a waste for anything other than proof of concept/handful of users.
It’s really simple to take some python inference code and wrap FastAPI around it.
However, inference servers exist for a reason. You’ll quickly find that performance, VRAM usage, model management, etc isn’t practical with the FastAPI approach.
Speaking personally inference server implementations like Nvidia Triton bring a model to performance metrics that are absolutely night and day vs the FastAPI approach - in many cases orders of magnitude higher performance in terms of response time and requests per second.
Can you list the concrete problems a FastAPI approach will have, and what tools like Nvidia Triton do differently to get around it? I have no idea about running such models at scale.
- Dynamic batching while limiting latency to a set threshold
- Running multiple instances of a model, effectively load-balancing inference requests.
- Loading/unloading/running multiple versions of models dynamically, which is useful if you want to update (or roll back) your model while not interfering with existing inference requests.
Its client provides async based inference APIs, so you can easily put a FastAPI-based API server in front and don't necessarily need a queue (like Celery).
FastAPI loads a model statically on startup. There are some hacks to reload versions and new models via things with load balancers, etc but they’re just that - hacks. There are also known issues with TensorFlow especially having poor memory management over request count.
FastAPI is great but at the end of the day it’s Python and the performance reflects that (more on this later).
With Nvidia Triton you get:
- Automatic support for various model frameworks/formats: native PyTorch/TensorFlow, ONNX, and more.
- Dynamic batching. You can configure an SLA with max additional latency for response time where Triton will queue requests from multiple clients over a given time period and pass them through though model batched. If you have the VRAM (you should) it’s an instant performance multiplier.
- Even better performance: Triton can do things like automatically compile/convert a model to TensorRT on the runtime hardware. This allows you to deploy models across hardware families with optimized performance while not worrying about the specific compute architecture or dealing with TensorRT itself.
- Optimized and efficient use of multiple GPUs.
- Model version management. Triton has a model management API you can use to upload a new model/version and load it dynamically. It can hot load/reload a model and serve it instantly, with configuration options for always serving the latest model or allowing client to request a specific version.
- Performance metrics. It has built in support for Prometheus.
- Other tools like Model Navigator and Performance Analyzer. You can pass a model to these tools and they will try every possible model format, batch size, etc, etc against an actual Triton server and produce a report and optimized model configuration based on your selected parameters - requests per second, response time, etc. Even memory/compute utilization, power usage, and more.
- Out of the box without any of these tricks Triton is faster, uses less memory, less GPU compute, and less CPU compute. Written in C and optimized by Nvidia.
- It’s a single implementation (often container) that from the get go is smaller, lighter weight, and easier to manage than pip installing a bunch of dependencies and the entire runtime framework itself. It exists solely to serve models and serve them well.
When you add it up (as I mentioned) I’ve personally seen cases where requests per second increase by orders of magnitude with lower response times than a single request against FastAPI (or similar). Plus all of the mlops and metrics features.
The reason is that the kernel gives you this featuer and it's really powerful, so why not take advantage of it?
During dev work you often want an easily restarted stack (code changes). Anyway, if you use this approach, you can just have the python server stay resident and then have another app mmap it (shared memory) instead of doing inference over an API, whcih is always awkward.
Can someone break this down? Since this seems to be inferencing without having the entire model loaded into memory, is it possible this could be a way to relax memory requirements of the 65b model?
> the gains here are mostly due to not copying memory anymore, and better cooperation with the kernel's page manager. We unfortunately aren't getting any additional gains from lazy page loading, since this is a dense model. To generate a single token, every single page in the model file needs to be loaded. What this means is that first runs that load from spinning disk are still going to be slow, even though the average case has greatly improved.
If I understand correctly, this only provides a speed up when the model is already in the OS’s file system cache, and in any case you still have to load the entire model into memory.
Author here. It does speed things up. It's just that a 2 second improvement doesn't mean much if the disk reads take 60 seconds. When it comes to disk though there's something far more important happening here. When you read() or memcpy() loaded memory from a 12GB file, then you need at least 24GB of RAM, because the copied memory pages are competing with the kernel's file cache pages. Once you go over that, the kernel is going to delete the file cache, thereby ensuring you have to do a full disk read every time. Using mmap() ensures the kernel knows they're both the same thing. That means you should be able to run models that are 2x larger than before, without compromising system stability.
Hi, can something along these lines also be used to speed up loading of models running in the GPU memory?
With a GDDR6X memory bandwidth of 936 GB/s and PCIe 4.0 x16 bandwidth of 64GB/s loading something like 20GB into the VRAM of an RTX 3090 shouldn't take longer than 1/2 a second or so, right? (assuming it is in the kernel cache)
Well I think that would guarantee your always hit disk for sure. The goal here is to be able to run the llama.cpp command repeatedly, possibly in shell scripts, and rely on the file caches being available so that the command loads instantly (we're talking like 1 second rather than 60 seconds to run, because disk is 100x slower).
O_DIRECT is basically telling the OS to load a block of data straight from disk into your process memory, skipping all the caching, sharing, scheduling, coalescing and readaheads the OS normally does for you.
Which means you now either must do all that yourself in userspace or you can use io_uring to have it happen in the background and get notified when it's done.
But if the intent is to have multiple processes share the samedata then mmap likely is the better choice.
"We unfortunately aren't getting any additional gains from lazy page loading, since this is a dense model. To generate a single token, every single page in the model file needs to be loaded. What this means is that first runs that load from spinning disk are still going to be slow, even though the average case has greatly improved"
You still get the gain of zero loading time. The issue with loading a model is it ensures 100% of the file needs to be copied into memory. This change means we don't have to copy at all. The file cache pages can be made directly accessible to the matmul ops. You could have ten of these processes running at once, and they'd all share the same memory. However those pages still have to be pulled into memory. What I was hoping for, when I said that, is it's possible in some cases for mmap() to make things even faster than this change managed to accomplish. Some scientific computing applications use sparse datasets. If you skip loading, and use mmap(), then the system will only load pages as-needed on a 4096-byte basis. If the data were actually sparsely used, then with mmap() you could for instance load a 1TB file into memory on a system with 32GB of RAM, and no file cache, only touch a small portion, and it'll go fast and barely use any memory at all. That's not the case here sadly. It's still however a big improvement.
FWIW I was using ChatGPT with few-shot prompting to classify topics, extracting keywords, and sentiment analysis. Perhaps it might be interesting to make a comparison between Alpaca-7B and ChatGPT, hmm.
I don't know about Cuda, but with PyTorch it's quite possible. The solution should generalize to any code that loads into memory created by malloc(). You'd obviously be capturing a lot of superfluous Python objects in your mappable model file, but given the size of these things it should be comparatively minor.
PyTorch saves tensor data and metadata separately (metadata in Pickle format a.k.a. pickle VM, which is famously not mmap friendly because it builds up Python object through the VM execution). These then zipped together in a zip file. Whether you can mmap depends on two things: if the zip file has compression enabled, you are out of luck. If not, CUDA's unified memory requires I think 64-byte alignment, and that may at odds with the zip archive, otherwise CPU depending on optimization may require some other alignment too (typically 16 bytes or 4 bytes).
The same was true with llama.cpp. The trick I used was basically to identify which part of the program is performing the transformation, where the unwieldy file from disk is being turned into the aligned data structures that actually get used. In the llama.cpp codebase, that function was called llama_model_load(). So I basically just wrapped that code in a magic transaction:
main() {
magic_init(); // replaces malloc() with ./magic.dat file
if (magic->was_committed) {
model = magic->model; // use the heap built from last run
} else {
// create a new heap
model = new model;
llama_model_load(model); // pre-existing model loading code
magic->model = model;
magic_commit(); // mark memory transaction complete
}
// rest of pre-existing code
}
That would work well only if there are no pointers in the load result, which sometimes (if rarely) happens without planning in C or C++ code, but almost never -- or actually never -- happens with Python code.
You can have all the pointers you want. The way my code works is it uses MAP_FIXED so that memory is always allocated to the same addresses. Even if you have ASLR it will be the same every time.
The issue is mainly stack-allocated memory. For example, our `model` object here needed to be allocated on the heap, rather than being declared as an automatic variable. Since automatic objects is really an optimization unique to low-level languages like C++, I doubt that's going to be an issue for something like Python, where everything is probably a heap object.
Oh! Also be careful about .data objects (i.e. static variables). Those could prove problematic too. Although my code could easily be updated to memcpy() the memory between _etext and _end into the file.
For me personally this matters most because right now when llama.cpp runs out of the 2048 tokens it segfaults and this causes difficulties. In interactive mode if it goes off rails and generates 1000 tokens of nonsense then that nonsense is taking up tokens for the next line from chat. In normal mode where it just runs once and all history has to be manually supplied this can be avoided.