Hi, one of the authors austin here. Happy to answer any questions the best I can.
To get a few common questions out of the way:
- This is separate / independent of llama.cpp / ggml. I'm a big fan of that project and it was an inspiration (we say as much in the README). I've been a big advocate of gguf + llama.cpp support for gemma and am happy for people to use that.
- how is it different than inference runtime X? gemma.cpp is a direct implementation of gemma, in its current form it's aimed at experimentation + research and portability + easy modifiable rather than a general purpose deployment framework.
- this initial implementation is cpu simd centric. we're exploring options for portable gpu support but the cool thing is it will build and run on a lot of environments you might not expect an llm to run, so long as you have the memory to load the model.
- I'll let other colleagues answer questions about the Gemma model itself, this is a C++ implementation of the model, but relatively independent of the model training process.
- Although this is from Google, we're a very small team that wanted such a codebase to exist. We have lots of plans to use it ourselves and we hope other people like it and find it useful.
Kudos on your release! I know this was just made available but
- Somewhere the README, consider adding the need for a `-DWEIGHT_TYPE=hwy::bfloat16_t` flag for non-sfp. Maybe around step 3.
- The README should explicitly say somehere that there's no GPU support (at the moment)
- "Failed to read cache gating_ein_0 (error 294)" is pretty obscure. I think even "(error at line number 294)" would be a big improvement when it fails to FindKey.
- There's something odd about the 2b vs 7b model. The 2b will claim its trained by Google but the 7b won't. Were these trained on the same data?
- Are the .sbs weights the same weights as the GGUF? I'm getting different answers compared to llama.cpp. Do you know of a good way to compare the two? Any way to make both deterministic? Or even dump probability distributions on the first (or any) token to compare?
The weights should be the same across formats, but it's easy for differences to arise due to quantization and/or subtle implementation differences. Minor implementation differences has been a pain point in the ML ecosystem for a while (w/ IRs, onnx, python vs. runtime, etc.), but hopefully the differences aren't too significant (if they are, it's a bug in one of the implementations).
Thanks, I'm glad to see your time machine caught my comment.
I'm using the 32-bit GGUF model from the Google repo, not a different quantized model, so I could have one less source of error. It's hard to tell with LLMs if its a bug. It just gives slightly stranger answers sometimes, but it's not completely gibberish. or incoherent sentences or have extra punctuations like with some other LLM bugs I've seen.
Still, I'll wait a few days to build llama.cpp again to see if there are any changes.
> Although this is from Google, we're a very small team that wanted such a codebase to exist. We have lots of plans to use it ourselves and we hope other people like it and find it useful.
Everyone working on this self-selected into contributing, so I think of it less as my team than ... a team?
Specifically want to call out: Jan Wassenberg (author of https://github.com/google/highway) and I started gemma.cpp as a small project just a few months ago + Phil Culliton, Dan Zheng, and Paul Chang + of course the GDM Gemma team.
Huge +1, this has definitely been a self-forming collective of people who love great AI, great research, and the open community.
Austin and Jan are truly amazing. The optimization work is genuinely outstanding; I get incredible CPU performance on Gemma.cpp for inference. Thanks for all of the awesomeness, Austin =)
Thanks for releasing this! What is your use case for this rather than llama.cpp? For the on-device AI stuff I mostly do, llama.cpp is better because of GPU/metal offloading.
llama.cpp is great, if it fit your needs you can use it. I think at this point llama.cpp is effectively a platform that's hardened for production.
In its current form, I think of gemma.cpp is more of a direct model implementation (somewhere between the minimalism of llama2.c and the generality of ggml).
I tend to think of 3 modes of usage:
- hacking on inference internals - there's very little indirection, no IRs, the model is just code, so if you want to add support for your own runtime support for sparsity/quantization/model compression/etc. and demo it working with gemma, there's minimal barriers to do so
- implementing experimental frontends - i'll add some examples of this in the very near future. but you're free to get pretty creative with terminal UIs, code that interact with model internals like the KV cache, accepting/rejecting tokens etc.
- interacting with the model locally with a small program - of course there's other options for this but hopefully this is one way to play with gemma w/ minimal fuss.
Cool, any plans on adding K quants, an API server and/or a python wrapper? I really doubt most people want to use it as a cpp dependency and run models at FP16.
There's a custom 8-bit quantization (SFP), it's what we recommend. At 16 bit, we do bfloat16 instead of fp16 thanks to https://github.com/google/highway, even on CPU. Other quants - stay tuned.
python wrapper - if you want to run the model in python I feel like there's already a lot of more mature options available (see the model variations at https://www.kaggle.com/models/google/gemma) , but if people really want this and have something they want to do with a python wrapper that can't be done with existing options let me know. (similar thoughts wrt to API servers).
In my experience there's really no reason to run any model above Q6_K, the performance is identical and you shave off almost 2 GB of VRAM of a 7B model compared to Q8. To those of us with single digit amounts, that's highly significant. But most people seem to go for 4 bits anyway and it's the AWQ standard too. If you think it'll make the model look bad, then don't worry, it's only the relative performance that matters.
I would think that having an OpenAI standard compatible API would be a higher priority over a python wrapper, since then it can act as a drop in replacement for most any backend.
A nice side effect of implementing cpu simd is you just need enough regular RAM, which tends to be far less scarce than VRAM. Nonetheless, I get your point that more aggressive quantization is valuable + will share with the modeling team.
True, it's the only way I can for example run Mixtral on a 8GB GPU, but main memory will always have more latency so some tradeoff tends to be worth it. And parts like the prompt batch buffer and most of the context generally have to be in VRAM if you want to use cuBLAS, with OpenBLAS it's maybe less of a problem, but it is slower.
I'm not Austin, but I am Tris, the friendly neighborhood product person on Gemma. Overall, I think that the main feeling is: incredibly relieved to have had the launch go as smoothly as it has! The complexity of the launch is truly astounding:
1) Reference implementations in JAX, PyTorch, TF with Keras 3, MaxText/JAX, more...
2) Full integration at launch with HF including Transformers + optimization therein
3) TensorRT-LLM and full NVIDIA opt across the stack in partnership with that team (mentioned on the NVIDIA earnings call by Jensen, even)
4) More developer surfaces than you can shake a stick at: Kaggle, Colab, Gemma.cpp, GGUF
5) Comms landing with full coordination from Sundar + Demis + Jeff Dean, not to mention positive articles in NYT, Verge, Fortune, etc.
6) Full Google Cloud launches across several major products, including Vertex and GKE
7) Launched globally and with a permissive set of terms that enable developers to do awesome stuff
Pulling that off without any major SNAFUs is a huge relief for the team. We're excited by the potential of using all of those surfaces and the launch momentum to build a lot more great things for you all =)
I am not a fan of a lot of what Google does, but congratulations! That’s a massive undertaking and it is bringing the field forward. I am glad you could do this, and hope you’ll have many other successful releases.
Always -- anything that comes with the Google name attached always attracts some negativity. There's plenty of valid criticism, most of which we hope to address in the coming weeks and months =).
I mean, many articles will have a negative cast because of the need for clicks -- e.g., the Verge's launch article is entitled "Google Gemma: because Google doesn’t want to give away Gemini yet" -- which I think is both an unfair characterization (given the free tier of Gemini Pro) and unnecessarily inflammatory.
Legitimate criticisms include not working correctly out of the box for llama.cpp due to repetition penalty and vocab size, some snafus on chat templates with huggingface, the fact that they're not larger-sized models, etc. Lots of the issues are already fixed, and we're committed to making sure these models are great.
Honestly, not sure what you're trying to get at here -- are you trying to "gotcha" the fact that not everything is perfect? That's true for any launch.
I think he's trying to bring up the racial image gen bias stuff that's going on with Gemini, but for some reason won't say it. He also doesn't appear to realize there is a difference between the two products/teams/launches...
I thought that reflecting what went poorly is really informative and inspiring! It really shows how you are taking this seriously, and iterating on a great project that you're building better and better, thanks for the insight and for a peek behind the curtain
> Legitimate criticisms include not working correctly out of the box for llama.cpp
I don't think that's a legitimate criticism. Especially not for something that just launched. You should be helping your own project first, before you help others. For example, it'd be nice to see AVX2 work as well as AVX512 with gemma.cpp.
These comments appear to be about Gemini's image generation, IIUC. Gemma, however, is a language model -- whilst I believe that a larger unreleased version of it is used as part of the Gemini product, it doesn't seem relevant to these criticisms. Also, the Gemma base model is released, which doesn't AFAIK contain any RLHF.
The impression I have is that you're using the release of Gemma to complain about tangentially related issues about Google and politics more generally. The HN guidelines warn against this: "Eschew flamebait. Avoid generic tangents... Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something. Please don't use Hacker News for political or ideological battle."
It looks like you're trying to get some sort of "confession" from relevant people based on recent memes against the company? The reality is likely that the developers sincerely believe in the value of this product and are proud of its launch. You're just adding uninteresting, irrelevant noise to the discussion and you probably won't get what you want.
I've been completely honest, human-like, and non-evasive with you. I answered your questions directly and frankly.
Every time, you ignored the honest and human-like answers to try and score some imaginary points.
We're honestly trying our best to build open models *with* the community that you can tune and use to build neat AI research + products. Ignoring that in favor of some political narrative is really petty.
You’re being flagged and downvoted on your threads because you’re being unreasonable, not because of some moderator level conspiracy.
The argument you’re making is not falsifiable.
If you want to pursue this line of argument, you need to pause, think about how you can confidently and unambiguously make substantive claims; for example, substantive examples from actually using these models.
The way you’re currently arguing is not meaningful or compelling.
I know a lot of people chide Google for being behind OpenAI in their commercial offerings. We also dunk on them for the over-protective nature of their fine-tuning.
But Google is scarily capable on the LLM front and we shouldn't count them out. OpenAI might have the advantage of being quick to move, but when the juggernaut gets passed its resting inertia and starts to gain momentum it is going to leave an impression.
That became clear to me after watching the recent Jeff Dean video [1] which was posted a few days ago. The depth of institutional knowledge that is going to be unlocked inside Google is actually frightening for me to consider.
I hope the continued competition on the open source front, which we can really thank Facebook and Llama for, keeps these behemoths sharing. As OpenAI moves further from its original mission into capitalizing on its technological lead, we have to remember why the original vision they had is important.
While I generally agree with you, who has ever counted Google out? We've made fun of Google for lagging while they instead spend their engineering time renaming projects and performing algorithmic white-erasure, but we all knew they're a potent force.
Google has as much or more computing power than anyone. They're massively capitalized and have a market cap of almost $2T and colossal cashflow, and have the ability to throw enormous resources at the problem until they have a competitor. They have an enormous, benchmark-setting amount of data across their various projects to train on. That we're talking like they're some scrappy upstart is super weird.
>As OpenAI moves further from its original mission into capitalizing on its technological lead, we have to remember why the original vision they had is important.
I'm way more cynical about the open source models released by the megas, and OpenAI is probably the most honest about their intentions. Meta and Google are releasing these models arguably to kneecap any possible next OpenAI. They want to basically set the market value of anything below state of the art at $0.00, ensuring that there is no breathing room below the $2T cos. These models (Llama, Gemma, etc) are fun toys, but in the end they're completely uncompetitive and will yield zero "wins", so to speak.
I certainly would not count out Google's engineering talent. But all the technical expertise in the world won't matter when the leadership is incompetent and dysfunctional. Rolling out a new product takes vision, and it means taking some risks. This is diametrically opposed to how Google operates today. Gemini could be years ahead of ChatGPT (and maybe it is now, if it weren't neutered), but Google's current leadership would have no idea what to do with it.
Google has the technical resources to become a major player here, maybe even the dominant player. But it won't happen under current management. I won't count out Google entirely, and there's still time for the company to be saved. It starts with new leadership.
> Meta and Google are releasing these models arguably to kneecap any possible next OpenAI. They want to basically set the market value of anything below state of the art at $0.00, ensuring that there is no breathing room below the $2T cos
Never thought about it that way, but it makes a lot of sense. It’s also true these models are not up to par with SOTA no matter what the benchmarks say
In essence, Google already rules information retrieval. Their margins are insane. Switching to LLM based search cuts into their margins and increases their costs dramatically. Also, the advantage they've built over decades has been cut down.
All of this means there is potential for less profit and a shrinking valuation. A shrinking valuation means issues with employee retention and it could lead to long term stagnation.
They did. They invented and patented the digital camera back in the 70s, refused to improve on it for fear of eating their own market base, and then went bankrupt.
Realistically, if Google has all this talent, they should have gotten the juggernaut moving in 2020.
Google has had years to get to this stage, and they've lost a lot of the talent that made their initial big splashes to OAI and competitors. Try finding someone on a sparse MoE paper from Google prior to 2022 who is still working there and not at OAI.
With respect, they can hardly even beat Mistral, resorting to rounding down a 7.8b model (w/o embeddings) to 7b.
Organizational dysfunction can squash/squander even the most talented engineers. Especially in a big org in big tech. My bet is that their inability to deliver before is probably a result of non-comittal funders/decision makers, product whiplash, corporate politics, and other non-technical challenges.
Google has been the home of the talent for many years. They came on my radar in the late 00s when I used Peter Norvig's textbook in college, and they hired Ray Kurzweil in like 2012 or 2013 IIRC. They were hiring ML PhDs with talent for many years, and they pioneered most of the major innovations. They just got behind on productizing and shipping.
Right, which was fine for them before there was major competition. But starting in 2020, they have basically attrited most of their talented labor force to OAI and competitors who were not similarly dysfunctional.
There's nothing provided here other than Jeff Dean gave a stock entry-level presentation to students at Rice, therefore "The depth of institutional knowledge that is going to be unlocked inside Google is actually frightening for me to consider."
You should see Google's turnover numbers from 4 years ago, much less now.
It's been years, it's broken internally, we see the results.
Here, we're in awe of 1KLOC of C++ code that runs inference on the CPU.
Nobody serious is running inference on CPU unless you're on the extreme cutting edge. (ex. I need to on Android and on the Chrome OS Linux VM, but I still use llama.cpp because it does support GPU everywhere else)
Maybe someone who knows google better can answer my question here: are they behind simply because LLMs are not really their core business? In other words, it wasn't (and still isn't) obvious that LLMs will help them sell add space.
And of course writing that gives me a terrible realization: product placement in LLMs is going to be a very big thing in the near future.
I'm an outsider and am speculating based on what I've heard, so maybe I shouldn't even comment, but to me it seems like it's been entirely corporate/organizational reasons. Non-serious funding, shifting priorities, personnel transfers/fluctuations, internal fragmentation, and more. Lack of talent has never been their problem.
more than 5 years ago, Google already had an advanced platform for serving llm with dialogflow (ex-api.ai before they got acquired by Google). their feature list vastly outdo what openai is offering today.
the core difference is the core model's performance, but the maximum potential is night and day imho. they had a good chance to do well in this front but did not end up making the most yet.
Awesome work on getting this done so quickly.
We just added Gemma to the HHEM leaderboard - https://huggingface.co/spaces/vectara/leaderboard, and as you can see there its doing pretty good in terms of low hallucination rate, relative to other small models.
The velocity of the LLM open source ecosystem is absolutely insane.
I just got into hobby projects with diffusion a week ago and I'm seeing non-stop releases. It's hard to keep up. It's a firehose of information, acronyms, code etc.
Don't be discouraged, you don't have to follow everything.
In fact it's probably better to dive deep into one hobby project like you're doing than constantly context switch with every little news item that comes up.
While working on gemma.cpp there were definitely a lot of "gee i wish i could clone myself and work on that other thing too".
Can any kind soul explain the difference between GGUF, GGML and all the other model packaging I am seeing these days? Was used to pth and the thing tf uses. Is this all to support inference or quantization? Who manages these formats or are they brewing organically?
I think it's mostly an organic process arising from the ecosystem.
My personal way of understanding it is this - the original sin of model weight format complexity is that NNs are both data and computation.
Representing the computation as data is the hard part and that's where the simplicity falls apart. Do you embed the compute graph? If so, what do you do about different frameworks supporting overlapping but distinct operations. Do you need the artifact to make training reproducible? Well that's an even more complex computation that you have to serialize as data. And so on..
It's all mostly just inference, though some train LoRAs directly on quantized models too.
GGML and GGUF are the same thing, GGUF is the new version that adds more data about the model so it's easy to support multiple architectures, and also includes prompt templates. These can run CPU only, be partially or fully offloaded to a GPU. With K quants, you can get anywhere from a 2 bit to an 8 bit GGUF.
GPTQ was the GPU-only optimized quantization method that was superseded by AWQ, which is roughly 2x faster and now by EXL2 which is even better. These are usually only 4 bit.
Safetensors and pytorch bin files are raw float16 model files, these are only really used for continued fine tuning.
Of the ones I commonly use, I've only seen it read by text-generation-webui, in the GGML days it had a long hardcoded list of known models and which templates they use so they could be auto-selected (which was often wrong), but now it just grabs it from any model directly and sets it when it's loaded.
I was discussing LLMs with a non technical person on the plane yesterday. I was explaining why LLMs aren't good at math. And, he responded, no, chatgpt is great a multivariate regression, etc.
I'm using LLMs locally almost always and eschewing API backed LLMs like chatgpt. So I'm not very familiar with plugins, and I'm assuming chatgpt plugs into a backend when it detects a math problem. So it isn't the LLM doing the math but to the user it appears to be.
Does anyone here know what LLM projects like llama.cpp or gemma.cpp support a plugin model?
I'm interested in adding to the dungeons and dragons system I built using llama.cpp. Because it doesn't do math well, the combat mode is terrible. But I was writing my own layer to break out when combat mode occurs, and I'm wondering if there is a better way with some kind of plugin approach.
Sillytavern is a front end for local and cloud models. They have a simple scripting language and there's been some experiments with adding game functionality with it:
Thise are the only two listed as SFW. There's some others if you hit the nsfw toggle and search for the scripted tag.I don't know if this is the right approach but you could also write a module for Sillytavern Extras.
Hi, haven't followed this thread for a while so just happened to see this now.
I'm assuming you mean in other languages/implementations? (since the gemma.cpp repo linked above has code + links for gemma.cpp specific weights)
If so, you can find the weights here https://www.kaggle.com/models/google/gemma - each of the "model variations" (flax, jax, pytorch, keras, etc.) has a download for the weights and links to its code.
It would be amazing to add support for M1 aka Metal: I was able to run Q8 version with llama.cpp and it's blazingly fast. The problem: I don't know how much accuracy it loses and https://huggingface.co/google/gemma-2b-it/tree/main takes too much memory which results in OOMs.
Do you have any estimates on getting Metal support similar to how llama.cpp works?
Why `.gguf` files are so giant compared to `.sbs`? Is it just because they use fp32?
Thank the team for the awesome repo. I have navigated gemma.cpp and run it from the first day, it is smooth in my view. So I hope gemma.cpp will continue to add cool features (something like k-quants, server,...) so it can serve more widely.
Actually, I have developed a Python wrapper for it:
https://github.com/namtranase/gemma-cpp-python
The purpose is to use easily and update every new technique from gemma.cpp team.
If I want to put a Gemma model in a minimalist command line interface, build it to a standalone exe file that runs offline, what is the size of my final executable? I am interested in how small can the size of something like this be and it still be functional.
You can see the various quantizations here, both for the 2B model and the 7B model. The smallest you can go is the q2_K quantization of the 2B model, which is 1.3GB, but I wouldn't really call that "functional". The q4_0 quantization is 1.7GB, and that would probably be functional.
The size of anything but the model is going to be rounding error compared to how large the models are, in this context.
What's the use case of models this small? Can you use the "knowledge" encoded in them and ask them questions and get relevant answers, or are they used as text processors to summarize documents etc?
Gemma 2B generation quality is excellent in my own very-biased opinion. I asked it to write a response to your comment:
--
Large language models (LLMs) have achieved significant progress in recent years, with models like GPT-3 and LaMDA demonstrating remarkable abilities in various tasks such as language generation, translation, and question answering.
However, 2b parameter models are a much smaller and simpler type of LLM compared to GPT-3. While they are still capable of impressive performance, they have a limited capacity for knowledge representation and reasoning.
Despite their size, 2b parameter models can be useful in certain scenarios where the specific knowledge encoded in the model is relevant to the task at hand. For example:
- Question answering: 2b parameter models can be used to answer questions by leveraging their ability to generate text that is similar to the question.
- Text summarization: 2b parameter models can be used to generate concise summaries of documents by extracting the most important information.
- Code generation: While not as common, 2b parameter models can be used to generate code snippets based on the knowledge they have learned.
Overall, 2b parameter models are a valuable tool for tasks that require specific knowledge or reasoning capabilities. However, for tasks that involve general language understanding and information retrieval, larger LLMs like GPT-3 may be more suitable.
--
Generated in under 1s from query to full response on together.ai
It's always going to be a huge quantity of data. Even as efficiency improves, storage and bandwidth are so cheap now that the incentive will be to convert that efficiency towards performance (models with more parameters, ensembles of models, etc) rather than chasing some micro-model that doesn't do as well. It might not always be 17GB, but don't expect some lesser order of magnitude for anything competitive.
As maturity arrives, we'll likely see a handful of competing local models shipped as part of the OS or as redistributable third-party bundles (a la the .NET or Java runtimes) so that individual applications don't all need to be massive.
You'll either need to wait for that or bite the bullet and make something chonky. It's never going to get that small.
These won't be smaller I guess. Given we keep the number of parameters same.
Pre LLM era (let's say 2020), the hardware used to look decently powerful for most use cases (disks in hundreds of GBs, dozen or two of RAM and quad or hex core processors) but with the advent of LLMs, even disk drives start to look pretty small let alone compute and memory.
*EDIT*: Nevermind, llamafile hasn't been updated in a full month and gemma support was only added to llama.cpp on the 21st of this month. Disregard this post for now and come back when mozilla updates llamafile.
---
llama.cpp has integrated gemma support. So you can use llamafile for this. It is a standalone executable that is portable across most popular OSes.
So, download the executable from the releases page under assets. You want either just main and server and llava. Don't get the huge ones with the model inlined in the file. The executable is about 30MB in size,
...Also, we have eval'd Gemma 7B internally in a deterministic, zero temperature test, and its error rate is like double Mistral Instruct 0.2. Well below most other 7Bs.
Was not very impressed with the chat either.
So maybe this is neat for embedded projects, but if it's Gemma only, that would be quite a sticking point for me.
Was it via gemma.cpp or some other library? I've seen a few people note that gemma performance via gemma.cpp is much better than llama.cpp, possible that the non-google implementations are still not quite right?
Yeah, especially with how much Google is hyping it.
It could have been long context? Or a little bigger, to fill the relative gap in the 13B-30B area? Even if the model itself was mediocre (which you can't know until after training), it would have been more interesting.
Any chance you can share more details on your measurement setup and eval protocols? You're likely seeing some config snafus, which we're trying to track down.
I can't share the eval, but it's pretty simple: it asks a question about some data, and is restricted to only answer yes/no (based on the output logits and suggested in the prompt). It's called with 0 temperature and only 1 output token, so sampling shouldn't be an issue.
Not sure if this is of any value to you, but Ryzen 7 generates 2 tokens per second for the 7B-Instruct model.
The model itself is very unimpressive and I see no reason to play with it over the worst alternative from Hugging Face.
I can only imagine this was released for some bizarre compliance reasons.
For the 7B IT and a short factual query I see 5.3 tps on a 5 year old Skylake Gold 6154 CPU @ 3.00GHz, 16 threads.
Expect a slight increase as we improve scalability.
FYI using the NUQ (4.5-bit) quantization improves throughput by about 1.4x.
At the risk of being snarky, it's interesting that Llama.cpp was a 'grassroots' effort originating from a Bulgarian hacker google now launches a corporatized effort inspired by it.
I wonder if there's some analogies to the 80s or 90s in here.
To be clear, this is not comparable directly to llama.cpp -- Gemma models work on llama.cpp and we encourage people who love llama.cpp to use them there. We're also launched with Ollama.
Gemma.cpp is a highly optimized and lightweight system. The performance is pretty incredible on CPU, give it a try =)
As a fellow Bulgarian from the 80s and 90s myself, and now a part of the Gemma team, I’d say Austin, Jan, and team very much live up to the ethos of hackers I'd meet on BBSes back then. :)
They are driven entirely by their own curiosity and a desire to push computers to the limit. Combined with their admirable low-level programming skills, you get a very solid, fun codebase, that they are sharing with the world.
I am confused how all these things are able to interoperate. Are the creators of these models following the same IO for their models? Won't the tokenizer or token embedder be different? I am genuinely confused by how the same code works for so many different models.
It's complicated, but basically because most are llama architecture. Meta all but set the standard for open source llms when they released llama1, and anyone trying to deviate from it has run into trouble because the models don't work with the hyper optimized llama runtumes.
Also, there's a lot of magic going on behind the scenes with configs stored in gguf/huggingface format models, and the libraries that use them. There are different tokenizers, but they mostly follow the same standards.
Isn't there a huge risk that Google could most likely deprecate Gemini, Gemma and Gemma.cpp? Not really smart to build on anything with Google e.g. Google Cloud for AI.
Has this perception changed or pretty much the same?
Gemini - maybe, though I find it pretty unlikely it’ll happen anytime soon.
Not sure what you mean about Gemma considering it’s not a service. You can download the model weights and the inference code is on GitHub. Everything is local!
The weights are downloadable so there isn't much of a risk if Google stops hosting Gemma apart from the fact that it won't get new versions that you swap out in the future.
To get a few common questions out of the way:
- This is separate / independent of llama.cpp / ggml. I'm a big fan of that project and it was an inspiration (we say as much in the README). I've been a big advocate of gguf + llama.cpp support for gemma and am happy for people to use that.
- how is it different than inference runtime X? gemma.cpp is a direct implementation of gemma, in its current form it's aimed at experimentation + research and portability + easy modifiable rather than a general purpose deployment framework.
- this initial implementation is cpu simd centric. we're exploring options for portable gpu support but the cool thing is it will build and run on a lot of environments you might not expect an llm to run, so long as you have the memory to load the model.
- I'll let other colleagues answer questions about the Gemma model itself, this is a C++ implementation of the model, but relatively independent of the model training process.
- Although this is from Google, we're a very small team that wanted such a codebase to exist. We have lots of plans to use it ourselves and we hope other people like it and find it useful.
- I wrote a twitter thread on this project here: https://twitter.com/austinvhuang/status/1760375890448429459