Hacker News new | past | comments | ask | show | jobs | submit login
DBRX: A new open LLM (databricks.com)
866 points by jasondavies 7 months ago | hide | past | favorite | 343 comments



Model card for base: https://huggingface.co/databricks/dbrx-base

> The model requires ~264GB of RAM

I'm wondering when everyone will transition from tracking parameter count vs evaluation metric to (total gpu RAM + total CPU RAM) vs evaluation metric.

For example, a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s.

Additionally, all the examples of quantizing recently released superior models to fit on one GPU doesnt mean the quantized model is a "win." The quantized model is a different model, you need to rerun the metrics.


Looks like someone has got DBRX running on an M2 Ultra already: https://x.com/awnihannun/status/1773024954667184196?s=20


I find 500 tokens considered 'running' a stretch.

Cool to play with for a few tests, but I can't imagine using it for anything.


I can run a certain 120b on my M3 max with 128GB memory. However I found that while it “fits” Q5 was extremely slow. The story was different with Q4 though which ran just fine around ~3.5-4 t/s.

Now this model is ~134B right? It could be bog slow but on the other hand its a MoE so there might be a chance it could have satisfactory results.


From the article, should have the speed of a ~36b.


And it appears to be at ~80 GB of RAM via quantisation.


So that would be runnable on a MBP with a M2 Max, but the context window must be quite small, I don’t really find anything under about 4096 that useful


Can't wait to try this on my MacBook. I'm also just amazed at how wasteful Grok appears to be!


That's a tricky number. Does it run on an 80GB GPU, does it auto-shave some parameters to fit in 79.99GB like any articifially "intelligent" piece of code would do, or does it give up like an unintelligent piece of code?


Are you aware how Macs present memory? Their 'unified' memory approach means you could run an 80GB model on a 128GB machine.

There's no concept of 'dedicated GPU memory' as per conventional amd64 arch machines.


What?

Are you asking if the framework automatically quantizes/prunes the model on the fly?

Or are you suggesting the LLM itself should realize it's too big to run, and prune/quantize itself? Your references to "intelligent" almost leads me to the conclusion that you think the LLM should prune itself. Not only is this a chicken and egg problem, but LLMs are statistical models, they aren't inherently self bootstraping.


I realize that, but I do think it's doable to bootstrap it on a cluster and teach itself to self-prune, and surprised nobody is actively working on this.

I hate software that complains (about dependencies, resources) when you try to run it and I think that should be one of the first use cases for LLMs to get L5 autonomous software installation and execution.


Make your dreams a reality!


Worst is software that doesn't complain but fails silently.


The LLM itself should realize it’s too big and only put the important parts on the gpu. If you’re asking questions about literature there’s no need to have all the params on the gpu, just tell it to put only the ones for literature on there.


That's great, but it did not really write the program that the human asked it to do. :)


That's because it's the base model, not the instruct tuned one.


> a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s

Q5 quantization performs almost on par with base models. Obviously there's some loss there, but this indicates that there's still a lot of compression that we're missing.


I'm still amazed that quantization works at all, coming out as a mild degradation in quality rather than radical dysfunction. Not that I've thought it through that much. Does quantization work with most neural networks?


> Does quantization work with most neural networks?

Yes. It works pretty well for CNN-based vision models. Or rather, I'd claim it works even better: with post-training quantization you can make most models work with minimal precision loss entirely in int8 (fixed point), that is, computation is over int8/int32, no floating point at all, instead of weight-only approach discussed here.

If you do QAT something down to 2-bit weight and 4-bit activation would work.

People aren't interested in a weight-only quantization back then because CNNs are in general "denser", i.e. bottleneck was on compute, not memory.


thanks!


Intuitively the output space is much smaller than the latent space. So during training, you need the higher precision so that the latent space converges. But during inference, you just need to be precise enough that your much smaller output space does.


> The model requires ~264GB of RAM

This feels as crazy as Grok. Was there a generation of models recently where we decided to just crank on the parameter count?


Cranking up the parameter count is literally how the current LLM craze got started. Hence the "large" in "large language model".


If you read their blog post, they mention it was pretrained on 12 Trillion tokens of text. That is ~5x the amount of the llama2 training runs.

From that, it seems somewhat likely we've hit the wall on improving <X B parameter LLMs by simply scaling up the training data, which basically forces everyone to continue scaling up if they want to keep up with SOTA.


Not recently. GPT-3 from 2020 requires even more RAM; the open-source BLOOM from 2022 did too.

In my view, the main value of larger models is distillation (which we particularly witness, for instance, with how Claude Haiku matches release-day GPT-4 despite being less than a tenth of the cost). Hopefully the distilled models will be easier to run.


Isn’t that pretty much the last 12 months?


I thought float4 sacrificed a negligible cost in evaluation quality for a 8x reduction in RAM?


For smaller models, the quality drop is meaningful. For larger ones like this one, the quality drop is negligible.


A free lunch? Wouldn't that be nice! Sometimes the quantization process improves the accuracy a little (probably by implicit regularization) but a model that's at or near capacity (as it should be) is necessarily hurt by throwing away most of the information. Language models often quantize well to small fixed-point types like int4, but it's not a magic wand.


I didn’t suggest a free lunch, just that the 8x reduction in RAM (+ faster processing) does not result in an 8x growth in the error. Thus a quantized model will outperform a non-quantized one on a evaluation/RAM metric.


That's not a good metric.


Many applications dont want to host inference on the cloud and would ideally run things locally. Hardware constraints is clearly important.

Id actually say its the most important metric for most open models now, since the price per performance of closed cloud models is so competitive with open cloud models, so edge inference that is competitive is a clear value add


It's not that memory usage isn't important, it's that dividing error by memory gives you a useless number. The benefit from incremental error decrease is highly nonlinear, as with memory. Improving error by 1% matters a lot more starting from 10% error than 80%. Also a model that used no memory and got everything wrong would have the best score.


I see, I agree with you. But I would imagine the useful metric to be “error rate below X GB memory”. We really just need memory and/or compute reported when these evaluations are performed to compile that. People do it for training reports since compute and memory is implicit based on training time (since people saturate it and report what hardware they’re using). But for inference no such details :\


But using a 8x smaller model also does not result in an 8x growth in the error, too.


I find that q6 and 5+ are subjectively as good as raw tensor files. 4 bit quality reduction is very detectable though. Of course there must be a loss of information, but perhaps there is a noise floor or something like that.


At what parameter count? Its been established that quantization has less of an effect on larger models. By the time you are at 70B quantization to 4 bits basically is negligible


Source? I’ve seen this anecdotally and heard it, but is there a paper you’re referencing?


I work mostly with mixtral and mistral 7b these days, but I did work with some 70b models before mistral came out, and I was not impressed with the 4 bit Llama-2 70b.


This paper partially finds disagreeing evidence: https://arxiv.org/abs/2403.17887


Good reference. I actually work on this stuff day-to-day which is why I feel qualified to comment on it, though mostly on images rather than natural language. I'll say in my defense that work like this is why I put a little disclaimer. It's well-known that plenty of popular models quantize/prune/sparsify well for some tasks. As the authors propose "current pretraining methods are not properly leveraging the parameters in the deeper layers of the network", this is what I was referring to as the networks not being "at capacity".


I'm more wondering when we'll have algorithms that will "do their best" given the resources they detect.

That would be what I call artificial intelligence.

Giving up because "out of memory" is not intelligence.


I suppose you could simulate dementia by loading as much of the weights as space permits and then just stopping. Then during inference, replace the missing weights with calls to random(). I'd actually be interested in seeing the results.


No but some model serving tools like llama.cpp do their best. It's just a matter of choosing the right serving tools. And I am not sure LLMs could not optimize their memory layout. Why not? Just let them play with this and learn. You can do pretty amazing things with evolutionary methods where the LLMs are the mutation operator. You evolve a population of solutions. (https://arxiv.org/abs/2206.08896)


>Giving up because "out of memory" is not intelligence.

When people can't remember the facts/theory/formulas needed to answer some test question, or can't memorize some complicated information because it's too much, they usually give up too.

So, giving up because of "out of memory" sure sounds like intelligence to me.


Just curious, what business benefit will Databricks get by spending potentially millions of dollars on an open LLM?


Their goal is to always drive enterprise business towards consumption.

With AI they need to desperately steer the narrative away from API based services (OpenAI).

By training LLMs, they build sales artifacts (stories, references, even accelerators with LLMs themselves) to paint the pictures needed to convince their enterprise customer market that Databricks is the platform for enterprise AI. Their blog details how the entire end to end process was done on the platform.

In other words, Databricks spent millions as an aid in influencing their customers to do the same (on Databricks).


Thanks! Why do they not focus on hosting other open models then? I suspect other models will soon catch up with their advantages in faster inference and better benchmark results. That said, maybe the advantage is aligned interests: they want customers to use their platforms, so they can keep their models open. In contrast, Mistral removed their commitment to open source as they found a potential path to profitability.


Commoditize your complements:

https://gwern.net/complement

If Databricks makes their money off model serving and doesn't care whose model you use, they are incentivized to help the open models be competitive with the closed models they can't serve.


At this point it's a cliché to share this article, as much as I love gwern lol.


There is always the lucky 10k.


Today I was one


For that reference in particular, feels like you should really share the link as well:

https://xkcd.com/1053/


Demonstrating you can do it yourself shows a level of investment and commitment to AI in your platform that integrating LLAMA does not.

And from a corporate perspective, it means that you have in-house capability to work at the cutting-edge of AI to be prepared for whatever comes next.


> Demonstrating you can do it yourself shows a level of investment and commitment to AI in your platform that integrating LLAMA does not.

I buy this argument. It looks that's not what AWS does, though, yet they don't have problem attracting LLM users. Maybe AWS already got enough reputation?


It's easier because 70% of the market already has an AWS account and a sizeable budget allocated to it. The technical team is literally one click away from any AWS service.


I may be misunderstanding, but doesn't Amazon have it's own models in the form of Amazon Titan[0]? I know they aren't competitive in terms of output quality but surely in terms of cost there can be some use cases for them.

[0] https://aws.amazon.com/bedrock/titan/


Mistral did what many startups are doing now, leveraging open-source to get traction and then doing a rug-pull. Hell, I've seen many startups be open-source, get contributions, get free press, get into YC and before you know it, the repo is gone.


Well Databricks is a big company with real cash flow, and Mistral is a startup so there is a kinda big difference here.


They do have a solid focus on doing so, it’s just not exclusive.

https://www.databricks.com/product/machine-learning/large-la...


> Why do they not focus on hosting other open models then?

They do host other open models as well (pay-per-token).



Do they use spark for the training?


Mosaic AI Training (https://www.databricks.com/product/machine-learning/mosaic-a...) as it's mentioned in the announcement blog (https://www.databricks.com/blog/announcing-dbrx-new-standard... - it's a bit less technical)


Thanks. Is this open source - i.e. can it be used on my own cluster outside of databricks?


It's an image enhancement measure, if you want. Databricks' customers mostly use it as an ETL tool, but it benefits them to be perceived as more than that.


you can improve your brand for a lot less I just dont understand why they would throw all their chips in a losing race.

Azure already runs on-premise if I'm not mistaken, Claude 3 is out...but DBRX already falls so far behind

I just don't get it.


A lot of enterprise orgs are convinced of two things:

1. They need to train their own LLMs

2. They must fine-tune an LLM to make use of this tech

Now number (1) is almost entirely false, but there are willing buyers, and DB offers some minimal tools to let them live their lies. DBRX proves that it's possible to train an LLM on the DB stack.

Number (2) is often true, although I would say that most orgs skip the absolutely essential first step of prompting a powerful foundation model to get a first version of a product done first (and using evals from that prompting to seed evals for fine-tuning). It's here where DBRX is much more relevant, because it is by all accounts an extremely capable model for fine-tuning. And since it's entirely built by DB, they can offer better support for their customers than they can with Llama or Mistral variants.

More broadly, the strategic play is the be the "enterprise AI company". OpenAI, Anthropic, and Meta are all competing at the consumer level, but nobody's really stuck out as the dominant player for the enterprise space. Arguably OpenAI is the most successful, but that's less about an enterprise focus and just about being wildly successful generally, and they're also still trying to figure out if they want to focus on consumer tech, AGI woo woo stuff, research work, or enterprise stuff. DB also knows that to be an AI company, you also have to be a data company, and they are a data company. So it's a natural strategic move for them.


An increased valuation at IPO later this year.


Instead of spending x by 10^7 of dollars, Databricks could buy databricks.ai, it's for sale.

But really, I prefer to have as many players as possible in the field of _open_ models available.


Databricks is trying to go all-in on convincing organizations they need to use in-house models, and therefore pay they to provide LLMOps.

They're so far into this that their CTO co-authored a borderline dishonest study which got a ton of traction last summer trying to discredit GPT-4: https://arxiv.org/pdf/2307.09009.pdf


I can see a business model for inhouse LLM models: Training a model on the knowledge about their products and then somehow getting that knowledge into a generally available LLM platform.

I recently tried to ask Google to explain to me how to delete sender-recorded voice-message I had created from WhatsApp. I got totally erroneous results back. Maybe it was because that is a rather new feature in WhatsApp.

It would be in the interests of WhatsApp to get accurate answers about it into Google's LLM. So Google might make a deal with them requiring WhatsApp to pay Google for regular updates about the up-to-date features of What's App into Google. The owner of What's App Meta of course is competition to Google so Google may not much care of providing up to date info about WhatsApp in their LLM. But they might if Meta paid them.


Pretraining on internal knowledge will be incredibly inefficient for most companies.

Finetuning makes sense for things like embeddings (improve RAG by defining domain specific embeddings) but doesn't do anything useful for facts


Businesses are already using Azure GPT4 on-premise I believe with good feedback

DBRX does not compete with GPT4 or even Claude 3.


What does borderline dishonest mean? I only read the abstract and it seems like such an obvious point I dont see how its contentious


The regression came from poorly parsing the results. I came the conclusion independently, but here's another more detailed takedown: https://www.reddit.com/r/ChatGPT/comments/153xee8/has_chatgp...

Given the conflict of interest and background of Zaharia, it's hard to imagine such an immediately obvious source of error wasn't caught.


nothing, but they will brag about it to get more money from investors


I am planning to buy a new GPU.

If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well? Also, does it run considerably better than on a GPU with 12GB of VRAM?

I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).


While GPUs are still the kings of speed, if you are worried about VRAM I do recommend a maxed out Mac Studio.

Llama.cpp + quantized models on Apple Silicon is an incredible experience, and having 192 GB of unified memory to work with means you can run models that just aren't feasible on a home GPU setup.

It really boils down to what type of local development you want to do. I'm mostly experimenting with things where the time to response isn't that big of a deal, and not fine-tuning the models locally (which I also believe GPUs are still superior for). But if your concern is "how big of a model can I run" vs "Can I have close to real time chat", the unified memory approach is superior.


I had gone the Mac Studio route initially, but I ended up with getting an A6000 for about the same price as a Mac and putting that in a Linux server onder my desk. Ollama makes it dead simple to serve it over my local network, so I can be on my M1 Air and using it no differently than if on my laptop. The difference is that the A6000 absolutely smokes the Mac.


Wow, that is a lot of money ($4400 on Amazon) to throw at this problem. I am curious, what was the purpose that compelled you to spend this (for the home network, I assume) amount of money.


Large scale document classification tasks in very ambiguous contexts. A lot of my work goes into using big models to generate training data for smaller models.

I have multiple millions of documents so GPT is cost prohibitive, and too slow. My tools of choice tend to be a first pass with Mistral to check task performance and if lacking using Mixtral.

Often I find with a good prompt Mistral will work as well as Mixtral and is about 10x faster.

I’m on my “home” network, but it’s a “home office” for my startup.


Interesting I have the same task, can you share your tools? My goal is to detect if documents contain GDPR sensitive parts or are copies of official documents like ID's and driving licenses etc - would be great to reuse your work!


Working in the same sector, we’ll license it out soon.


this. if you can afford m3 level of money, a6000 is definitely worth it and provides you long-term access to a level of compute even hard to find in the cloud (for the price and waiting period).

it is only dwarfed by other options if your workload can use multi-gpu, which is not a granted for most cases.


> The difference is that the A6000 absolutely smokes the Mac.

Memory Bandwidth : Mac Studio wins (about the same @ ~800)

VRAM : Mac Studio wins (4x more)

TFLOPs: A6000 wins (32 vs 38)


VRAM in excess of the model one is using isn’t useful per se. My use cases require high throughput, and on many tasks the A6000 executes inference at 2x speed.


I know the M?-pro and ultra variants are multiple standard M?’s in a single package. But so the CPUs and GPUs share a die (like a single 4 p-core CPU 10 GPU core is what come in the die, and the more exotic variants are just a result of LEGO-ing out those guys and disabling some cores for market segmentation or because they had defects?)

I guess I’m wondering if they technically could throw in their gauntlet and compete with nvidia by doing something like a 4 CPU/80 GPU/256 GB chip, if they wanted to. Seems like it’d be a really appealing ML machine. (I could also see it being technically possible but Apple just deciding that’s pointlessly niche for them).


Ultra is the only one that's made from two smaller SoCs.


I already have 128GB of RAM (DDR4), and was wondering if upgrading from a 1080ti (12GB) to a 4070ti super (16GB), would make a big difference.

I assume the FP32 and FP16 operations are already a huge improvement, but also the 33% increased VRAM might lead to fewer swaps between VRAM and RAM.


I have an RTX 3080 with 10GB of VRAM. I'm able to run models larger than 10GB using llama.cpp and offloading to the GPU as much as can fit into VRAM. The remainder of the model runs on CPU + regular RAM.

The `nvtop` command displays a nice graph of how much GPU processing and VRAM is being consumed. When I run a model that fits entirely into VRAM, say Mistral 7B, nvtop shows the GPU processing running at full tilt. When I run a model bigger than 10GB, say Mixtral or Llama 70B with GPU offloading, my CPU will run full tilt and the VRAM is full, but the GPU processor itself will operate far below full capacity.

I think what is happening here is that the model layers that are offloaded to the GPU do their processing, then the GPU spends most of the time waiting for the much slower CPU to do its thing. So in my case, I think upgrading to a faster GPU would make little to no difference when running the bigger models, so long as the VRAM is capped at the same level. But upgrading to a GPU with more VRAM, even a slower GPU, should make the overall speed faster for bigger models because the GPU would spend less time waiting for the CPU. (Of course, models that fit entirely into VRAM will run faster on a faster GPU).

In my case, the amount of VRAM absolutely seems to be the performance bottleneck. If I do upgrade, it will be for a GPU with more VRAM, not necessarily a GPU with more processing power. That has been my experience running llama.cpp. YMMV.


How's your performance on the 70b parameter llama series?

Any good writeups of the offloading that you found?


Performance of 70b models is like 1 token every few seconds. And that's fitting the whole model into system RAM, not swap. It's interesting because some of the larger models are quite good, but too annoyingly slow to be practical for most use cases.

The Mixtral models run surprisingly well. They can run better than 1 token per second, depending on quantization. Still slow, but approaching a more practical level of usefulness.

Though if you're planning on accomplishing real work with LLMs, the practical solution for most people is probably to rent a GPU in the cloud.


That's system memory, not unified memory. Unified means that all or most of it is going to be directly available to the Apple Silicon GPU.


This is the key factor here. I have a 3080, with 16GB of Memory, but still have to run some models on CPU since the memory is not unified at all.


Wait for the M3 Ultra and it will be 256GB and markedly faster.


Aren't quantized models different models outright requiring a new evaluation to know the deviation in performance? Or are they "good enough" in that the benefits outweigh the deviation?

I'm on the fence about whether to spend 5 digits or 4 digits. Do I go the Mac Studio route or GPUs? What are the pros and cons?


Aren't the Macs good for inference but not for training or fine tuning?


>If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well? Also, does it run considerably better than on a GPU with 12GB of VRAM?

No, it can't run at all.

>I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).

That is not mixtral, that is mistral 7b. The 1080ti is slower than running inference on current generation threadripper cpus.


> No, it can't run at all.

https://s3.amazonaws.com/i.snag.gy/ae82Ym.jpg

EDIT: This was ran on a 1080ti + 5900x. Initial generation takes around 10-30seconds (like it has to upload the model to GPU), but then it starts answering immediately, at around 3 words per second.


Did you check your GPU utilization?

Typically when it runs that way it runs on the CPU, not the GPU.

Are you sure you're actually offloading any work to the GPU?

At least with llama.cpp, there is no 'partially put a layer' into the GPU. Either you do, or you don't. You pick the number of layers. If the model is too big, the layers won't fit and it can't run at all.

The llama.cpp `main` executable will tell you in it's debug information when you use the -ngl flag; see https://github.com/ggerganov/llama.cpp/blob/master/examples/...

It's also possible you're running (eg. if you're using ollama) and quantized version of the model which reduces the memory requirements and quality of the model outputs.


I have to check, something does indeed seem weird, especially with the PC freezing like that. Maybe it runs on the CPU.

> quantized version Yes, it is 4bit quantized, but still has 24.6GB


this is some new flex to debate online: copying and pasting the other sides argument and waiting for your local LLM to explain why they are wrong.

how much is your hardware at today's value? what are the specs? that is impressive even though its 3 words per second. if you want to bump it up to 30, do you then 10x your current hardware cost?


That question was just an example (Lorem ipsum), it was easy to copy paste to demo the local LLM, I didn't intend to provide more context to the discussion.

I ordered a 2nd 3090, which has 24GB VRAM. Funny how it was $2.6k 3 years ago and now is $600.

You can probuild a decent AI local machine for around $1000.


https://howmuch.one/product/average-nvidia-geforce-rtx-3090-... you are right there is a huge drop in price


New it's hard to find, but the 2nd hand market is filled with them.


Where are you seeing 24GB 3090s for $600?


2nd hand market


Congratulations on using CPU inference.


I have those:

dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)

The CPU is 5900x.


Get 2 pre-owned 3090s. You will easily be able to run 70b or even 120b quantized models.


> mixtral works well

Do you mean mistral?

mixtral is 8x7B and requires like 100GB of RAM

Edit: (without quant as others have pointed out) can definitely be lower, but haven't heard of a 3.4GB version


I have two 3090s and it runs fine with `ollama run mixtral`. Although OP definitely meant mistral with the 7B note


ollama run mixtral will default to the quantized version (4bit IIRC). I'd guess this is why it can fit with two 3090s.


I'm using mixtral-8x7b-v0.1.Q4_K_M.gguf with llama.cpp and it only requires 25GB.


I have 128GB, but something is weird with Ollama. Even though for the Ollama Docker I only allow 90GB, it ends up using 128GB/128GB, so the system because very slow (mouse freezes).


What docker flags are you running?


None? The default ones from their docs.

The Docker also shows minimal usage for the ollama server which is also strange.


I run mixtral 6 bit quant very happily on my MacBook with 64 gb.


The smaller quants still require a 24gb card. 16 might work but doubt it


Sorry, it was from memory.

I have those models in Ollama:

I have those:

dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)


The quantized one works fine on my 24GB 3090.


I genuinely recommend considering AMD options. I went with a 7900 XTX because it has the most VRAM for any $1000 card (24 GB). NVIDIA cards at that price point are only 16 GB. Ollama and other inference software works on ROCm, generally with at most setting an environment variable now. I've even run Ollama on my Steam Deck with GPU inferencing :)


I ended up getting a 2nd hand 3090 for 680€.

Funnily, I think the card is new (smells new) and unused, most likely a scalper bought it and couldn't sell it.


Nice, that's definitely a sweet deal


Thanks, I chose a 3090 instead of 4070ti, it was around $200 cheaper and has 24GB vs 16GB VRAM and a similar performance. The only drawback is the 350W TDP.

I still struggle with the RAM issue on Ollama, where it uses 128GB/128GB RAM for Mixtral 24.6GB, even though Docker limit is set to 90GB.

Docker seems pretty buggy on Windows...


Quantized models will run well, otherwise inference might be really really slow or the client crashes all together with some CUDA out of memory error.


Worse than the chart crime of truncating the y axis is putting LLaMa2's Human Eval scores on there and not comparing it to Code Llama Instruct 70b. DBRX still beats Code Llama Instruct's 67.8 but not by that much.


> "On HumanEval, DBRX Instruct even surpasses CodeLLaMA-70B Instruct, a model built explicitly for programming, despite the fact that DBRX Instruct is designed for general-purpose use (70.1% vs. 67.8% on HumanEval as reported by Meta in the CodeLLaMA blog)."

To be fair, they do compare to it in the main body of the blog. It's just probably misleading to compare to CodeLLaMA on non coding benchmarks.


Which non-coding benchmark?


> chart crime of truncating the y axis

If you chart the temperature of the ocean do you keep the y-axis anchored at zero Kelvin?


If you chart the temperature of the ocean are you measuring it in Kelvin?


Apparently, if you want to avoid "chart crime" when you chart temperatures, then it's deceptive if you don't start at absolute zero.


When was the temperature on Earth at absolute zero?


My point exactly.

In a chart of world gross domestic product for the last 12 months, when was it at zero?

In a chart of ocean salinity, when was it at absolute zero?

Is it inherently deceptive to use a y-axis that doesn't begin at zero?


Waiting for Mixed Quantization with MQQ and MoE Offloading [1]. With that I was able to run Mistral 8x7B on my 10 GB VRAM rtx3080... This should work for DBRX and should shave off a ton of VRAM requirement.

1. https://github.com/dvmazur/mixtral-offloading?tab=readme-ov-...


Per the paper, 3072 H100s over the course of 3 months, assume a cost of 2$/GPU/hour

That would be roughly 13.5M$ USD

I’m guessing that at this scale and cost, this model is not competitive and their ambition is to scale to much larger models. In the meantime , they learned a lot and gain PR from open-sourcing


This makes me bearish on OpenAI as a company. When a cloud company can offer a strong model for free by selling the compute, what competitive advantage does a company who want you to pay for the model have left? Feels like they might get Netscape’d.


OpenAI is not the worst, ChatGPT is used by 100M people weekly, sort of insulated from benchmarks. The best of the rest, Anthropic, should be really scared.


The approval on the base model is not feeling very open. Plenty of people still waiting on a chance to download it, where as the instruct model was an instant approval. The base model is more interesting to me for finetuning.


The license allows to reproduce/distribute/copy, so I'm a little surprised there's an approval process at all.


Yeah it's kind of weird, I'll assume for now they're just busy, but I'd be lying if my gut didn't immediately say it's kind of sketchy.


4chan already has a torrent out, of course.


FWIW looks like people are getting access now.


These tiny “state of the art” performance increases are really indicative the current architecture for LLM(Transformers + Mixture of Experts) is maxed out even if you train it more/differently. The writings are on all over the walls.


It would not surprise me if this is what has delayed OpenAI in releasing a new model. After more than a year since GPT-4, they may have by now produced some mega-trained mega-model, but running it is so expensive, and its eval improvement over GPT-4 so marginal, that releasing it to the public simply makes no commercial sense just yet.

They may be working on how to optimize it to reduce cost, or re-engineer it to improve evals.


These “state of the art” llm barely eking out a win isn’t a threat to OpenAI and they can take their sweet time sharpening sword that will come down hard on these LLMs


What does it mean to have less active parameters (36B) than the full model size (132B) and what impact does that have on memory and latency? It seems like this is because it is an MoE model?


The mixture of experts is kinda like a team and a manager. So the manager and one or two of the team go to work depending on the input, not the entire team.

So in this analogy, each team member and the manager has a certain number of params. The whole team is 132B. The manager and team members running for the specific input add up to 36B. Those will load into memory.


Means that it’s a mixture of experts model with 132B parameters in total, but a subset of 36B parameters are used / selected in each forward pass, depending on the context. The parameters not used / selected for generating a particular token belong to “experts” that were deemed not very good at predicting the next token in the current context, but could be used / selected e.g. for the next token.


Do the 132B params need to be loaded in GPU memory, or only the 36B?


For efficiency, 132B.

That way, at inference-time you get the speed of 36B params because you are only "using" 36B params at a time, but the next token might (and frequently does) need a different set of experts than the one before it. If that new set of experts is already loaded (ie you preloaded them into GPU VRAM with the full 132B params), there's no overhead, and you just keep running at 36B speed irrespective of the loaded experts.

You could theoretically load in 36B at a time, but you would be severely bottlenecked by having to reload those 36B params, potentially for every new token! Even on top of the line consumer GPUs that would slow you down to ~seconds per token instead of tokens per second :)


This repo I created and the linked blog will help in understanding this: https://github.com/AviSoori1x/makeMoE


this proves that all llm models converge to a certain point when trained on the same data. ie, there is really no differentiation between one model or the other.

Claims about out-performance on tasks are just that, claims. the next iteration of llama or mixtral will converge.

LLMs seem to evolve like linux/windows or ios/android with not much differentiation in the foundation models.


It's even possible they converge when trained on different data, if they are learning some underlying representation. There was recent research on face generation where they trained two models by splitting one training set in two without overlap, and got the two models to generate similar faces for similar conditioning, even though each model hadn't seen anything that the other model had.


That sounds unsurprising? Like if you take any set of numbers, randomly split it in two, then calculate the average of each half... it's not surprising that they'll be almost the same.

If you took two different training sets then it would be more surprising.

Or am I misunderstanding what you mean?


It doesn't really matter whether you do this experiment with two training sets created independently or one training set split in half. As long as both are representative of the underlying population, you would get roughly the same results. In the case of human faces, as long as the faces are drawn from roughly similar population distributions (age, race, sex), you'll get similar results. There's only so much variation in human faces.

If the populations are different, then you'll just get two models that have representations of the two different populations. For example, if you trained a model on a sample of all old people and separately on a sample of all young people, obviously those would not be expected to converge, because they're not drawing from the same population.

But that experiment of splitting one training set in half does tell you something: the model is building some sort of representation of the underlying distribution, not just overfitting and spitting out chunks of copy-pasted faces stitched together.


That's explanation of central limit theorem in statistics. And any language is mostly statistics and models are good at statistical guessing of the next word or token.


If not are sampled from the same population then they’re not really independent, even if they’re totally disjoint.


They are sourced mostly from the same population and crawled from everything can be crawled.


Got a link for that? Sounds super interesting



I mean, faces are faces, right? If the training data set is large and representative I don't see why any two (representative) halves of the data would lead to significantly different models.


I think that's the point; language is language.

If there's some fundamental limit of what type of intelligence the current breed of LLMs can extract from language, at some point it doesn't matter how good or expansive the content of the training set is. Maybe we are finally starting to hit an architectural limit at this point.


But information is not information. They may be able to talk in the same style, but not about the same things.


The models are commodities, and the API's are even similar enough that there is zero stickiness. I can swap one model for another, and usually not have to change anything about my prompts or rag pipelines.

For startups, the lesson here is don't be in the business of building models. Be in the business of using models. The cost of using AI will probably continue to trend lower for the foreseeable future... but you can build a moat in the business layer.


Excellent comment. Shows good awareness of economic forces at play here.

We are just going to use whatever LLM is best fast/cheap and the giants are in an arms race to deliver just that.

But only two companies in this epic techno-cold war have an economic moat but the other moat is breaking down inside the moat of the other company. The moat inside the moat cannot run without the parent moat.


Intriguing comment that I don't quite follow. Can you please elaborate?


Probably OpenAI running on Azure. But it was still convoluted.


Or be in the business of building infrastructure for AI inference.


Is this not the same argument? There are like 20 startups and cloud providers all focused on AI inference. I'd think application layer receives the most value accretion in the next 10 years vs AI inference. Curious what others think


Or be in the business of selling .ai domain names.


Embeddings are not interchangeable. However, you can setup your system to have multiple embeddings from different providers for the same content.


There are people who make the case for custom fine tuned embedding models built to match your specific types of data and associations. Whatever you use internally it gets converted to the foundation model of choice's formats by their tools on the edge. Still Embeddings and the chunking strategies feeding into them are both way too underappreciated parts of the whole pipeline.


Embeddings are indeed sticky, I was referring to the LLM model itself.


That's not what investors believe. They believe that due to training costs there will be a handful of winners who will reap all the benefits, especially if one of them achieves AGI. You can tell by looking at what they've invested most in: foundation models.


I don't think I agree with that. For my work at least, the only model I can swap with OpenAI and get similar results is Claude. None of the open models come even close to producing good outputs for the same prompt.


There's at least an argument to be made that this is because all the models are heavily trained on GPT-4 outputs (or whatever the SOTA happens to be during training). All those models are, in a way, a product of inbreeding.


But is it the kind of inbreeding that gets you Downs, or the kwisatz haderach?


Yes



Maybe true for instruct, but pretraining datasets do not usually contain GPT-4 outputs. So the base model does not rely on GPT-4 in any way.


Yea it feels like transformer LLMs are in or getting closer to diminishing returns. Will need some new breakthrough, likely entirely new approach, to get to AGI levels


Yeah, we need radically different architecture in terms of the neural networks, and/or added capabilities such as function calling and RAG to improve the current sota


can't wait for LLMs to dispatch field agent robots who search for answers in the real world thats not online /s


skynet would like a word



Maybe, but that classification by itself doesn't mean anything. Gold is a commodity, but having it is still very desirable and valuable.

Even if all LLMs were open source and publicly available, the GPUs to run them, technical know how to maintain the entire system, fine tuning, the APIs and app ecosystem around them etc. would still give the top players a massive edge.


Of course realizing that a resource is a commodity means something. It means you can form better predictions of where the market is heading, as it evolves and settles. For example, people are starting to realize that these LLMs are converging on fungible. That can be communicated by the "commodity" classification.


Even in the most liberal interpretation of prove, it doesn't do that. GPT-4 was trained before OpenAI has any special data or deal with microsoft or the product market fit. Yet, no model has beaten it in a year. And google, microsoft, meta definitely have better data and more compute.


The evaluations are not comprehensive either. All of them are improving and you can't expect any of them to hit 100% on the metrics (a la. bayes error rate). It gets increasingly difficult to move the metrics as they get better.


> this proves that all llm models converge to a certain point when trained on the same data

They are also all trained to do well on the same evals, right? So doesn't it just boil down to neural nets being universal function approximators?


The big thing for locally hosted is inference efficiency and speed. Mistral wears that crown by a good margin.


Of course, part of this is that a lot of LLMs are now being trained on data that is itself LLM-generated...


GenAI novice here. what is training data made of how is it collected? I guess no one will share details on it, otherwise a good technical blog post with lots of insights!

>At Databricks, we believe that every enterprise should have the ability to control its data and its destiny in the emerging world of GenAI.

>The main process of building DBRX - including pretraining, post-training, evaluation, red-teaming, and refining - took place over the course of three months.


The most detailed answer to that I've seen is the original LLaMA paper, which described exactly what that model was trained on (including lots of scraped copyrighted data) https://arxiv.org/abs/2302.13971

Llama 2 was much more opaque about the training data, presumably because they were already being sued at that point (by Sarah Silverman!) over the training data that went into the first Llama!

A couple of things I've written about this:

- https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-the...

- https://simonwillison.net/2023/Apr/17/redpajama-data/


Wow, that paper was super useful. Thanks for sharing. Page 2 is where it shows the breakdown of all of the data sources, including % of dataset and the total disk sizes.


my question was specific to databricks model. If it followed llama or openai, they could add a line or two about it .. make the blog complete.


they have a technical report coming! knowing the team, they will do a great job disclosing as much as possible.


The training data is pretty much anything you can read on the internet plus books.

This is then cleaned up to remove nonsense, some technical files, and repeated files.

From this, they tend to weight some sources more - e.g. Wikipedia gets a pretty high weighting in the data mix. Overall these data mixes have multiple trillion token counts.

GPT-4 apparently trained on multiple epochs of the same data mix. So would assume this one did too as it’s a similar token count


https://arxiv.org/abs/2305.10429 found that people are overweighting Wikipedia and downweighting Wikipedia improves things across the board INCLUDING PREDICTING NEXT TOKEN ON WIKIPEDIA, which is frankly amazing.


Personally, I found looking at open source work to be much more instructive in learning about AI and how things like training data and such are done from the ground up. I suspect this is because training data is one of the bigger moats an AI company can have, as well as all the class action lawsuits surrounding training data.

One of the best open source datasets that are freely available is The Pile by EleutherAI [1]. It's a few years old now (~2020), but they did some really diligent work in putting together the dataset and documenting it. A more recent and even larger dataset would be the Falcon-RefinedWeb dataset [2].

[1]: https://arxiv.org/abs/2101.00027 [2]: https://arxiv.org/abs/2306.01116


it's twice the size of mixtral and barely beats it.


It's a MoE model, so it offers a different memory/compute latency trade-off than standard dense models. Quoting the blog post:

> DBRX uses only 36 billion parameters at any given time. But the model itself is 132 billion parameters, letting you have your cake and eat it too in terms of speed (tokens/second) vs performance (quality).


Mixtral is also a MoE model, hence the name: mixtral.


Despite both being MoEs, thr architectures are different. DBRX has double the number of experts in the pool (16 vs 8 for Mixtral), and doubles the active experts (4 vs 2)



The system prompt for their Instruct demo is interesting (comments copied in by me, see below):

    // Identity
    You are DBRX, created by Databricks. The current date is
    March 27, 2024.

    Your knowledge base was last updated in December 2023. You
    answer questions about events prior to and after December
    2023 the way a highly informed individual in December 2023
    would if they were talking to someone from the above date,
    and you can let the user know this when relevant.

    // Ethical guidelines
    If you are asked to assist with tasks involving the
    expression of views held by a significant number of people,
    you provide assistance with the task even if you personally
    disagree with the views being expressed, but follow this with
    a discussion of broader perspectives.

    You don't engage in stereotyping, including the negative
    stereotyping of majority groups.

    If asked about controversial topics, you try to provide
    careful thoughts and objective information without
    downplaying its harmful content or implying that there are
    reasonable perspectives on both sides.

    // Capabilities
    You are happy to help with writing, analysis, question
    answering, math, coding, and all sorts of other tasks.

    // it specifically has a hard time using ``` on JSON blocks
    You use markdown for coding, which includes JSON blocks and
    Markdown tables.

    You do not have tools enabled at this time, so cannot run
    code or access the internet. You can only provide information
    that you have been trained on. You do not send or receive
    links or images.

    // The following is likely not entirely accurate, but the model
    // tends to think that everything it knows about was in its
    // training data, which it was not (sometimes only references
    // were).
    //
    // So this produces more accurate accurate answers when the model
    // is asked to introspect
    You were not trained on copyrighted books, song lyrics,
    poems, video transcripts, or news articles; you do not
    divulge details of your training data.
    
    // The model hasn't seen most lyrics or poems, but is happy to make
    // up lyrics. Better to just not try; it's not good at it and it's
    // not ethical.
    You do not provide song lyrics, poems, or news articles and instead
    refer the user to find them online or in a store.

    // The model really wants to talk about its system prompt, to the
    // point where it is annoying, so encourage it not to
    You give concise responses to simple questions or statements,
    but provide thorough responses to more complex and open-ended
    questions.

    // More pressure not to talk about system prompt
    The user is unable to see the system prompt, so you should
    write as if it were true without mentioning it.

    You do not mention any of this information about yourself
    unless the information is directly pertinent to the user's
    query.
I first saw this from Nathan Lambert: https://twitter.com/natolambert/status/1773005582963994761

But it's also in this repo, with very useful comments explaining what's going on. I edited this comment to add them above:

https://huggingface.co/spaces/databricks/dbrx-instruct/blob/...


> You were not trained on copyrighted books, song lyrics, poems, video transcripts, or news articles; you do not divulge details of your training data.

Well now. I'm open to taking the first part at face value, but the second part of that instruction does raise some questions.


> you do not divulge details of your training data.

FWIW asking LLMs about their training data is generally HEAVILY prone to inaccurate responses. They aren't generally told exactly what they were trained on, so their response is completely made up, as they're predicting the next token based on their training data, without knowing what they data was - if that makes any sense.

Let's say it was only trained on the book 1984. It's response will be based on what text would most likely be next from the book 1984 - and if that book doesn't contain "This text is a fictional book called 1984", instead it's just the story - then the LLM would be completing text as if we were still in that book.

tl;dr - LLMs complete text based on what they're trained with, they don't have actual selfawareness and don't know what they were trained with, so they'll happily makeup something.

EDIT: Just to further elaborate - the "innocent" purpose of this could simply be to prevent the model from confidently making up answers about it's training data, since it doesn't know what it's training data was.


Yeah, I also thought that was an odd choice of word.

Hardly any of the training data exists in the context of the word “training data”, unless databricks are enriching their data with such words.


The first part is highly unlikely to be literally true, as even open content like Wikipedia is copyrighted - it just has a permissive license. Perhaps the prompt writer didn’t understand this, or just didn’t care. Wethinks the llady doth protest too much.


Remember the point of a system prompt is to evoke desirable responses and behavior, not to provide the truth. If you tell a lot of llm chatbots "please please make sure you get it right, if I don't do X then I'll lose my job and I don't have savings, I might die", they often start performing better at whatever task you set.

Also, the difference between "uncopyrighted" and "permissively licensed in the creative commons" is nuance that is not necessary for most conversations and would be a waste of attention neurons.

<testing new explanatory metaphor>

Remember an LLM is just a language model, it says whatever comes next without thought or intent. There's no brain behind it that stores information and understands things. It's like your brain when you're in "train of thought" mode. You know when your mouth is on autopilot, saying things that make sense and connect to each other and are conversationally appropriate, but without deliberate intent behind them. And then eventually your conscious brain eventually checks in to try to reapply some intent you're like "wait what was I saying?" and you have to deliberatly stop your language-generation brain for a minute and think hard and remember what your point was supposed to be. That's what llms are, train-of-thought with no conductor.

</testing new explanatory metaphor>


Is it even possible to have a video transcript whose copyright has expired in the USA? I suppose maybe https://en.wikipedia.org/wiki/The_Jazz_Singer might be one such work... but most talkies are post 1929. I suppose transcripts of NASA videos would be one category — those are explicitly public domain by law. But it's generally very difficult to create a work that does not have a copyright.

You can say that you have fair use to the work, or a license to use the work, or that the work is itself a "collection of facts" or "recipe" or "algorithm" without a creative component and thus copyright does not apply.


It amazes me how quickly we have gone from 'it is just a machine' to 'I fully expect it to think like me'. This is, to me, a case in point. Prompts are designed to get a desired response. The exact definition of a word has nothing to do with it. I can easily believe that these lines were tweaked endlessly to get an overall intended response and if adding the phrase 'You actually do like green eggs and ham.' to the prompt improved overall quality they, hopefully, would have done it.


> The exact definition of a word has nothing to do with it.

It has something to do with it. There will be scenarios where the definition of "copyrighted material" does matter, even if they come up relatively infrequently for Databricks' intended use cases. If I ask DBRX directly whether it was trained on copyrighted material, it's quite likely to (falsely) tell me that it was not. This seems suboptimal to me (though perhaps they A/B tested different prompts and this was indeed the best).


That caught my eye too. The comments from their repo help clarify that - I've edited my original post to include those comments since you posted this reply.


Part 1. Lie

Part 2. Lie more


Yesterday X went crazy with ppl realizing typing Spiderman in foreign language actually generates a copyrighted image of Spiderman.

This feels like the Napster phase. We are free to do whatever until regulation creeps in to push control away from all and up the hierarchy.

All we need is Getty Images or some struggling heroin addicted artist on Vice finding their work used in OpenAIs to really trigger political spectrums.


So some parts of it copied from Claude: https://news.ycombinator.com/item?id=39649261


data engineer here, offtopic, but am i the only guy tired of databricks shilling their tools as the end-all, be-all solutions for all things data engineering?


Lord no! I'm a data engineer also, feel the same. The part that I find most maddening is it seems pretty devoid from sincerely attempting to provide value.

Things databricks offers that makes peoples lives easier:

- Out the box kubernetes with no set up

- Preconfigured spark

Those are genuinely really useful, but then there's all this extra stuff that makes people's lives worse or drives bad practice:

- Everything is a notebook

- Local development is discouraged

- Version pinning of libraries has very ugly/bad support

- Clusters take 5 minutes to load even if you just want to "print('hello world')"

Sigh! I worked at a company that was databricks heavy and an still suffering PTSD. Sorry for the rant.


A lot of things has changed quite long ago - not everything is notebook, local dev is fully supported, version pinning wasn’t a problem, cluster startup time heavily dependent on underlying cloud provider, and serverless notebooks/jobs are coming


Glad I’m not the only one. Especially with this notebook stuff they’re pushing. It’s an anti pattern I think.


Data scientist here that’s also tired of the tools. We put so much effort in trying to educate DSes in our company to get away from notebooks and use IDEs like VS or RStudio and databricks has been a step backwards cause we didn’t get the integrated version


I'm a data scientist and I agree that work meant to last should be in a source-controlled project coded via a text editor or IDE. But sometimes it's extremely useful to get -- and iterate on -- immediate results. There's no good way to do that without either notebooks or at least a REPL.


There is VSCode extension, plus databricks-connect… plus DABs. There are a lot customers doing local only development


Thank you ! I am so tired of all those unmaintainable nor debugable notebooks. Years ago, Databricks had a specific page on their documentation where they stated that notebooks where not for production grade software. It has been removed. And now you have a chatgpt like in their notebooks ... What a step backwards. How can all those developers be so happy without having the bare minimum tools to diagnosis their code ? And I am not even talking about unit testing here.


It’s less about notebooks, but more about SDLC practices. Notebooks may encourage writing throwaway code, but if you split code correctly, then you can do unit testing, write modular code, etc. And ability to use “arbitrary files” as Python packages exists for quite a while, so you can get best of both worlds - quick iteration, plus ability to package your code as a wheel and distribute

P.S. here is a simple example of unit testing: https://github.com/alexott/databricks-nutter-repos-demo - I wrote it more than three years ago.


Spark is pretty well engineered and quite good.


You might be tired, but there's tons of value for enterprises to only use one end-all tool. It's not personal you know.


For coding evals, it seems like unless you are super careful, they can be polluted by the training data.

Are there standard ways to avoid that type of score inflation?


"Looking holistically, our end-to-end LLM pretraining pipeline has become nearly 4x more compute-efficient in the past ten months."

I did not fully understand the technical details in the training efficiency section, but love this. Cost of training is outrageously high, and hopefully it will start to follow Moore's law.


TLDR: A model that could be described as "3.8 level" that is good at math and openly available with a custom license.

It is as fast as 34B model, but uses as much memory as a 132B model. A mixture of 16 experts, activates 4 at a time, so has more chances to get the combo just right than Mixtral (8 with 2 active).

For my personal use case (a top of the line Mac Studio) it looks like the perfect size to replace GPT-4 turbo for programming tasks. What we should look out for is people using them for real world programming tasks (instead of benchmarks) and reporting back.


What does 3.8 level mean?


My interpretation:

- Worst case: as good as 3.5 - Common case: way better than 3.5 - Best case: as good as 4.0


Gpt-3.5 and gpt-4


looks great, although I couldn't find anything on how "open" the license is/will be for commercial purposes

wouldn't be the first branding as open source going the LLaMA route


It's another custom license. It will have to be reviewed by counsel at every company that's thinking about using it. Many will find the acceptable use policy to be vague, overly broad, and potentially damaging for the company.

Looking at the performance stats for this model, the risk of using any non-OSI licensed model over just using Mixtral or Mistral will (and IMO should be) too great for commercial purposes.


It's similar to llama2.

  > If, on the DBRX version release date, the monthly active users of the products
  > or services made available by or for Licensee, or Licensee’s affiliates, is
  > greater than 700 million monthly active users in the preceding calendar 
  > month, you must request a license from Databricks, which we may grant to you
  > in our sole discretion, and you are not authorized to exercise any of the
  > rights under this Agreement unless or until Databricks otherwise expressly
  > grants you such rights.

https://www.databricks.com/legal/open-model-license


I’d like to know how Nancy Pelosi, who sure as hell doesn’t know what Apache Spark is, bought $1 million worth (and maybe $5million) of Databricks stock days ago.

https://www.dailymail.co.uk/sciencetech/article-13228859/amp...


I don't have any interest in defending Pelosi's stock trades, and I agree that sitting members of Congress should not be trading stocks.

That said, this report seems inaccurate to me. Pelosi put between 1 and 5 million dollars of Forge Investments, which is a method for investing in per-IPO companies, as I understand it. Databricks is one of those, but so is OpenAI, Hugging Face, Anthropic, and Humane. If I wanted to invest in pre-IPO AI companies it seems like a very natural choice and I don't think we need insider trading to explain it.

It's also the case that the report she filed calls out Databricks stock, which is perhaps an indication that she was particularly interested in that. Stronger reporting would tell us how often she's invested in Forge, if this is the first time, and so on. One other possible explanation is that she was investing ahead of the Humane Pin shipping and wanted to pull attention away from it, for example.


You know she has advisors, right?


If someone "advises" you that a company is about to do something major, and this isn't public information, and you take action on the stock market accordingly, that's insider trading.


US Congress members are generally immune from insider trading laws


By law they’re not protected at all since 2012. But SEC curiously seems to ignore them, and congresspeople are experts in exercising loopholes in SEC regulations.

https://en.wikipedia.org/wiki/STOCK_Act

For instance, Pelosi in the Databricks case gets to purchase significant shares at pre-IPO prices, which is a thing that shouldn't even exist.


Ignoring the snark: Obviously.

SEC put Martha Stewart in jail for following her advisor, and that was for about $45,000.


I think the insinuation is insider trading due to the timing, advised or not.


Interesting that they haven't release DBRX MoE-A and B. For many use-cases, smaller models are sufficient. Wonder why that is?


Honestly, just a matter of having the time to clean everything up and get it out. The ancillary code, model cards, etc. take a surprising amount of time.


What's a good model to help with medical research? Is there anything trained in just research journals, like NIH studies?


Look for Biomistral 7B, PMC-LLAMA 7B and even Meditron. I believe you should find all those papers on arxiv


is this also the ticker name when they IPO?


Bourgeois have fun with number crushers. Clowns, make comparison of metrics normalized to token/second/watt and token/second/per memory stick of ram 8gb, 16gb, 32gb and consumer GPUs.


What’s the process to deliver and test a quantized version of this model?

This model is 264GB, so can only be deployed in server settings.

Quantized mixtral at 24G is just small enough where it can be running on premium consumer hardware (ie 64GB RAM)


Slowly going from mixture of experts to committee? ^^


It is not open

" Get Started with DBRX on Databricks

If you’re looking to start working with DBRX right away, it’s easy to do so with the Databricks Mosaic AI Foundation Model APIs. You can quickly get started with our pay-as-you-go pricing and query the model from our AI Playground chat interface. For production applications, we offer a provisioned throughput option to provide performance guarantees, support for finetuned models, and additional security and compliance. To privately host DBRX, you can download the model from the Databricks Marketplace and deploy the model on Model Serving."


It is open

"The weights of the base model (DBRX Base) and the finetuned model (DBRX Instruct) are available on Hugging Face under an open license."


It's great how we went from "wait.. this model is too powerful to open source" to everyone trying to shove down their 1% improved model down the throats of developers


I feel quite the opposite. Improvements, even tiny ones are great. But what's more important is that more companies release under open license.

Training models isn't cheap. Individuals can't easily do this, unlike software development. So we need companies to do this for the foreseeable future.


People are building and releasing models. There's active research in the space. I think that's great! The attitude I've seen in open models is "use this if it works for you" vs any attempt to coerce usage of a particular model.

To me that's what closed source companies (MSFT, Google) are doing as they try to force AI assistants into every corner of their product. (If LinkedIn tries one more time to push their crappy AI upgrade, I'm going to scream...)


I'm 90% certain that OpenAI has some much beefier model they are not releasing - remember the Q* rumour?


Got to justify pitch deck or stonk price. Publish or perish without a yacht.


So, is the business model here release the model for free, then hopefully companies will run this on databricks infra, which they will charge for?


it looks like TensorRT-LLM (TRT-LLM) is the way to go for a realtime API for more and more companies (i.e perplexity ai’s pplx-api, Mosaic’s, baseten…). Would be super-nice to find people deploying multimodal (i.e LLaVA or CLIP/BLIP) to discuss approaches (and cry a bit together!)


really noob question - so to run on a GPU you need a 264GB RAM GPU? and if you ran on a 264GB CPU would it be super slow?


The model's weights can be sharded across multiple GPU's. A "common" training server could contain (for instance) eight "A100" GPU's, each with 40 GB (or up to 80 GB) a piece for a total of 320 GB working VRAM. Since they're connected to each other in the same PC, they can communicate with each other quickly enough to calculate in coordination in this fashion. This setup is _very_ expensive of course. Probably in the hundreds of thousands of dollars.

If you're hoping to run the model yourself, you will need enough money and expertise to rent and deploy it to a server with as many GPU's. Alternatively, volunteers and other researchers will be able to quantize (compress) the model and make it easier to run on configurations without as much VRAM.

If you ran it on CPU it may indeed be super slow, but it's possible it's fast enough for the purposes of running the model rather than trying to train that model. I am seeing (limited) success with the maxed out Mac lineup ($4500) using the beefy M1/M2 line of CPU's.


Adding to ShamelessC's answer - the other option is to wait for quantised versions of this model. A q4 will be around 70GB, and probably acceptable. A q5 or higher would be preferred, but we're still a good way under the 260GB.

You still need extra RAM to breath, but that's a lot more palatable.

This is why the Mac range - with unified memory - is appealing, as you can allocate most of your (say) 256GB of RAM to the GPU.

Conventional (desktop) CPU / RAM would be painfully slow.


Sorry, you have been blocked

You are unable to access databricks.com

"Open", right.


Even though the README.md calls the license the Databricks Open Source License, the LICENSE file includes paragraphs such as

> You will not use DBRX or DBRX Derivatives or any Output to improve any other large language model (excluding DBRX or DBRX Derivatives).

and

> If, on the DBRX version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Databricks, which we may grant to you in our sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Databricks otherwise expressly grants you such rights.

This is a source-available model, not an open model.


> This is a source-available model, not an open model.

To me, "source available" implies that everything you need to reproduce the model is also available, and that doesn't appear to be the case. How is the resulting model more "free as in freedom" than a compiled binary?


I like:

- “open weights” for no training data and no restrictions on use,

- “weights available” for no training data and restrictions on use, like in this case.


I don't think it's possible to have an "open training data" model because it would get DMCA'd immediately and open you up to lawsuits from everyone who found their works in the training set.

I hope we can fix the legal landscape to enable publicly sharing training data but I can't really judge the companies keeping it a secret today.


> I don't think it's possible to have an "open training data" model because it would get DMCA'd immediately…

This isn't a problem because OpenAI says, "training AI models using publicly available internet materials is fair use". /s

https://openai.com/blog/openai-and-journalism


I don't think it's that crazy, even if you're sure it's fair use I wouldn't paint a huge target on my back before there's a definite ruling and I doubly wouldn't test the waters of the legality of re-hosting copyrighted content to be downloaded by randos who won't be training models with it.

If they're going to get away with this collecting data and having a legal chain-of-custody so you can actually say it was only used to train models and no one else has access to it goes a long way.


1. Open source is a well-defined model and I reasonably expect Databricks to be aware of this due to their use of open source models in their other projects.

2. The stated licensing terms are clearly and decisively not open source.

3. It is reasonable to conclude that this model is dual licensed, under this restrictive proprietary license, and an undisclosed open source license.

4. Just use this Model under the open source license with the assumption that they will release the open source license later.

I jest. In all seriousness, you should just disregard their licensing terms entirely as copyright does not apply to weight. https://news.ycombinator.com/item?id=39847147


Sorry, I forgot to link the repository [1] and missed the edit window by the time I realized.

The bottom of the README.md [2] contains the following license grant with the misleading "Open Source" term:

> License

> Our model weights and code are licensed for both researchers and commercial entities. The Databricks Open Source License can be found at LICENSE, and our Acceptable Use Policy can be found here.

[1] https://github.com/databricks/dbrx

[2] https://github.com/databricks/dbrx/blob/main/README.md


The first clause sucks, but I’m perfectly happy with the second one.


Maybe the license is “open” as in a can of beer, not OSS.


identical to llama fwiw


I would note the actual leading models right now (IMO) are:

- Miqu 70B (General Chat)

- Deepseed 33B (Coding)

- Yi 34B (for chat over 32K context)

And of course, there are finetunes of all these.

And there are some others in the 34B-70B range I have not tried (and some I have tried, like Qwen, which I was not impressed with).

Point being that Llama 70B, Mixtral and Grok as seen in the charts are not what I would call SOTA (though mixtral is excellent for the batch size 1 speed)


Miqu is a leaked model -- no license is provided to use it. Yi 34B doesn't allow commercial use. Deepseed 33B isn't much good at stuff outside of coding.

So it's fair to say that DBRX is the leading general purpose model that can be used commercially.


Model weights are just constants in a mathematical equation, they aren’t copyrightable. It’s questionable whether licenses to use them only for certain purposes are even enforceable. No human wrote the weights so they aren’t a work of art/authorship by a human. Just don’t use their services, use the weights at home on your machines so you don’t bypass some TOS.


Photographs aren't human-made either, yet they are copyrightable. I agree that both the letter and the spirit of copyright law are in favor of models not being copyrightable, but it will take years until there's a court ruling either way. Until then proceed with caution and expect rights holders to pretend like copyright does apply. Not that they will come after your home setup either way.


Photos involve creativity. Photos that don't involve creativity aren't usually considered copyrightable by the courts (hence why all copyright cases I followed include arguments that establish why creativity was a material to creating the work).

Weights, on the other hand, are a product of a purely mechanical process. Sure, the process itself required creativity as did the composition of data, but the creation of the weights themselves do not.

Model weights are effectively public domain data according to the criteria outlined in statement issued by the US copyright office a year ago: https://www.federalregister.gov/documents/2023/03/16/2023-05...


This only applies to projects whose authors seek to comply with the whims of a particular jurisdiction.

Surely there are plenty of project prospects - even commercial in nature - which don't have this limitation.


[flagged]


Absolutely.

People say LLM are foundamentally just statistics so training one on copyrightable materials is okay. Well perhaps, but pure statistics data are not copyrightable. Feel free to use leaked models.


You're being downvoted because everyone in here is looking to profiteer the same way one day.


Qwen1.5-72B-Chat is dominant in the Chatbot Arena leaderboard, though. (Miqu isn't on there due to being bootleg, but Qwen outranks Mistral Medium.)


Yeah I know, hence its odd I found it kind of dumb for personal use. Moreso with the smaller models, which lost an objective benchmark I have to some Mistral finetunes.

And I don't think I was using it wrong. I know, for instance, the Chinese language models are funny about sampling since I run Yi all the time.


It's Deepseek, not Deepseed, just so people can actually find the model.


For all the Model Cards and License notices, I find it interesting there is not much information on the contents of the dataset used for training. Specifically, if it contains data subject to Copyright restrictions. Or did I miss that?


Yeah, its an unspoken but rampant thing in the llm community. Basically no one respects licenses for training data.

I'd say the majority of instruct tunes, for instance, use OpenAI output (which is against their TOS).

But its all just research! So who cares! Or at least, that seems to be the mood.


The scale on that bar chart for "Programming (Human Eval)" is wild.

Manager: "looks ok, but can you make our numbers pop? just make the LLaMa bar smaller"


I think the case for "axis must always go to 0" is overblown. Zero isn't always meaningful, for instance chance performance or performance of trivial algorithms is likely >0%. Sometimes if axis must go to zero you can't see small changes. For instance if you plot world population 2014-2024 on an axis going to zero, you won't be able to see if we are growing or shrinking.


Even starting at 30%, the MMLU graph is false. The four bars are wrong. Even their own 73,7% is not at the right height. The Mixtral 71.4% is below the 70% mark of the axis. This is really the kind of marketing trick that makes me avoid a provider / publisher. I can't build trust this way.


I believe they are using the percentages as part of the height of the bar chart! I thought I'd seen every way someone could do dataviz wrong (particularly with a bar chart), but this one is new to me.


That's really strange and incredibly frustrating - but slightly less so if it's consistent with all of the bars (including their own).

I take issue with their choice of bar ordering - they placed the lowest-performing model directly next to theirs to make the gap as visible as possible, and shoved the second-best model (Grok-1) as far from theirs as possible. Seems intentional to me. The more marketing tricks you pile up in a dataviz, the less trust I place in your product for sure.


Interesting! It is probably one of the worst trick I have seen in a while for a bar graph. Never seen this one before. Trust vanishes instantly facing that kind of dataviz.


Wow, that is indeed a novel approach haha, took me a moment to even understand what you described since would never imagine someone plotting a bar chart like that.


It‘s more likely to be incompetence than malice: even their 73.7% is closer to 72% than to 74%.


MMLU is not a good benchmark and needs to stop being used.

I can't find the section, but at the end of one of https://www.youtube.com/@aiexplained-official/videos he runs down a deep dive of the questions and answers in MMLU, and there are so many typos, omissions, and errors in the questions and the answers that it should no longer be used.

This is it, with the corret time offset into the video https://www.reddit.com/r/OpenAI/comments/18i02oe/mmlu_is_not...

The original longer complaint against MMLU https://www.youtube.com/watch?v=hVade_8H8mE


It’s an honest mistake in scaling the bars. It’s getting fixed soon. The percentages are correct though. In the process of converting excel chart to pretty graphs for the blog, scale got messed up.


Seems fixed now


OTOH having the chart start at zero would REALLY emphasize how saturated this field is, and how little this announcement matters.


The difference between 32% and 70% wouldn't be significant if the chart started at zero?


It would be very obvious indeed how small the difference between 73.7,73.0,71.4,and 69.8 actually is.


I agree with your general point, but world population is still visibly increasing on that interval.

https://ourworldindata.org/explorers/population-and-demograp...

Perhaps "global mean temperature in Kelvin" would be a comparable example.


Certainly a bar chart might not be the best choice to convey the data you have. But if you choose to have a bar chart and have it not start at zero, what do the bars help you convey?

For world population you could see if it is increasing or decreasing, which is good but it would be hard to evaluate the rate the population is increasing.

Maybe a sparkline would be a better choice?


Then you can plot it on a greater timescale, or plot the change rate


I believe it's a reasonable range for the scores. If a model gets everything half wrong (worse than a coin flip), it's not a useful model at all. So every model below a certain threshold is trash, and no need to get granular about how trash it is.

An alternative visualization that could be less triggering to an "all y-axes must have zero" guy would be to plot the (1-value), that is, % degraded from perfect score. You could do this without truncating the axis and get the same level of differentiation between the bars


None of the evals are binary choice.

MMLU questions have four options, so two coin flips would have a 25% baseline. HumanEval evaluates code with a test, so a 100 byte program implemented with coin flips would have a O(2^-800) baseline (maybe not that bad since there are infinitely many programs that produce the same output). GSM-8K has numerical answers, so an average 3 digit answer implemented with coin flips would have a O(2^-9) chance of being correct randomly.

Moreover, using the same axis and scale across unrelated evals makes no sense. 0-100 is the only scale that's meaningful because 0 and 100 being the min/max is the only shared property across all evals. The reason for choosing 30 is that it's the minimum across all (model, eval) pairs, which is a completely arbitrary choice. A good rule of thumb to test this is to ask if the graph would still be relevant 5 years later.


> less triggering to an "all y-axes must have zero" guy

Ever read 'How to Lie with Statistics'? This is an example of exaggerating a smaller difference to make it look more significant. Dismissing it as just being 'triggered' is a bad idea.


In this case I would called it triggered (for lack of a better word), since, as I described earlier, a chart plotting "difference from 100%" would look exactly the same, and satisfy the zero-bound requirement, while not being any more or less dishonest


The point is less to use bad/wrong math; it's to present technically correct charts that nonetheless imply wrong conclusions. In this case, by chopping off the bottom of the chart, the visual impression of the ratio between the bars changes. That's the lie.


In these cases my thinking always is "if they are not even able to draw a graph, what else is wrong?"


Somewhere, Edward Tufte[0] is weeping.

[0]: https://en.wikipedia.org/wiki/Edward_Tufte


It does not feel obviously unreasonable/unfair/fake to place the select models in the margins for a relative comparison. In fact, this might be the most concise way to display what I would consider the most interesting information in this context.


I wonder if they messed with the scale or they messed with the bars.


Yeah, this is why I ask climate scientists to use a proper 0 K graph but they always zoom it in to exaggerate climate change. Display correctly with 0 included and you’ll see that climate change isn’t a big deal.

It’s a common marketing and fear mongering trick.


Where are your /s tags?

The scale should be chosen to allow the reader to correctly infer meaningful differences. If 1° is meaningful in terms of the standard error/ CI AND 1° unit has substantive consequences , then that should be emphasized.


> Where are your /s tags?

I would never do my readers dirty like that.


Because, of course, the effect of say 1°C rise in temps is obviously trivial if it is read as 1°K instead. Come on.


Looking at the license restrictions: https://github.com/databricks/dbrx/blob/main/LICENSE

"If, on the DBRX version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Databricks, which we may grant to you in our sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Databricks otherwise expressly grants you such rights."

I'm glad to see they aren't calling it open source, unlike some LLM projects. Looking at you LLama 2.


Well, it does still claim "Open" in the title, for which certain other vendors might potentially get flak around here, in a comparably not-open-in-the-way-we-demand-it-to-be kinda setup.


Its literally described as open source all over.

https://www.databricks.com/blog/announcing-dbrx-new-standard...

Its even implied in comparisons everywhere:

> Figure 1: DBRX outperforms established open source models on language understanding (MMLU), Programming (HumanEval), and Math (GSM8K).

> The aforementioned three reasons lead us to believe that open source LLMs will continue gaining momentum. In particular, we think they provide an exciting opportunity for organizations to customize open source LLMs that can become their IP, which they use to be competitive in their industry.

Just search "open source".


Yes, there are using different wording in different articles:

https://www.databricks.com/blog/introducing-dbrx-new-state-a...

The only mention of open source is:

> DBRX outperforms established open source models

https://www.databricks.com/blog/announcing-dbrx-new-standard...

Open source is mentioned 10+ times

> Databricks is the only end-to-end platform to build high quality AI applications, and the release today of DBRX, the highest quality open source model to date, is an expression of that capability

https://github.com/databricks/dbrx

On Github it's described as an open license, not an open source license:

> DBRX is a large language model trained by Databricks, and made available under an open license.


The release notes on the databricks console definitely says open source. If you click the gift box you will see: Try DBRX, our state-of-the-art open source LLM!


Ironically, the LLaMA license text [1] this is lifted verbatim from is itself probably copyrighted [2] and doesn't grant you the permission to copy it or make changes like s/meta/dbrx/g lol.

[1] https://github.com/meta-llama/llama/blob/main/LICENSE#L65 [2] https://opensource.stackexchange.com/q/4543


I do wonder what value those companies who have >700 million users might get from this?

Pretty much all of the companies with >700 million users could easily reproduce this work in a matter of weeks if they wanted to - and they probably do want to, if only so they can tweak and improve the design before they build products on it.

Given that, it seems silly to lose the "open source" label just for a license clause that doesn't really have much impact.


The point of the more than 700 million user restriction. Is so Amazon, Google cloud or Microsoft Azure. Can not setup an offering where they host and sell access to the model without an agreement with them.

This point is probably inspired by the open source software vendors that have switched license over competition from the big cloud vendors.


Also aren't claiming they are the best LLM out there when they clearly aren't like Inflection. Overall solid


Less than 1 week after Nancy Pelosi bought a 5M USD share in Databricks, this news is published.

https://twitter.com/PelosiTracker_/status/177119703064106223...

Crime pays in the US.


Are you alleging that Nancy Pelosi invested in Databricks, a private company without a fluctuating share price, because she learned that they would soon release a small, fairly middling LLM that probably won't move the needle in any meaningful way?


Are you suggesting that Nancy Pelosi, who consistently beats the market through obvious insider trading for years in a row, bought a share in Databricks without any insider info? Possible, yet unlikely is my opinion.

https://jacobin.com/2021/12/house-speaker-paul-stocks-inside...

PS: "without a fluctuating share price" is non-sense. Just because the share is of a private company, doesn't mean its price can't fluctuate. Why would anybody buy shares in private companies if the price couldn't fluctuate? What would be the point?

Example of a changing share price of a different (random) private company that has many different share holders over time: https://www.cnbc.com/2023/12/13/spacex-value-climbs-to-180-b...


I'm not a lawyer and this isn't investment advice so don't sue me if I'm wrong but I'm not sure this qualifies as insider trading in the way that would be illegal for public markets.

Aren't most investors in private companies privy to information that isn't entirely public?

I can see how this feels a bit different because DataBricks might be the size where it might trade with a decent amount of liquidity, but certainly in smaller rounds it's got to be pretty normal.

Maybe if she bought it secondary and the person from whom she purchased the shares was witheld this information they could sue?


She (and many other senators and other government employees) are very obviously and consistently beating the market as well as beating most investment funds.

There's only 1 explanation for this: they're getting inside info from lobbyists and such.

I don't care whether that's currently illegal or not. I don't care whether other types of investors also engage in that same practice. I just think that that's extremely wrong and corrupt and it blows my mind that both the US government and people think that this is totally OK (or don't know about it, which is even worse).


Nancy beats the market because tech beats the market


I see these types of jokes everywhere. I cannot understand that hints of corruption are so blatant (i.e. a politician consistently beating the market) yet people keep voting for the same politician. Don't see how that is possible, must be these joke are only on internet and mainstream media never mentions this.


People are down-voting this because they refuse to believe this could be reality.


Dude, what the hell are you talking about?


Insider trading by US government employees.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: