More

rxin · 2025-02-14T03:42:39 1739504559

This is coming to Serverless SQL warehouses soon!

rxin · on April 12, 2023

(OP / Databricks cofounder here)

I'm not sure what happened but I just sent an email out internally to ask people not to do this. The team might have gotten overly excited by this because they were all part of the creation of the dataset and the model.

baidifnaoxi · on April 12, 2023

This seems to happen a bit too often tbh on all things Databricks.

rxin · on April 12, 2023

(Databricks cofounder here)

All datasets are biased, including this specific one. However, we believe it's still very valuable to open source, for a few reasons:

- This dataset is primarily used to train instruction reasoning, not for knowledge. (Keep in mind Dolly and any of the well known models have not been specifically trained for knowledge. They are all just demonstrating instruction reasoning.) The lack of a true open source (available for both research and commercial use) instruction dataset is the primary blocker for making these LLMs available for commercial use.

- We hope this will lead to not just open source innovations in models, but also future training datasets.

- Given the international population of our employee based, it's likely more diverse than datasets created by a small number of human labelers. And it is easier to identify, discuss, and debate dataset bias in the open.

rxin · on Jan 21, 2023

Databricks founder here. Doesn’t seem like DM work on HN anymore. Do you mind shooting me an email? Would love to follow up and understand the issues.

rxin@databricks.com

rxin · on March 8, 2022

You should try Databricks, especially the new Photon engine powering Spark. In general more performant than Snowflake in SQL and a lot more flexible. (There are some cases in which Databricks would be slower but the perf is improving rapidly.)

belter · on March 8, 2022

Probably an oversight on your part, but I would argue would be elegant to disclose you are one of the co-founders.

Fiahil · on March 8, 2022

Databricks has an extremely bad API. So, sure, your Spark jobs might be a little bit faster some times, but why would you use it if you can't even read logs of running jobs?

Lucasoato · on March 8, 2022

Databricks is amazing, the Delta Live Table technology is incredible. It's very hard to approach problems like Data Lineage and Data Quality, but that platform does it in the right way.

My only concern is that they offer just a managed cloud product. That's cool for startups, but large enterprises sometimes need more governance and ownership than that.

soulbadguy · on March 8, 2022

Very surprised by this. Do you have a reference ?

fnord123 · on March 8, 2022

Fyi, rxin is co-founder of databricks.

soulbadguy · on March 8, 2022

That explains it

StreamBright · on March 8, 2022

The biggest selling point of Snowflake for most of the customers is that they do not need to maintain the infrastructure.

fritkot · on March 8, 2022

Of course you would say that it's more performant and flexible ...TCP-DS was just a PR ploy

rxin · on Nov 15, 2021

Isn’t that what the official TPC does?

rxin · on Nov 13, 2021

Geometric mean is commonly used in benchmarks when the workloads consists of queries that have large (often orders of magnitude) differences in runtime.

Consider 4 queries. Two run for 1sec, and the other two 1000sec. If we look at arithmetic mean, then we are really only taking into account the large queries. But improving geometric mean would require improving all queries.

Note that I'm on the opposite side (Databricks cofounder here), so when I say that Snowflake didn't make a mistake here, you should trust me :)

bjornsing · on Nov 13, 2021

> But improving geometric mean would require improving all queries.

No. Improving the geometric mean only requires reducing the product of their execution times. So if you can make the two 1 ms queries execute in 0.5 ms at the expense of the two 1000 ms queries taking 1800 ms each then that’s an improvement in terms of geometric mean.

So… kind of QED. The geometric mean is not easy to reason about.

ttmahdy · on Nov 14, 2021

Usually making a 1 ms query execute in 0.5 ms is a lot harder than making a 10 second query execute in 5 second.

One of the benefits of geometric mean is that all queries have "equal" weight in the metric, this keeps vendors from focusing on the long running queries and ignoring the short running ones. It is one way to balance between long and short query performance.

A similar concept is applied to TPC-DS for data load, single user run (Power), multi user run (Throughput) and data maintenance (Concurrent Delete and Inserts).

Check clause 7.6.3.1 in the TPC-Ds spec in http://tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3....

bjornsing · on Nov 14, 2021

> Usually making a 1 ms query execute in 0.5 ms is a lot harder than making a 10 second query execute in 5 second.

Eh, okay... It produces the same reduction in geometric mean though, right?

rxin · on Nov 13, 2021

Check my reply, Leo.

lmeyerov · on Nov 13, 2021

The audit question is on Databricks marketing unaudited Snowflake TPC numbers. I do think Snowflake is big enough to run TPC, but how you guys choose to market is on you.

But: I think it's cool both companies got it to $200-300. Way better than years ago. Next stop: GPUs :)

rxin · on Nov 13, 2021

Ah ok. Wasn't clear. I think some repro scripts will be available soon.

rxin · on Nov 13, 2021

Exactly. Not sure about Netflix special, but there are experts that have dedicated their professional careers to creating fair benchmarks. Snowflake should just participate in the official TPC benchmark.

Disclaimer: Databricks cofounder who authored the original blog post.

AtlasLion · on Nov 13, 2021

The benchmark itself is kinda useless, so I don't see why they should. If you look at tpc-h for years, you had exasol as a top dog, but in the real world that meant nothing for them.

ttmahdy · on Nov 15, 2021

Exactly, companies learnt from Exasol Out of the box performance is the name of the game Executing a benchmark as complex as TPC-DS without tuning by Databricks or Snowflake is a big accomplishment

blobbers · on Nov 15, 2021

Come on, you're going to make a ton of money on the IPO now focus on the things that matter in life... ie: starring in a netflix special.

rxin · on Nov 13, 2021

There's an official TPC process to audit and review the benchmark process. This debate can be easiest settled by everybody participating in the official benchmark, like we (Databricks) did.

The official review process is significantly more complicated than just offering a static dataset that's been highly optimized for answering the exact set of queries. It includes data loading, data maintenance (insert and delete data), sequential query test, and concurrent query test.

You can see the description of the official process in this 141 page document: http://tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3....

Consider the following analogy: Professional athletes compete in the Olympics, and there are official judges and a lot of stringent rules and checks to ensure fairness. That's the real arena. That's what we (Databricks) have done with the official TPC-DS world record. For example, in data warehouse systems, data loading, ordering and updates can affect performance substantially, so it’s most useful to compare both systems on the official benchmark.

But what’s really interesting to me is that even the Snowflake self-reported numbers ($267) are still more expensive than the Databricks’ numbers ($143 on spot, and $242 on demand). This is despite Databricks cost being calculated on our enterprise tier, while Snowflake used their cheapest tier without any enterprise features (e.g. disaster recovery).

Edit: added link to audit process doc

_dark_matter_ · on Nov 13, 2021

Thanks for the additional context here. As someone who works for a company that pays for both databricks and snowflake, I will say that these results don't surprise me.

Spark has always been infinitely configurable, in my experience. There are probably tens of thousands of possible configurations; everything from Java heap size to parquet block size.

Snowflake is the opposite: you can't even specify partitions! There is only clustering.

For a business, running snowflake is easy because engineers don't have to babysit it, and we like it because now we're free to work on more interesting problems. Everybody wins.

Unless those problems are DB optimization. Then snowflake can actually get in your way.

rxin · on Nov 13, 2021

Totally. Simplicity is critical. That’s why we built Databricks SQL not based on Spark.

As a matter of fact, we took the extreme approach of not allowing customers (or ourselves) to set any of the known knobs. We want to force ourselves to build the best the system to run well out of the box and yet still beats data warehouses in price perf. The official result involved no tuning. It was partitioned by date, loaded data in, provisioned a Databricks SQL endpoint and that’s it. No additional knobs or settings. (As a matter of fact, Snowflakes own sample TPC-DS dataset has more tuning than the ones we did. They clustered by multiple columns specifically to optimize for the exact set of queries.)

geoduck14 · on Nov 13, 2021

>That’s why we built Databricks SQL not based on Spark.

Wait... really? The sales folks I've been talking to didn't mention this. I assumed that when I ran SQL inside my Python, it was decomposed into Spark SQL with weird join problems (and other nuances I'm not fully familiar with).

Not that THAT would have changed my mind. But it would have changed the calculus of "who uses this tool at my company" and "who do I get on board with this thing"

Edit: To add, I've been a customer of Snowflake for years. I've been evaluating Databricks for 2 months, and put the POC on hold.

alexott · on Nov 13, 2021

it's different - rxin talks about this: https://databricks.com/product/databricks-sql

when you run Python, it's on Spark, although you now can use Photon engine that is used for DB SQL by default

gibneyMI · on Nov 15, 2021

Credit to you for these amazing benchmark scores via an official process. You've certainly proved to naysayers such as Stonebreaker that lakes and warehouses can be combined in a performant manner!

Shame on your for quoting a fake non-official score for Snowflake in your blog post with crude suggestions to make it seem you're showing an apples-to-apples comparison.

I run a BI org in an F500 company that uses both Databricks & Snowflake on AWS. I can tell you that such dishonest shenanigans take away much from your truly noteworthy technical achievements and make me not want to buy your stuff for lack of integrity. Not very long ago, Azure+GigaOM did a similar blog post with fake numbers on AWS Redshift and it resulted in my department and a bunch of large F500 enterprises that I know moving away from Synapse for lack of integrity.

On many occasions, I've felt that Databricks product management and sales teams lack integrity (especially the folks from Uber & VMW) and such moves only amplify this impression. Your sales guys use arm-twisting tactics to meet quotas and your PM execs. are clueless about your technology and industry. My suggestion is to overhaul some of these teams and cull the rot - it is taking away from the great work your engineers and Berkley research teams are doing.

uvdn7 · on Nov 13, 2021

Snowflake claims the snowflake result from Databricks was not audited. It’s not that Databricks numbers were artificially good but rather Snowflake’s number was unreasonably bad.

jiggawatts · on Nov 13, 2021

Please also refer to my comment below on the value of the TPC audit process: https://news.ycombinator.com/item?id=29208172