Databricks response to Snowflake's accusation of lacking integrity

gnabgib · on Nov 15, 2021

Related post (2 days ago, 95 comments): [Snowflake’s response to Databricks’ TPC-DS post](https://news.ycombinator.com/item?id=29206959)

drej · on Nov 15, 2021

What I find hilarious is that companies argue who can query 100 TB faster and try to sell this to people. I've been on the receiving end of offers by both of the companies in question and used both platforms (and sadly migrated some data jobs to them).

While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.

Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).

sanketsarang · on Nov 16, 2021

I did work on making a database myself, and I must say that querying 100TB fast, let alone storing 100TB of data, is a real problem. Some companies (very few) don't have much choice but to use a DB that works on 100TB. If you do have small data, then you have a lot of options. But if your data is large, then you have very few options. So it is correct to be competing on how fast a DB can query 100TB of data; while at the same time being slow if you have just 10GB of data. Some databases are designed only for large data, and should not be used if your data is small.

doppelganger1 · on Nov 16, 2021

The larger your data, the more that indexing and maintaining them hurt you. This is why they do much better at larger datasets vs small data sets. It’s all about trade offs.

To overcome this, they make use of cache and if the small data is frequently accessed, the performance is generally pretty good and acceptable for most use cases.

geoduck14 · on Nov 16, 2021

Did anyone else notice the surge of brand new accounts that are appearing on these discussions of Databricks with pro-Databrick opinions?

If we had access to IP address of the posters, I sure would be interested in looking at correlation among them.

doppelganger1 · on Nov 18, 2021

What about my comment above is pro-Databricks? Snowflake works the same way. So do most large scale DW insert Exadata, Netezza, etc...

Does anyone else notice people questioning common sense?

khc · on Nov 16, 2021

with most people working from home, not sure if this heuristic works.

disclaimer: works for databricks, but not on spark, and first time posting in this thread

tshanmu · on Nov 15, 2021

Resume driven development FTW!

StephenJGL · on Nov 16, 2021

Very true. You have to understand the actual capabilities and your actual requirements. We work with petabyte size datasets and BigQuery is hard to beat. Our other reporting systems are still all in MySQL though.

autokad · on Nov 16, 2021

its my experience if its just 10s of GBs then use 'normal' solutions. if TB then spark is great for that. note I have only used DataBricks & Spark, no snowflake.

jeltz · on Nov 16, 2021

PostgreSQL and MySQL can handle a few TB just fine. It is when you reach over 10TB that you need something else.

scapecast · on Nov 15, 2021

The irony here is that what Databricks is doing to Snowflake is exactly what Snowflake did to AWS and Redshift.

Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.

In Snowflake’s case, that was separation of storage and compute.

In Databrick’s case, it’s the Lakehouse Architecture.

I think the reason why Snowflake is so nervous because they know they can’t win this game.

falaki · on Nov 15, 2021

To be fair Apache Spark, which started long before either company existed, was built on the assumption that compute and storage should be separate. Unlike Hadoop, Spark did not come with any storage system and could read from any source.

d-d-d · on Nov 15, 2021

> To be fair Apache Spark, which started long before either company existed

Databricks was founded before Spark 1.0 released by Spark's creators.

Hadoop was created at a time when network and disk were much slower, RAM was less abundant. Bringing compute to the data made sense, but it typically doesn't anymore.

doppelganger1 · on Nov 22, 2021

Hadoop was built on the notion that commodity hardware, when pooled together, can be extremely cheap and powerful. The problem is, to manage it, is a nightmare. Cloudera/HWX and others were unable to reduce the management burden and their inability to pivot to a cloud based architecture really sunk their ship.

doppelganger1 · on Nov 16, 2021

SF spreads a lot of FUD saying that DB can’t perform, and it was true. DB then went out and hired a lot of engineering talent with a diverse background and has been investing a lot of money in being a best in class SQL offering, so what do you do? You do something to get people’s attention. They’re saying, “hey, we have great performance too, you should also look at us for your SQL workloads.”

ignoramous · on Nov 15, 2021

> I think the reason why Snowflake is so nervous because they know they can’t win this game.

Isn't Databricks' delta.io, which their Data Lakehouse product builds on top of, open source? Snowflake could take the best parts from and run with it?

bpaneural · on Nov 15, 2021

They could in principle. GCP, for instance, does do that. So does HP. And Databricks don't mind that as they have a strong open source legacy. But that takes away the proprietary lock-in strategy of Snowflake.

buttaphingas · on Nov 16, 2021

Delta is open source, but Databricks keeps optimizations for themselves as proprietary. I'm not sure why it would be any better than Snowflake's solution, which is automatically deployed across multiple AZs as a fully HA system and gives full ACID transaction compliance across any number of tables (not just per-table).

bpaneural · on Nov 17, 2021

Essentially with Databricks making Delta open source, you can move away from Databricks to EMR or Presto (with their own optimizations) without incurring a data tax. You're also able to move between cloud providers at ease as data sits in low cost buckets.

glogla · on Nov 15, 2021

In what way is lakehouse architecture beneficial over something like Snowflake or BigQuery?

I understand the appeal over having lake and warehouse as separate components, but with those native cloud warehouses, you can already do everything a lake does.

turk- · on Nov 15, 2021

With a datawarehouse, you can only interface with your data in SQL. With big query and snowflake, your data is locked away in a proprietary format not accessible by other compute platforms. You need to export/copy your data to a different system to train an ML model in python or R.

With the lakehouse, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (spark, Databricks, presto) so you are not locked into one compute engine.

I recall being a junior programmer, and wishing I could talk to my MySQL database in python code to do some processing that was difficult to express in SQL, that day is finally here.

adeelk93 · on Nov 16, 2021

BigQuery does support ML. But the pricing is kind of a racket ($250/TB) so I’ll stick to modeling in R/python. Which I guess reinforces your point. I wonder who pays for this.

https://cloud.google.com/bigquery-ml/docs/introduction

glogla · on Nov 16, 2021

My experience is that's how it looks at first. But it is hard to actually make use of lake or lakehouse openness.

You can access data in Snowflake or BigQuery using JDBC or Python clients. You do pay for the compute that reads the data for you. You cannot access the data in storage directly.

You can access data in lakehouse directly, by going to cloud storage. That has two major challenges:

Lakehouse formats aren't easy to deal with. You need a smart engine (like Spark) to do that. But those engines are pretty heavy. Staring a Spark cluster to update 100 records in a table is wasteful.

The bigger challenge is security. Cloud storage can't give you granular access control. It only sees files, not tables and columns. So if you have a need for column or row-based security or data masking, you're out of luck. Cloud storage also makes it hard to assign even the non-granular access. Not sure about other clouds, but AWS IAM roles are hard to manage and don't scale for large number of users/groups.

You can sidestep this by using a long-running engine (like Trino) and applying security there. Then you don't need to start Spark to change or query a few records. But it means you're basically implementing your own cloud warehouse.

Which honestly can be the way if that's what you want! You can also use multiple engines if you are ok with implementing security multiple times. To me, that doesn't seem to be worth it.

In the end, I don't see data that's one SELECT away as much more proprietary and "outsourced" than data that is one Spark/Trino cluster and then SELECT away, just because you can read the S3 is sits on.

doppelganger1 · on Nov 16, 2021

Have you ever tried to train models on large data sets over JDBC/ODBC? it’s terrible even with parallelism. Having direct access to the underlying storage and being able to bypass sucking a lot of data over a small straw is a game changer. That is one advantage that Spark and Databricks have over Snowflake.

glogla · on Nov 16, 2021

Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

Sadly, those things are mutually exclusive at the moment and with the way things are deployed here (large multi-tenant platforms), the security has to take priority.

But if that's not your situation, then obviously it makes sense to make use of that!

saj1th · on Nov 16, 2021

> Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

It is a solved problem. Essentially you need a central place ( with decentralized ownership for the datamesh fans ) to specify the ACLS ( row-based, column-based, attribute-based etc.) - and an enforcement layer that understands these ACLs. There are many solutions, including the ones from Databricks. Data discovery, lineage, data quality etc., go hand in glove.

Security is front and centre for almost all organizations now.

feqgmmr2 · on Nov 16, 2021

This is exactly what FAANGs do with their data platforms. There are literally hundreds of groups within these companies with very strict data isolation requirements between them. Pretty sure something like that is either already possible or will be very soon, there's just too much prior art here.

buzzscale · on Nov 16, 2021

Thats where Databricks comes in though, you can implement row/column based security on your data on cloud object storage and use it for all your downstream use cases (Not just BI/SQL but AI/ML without piping data over JDBC/ODBC).

glogla · on Nov 16, 2021

According to their documentation [1], Databricks does not have this capability even for their own engines, and definitely not for "without piping data".

This is what I've personally seen few times - Databricks claiming they can do something and then it turns out they can't. Buyer beware lying salespeople and HN shills.

[1]: https://docs.databricks.com/administration-guide/access-cont...

saj1th · on Nov 16, 2021

Check out https://databricks.com/product/unity-catalog when you get a chance. There are other solutions in this space as well.

turk- · on Nov 16, 2021

I don’t understand what capability you are saying Databricks lacks. This capability is literally the entire premise of the Data Lakehouse. With Snowflake you need to export data out/or pipe data over jdbc/odbc to an external tool. With Databricks you can use SQL for data warehousing and when you need you can work with that same data using python to train an ML model without piping data out over jdbc (using the spark engine). One security model, one dataset, multiple use cases (AI/ML/BI/SQL) on one platform.

buttaphingas · on Nov 17, 2021

They're still lacking things in the SQL space. For example, Databricks say they're ACID compliant, but it's only on a single-table basis. Snowflake offers multi-table ACID consistency, which is something that you would expect by default in the data warehousing world. If I'm loading, say, 10 tables in parallel, I want to be able to roll-back or commit the complete set of transactions in order to maintain data consistency. I'm sure you could work around this limitation, but it would feel like a hack, especially if you're coming from a traditional DWH world (Teradata, Netezza etc.).

Snowflake now offers Scala, Java and Python support, so it would seem their capabilities are converging even more, but both with their own strengths due to their respective histories.

doppelganger1 · on Nov 18, 2021

Actually, you would expect that in an OLTP world. DW's for the longest time, even Oracle, recommends you disable txn to get better performance. The logic is implemented in the ETL layer. Very rarely do you need multi-table txn in large scale DW.

Snowpark is still inferior.

jeltz · on Nov 16, 2021

I have not, but I do not see why it would be much slower than direct access to the storage. Databases are quite good at streaming rows.

saj1th · on Nov 16, 2021

> I do not see why it would be much slower than direct access to the storage.

Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.

There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.

saj1th · on Nov 18, 2021

> There is also the question of computing for ML.

Few reasons why Databricks platform shines here.

1) Not limited by just udfs - Extensions to improve performance, including GPU acceleration in XGBoost, distributed deep learning using HorovodRunner.

2.) End to end MLOps solution - including Feature store, Model registry & Model Serving

3.) Open approach with https://www.mlflow.org/

4.) Glass box (not blackbox) model for AutoML

turk- · on Nov 16, 2021

Here is the thing with the lakehouse though, you have flexibility and don’t need to use multiple engines to achieve the lakehouse vision. Databricks has all the security features a redshift / snowflake does so you can secure databases and tables rather than s3 buckets. It does get more complex if you want to introduce multiple engines but at least you have the option to make that trade off if you want to.

If you want simplicity, you can limit your engine to Databricks. You can also use JDBC/ODBC with Databricks if you want to use other tools that don’t support the delta format/parquet but piping data over JDBC/ODBC doesn’t scale with any tool to large datasets. Databricks has all the capabilities of big query/snowflake/redshift but none of those tools support python/r/scala. Their engines need to be rewritten from the ground up in order to do so.

buttaphingas · on Nov 17, 2021

But you do still have to secure the S3 buckets, right? And I guess also secure the infrastructure you have to deploy in order to run Databricks. Plus then configure for cross-AZ failover etc. So you get flexibility, but I would think at the cost of much more human labor to get it up and running.

Snowflake uses the Arrow data format with their drivers, so is plenty fast enough when retrieving data in general. But it would be way less efficient if a data scientist just does a SELECT * to bring everything back from a table to load into a notebook.

Snowflake has had Scala support since earlier in the year, along with Java UDFs, and also just announced Python support - not a Python connector, but executing Python code directly on the Snowflake platform. Not GA yet though.

buttaphingas · on Nov 17, 2021

You can use Scala, Java and Python with Snowflake now, as well as process structured, semi-structured and unstructured data. So I guess that means it doesn't fit into the data warehouse category, but is not a lakehouse either.

doppelganger1 · on Nov 16, 2021

Big Query&Data Proc, Redshift&EMR, Synapse&HDR are tied to the cloud vendors. You can’t move easily from AWS stack to GCP without refactoring. Switching costs are higher.

Snowflake and Databricks are multicloud. The different is that Snowflake is more like a SaaS solution and only does SQL. Databricks is more than just SQL. It has all the data science, machine learning information, built into it. Snowflake has Snowpark but it’s every limited and so you are more likely to have to buy more products to build out your capabilities and integrate them with Snowflake. With Databricks it is more out of the box in terms of capabilities. Databricks also runs in your cloud account which has trade offs. It can be harder to get going and more complex but you end up with a lot more flexibility and you own your data and have complete control over it. While Snowflake gives you control of your data with their tools, everything has to go through Snowflake and incur their tax to get to it. You pay for simplicity, which many customers are ok with because they see value in it. On the contrary, a lot of customers see value in having more control and options. This market is big enough for everyone - it’s really just about market share.

avip · on Nov 15, 2021

I've used both products in production. Both are good++.

The blog wars seem extremely ridiculous to me. I don't recall ever choosing one over another based on how fast it runs on some imaginary arbitrary dataset.

paxys · on Nov 15, 2021

Manufactured rivalries can be a great thing for business. We have been debating Coke vs Pepsi, Nike vs Reebok, McDonald's vs Burger King for decades now while these companies laugh all the way to the bank.

javajosh · on Nov 15, 2021

Like the post but I would add "Ford v Ferrari" there. A synthetic 100T test is much like an F1 course - not something you deal with during your commute, but it's nice to know what the limit is, and that there are people pushing that limit.

kartoonhero · on Nov 15, 2021

Its not ridiculous at all. This is the coming of age for a brand new data architecture.

One of the biggest FUDs for a data lake architecture is performance - and this benchmark should put that concern to rest.

buttaphingas · on Nov 16, 2021

I actually see them as variations on the same architecture. Databricks keeps their metadata in files, Snowflake keeps theirs in a database, but they both, ultimately, are querying data stored in a columnar format on blob store (and, to be fair, Snowflake have been doing that with ACID-compliant SQL for a lot longer than Databricks). So using SQL over blob at high performance has been around for a while.

Databricks say their solution is better because it's open (though keep the optimizations you need to run this at scale to themselves, i.e. is ultimately proprietary). Snowflake says theirs is better because it's a fully managed service, meaning no infrastructure to procure or manage, is fully HA across multiple data centers by default etc.

Databricks push 'open' but really still want you to use their proprietary tech for first transforming into something usable (Parquet/Delta) and then querying with Photon/SQL, though you can also use other tech. With Snowflake you can just ingest and query, but it has to be through their engine.

Customers should do their own valudation and see which one fits their needs best.

syntaxfree · on Nov 16, 2021

I don’t know, “coming of age” seems to imply that there’s some pre-maturity period out of which something is emerging.

CactusOnFire · on Nov 16, 2021

It was inevitable.

Both Databricks and Snowflake have inflated marketing budgets, and marketing feels they have to "beat" the other one or they'll lose the market.

inetknght · on Nov 15, 2021

Snowflake accuses other companies of lacking integrity?

I really wish I could block all of Snowflake's domain from my inbox. Sadly, Google encourages spammers to just create a new email address. So I get a few emails each month from Snowflake who ask me to try their products. I've never done business with them and there's no unsubscribe link.

Fuck Snowflake for thinking it has any room to talk about integrity.

doppelganger1 · on Nov 16, 2021

What I find comical is they accuse Databricks of lacking integrity but they don’t actually call out anything except their benchmark was faster than what Databricks did in Snowflake. Databricks then reruns the benchmark and says the only reason that Snowflake’s was faster was because of the built in dataset they used. Databricks was able to match Snowflakes numbers using it but when they loaded the actual data set, it was much slower, which is how a proper TPC benchmark is supposed to happen. They then said that Databricks blog doesn’t match the TPC results, but when I looked at them, they do match. I guess Snowflake just expects people to take arguments at face value. Then I saw someone on LinkedIn complaining that Databricks must have used some beta version. I didn’t see a beta version being used, but that kind of goes out the window when Databricks follows up and then posts that they matched Snowflake when they used their built in TPC data set.

This is funny and interesting to watch but also a distraction I feel. Amazon says it best when they say, “Leaders start with the customer and work backwards. They work vigorously to earn and keep customer trust. Although leaders pay attention to competitors, they obsess over customers.”

boublepop · on Nov 16, 2021

Snowflake must be kicking themselves hard now for letting a story that was “Databricks is a viable alternative” turn into “Snowflake has absolutely no integrity and will fling mud even while they are gaming the statistics”

Really can’t see what they can do now short of “bending” to Databricks and entering the competition. And naturally it’s no longer just enough that they show comparable performance. They have to hit their games stats somehow otherwise any news even of they beat Databricks will be reported as “see, we told you they where cheating”

bloodyplonker22 · on Nov 15, 2021

Databricks is trying to punch up at the market leader. Every decent marketer knows that you should never do the opposite and punch down.

djbusby · on Nov 15, 2021

I'm crap at marketing and know the only-punch-up rule.

aliswe · on Nov 15, 2021

what differences in size (or height) are we talking about?

jchw · on Nov 15, 2021

Before the Snowflake blog post, I did not know what Snowflake or Databricks were. I can only imagine that this rivalry is great for both of them, even if Databricks is somewhat on the advantage end, at least from a tactical standpoint; I admit though that they seem to be a bit unnecessarily defensive considering the position they're in with the exchange.

In general though, I'm still not complaining. It's interesting to see a dispute like this unfold.

qaq · on Nov 15, 2021

Snowflake is 120B Market Cap Darling of Cloud Data warehouses I doubt obscurity is a problem they are trying to solve

jchw · on Nov 16, 2021

Of course they’re known among their pre-existing customer base of people and entities who already solve problems using tools like this. But it’s a subset of the multi-trillion dollar cloud industry, which itself is not the entire software engineering industry.

AdamProut · on Nov 15, 2021

I would say that TPC-DS and TPC-H are really table stakes benchmarks for data warehouses at this point in time (maybe they weren't 10 years ago). How to build a database that does well on them is well documented in the literature now[1][2][3][4] (maybe a few other papers). Its not easy to build such a database, but its "just" hard work and many companies have the $$ necessary to do that work. There isn't any magic or technical moat in the results for databricks (or snowflake, or redshift, etc.).

I think Databricks is overly enthusiastic about their results as they have been trying to be competitive with cloud DWs on these benchmarks for a number of years now. They have finally caught up (by building deltalake and their photon query engine which implement a number of standard DW features).

  [1] http://www.vldb.org/pvldb/vol13/p1206-dreseler.pdf
  [2] https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
  [3] https://web.stanford.edu/class/cs245/readings/c- store.pdf
  [4] http://sites.computer.org/debull/A12mar/vectorwise.pdf

thrtlvlmidnight · on Nov 16, 2021

I agree with everything above. The main advantage the newer data warehouses have over the legacy on-prem incumbents is that they had the chance to build from scratch having learned from all of the challenges that the original players encountered.

The public pissing contest is entertaining while also being silly and slightly cringe, but I think it's a nice story for Databricks nonetheless. They now have a performant SQL-based analytics engine that can credibly compete with the best DWs in the market today, and it's just one part of their overall platform.

The sense I get is that Snowflake wants the conversation to be "no matter what you do, you need a data warehouse, and we're the best in the business at that." Databricks' Lakehouse approach is a fundamental challenge to that, and if they're getting this kind of performance from their analytics engine against the market-leading data warehouses today, that's a big momentum shift in their favour.

redwood · on Nov 15, 2021

As much as I love seeing competition in the space and am enjoying my popcorn, I really don't understand what Databricks is doing here: this feels like a childish foodfight rather than an obsession with the customer...

saj1th · on Nov 15, 2021

:) That is a good question. Why spend eng cycles to submit results to the TPC council - why not just focus on customers?

I believe the co-founders have addressed this in the blog.

> Our goal was to dispel the myth that Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark.

I'm sure anybody seriously looking at evaluating data platforms would want to look at things holistically. There are different dimensions like open ecosystem, support for machine learning, performance etc. And different teams evaluating these platforms would stack rank them in different orders.

These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.

minerva23 · on Nov 15, 2021

I'd say helping customers spot fraud* is serving the customers' interests.

* I haven't executed the test suite, but fraud seems likely.

jjoonathan · on Nov 15, 2021

All publicity is good publicity.

Both participants in a fight can win by implicitly excluding their real competitors.

glogla · on Nov 15, 2021

Yes, the tone of those blogposts, the likelihood of fake benchmarks submitted on someone else's behalf and especially the deluge of new accounts supporting them makes me want to trust Databricks even less than the PoC my company ran with them last year and spending time with their terrible, terrible salespeople.

EDIT: I forgot lying about how open they are when all their interesting technologies (like the new sql engine and the good parts of delta) are proprietary.

mostdataisnice · on Nov 16, 2021

What fake benchmarks are you talking about?

vgt · on Nov 15, 2021

I think Snowflake cultivates a very careful public image, but in private their sales people use.. how do you say.. aggressive techniques.. databricks is addressing the source of market confusion head-on

benjaminwootton · on Nov 15, 2021

Ive been following this and it’s kind of embarrassing to watch.

I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.

It makes no sense to fall out about this though.

For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.

imslowbutnice · on Nov 16, 2021

What are you talking about. Spark isn't even used, and TPC DS is not a funky calculation at all. It's supposed to be a collection of typical datawarehouse type queries. Although I'm not really sure what funky means, but why would Spark trounce Snowflake on "funky" calculation at all. Do you mean an ML algorithm, and are you implying that TPC-DS has anything close to an ML Algorithm? And why would Snowflake perform better on returning one row, they are columnar stored.

nojvek · on Nov 16, 2021

Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

Also what kind of queries are we talking about?

saj1th · on Nov 16, 2021

> Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

These are the slides from a talk one of the co-founders (@rxin) gave at Stanford. https://web.stanford.edu/class/cs245/slides/LakehouseGuestTa...

It goes into the details of how this performance is achieved(and not just at 100TB). Part of this could be attributed to innovations in the storage layer(delta lake), and part of it is just the new query engine design itself.

__MatrixMan__ · on Nov 15, 2021

Instead of blog posts written but experts in app A based on their experience with app B, I wish there were a platform for this kind of comparison.

Some objective third party sets the goal and then each company submits automation (selenium?) that configures their own app to achieve the goal. Entrants are scored by:

- time

- storage

- compute

- config complexity

No need to waste time making your opponent look bad, just focus on making your self look good, and do it on a level playing field.

rxin · on Nov 15, 2021

Isn’t that what the official TPC does?

falaki · on Nov 15, 2021

That is exactly the role of tpc.org.

renewiltord · on Nov 15, 2021

If you want some information like this quick, you're gonna have to pay to run it.

michaelhartm · on Nov 15, 2021

Data Wars: Snowflake vs Databricks (0 - 2)?

drawturkey · on Nov 16, 2021

Snowflake has way more revenue, is worth 3 times more than Databricks and is growing faster. I'd say Snowflake is still in the lead. Plus, just look at Snowflake's customer list. It's a "who's who", Databricks is a "Who's that?".

thrtlvlmidnight · on Nov 16, 2021

I took a look at Databricks public customer case studies[1] and haven't a clue who any of these companies are:

Atlassian? Adobe? ExxonMobil? PagerDuty? McAfee? HSBC? Starbucks? AstraZeneca? GlaxoSmithKline? Comcast? FINRA? Regeneron? Riot Games? Nielsen? HP? Conde Nast? Viacom? McGraw-Hill? Cisco? NBCUniversal?

Hopefully they can scale to the enterprise soon.

[1]https://databricks.com/customers

glogla · on Nov 16, 2021

Lists of "references" like these are worthless. Because larger companies tend to be fragmented, especially companies that have more complicated business lines and are used to departments and divisions acting independently.

You know what, our company uses both Snowflake and Databricks.

For Databricks, there's one or two projects that someone built on it running in production. For Snowflake, there's a sizeable use because we bought a smaller company that used it for reporting and warehousing. Neither of them are "the chosen tool" and will see any growth unless wind changes. But we could be (F50 company) counted as reference by both I guess.

naattee · on Nov 15, 2021

snowflake should just pony up and do a TPC-DS audited benchmark

maslam · on Nov 15, 2021

Everyone win when data platforms submit audited benchmarks...

boringg · on Nov 15, 2021

And how soon is the S-1 for Databricks dropping?

Normal_gaussian · on Nov 15, 2021

so, alternatives?

Aside from the Azure/GCP/AWS internal offeringa I know about Snowflake and Firebolt, Databricks is new to me.

glogla · on Nov 15, 2021

Redshift is pretty terrible, stay away. AWS is even worse at delivering promises than Databricks and that's saying something.

I heard Google BigQuery is good. It is completely SaaS (like AWS Athena that works).

Unicorns often run their own stack and you could replicate that, if you have the apetite. Netflix and Apple run Trino + Spark on k8s + Iceberg. Uber used their own Hudi thing, not sure if they still do.

falaki · on Nov 15, 2021

Apple is a big Deltalake (and Databricks) customer: https://www.youtube.com/watch?v=SFeBJxI4Q98

drawturkey · on Nov 16, 2021

"Big" No, not really. They use Deltalake for the security use case, sure, but that pales in comparison to how much Iceberg they use.

ethbr0 · on Nov 15, 2021

https://en.m.wikipedia.org/wiki/Databricks

"Databricks is an enterprise software company founded by the creators of Apache Spark. [...] Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks."

tyingq · on Nov 15, 2021

Oracle and Teradata still have data warehouse pitches ;)

kofejnik · on Nov 15, 2021

maybe clickhouse?

glogla · on Nov 15, 2021

Clickhouse is good if you're building application. It has lot of great features and incredible performance, but there's an expectancy that people using it know what they're doing and can work around its limitations (like limited support for joins and sql in general).

Something like Snowflake works much better when you're building a platform that you can give to two hundred data analysts or various skills spread over fifty teams, so they can build their own stuff. The nice UI, broad feature set (materialized views, time travel, automatic backups, superfast scaling up and down, ...) and general just-work-iness makes it nice for that, but you're going to pay for the privilege.

Databricks is somewhere in the middle - things are way less polished, features don't always work and you still have to figure out things like backups and partitions on S3 on your own, but some people like that. Expect to also pay a pretty penny for hundreds of Spark clusters nobody knows who uses.

solidangle · on Nov 15, 2021

When was the last time you used Databricks? You should definitely try it again. Their product offering has improved a lot in the past few years.

> broad feature set

My experience is that the feature sets of Snowflake and Databricks are very similar. Both have time travel support. Snowflake has materialized views, but Databricks has Delta Live Tables. Databricks has a distributed Pandas API, but Snowflake recently introduced Snowpark. Databricks also has autoscaling and they recently launched a serverless offering that makes autoscaling super fast aswell.

glogla · on Nov 16, 2021

Snowflake has much more advanced data security - table, column, and row level security and dynamic data masking policies. The zero-copy cloning is also pretty useful for CI/CD (pretty much the one practical way to do blue-green deployment for data application).

Databricks has some interesting features (we were originally interested in it as "nice UI" for our AWS data lake for citizen data scientists - using it for industrialized processing was price impractical compared to AWS Glue) but the security seems lacking - it goes just table level and only in SQL and Spark, with R you can't have security at all.

I really liked the Databricks UI and integrated visualizations, though, that's where they are better than Snowflake I think. Of course, they gained those by buying open source Redash.io and ending it.

The part that ended our PoC with them was when they gave us a price quote for expected number of users, the management was like "ok that sounds reasonable" until I told them that's just license and does not include EC2 costs - the real cost would be at least twice. That made everyone angry.

feqgmmr2 · on Nov 15, 2021

If you want Python, ML and SQL to be easily usable together on the same data nothing can touch Databricks.

62951413 · on Nov 15, 2021

* Apples and oranges: Clickhouse is a query engine while Databricks is a SaaS product/company. Apache Spark could be compared to Clickhouse, Databricks to clickhouse.com/company. The latter is barely a couple months old.

* Databricks pivoted from analytics to ML and it's not just marketing. Clickhouse is all about OLAP use cases.

* Clickhouse competes with Druid/Pinot/Timescale, Spark competes with Flink.

funstuff007 · on Nov 16, 2021

I guess if anyone suggests "sampling" the data in meeting these days, they get their head blown off.

xiaodai · on Nov 16, 2021

Spark compares itself to Hadoop only on the front page. I wonder how Spark compares to Firebolt.

uvdn7 · on Nov 16, 2021

Now I see that getting rid of the DeWitt clause is indeed great. Kudos to both companies.

1cvmask · on Nov 15, 2021

This reminds me of the old performance ads of Oracle where they would show you how everything ran better on Oracle. They used to put those ads at airports, business lounges and the back cover of newspapers and magazines read by non-technical executives like the FT and Economist.

Everyone technical knew they would game every environment to come out with superior results. I suppose it worked. As the top executives buy big system software and ignore the IT crowd who could easily point out the flaws in the methodology of the"studies".

Breakdown of one of those example ads:

https://db2news.wordpress.com/2011/06/08/a-closer-examinatio...

initplus · on Nov 15, 2021

A key part of the Oracle strategy is making it a breach of license to publish any benchmarking data. No performance data about Oracle's database is allowed to be published without their approval, which means no negative results are published.

1cvmask · on Nov 15, 2021

They also sue you for so many other reasons. It's like the management hierarchy joke that Oracle is a litigious law firm with a sales team.

https://palisadecompliance.com/oracle-org-chart/

doppelganger1 · on Nov 16, 2021

Also, if a sales rep or manager are struggling to make their numbers, they will audit customers.

laserlight · on Nov 15, 2021

Here's some background for those who are interested [0].

[0] A solution to DeWitt clauses. https://danluu.com/anon-benchmark/

doppelganger1 · on Nov 16, 2021

Oracle Exadata is very fast but expensive. I bet it would beat a similarly sized cluster from these 2 vendors. The problem is price to performance and elasticity. Because DB and SF are in the cloud, they have a lot more options that Oracle doesn’t have. This is why Kurian left Oracle to go to Google, because LE would not allow Oracle to make cloud native products that would run in other clouds. The SF cofounders are ex Oracle engineers and LE was not interested in creating a cloud native DB from scratch. If he did, we wouldn’t have a SF computing right now.

glogla · on Nov 16, 2021

Yeah, the biggest benefit something like Snowflake or Databricks or whatever AWS tools has over the more traditional technologies is the pay-as-you-go pricing.

We're are now trying to scale unnamed technology running on EC2 from 100 nodes to 200 cores and the process to buy larger license is pretty painful. If we were using Snowflake or Databricks, we could just scale it up and update our opex estimate.

jpalomaki · on Nov 15, 2021

I think this has been quite common clause in the license contracts. Databrics has a blog post about it: https://databricks.com/blog/2021/11/08/eliminating-the-dewit...

This is kind of understandable. Benchmarking complex software is complicated. It’s easy to give totally wrong picture of things either accidentally or deliberately.

belter · on Nov 15, 2021

Who could forget the Unbreakable and Unhackable Campaign...

The "Unbreakable" Marketing Campaign:

https://www.oreilly.com/library/view/the-oracle-hackers/9780...

https://www.zdnet.com/article/invincible-oracle-not-so-secur...

doppelganger1 · on Nov 16, 2021

The first thing unbreakable Linux did was break.

supercanuck · on Nov 15, 2021

similiar as to how SAP is still showing growth even thought their core product (ERP Financials) hasn't changed much.

falaki · on Nov 15, 2021

tl;dr: The data warehouse company used a pre-baked TPC-DS dataset and claimed they have similar performance to Databricks. Turns out if you use the official TPC-DS data generation scripts, you get much worse performance.

slownews45 · on Nov 15, 2021

Even worse, they claimed to have similar performance to Databricks AND claimed databricks "lacked integrity". WOW, talk about chutzpah!

tyingq · on Nov 15, 2021

I read the original post, the Snowflake response, and this. From that I gather that both of them aren't being completely honest or fair when making comparisons. A fair amount of truth, but also some clever wording and omission on both their parts. Which is not surprising or particularly new in this space :)

slownews45 · on Nov 15, 2021

Databricks results are available at tpc.org [1]

Snowflake has shown NOTHING close to this.

[1] http://tpc.org/results/fdr/tpcds/databricks~tpcds~100000~dat...

david_allison · on Nov 15, 2021

Sorry to nitpick (document seems solid), on page 32:

> Due to a TPC-internal error during the production of 3.2.0 of the TPC-DS kit, the benchmark execution had to use version 2.13 of the kit. It was confirmed by the TPC that the only changes between these two versions of the kit is the version number set in the tools/release.h parameter file.

How can there be that much of a delta of major/minor versions without a change? The only way that I see this happening is if 'change' being defined as the specific benchmark which was run, rather than the kit.

ni_po · on Nov 15, 2021

The change is in the Spec ie., allowing cloud storage and the metric itself causing the major version update, not in the datagen binaries.

tyingq · on Nov 15, 2021

Yes, I wasn't saying they were lying about their tpc.org posted results. I'm saying both companies made use of clever indirection, wording, presentations of stats, etc. Like price/performance, and which of your competitor's tier's to select when doing that, and which of your own. Or over-provisioning the competition's setup, for example.

slownews45 · on Nov 15, 2021

Ahh, fair enough there. That said, snowflake would help their case if they would actually do at least one actual tpc.org third party result

tyingq · on Nov 15, 2021

My guess is that they know the result won't look terrific. And they also know Snowflake works well in production for people despite that. So, little upside.

slownews45 · on Nov 15, 2021

For sure.

All leaders in a space take this approach. Little be gained, a fair bit to lose if you are ALREADY leading without having to debate / do a benchmark etc.

Anyways, the benchmark is only one part of the overall story for these solutions.

arnon · on Nov 15, 2021

That's altering the methods - and generally considered a violation of the validity of the results.

xiaodai · on Nov 15, 2021

dreyfan · on Nov 15, 2021

Databricks is a rapidly approaching IPO. Trying to justify their valuation with their overpriced in-memory hadoop.

kartoonhero · on Nov 15, 2021

Databricks is way more than hadoop or spark. A great analogy - Spark is a great engine but you need to design and build all of the other subsystems.

Databricks is an F1 car - everything is built out. You get in and drive - FAST.

dreyfan · on Nov 15, 2021

Databricks is a shit platform that encourages terrible data practices and accretion of technical debt.

matt123456789 · on Nov 15, 2021

As with other offerings in this space, the key to managing technical debt is to get functions out of notebooks ASAP, stage intermediate results where appropriate, and turn everything into jobs.

exsmelliarmus · on Nov 15, 2021

Seems pretty good to us! Can you give more information?

0x500x79 · on Nov 15, 2021

As people noted elsewhere, you have to be VERY careful with using databricks for a full data warehouse due to the fact that it drives you to notebook driven development and scheduling of those notebooks when data pipelines should follow similar development practices as other software projects.

Great for proof of concepts, but when you start to build out complete pipelines please look into how to make the pipelines more sustainable and maintainable.

fs111 · on Nov 15, 2021

Finally somebody that has used Databricks! I can't believe all the praise I read elsewhere in the comments here. Databricks is broken in so many ways, it is beyond me how anyone can like using this.

glogla · on Nov 15, 2021

> Databricks is an F1 car

F1 cars really unreliable and need a lot of engineers to keep running, are very expensive, and completely impractical in normal use. They are fast but only on very specific roads, they couldn't survive on normal roads.

What do you know, you might be right! :D

drawturkey · on Nov 16, 2021

You nailed it. Meanwhile the rest of the world just needs a camry.

fs111 · on Nov 15, 2021

> Databricks is an F1 car - everything is built out. You get in and drive - FAST.

found the databricks employee

hello_moto · on Nov 15, 2021

Serious question: Databricks, Snowflake, Dremio. All these "Data" platform companies => which one do you have for your Data Lake and Data Warehouse solution?

I'm sick and tired of these companies Snake Oiling the Data industry by offering "the easiest" platform to satisfy your Data Lake + Warehouse solution only to fall hard whenever you hook it up with your production data (big dataset).

PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one platform) is on meth.

kartoonhero · on Nov 15, 2021

Please read up on Lakehouse.

Data Lake + Merge support + DW performance is now possible.

That is the game changer.

hello_moto · on Nov 16, 2021

It'll take a few more years until these companies fixed all the bugs and address all the scalability issues.

As of today, these companies are not good enough to take on the Data Warehouse part.

feqgmmr2 · on Nov 16, 2021

Spark has always been able to handle way larger scale than any DW.

hello_moto · on Nov 22, 2021

Handle what though?

Can Spark queries 100Bn structured data performing aggregation on multiple fields (or dimension?)

geoduck14 · on Nov 16, 2021

In my previous company, we had 63 petabytes of data in Snowflake.

hello_moto · on Nov 22, 2021

That sounds great: storage problem is solved.

What about large scale read via OLAP queries (y'know, the typical measures and dimensions)

blueglassfish · on Nov 16, 2021

That's a respectable amount for a DW, true. Spark and it's ilk are designed for much larger scales though. Multiple FAANG use cases for Spark are in the petabytes per week range.

strongbond · on Nov 15, 2021

Do you work for Databricks?

bpaneural · on Nov 15, 2021

They must do. But if you've been in this area for long enough, I'd put my money on Databricks, if anything, because of their open source integrity

drawturkey · on Nov 16, 2021

Photon, which was used in their benchmark, is not open source. Don't be fooled by DB.

hack_daddy · on Nov 16, 2021

Apache Spark is an open API. You can build your ETL with it and run it on an open source Spark cluster, an AWS EMR cluster, or a Databricks cluster. It will work across all three (and others) because the API is open.

Vendors can implement that API with their own optimizations. EMR makes optimizations in their implementation and so does Databricks. Photon is a new engine, but it implements the Apache Spark API for better performance. There's nothing to stop EMR or any other Apache Spark vendor from undertaking the same strategy.

This openness has allowed customers of Hortonworks and Cloudera to migrate their workloads to the cloud easier than if they had to refactor from something completely different, like from Oracle PL/SQL routines.

Snowflake does not have an open ETL API. If you write stored procedures in Snowflake, you can only run them on Snowflake. This is one of the reasons people choose to use dbt with Snowflake. It gives them an open ETL layer to provide future optionality.

There's no reason why you couldn't use Snowflake as the datastore and Spark as the ETL. However, it would be prohibitively expensive to do so. You would need to pay for the Spark cluster, but also a Snowflake cluster to export and import the data. Exporting a handful of terabytes from Snowflake can also take hours depending on your cluster configuration.

By storing your data on S3 in an open format, like Apache Parquet or Delta Lake, you can just use a different engine on it without needing to export / import it. In addition to Spark, Presto & Trino are popular engines to use when querying a data lake.

This optionality is ultimately good for customers. If Apache Spark is best for your use case, then you can choose to host Spark yourself, EMR, Databricks, Cloudera, etc. If Presto is best for your use case, you can choose AWS Athena, Starburst, Ahana, etc. Once you pick the best tech for your use case, you have several vendors to compare against for the best deal.

If I want to move off Snowflake to Firebolt or some other data warehouse, I need to pay both vendors to get my data out and get my data in. Snowflake wasn't around 10 years ago, and if they are not still a good option 10 years from now, I don't want to have to pay them for the privilege to export my data out. I could rectify that by keeping all my data in a data lake, but now I'm paying to store the data twice.

Open APIs enables an open ecosystem, which encourages competition.

buttaphingas · on Nov 15, 2021

Databricks isn't open source, as they keep hold of all the IP that makes it much better than OS Spark. Whether you buy Snowflake or Databricks, you're buying proprietary software.

ttmahdy · on Nov 16, 2021

With Snowflake data is locked away in a proprietary format not accessible by other compute platforms. You need to export/copy your data to a different system to train an ML model in python or R. With the Databricks, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (Spark, presto and other engines that support Delta) so you are not locked into one compute engine.

drawturkey · on Nov 16, 2021

This is very true. They make the lowest common denominator parts "open source" but control all of the commits. Also the query engine used for this benchmark is proprietary, closed source (Photon)

feqgmmr2 · on Nov 16, 2021

The 'open' here refers to the data. Delta lake can be read/written by multiple open source engines, not just Spark. Not to mention, if you want you can use Databricks with Parquet, though the experience won't be as good.

But with Snowflake, the data never comes out. Can't use Spark/Trino/Flink... on data in SF.

feqgmmr2 · on Nov 16, 2021

Do you have to pay to export data out of Snowflake? Yes. They have a nice guide on how to spend money doing it (https://docs.snowflake.com/en/user-guide/data-unload-overvie...).

Do you have to pay to export data out of Databricks? No, it's already sitting where you want it.

Which one is open? I wonder

geoduck14 · on Nov 16, 2021

I used Snowflake in my previous company. When we loaded data into Snowflake, we loaded it FROM S3/Blob where we also kept it.

hack_daddy · on Nov 16, 2021

So you were paying to store the same data twice. Once in S3 and once in Snowflake. Why not just purge it from S3 and only keep it in Snowflake?

drawturkey · on Nov 16, 2021

Not entirely true. There is a bi-directional Spark connector for Snowflake written by Databricks. And exporting your data in bulk out of Snowflake into any number of open formats is incredibly easy using the COPY INTO command. You can also use Snowflake on top of Parquet and even Delta Lake.

This is the problem. Both Snowflake and Databricks are spreading FUD and otherwise smart people are falling for it.

feqgmmr2 · on Nov 16, 2021

It is not a "small" cost. The cost is proportional to the size of the data exported.

For all intents and purposes, large amounts of data are locked into Snowflake. Is it theoretically possible to export a petabyte out of SF? Sure.

Do I want to spend money on it? Not really. That is what I mean by the "data doesn't come out".

"Exporting" a petabyte out of Databricks is a no-op. I can already read Deltalake from other open source tools.

glogla · on Nov 16, 2021

"Exporting PB from Snowflake" is only ever relevant if you want to move from Snowflake to something else. In that case, all other migration costs (recoding, redocumenting and especially revalidating everything, if in regulated environment) are going to make any cost of data movement irrelevant.

This is just FUD.

feqgmmr2 · on Nov 17, 2021

I think it's important to understand how this kind of scenario comes up. It's unusual to want to move a whole PB at one time, and yeah in that case these other costs would come up. Problem is, the cost is more insidious than that.

Consider a scenario where data is coming in periodically, say daily, from some source, server logs, sensor data, whatever. And the user wants to train models daily on the data and they also want to do some SQL. Maybe they ingest the data directly into SF and copy it out for training, or they do it the other way round, land it in object store and the ingest into SF. This is unlikely to be a humongous amount of data, it's probably not a PB. However, this adds up, maybe for some use cases it becomes a PB in a month, maybe in a quarter, maybe it only adds up to a PB in a year.

Thing is, without a Lakehouse architecture, the user will pay to store and copy that data multiple times (at least twice) no. matter. what. They may not pay for a PB in one shot, but you can bet that eventually they'll pay multiple times to store and copy that PB.

feqgmmr2 · on Nov 16, 2021

It's very relevant if you ever want to do serious ML or anything other than SQL. Of course Snowflake wants you to think that you never need another platform. Every customer knows that's not the case.

drawturkey · on Nov 16, 2021

So if I stop paying Databricks, I can no longer use their proprietary query engine (Photon), right? I have to use something else, like Open Source Spark SQL which is slower and will cost a lot more money.

There are different ways to lock customers in and both Databricks and Snowflake are playing the game.

blueglassfish · on Nov 16, 2021

I’m not sure this locks anyone in. The APIs are open and Spark code will run on, say EMR, just fine.

Every vendor, be it Snowflake, Databricks, EMR, Athena, BQ, … charges for use of the engine. The difference with a Lakehouse is that one doesn’t have to pay the vendor for the simple ability to use the data with another offering. That’s what you have to pay for with closed systems, whether it’s data on the way in or data on the way out.

drawturkey · on Nov 16, 2021

Agreed there is a small cost, but it is possible, which is at odds with your statement "with Snowflake, the data never comes out".