What I find hilarious is that companies argue who can query 100 TB faster and try to sell this to people. I've been on the receiving end of offers by both of the companies in question and used both platforms (and sadly migrated some data jobs to them).
While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.
Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).
I did work on making a database myself, and I must say that querying 100TB fast, let alone storing 100TB of data, is a real problem. Some companies (very few) don't have much choice but to use a DB that works on 100TB. If you do have small data, then you have a lot of options. But if your data is large, then you have very few options. So it is correct to be competing on how fast a DB can query 100TB of data; while at the same time being slow if you have just 10GB of data. Some databases are designed only for large data, and should not be used if your data is small.
The larger your data, the more that indexing and maintaining them hurt you. This is why they do much better at larger datasets vs small data sets. It’s all about trade offs.
To overcome this, they make use of cache and if the small data is frequently accessed, the performance is generally pretty good and acceptable for most use cases.
Very true. You have to understand the actual capabilities and your actual requirements. We work with petabyte size datasets and BigQuery is hard to beat. Our other reporting systems are still all in MySQL though.
its my experience if its just 10s of GBs then use 'normal' solutions. if TB then spark is great for that. note I have only used DataBricks & Spark, no snowflake.
The irony here is that what Databricks is doing to Snowflake is exactly what Snowflake did to AWS and Redshift.
Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.
In Snowflake’s case, that was separation of storage and compute.
In Databrick’s case, it’s the Lakehouse Architecture.
I think the reason why Snowflake is so nervous because they know they can’t win this game.
To be fair Apache Spark, which started long before either company existed, was built on the assumption that compute and storage should be separate. Unlike Hadoop, Spark did not come with any storage system and could read from any source.
> To be fair Apache Spark, which started long before either company existed
Databricks was founded before Spark 1.0 released by Spark's creators.
Hadoop was created at a time when network and disk were much slower, RAM was less abundant. Bringing compute to the data made sense, but it typically doesn't anymore.
Hadoop was built on the notion that commodity hardware, when pooled together, can be extremely cheap and powerful. The problem is, to manage it, is a nightmare. Cloudera/HWX and others were unable to reduce the management burden and their inability to pivot to a cloud based architecture really sunk their ship.
SF spreads a lot of FUD saying that DB can’t perform, and it was true. DB then went out and hired a lot of engineering talent with a diverse background and has been investing a lot of money in being a best in class SQL offering, so what do you do? You do something to get people’s attention. They’re saying, “hey, we have great performance too, you should also look at us for your SQL workloads.”
> I think the reason why Snowflake is so nervous because they know they can’t win this game.
Isn't Databricks' delta.io, which their Data Lakehouse product builds on top of, open source? Snowflake could take the best parts from and run with it?
They could in principle. GCP, for instance, does do that. So does HP. And Databricks don't mind that as they have a strong open source legacy. But that takes away the proprietary lock-in strategy of Snowflake.
Delta is open source, but Databricks keeps optimizations for themselves as proprietary. I'm not sure why it would be any better than Snowflake's solution, which is automatically deployed across multiple AZs as a fully HA system and gives full ACID transaction compliance across any number of tables (not just per-table).
Essentially with Databricks making Delta open source, you can move away from Databricks to EMR or Presto (with their own optimizations) without incurring a data tax. You're also able to move between cloud providers at ease as data sits in low cost buckets.
In what way is lakehouse architecture beneficial over something like Snowflake or BigQuery?
I understand the appeal over having lake and warehouse as separate components, but with those native cloud warehouses, you can already do everything a lake does.
With a datawarehouse, you can only interface with your data in SQL. With big query and snowflake, your data is locked away in a proprietary format not accessible by other compute platforms. You need to export/copy your data to a different system to train an ML model in python or R.
With the lakehouse, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (spark, Databricks, presto) so you are not locked into one compute engine.
I recall being a junior programmer, and wishing I could talk to my MySQL database in python code to do some processing that was difficult to express in SQL, that day is finally here.
BigQuery does support ML. But the pricing is kind of a racket ($250/TB) so I’ll stick to modeling in R/python. Which I guess reinforces your point. I wonder who pays for this.
My experience is that's how it looks at first. But it is hard to actually make use of lake or lakehouse openness.
You can access data in Snowflake or BigQuery using JDBC or Python clients. You do pay for the compute that reads the data for you. You cannot access the data in storage directly.
You can access data in lakehouse directly, by going to cloud storage. That has two major challenges:
Lakehouse formats aren't easy to deal with. You need a smart engine (like Spark) to do that. But those engines are pretty heavy. Staring a Spark cluster to update 100 records in a table is wasteful.
The bigger challenge is security. Cloud storage can't give you granular access control. It only sees files, not tables and columns. So if you have a need for column or row-based security or data masking, you're out of luck. Cloud storage also makes it hard to assign even the non-granular access. Not sure about other clouds, but AWS IAM roles are hard to manage and don't scale for large number of users/groups.
You can sidestep this by using a long-running engine (like Trino) and applying security there. Then you don't need to start Spark to change or query a few records. But it means you're basically implementing your own cloud warehouse.
Which honestly can be the way if that's what you want! You can also use multiple engines if you are ok with implementing security multiple times. To me, that doesn't seem to be worth it.
In the end, I don't see data that's one SELECT away as much more proprietary and "outsourced" than data that is one Spark/Trino cluster and then SELECT away, just because you can read the S3 is sits on.
Have you ever tried to train models on large data sets over JDBC/ODBC? it’s terrible even with parallelism. Having direct access to the underlying storage and being able to bypass sucking a lot of data over a small straw is a game changer. That is one advantage that Spark and Databricks have over Snowflake.
Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.
Sadly, those things are mutually exclusive at the moment and with the way things are deployed here (large multi-tenant platforms), the security has to take priority.
But if that's not your situation, then obviously it makes sense to make use of that!
> Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.
It is a solved problem. Essentially you need a central place ( with decentralized ownership for the datamesh fans ) to specify the ACLS ( row-based, column-based, attribute-based etc.) - and an enforcement layer that understands these ACLs. There are many solutions, including the ones from Databricks. Data discovery, lineage, data quality etc., go hand in glove.
Security is front and centre for almost all organizations now.
This is exactly what FAANGs do with their data platforms. There are literally hundreds of groups within these companies with very strict data isolation requirements between them. Pretty sure something like that is either already possible or will be very soon, there's just too much prior art here.
Thats where Databricks comes in though, you can implement row/column based security on your data on cloud object storage and use it for all your downstream use cases (Not just BI/SQL but AI/ML without piping data over JDBC/ODBC).
According to their documentation [1], Databricks does not have this capability even for their own engines, and definitely not for "without piping data".
This is what I've personally seen few times - Databricks claiming they can do something and then it turns out they can't. Buyer beware lying salespeople and HN shills.
I don’t understand what capability you are saying Databricks lacks. This capability is literally the entire premise of the Data Lakehouse. With Snowflake you need to export data out/or pipe data over jdbc/odbc to an external tool. With Databricks you can use SQL for data warehousing and when you need you can work with that same data using python to train an ML model without piping data out over jdbc (using the spark engine). One security model, one dataset, multiple use cases (AI/ML/BI/SQL) on one platform.
They're still lacking things in the SQL space. For example, Databricks say they're ACID compliant, but it's only on a single-table basis. Snowflake offers multi-table ACID consistency, which is something that you would expect by default in the data warehousing world. If I'm loading, say, 10 tables in parallel, I want to be able to roll-back or commit the complete set of transactions in order to maintain data consistency. I'm sure you could work around this limitation, but it would feel like a hack, especially if you're coming from a traditional DWH world (Teradata, Netezza etc.).
Snowflake now offers Scala, Java and Python support, so it would seem their capabilities are converging even more, but both with their own strengths due to their respective histories.
Actually, you would expect that in an OLTP world. DW's for the longest time, even Oracle, recommends you disable txn to get better performance. The logic is implemented in the ETL layer. Very rarely do you need multi-table txn in large scale DW.
> I do not see why it would be much slower than direct access to the storage.
Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.
There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.
Here is the thing with the lakehouse though, you have flexibility and don’t need to use multiple engines to achieve the lakehouse vision. Databricks has all the security features a redshift / snowflake does so you can secure databases and tables rather than s3 buckets. It does get more complex if you want to introduce multiple engines but at least you have the option to make that trade off if you want to.
If you want simplicity, you can limit your engine to Databricks. You can also use JDBC/ODBC with Databricks if you want to use other tools that don’t support the delta format/parquet but piping data over JDBC/ODBC doesn’t scale with any tool to large datasets. Databricks has all the capabilities of big query/snowflake/redshift but none of those tools support python/r/scala. Their engines need to be rewritten from the ground up in order to do so.
But you do still have to secure the S3 buckets, right? And I guess also secure the infrastructure you have to deploy in order to run Databricks. Plus then configure for cross-AZ failover etc. So you get flexibility, but I would think at the cost of much more human labor to get it up and running.
Snowflake uses the Arrow data format with their drivers, so is plenty fast enough when retrieving data in general. But it would be way less efficient if a data scientist just does a SELECT * to bring everything back from a table to load into a notebook.
Snowflake has had Scala support since earlier in the year, along with Java UDFs, and also just announced Python support - not a Python connector, but executing Python code directly on the Snowflake platform. Not GA yet though.
You can use Scala, Java and Python with Snowflake now, as well as process structured, semi-structured and unstructured data. So I guess that means it doesn't fit into the data warehouse category, but is not a lakehouse either.
Big Query&Data Proc, Redshift&EMR, Synapse&HDR are tied to the cloud vendors. You can’t move easily from AWS stack to GCP without refactoring. Switching costs are higher.
Snowflake and Databricks are multicloud. The different is that Snowflake is more like a SaaS solution and only does SQL. Databricks is more than just SQL. It has all the data science, machine learning information, built into it. Snowflake has Snowpark but it’s every limited and so you are more likely to have to buy more products to build out your capabilities and integrate them with Snowflake. With Databricks it is more out of the box in terms of capabilities. Databricks also runs in your cloud account which has trade offs. It can be harder to get going and more complex but you end up with a lot more flexibility and you own your data and have complete control over it. While Snowflake gives you control of your data with their tools, everything has to go through Snowflake and incur their tax to get to it. You pay for simplicity, which many customers are ok with because they see value in it. On the contrary, a lot of customers see value in having more control and options. This market is big enough for everyone - it’s really just about market share.
I've used both products in production. Both are good++.
The blog wars seem extremely ridiculous to me. I don't recall ever choosing one over another based on how fast it runs on some imaginary arbitrary dataset.
Manufactured rivalries can be a great thing for business. We have been debating Coke vs Pepsi, Nike vs Reebok, McDonald's vs Burger King for decades now while these companies laugh all the way to the bank.
Like the post but I would add "Ford v Ferrari" there. A synthetic 100T test is much like an F1 course - not something you deal with during your commute, but it's nice to know what the limit is, and that there are people pushing that limit.
I actually see them as variations on the same architecture. Databricks keeps their metadata in files, Snowflake keeps theirs in a database, but they both, ultimately, are querying data stored in a columnar format on blob store (and, to be fair, Snowflake have been doing that with ACID-compliant SQL for a lot longer than Databricks). So using SQL over blob at high performance has been around for a while.
Databricks say their solution is better because it's open (though keep the optimizations you need to run this at scale to themselves, i.e. is ultimately proprietary). Snowflake says theirs is better because it's a fully managed service, meaning no infrastructure to procure or manage, is fully HA across multiple data centers by default etc.
Databricks push 'open' but really still want you to use their proprietary tech for first transforming into something usable (Parquet/Delta) and then querying with Photon/SQL, though you can also use other tech. With Snowflake you can just ingest and query, but it has to be through their engine.
Customers should do their own valudation and see which one fits their needs best.
Snowflake accuses other companies of lacking integrity?
I really wish I could block all of Snowflake's domain from my inbox. Sadly, Google encourages spammers to just create a new email address. So I get a few emails each month from Snowflake who ask me to try their products. I've never done business with them and there's no unsubscribe link.
Fuck Snowflake for thinking it has any room to talk about integrity.
What I find comical is they accuse Databricks of lacking integrity but they don’t actually call out anything except their benchmark was faster than what Databricks did in Snowflake. Databricks then reruns the benchmark and says the only reason that Snowflake’s was faster was because of the built in dataset they used. Databricks was able to match Snowflakes numbers using it but when they loaded the actual data set, it was much slower, which is how a proper TPC benchmark is supposed to happen. They then said that Databricks blog doesn’t match the TPC results, but when I looked at them, they do match. I guess Snowflake just expects people to take arguments at face value. Then I saw someone on LinkedIn complaining that Databricks must have used some beta version. I didn’t see a beta version being used, but that kind of goes out the window when Databricks follows up and then posts that they matched Snowflake when they used their built in TPC data set.
This is funny and interesting to watch but also a distraction I feel. Amazon says it best when they say, “Leaders start with the customer and work backwards. They work vigorously to earn and keep customer trust. Although leaders pay attention to competitors, they obsess over customers.”
Snowflake must be kicking themselves hard now for letting a story that was “Databricks is a viable alternative” turn into “Snowflake has absolutely no integrity and will fling mud even while they are gaming the statistics”
Really can’t see what they can do now short of “bending” to Databricks and entering the competition. And naturally it’s no longer just enough that they show comparable performance. They have to hit their games stats somehow otherwise any news even of they beat Databricks will be reported as “see, we told you they where cheating”
Before the Snowflake blog post, I did not know what Snowflake or Databricks were. I can only imagine that this rivalry is great for both of them, even if Databricks is somewhat on the advantage end, at least from a tactical standpoint; I admit though that they seem to be a bit unnecessarily defensive considering the position they're in with the exchange.
In general though, I'm still not complaining. It's interesting to see a dispute like this unfold.
Of course they’re known among their pre-existing customer base of people and entities who already solve problems using tools like this. But it’s a subset of the multi-trillion dollar cloud industry, which itself is not the entire software engineering industry.
I would say that TPC-DS and TPC-H are really table stakes benchmarks for data warehouses at this point in time (maybe they weren't 10 years ago). How to build a database that does well on them is well documented in the literature now[1][2][3][4] (maybe a few other papers). Its not easy to build such a database, but its "just" hard work and many companies have the $$ necessary to do that work. There isn't any magic or technical moat in the results for databricks (or snowflake, or redshift, etc.).
I think Databricks is overly enthusiastic about their results as they have been trying to be competitive with cloud DWs on these benchmarks for a number of years now. They have finally caught up (by building deltalake and their photon query engine which implement a number of standard DW features).
I agree with everything above. The main advantage the newer data warehouses have over the legacy on-prem incumbents is that they had the chance to build from scratch having learned from all of the challenges that the original players encountered.
The public pissing contest is entertaining while also being silly and slightly cringe, but I think it's a nice story for Databricks nonetheless. They now have a performant SQL-based analytics engine that can credibly compete with the best DWs in the market today, and it's just one part of their overall platform.
The sense I get is that Snowflake wants the conversation to be "no matter what you do, you need a data warehouse, and we're the best in the business at that." Databricks' Lakehouse approach is a fundamental challenge to that, and if they're getting this kind of performance from their analytics engine against the market-leading data warehouses today, that's a big momentum shift in their favour.
As much as I love seeing competition in the space and am enjoying my popcorn, I really don't understand what Databricks is doing here: this feels like a childish foodfight rather than an obsession with the customer...
:) That is a good question. Why spend eng cycles to submit results to the TPC council - why not just focus on customers?
I believe the co-founders have addressed this in the blog.
> Our goal was to dispel the myth that Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark.
I'm sure anybody seriously looking at evaluating data platforms would want to look at things holistically. There are different dimensions like open ecosystem, support for machine learning, performance etc. And different teams evaluating these platforms would stack rank them in different orders.
These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.
Yes, the tone of those blogposts, the likelihood of fake benchmarks submitted on someone else's behalf and especially the deluge of new accounts supporting them makes me want to trust Databricks even less than the PoC my company ran with them last year and spending time with their terrible, terrible salespeople.
EDIT: I forgot lying about how open they are when all their interesting technologies (like the new sql engine and the good parts of delta) are proprietary.
I think Snowflake cultivates a very careful public image, but in private their sales people use.. how do you say.. aggressive techniques.. databricks is addressing the source of market confusion head-on
Ive been following this and it’s kind of embarrassing to watch.
I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.
It makes no sense to fall out about this though.
For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.
What are you talking about. Spark isn't even used, and TPC DS is not a funky calculation at all. It's supposed to be a collection of typical datawarehouse type queries. Although I'm not really sure what funky means, but why would Spark trounce Snowflake on "funky" calculation at all. Do you mean an ML algorithm, and are you implying that TPC-DS has anything close to an ML Algorithm? And why would Snowflake perform better on returning one row, they are columnar stored.
It goes into the details of how this performance is achieved(and not just at 100TB). Part of this could be attributed to innovations in the storage layer(delta lake), and part of it is just the new query engine design itself.
Instead of blog posts written but experts in app A based on their experience with app B, I wish there were a platform for this kind of comparison.
Some objective third party sets the goal and then each company submits automation (selenium?) that configures their own app to achieve the goal. Entrants are scored by:
- time
- storage
- compute
- config complexity
No need to waste time making your opponent look bad, just focus on making your self look good, and do it on a level playing field.
Snowflake has way more revenue, is worth 3 times more than Databricks and is growing faster. I'd say Snowflake is still in the lead. Plus, just look at Snowflake's customer list. It's a "who's who", Databricks is a "Who's that?".
Lists of "references" like these are worthless. Because larger companies tend to be fragmented, especially companies that have more complicated business lines and are used to departments and divisions acting independently.
You know what, our company uses both Snowflake and Databricks.
For Databricks, there's one or two projects that someone built on it running in production. For Snowflake, there's a sizeable use because we bought a smaller company that used it for reporting and warehousing. Neither of them are "the chosen tool" and will see any growth unless wind changes. But we could be (F50 company) counted as reference by both I guess.
Redshift is pretty terrible, stay away. AWS is even worse at delivering promises than Databricks and that's saying something.
I heard Google BigQuery is good. It is completely SaaS (like AWS Athena that works).
Unicorns often run their own stack and you could replicate that, if you have the apetite. Netflix and Apple run Trino + Spark on k8s + Iceberg. Uber used their own Hudi thing, not sure if they still do.
"Databricks is an enterprise software company founded by the creators of Apache Spark. [...] Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks."
Clickhouse is good if you're building application. It has lot of great features and incredible performance, but there's an expectancy that people using it know what they're doing and can work around its limitations (like limited support for joins and sql in general).
Something like Snowflake works much better when you're building a platform that you can give to two hundred data analysts or various skills spread over fifty teams, so they can build their own stuff. The nice UI, broad feature set (materialized views, time travel, automatic backups, superfast scaling up and down, ...) and general just-work-iness makes it nice for that, but you're going to pay for the privilege.
Databricks is somewhere in the middle - things are way less polished, features don't always work and you still have to figure out things like backups and partitions on S3 on your own, but some people like that. Expect to also pay a pretty penny for hundreds of Spark clusters nobody knows who uses.
When was the last time you used Databricks? You should definitely try it again. Their product offering has improved a lot in the past few years.
> broad feature set
My experience is that the feature sets of Snowflake and Databricks are very similar. Both have time travel support. Snowflake has materialized views, but Databricks has Delta Live Tables. Databricks has a distributed Pandas API, but Snowflake recently introduced Snowpark. Databricks also has autoscaling and they recently launched a serverless offering that makes autoscaling super fast aswell.
Snowflake has much more advanced data security - table, column, and row level security and dynamic data masking policies. The zero-copy cloning is also pretty useful for CI/CD (pretty much the one practical way to do blue-green deployment for data application).
Databricks has some interesting features (we were originally interested in it as "nice UI" for our AWS data lake for citizen data scientists - using it for industrialized processing was price impractical compared to AWS Glue) but the security seems lacking - it goes just table level and only in SQL and Spark, with R you can't have security at all.
I really liked the Databricks UI and integrated visualizations, though, that's where they are better than Snowflake I think. Of course, they gained those by buying open source Redash.io and ending it.
The part that ended our PoC with them was when they gave us a price quote for expected number of users, the management was like "ok that sounds reasonable" until I told them that's just license and does not include EC2 costs - the real cost would be at least twice. That made everyone angry.
* Apples and oranges: Clickhouse is a query engine while Databricks is a SaaS product/company. Apache Spark could be compared to Clickhouse, Databricks to clickhouse.com/company. The latter is barely a couple months old.
* Databricks pivoted from analytics to ML and it's not just marketing. Clickhouse is all about OLAP use cases.
* Clickhouse competes with Druid/Pinot/Timescale, Spark competes with Flink.
This reminds me of the old performance ads of Oracle where they would show you how everything ran better on Oracle. They used to put those ads at airports, business lounges and the back cover of newspapers and magazines read by non-technical executives like the FT and Economist.
Everyone technical knew they would game every environment to come out with superior results. I suppose it worked. As the top executives buy big system software and ignore the IT crowd who could easily point out the flaws in the methodology of the"studies".
A key part of the Oracle strategy is making it a breach of license to publish any benchmarking data. No performance data about Oracle's database is allowed to be published without their approval, which means no negative results are published.
Oracle Exadata is very fast but expensive. I bet it would beat a similarly sized cluster from these 2 vendors. The problem is price to performance and elasticity. Because DB and SF are in the cloud, they have a lot more options that Oracle doesn’t have. This is why Kurian left Oracle to go to Google, because LE would not allow Oracle to make cloud native products that would run in other clouds. The SF cofounders are ex Oracle engineers and LE was not interested in creating a cloud native DB from scratch. If he did, we wouldn’t have a SF computing right now.
Yeah, the biggest benefit something like Snowflake or Databricks or whatever AWS tools has over the more traditional technologies is the pay-as-you-go pricing.
We're are now trying to scale unnamed technology running on EC2 from 100 nodes to 200 cores and the process to buy larger license is pretty painful. If we were using Snowflake or Databricks, we could just scale it up and update our opex estimate.
This is kind of understandable. Benchmarking complex software is complicated. It’s easy to give totally wrong picture of things either accidentally or deliberately.
tl;dr: The data warehouse company used a pre-baked TPC-DS dataset and claimed they have similar performance to Databricks. Turns out if you use the official TPC-DS data generation scripts, you get much worse performance.
I read the original post, the Snowflake response, and this. From that I gather that both of them aren't being completely honest or fair when making comparisons. A fair amount of truth, but also some clever wording and omission on both their parts. Which is not surprising or particularly new in this space :)
Sorry to nitpick (document seems solid), on page 32:
> Due to a TPC-internal error during the production of 3.2.0 of the TPC-DS kit, the benchmark execution had to use version 2.13 of the kit. It was confirmed by the TPC that the only changes between these two versions of the kit is the version number set in the tools/release.h parameter file.
How can there be that much of a delta of major/minor versions without a change? The only way that I see this happening is if 'change' being defined as the specific benchmark which was run, rather than the kit.
Yes, I wasn't saying they were lying about their tpc.org posted results. I'm saying both companies made use of clever indirection, wording, presentations of stats, etc. Like price/performance, and which of your competitor's tier's to select when doing that, and which of your own. Or over-provisioning the competition's setup, for example.
My guess is that they know the result won't look terrific. And they also know Snowflake works well in production for people despite that. So, little upside.
All leaders in a space take this approach. Little be gained, a fair bit to lose if you are ALREADY leading without having to debate / do a benchmark etc.
Anyways, the benchmark is only one part of the overall story for these solutions.
As with other offerings in this space, the key to managing technical debt is to get functions out of notebooks ASAP, stage intermediate results where appropriate, and turn everything into jobs.
As people noted elsewhere, you have to be VERY careful with using databricks for a full data warehouse due to the fact that it drives you to notebook driven development and scheduling of those notebooks when data pipelines should follow similar development practices as other software projects.
Great for proof of concepts, but when you start to build out complete pipelines please look into how to make the pipelines more sustainable and maintainable.
Finally somebody that has used Databricks! I can't believe all the praise I read elsewhere in the comments here. Databricks is broken in so many ways, it is beyond me how anyone can like using this.
F1 cars really unreliable and need a lot of engineers to keep running, are very expensive, and completely impractical in normal use. They are fast but only on very specific roads, they couldn't survive on normal roads.
Serious question: Databricks, Snowflake, Dremio. All these "Data" platform companies => which one do you have for your Data Lake and Data Warehouse solution?
I'm sick and tired of these companies Snake Oiling the Data industry by offering "the easiest" platform to satisfy your Data Lake + Warehouse solution only to fall hard whenever you hook it up with your production data (big dataset).
PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one platform) is on meth.
That's a respectable amount for a DW, true.
Spark and it's ilk are designed for much larger scales though. Multiple FAANG use cases for Spark are in the petabytes per week range.
Apache Spark is an open API. You can build your ETL with it and run it on an open source Spark cluster, an AWS EMR cluster, or a Databricks cluster. It will work across all three (and others) because the API is open.
Vendors can implement that API with their own optimizations. EMR makes optimizations in their implementation and so does Databricks. Photon is a new engine, but it implements the Apache Spark API for better performance. There's nothing to stop EMR or any other Apache Spark vendor from undertaking the same strategy.
This openness has allowed customers of Hortonworks and Cloudera to migrate their workloads to the cloud easier than if they had to refactor from something completely different, like from Oracle PL/SQL routines.
Snowflake does not have an open ETL API. If you write stored procedures in Snowflake, you can only run them on Snowflake. This is one of the reasons people choose to use dbt with Snowflake. It gives them an open ETL layer to provide future optionality.
There's no reason why you couldn't use Snowflake as the datastore and Spark as the ETL. However, it would be prohibitively expensive to do so. You would need to pay for the Spark cluster, but also a Snowflake cluster to export and import the data. Exporting a handful of terabytes from Snowflake can also take hours depending on your cluster configuration.
By storing your data on S3 in an open format, like Apache Parquet or Delta Lake, you can just use a different engine on it without needing to export / import it. In addition to Spark, Presto & Trino are popular engines to use when querying a data lake.
This optionality is ultimately good for customers. If Apache Spark is best for your use case, then you can choose to host Spark yourself, EMR, Databricks, Cloudera, etc. If Presto is best for your use case, you can choose AWS Athena, Starburst, Ahana, etc. Once you pick the best tech for your use case, you have several vendors to compare against for the best deal.
If I want to move off Snowflake to Firebolt or some other data warehouse, I need to pay both vendors to get my data out and get my data in. Snowflake wasn't around 10 years ago, and if they are not still a good option 10 years from now, I don't want to have to pay them for the privilege to export my data out. I could rectify that by keeping all my data in a data lake, but now I'm paying to store the data twice.
Open APIs enables an open ecosystem, which encourages competition.
Databricks isn't open source, as they keep hold of all the IP that makes it much better than OS Spark. Whether you buy Snowflake or Databricks, you're buying proprietary software.
With Snowflake data is locked away in a proprietary format not accessible by other compute platforms. You need to export/copy your data to a different system to train an ML model in python or R.
With the Databricks, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (Spark, presto and other engines that support Delta) so you are not locked into one compute engine.
This is very true. They make the lowest common denominator parts "open source" but control all of the commits. Also the query engine used for this benchmark is proprietary, closed source (Photon)
The 'open' here refers to the data. Delta lake can be read/written by multiple open source engines, not just Spark. Not to mention, if you want you can use Databricks with Parquet, though the experience won't be as good.
But with Snowflake, the data never comes out. Can't use Spark/Trino/Flink... on data in SF.
Not entirely true. There is a bi-directional Spark connector for Snowflake written by Databricks. And exporting your data in bulk out of Snowflake into any number of open formats is incredibly easy using the COPY INTO command. You can also use Snowflake on top of Parquet and even Delta Lake.
This is the problem. Both Snowflake and Databricks are spreading FUD and otherwise smart people are falling for it.
"Exporting PB from Snowflake" is only ever relevant if you want to move from Snowflake to something else. In that case, all other migration costs (recoding, redocumenting and especially revalidating everything, if in regulated environment) are going to make any cost of data movement irrelevant.
I think it's important to understand how this kind of scenario comes up. It's unusual to want to move a whole PB at one time, and yeah in that case these other costs would come up. Problem is, the cost is more insidious than that.
Consider a scenario where data is coming in periodically, say daily, from some source, server logs, sensor data, whatever. And the user wants to train models daily on the data and they also want to do some SQL. Maybe they ingest the data directly into SF and copy it out for training, or they do it the other way round, land it in object store and the ingest into SF. This is unlikely to be a humongous amount of data, it's probably not a PB. However, this adds up, maybe for some use cases it becomes a PB in a month, maybe in a quarter, maybe it only adds up to a PB in a year.
Thing is, without a Lakehouse architecture, the user will pay to store and copy that data multiple times (at least twice) no. matter. what. They may not pay for a PB in one shot, but you can bet that eventually they'll pay multiple times to store and copy that PB.
It's very relevant if you ever want to do serious ML or anything other than SQL. Of course Snowflake wants you to think that you never need another platform. Every customer knows that's not the case.
So if I stop paying Databricks, I can no longer use their proprietary query engine (Photon), right? I have to use something else, like Open Source Spark SQL which is slower and will cost a lot more money.
There are different ways to lock customers in and both Databricks and Snowflake are playing the game.
I’m not sure this locks anyone in. The APIs are open and Spark code will run on, say EMR, just fine.
Every vendor, be it Snowflake, Databricks, EMR, Athena, BQ, … charges for use of the engine. The difference with a Lakehouse is that one doesn’t have to pay the vendor for the simple ability to use the data with another offering. That’s what you have to pay for with closed systems, whether it’s data on the way in or data on the way out.