Maybe unrelated but Databricks is the most annoying garbage I have ever had to u...

mritchie712 · 2025-05-14T11:25:21 1747221921

Databricks started in 2013 when Spark sucked (it still does) and they aimed to make it better / faster (which they do).

The product is still centered Spark, but most companies don't want or need Spark and a combination of Iceberg and DuckDB will work for 95% of companies. It's cheaper, just as fast or faster and way easier to reason about.

We're building a data platform around that premise at Definite[0]. It includes everything you need to get started with data (ETL, BI, datalake).

0 - https://www.definite.app/

isignal · 2025-05-14T15:38:01 1747237081

Aren't the alternatives you mentioned - icerberg and duckdb - both storage solutions while spark is a way to express distributed compute? I'm a bit out of touch with this space, is there a newer way to express distributed compute?

mritchie712 · 2025-05-14T17:28:54 1747243734

duckdb is primarily a query engine. It does have a storage format, but one of it's strengths is querying data where it already resides (e.g. a parquet file sitting in S3).

There are some examples[0] of enabling DuckDB to manage distributed workloads, but these are pretty experimental.

0 - https://www.definite.app/blog/smallpond

isignal · 2025-05-15T03:25:13 1747279513

Thanks for the pointers!

robertlacok · 2025-05-15T09:45:36 1747302336

I think what many people are finding out is they don’t really need distributed processing. DuckDB on a single node can get you really far, and it’s much simpler.

tomjakubowski · 2025-05-14T17:15:55 1747242955

DuckDB is not only a storage solution. It can directly query a variety of file formats at rest, without having to re-store anything. That's one of its selling points: you can query across archival/log data stored in S3 (or wherever) without needing to "ingest" anything or double-pay to duplicate the data you've already stored.

steve_adams_86 · 2025-05-15T05:57:54 1747288674

I’m just getting into DuckDB lately and finding this feature so exciting. It’s a totally new paradigm. Such a great tool for scientists, and probably many other people. I wish I took it seriously sooner.

winwang · 2025-05-14T19:45:01 1747251901

Not a new way like Ray, but a new way to express Spark super-efficiently (GPU-acceleration): https://news.ycombinator.com/item?id=43964505

Nate75Sanders · 2025-05-14T15:48:59 1747237739

Flink. It has more momentum than Spark right now.

mgfist · 2025-05-14T17:46:00 1747244760

"momentum" is a tricky word. Zig has more momentum than C++, but will it ever overtake the language? I'd bet not.

franktankbank · 2025-05-15T13:37:54 1747316274

Well its not a tricky word it just wrong. Velocity maybe. Or more probably acceleration.

lamp_book · 2025-05-14T21:47:05 1747259225

Flink is designed around streaming first, while Spark is built around batch first and you're likely best off selecting accordingly. Though any streaming application likely needs batch processing to some degree. Latency vs throughput.

MOARDONGZPLZ · 2025-05-14T15:12:16 1747235536

Databricks is the Jira of dealing with data. No one wants to use it, it sucks, there are too many features to try to appease all possible users but none of them particularly good, and there are substantially better options now than there were not long ago. I would never, ever use it by choice.

winwang · 2025-05-14T19:51:51 1747252311

What options do you use? I don't work for Databricks but I am building my own data infra startup, so I'd like to hear what "good" looks like!

MOARDONGZPLZ · 2025-05-14T20:30:08 1747254608

Eh you don’t even need to go through all the trouble building a startup. imo Neon was interesting and filled a niche while open source solutions were really gaining maturity and adoption. Now they have, lots and lots of recommendations in this comment section, so my sense is that building a startup would be like reinventing the Neon wheel, just too late. Perhaps, depending on licensing, running OSS as a software is viable.

winwang · 2025-05-14T20:33:31 1747254811

Oh, my startup isn't about Postgres, but rather a GPU-accelerated Spark: https://news.ycombinator.com/item?id=43964505

What are some bad UX choices you generally dislike in data products?

hadlock · 2025-05-14T23:04:39 1747263879

If you come up with the answers to some of these questions, I'd definitely read those blog articles on how you came to those conclusions. Keep asking interesting questions! Cheers

winwang · 2025-05-15T00:36:37 1747269397

Which questions are you referring to specifically?

swalsh · 2025-05-14T11:18:13 1747221493

Really hard disagree. Coming from hadoop, databricks is utopia. It's stable, fast, scales really well if you have massive datasets.

The biggest gripe in have is how crazy expensive it is.

willvarfar · 2025-05-14T11:54:29 1747223669

Spark was a really big step up from hadoop.

But these days just use trino or whatever. There are lots of new ways to work on data that are all bigger steps up - ergonomically, performance and price - over spark as spark was over hadoop.

disgruntledphd2 · 2025-05-14T13:14:13 1747228453

The nice thing about spark is the scala/python/R APIs. That helps to avoid lots of the irritating things about SQL (the same transformation applied to multiple columns is a big one).

lelandbatey · 2025-05-15T05:08:58 1747285738

I really can't speak highly enough of Trino (though I used it as AWS Athena, and this was back when Trino was called Presto). It's impressive how well it took "ever growing pile of CSV/JSON/Excel/Parquet/whatever" and let you query it via SQL as-is without transforming it and putting it into some other system.

What an impressive feat of engineering.

DebtDeflation · 2025-05-14T14:13:07 1747231987

Hadoop was fundamentally a batch processing system for large data files that was never intended for the sort of online reporting and analytics workloads for which the DW concept addressed. No amount of Pig and Hive and HBase and subsequent tools layered on top of it could ever change that basic fact.

winwang · 2025-05-14T19:46:56 1747252016

If cost (or perf) is the issue, we're building a super-efficient, GPU-accelerated, easy-to-use Spark: https://news.ycombinator.com/item?id=43964505

robertkoss · 2025-05-14T11:17:51 1747221471

I used to be a big fan of the platform because back in 2020 / 2021 it really was the only reasonable choice compared to AWS / Azure / Snowflake for building data platforms.

Today it suffers from feature creep and too many pivots & acquisitions. That they are insanely bad at naming features doesn't help either.

kristjansson · 2025-05-14T15:35:46 1747236946

I’d settle for only one bad name per feature from them. Alas, they don’t feel so limited

winwang · 2025-05-14T19:45:52 1747251952

I'm building another Spark-based choice now with ParaQuery (GPU-accelerated Spark): https://news.ycombinator.com/item?id=43964505

apwell23 · 2025-05-14T12:14:18 1747224858

Is hosting spark really that groundbreaking ? Also isn't spark kind of too complicated for 90% of enterprisey data-processing .

I really don't understand the valuation for this company. Why is it so high.

teetertater · 2025-05-15T06:50:58 1747291858

Yes, spark is too complicated for most cases;

But if you're inclined to use it, databricks' setup of spark just saves you an incredible amount of time that you'd normally waste on configuration and wiring infrastructure (storage, compute, pipelines, unified access, VPNs etc). It's expensive and opinionated, but the data engineers you need to deal with spark OOM errors constantly is greater. Also databricks' default configs give you MUCH better performance out of the box than anything DIY and you don't have to fiddle with partitions and super niche config options to get even medium workloads stable

isoprophlex · 2025-05-14T11:08:14 1747220894

The market for IBM-like software and platforms (everyone else uses this! It must be good!) apparently wasn't saturated yet

viccis · 2025-05-14T17:19:08 1747243148

They push Serverless so hard but there are SO MANY limitations and surprise gotchas. It's driving me absolutely insane.

datadrivenangel · 2025-05-14T17:35:22 1747244122

And it tends to be notably more expensive! 4-5x the price for less features...

antruok · 2025-05-14T18:18:06 1747246686

the new cost-optimized mode is very promising, though

hacliff · 2025-05-14T18:36:02 1747247762

Hey, what are the most painful limitations/gotchas you're hitting? I'm on this team and would like to hear about painpoints.

viccis · 2025-05-17T20:03:30 1747512210

To list off a few:

* No persist(). Not being able to cache dataframes is a nightmare when it's a workflow that involves taking a massive source of data, doing some rough filtering on it that gets it down to a tiny subset, and then doing more complex stuff with that.

* No good way to get usage info programmatically that I've found. For things like monitoring for periodic queries that get out of hand.

* Can't set Spark config. There are often ways to get around this, like when I recently had to set S3A credentials and needed a way that wasn't OS environment variables (this doesn't work for worker nodes). Eventually, through much documentation browsing and finally an exasperated hail mary question to ChatGPT that solved it (told me the things to pass into options() ) I got it working. But all the documentation and online QA resources just say to use Spark config.

* This is more of a Unity Catalog problem, but kind of applies because Serverless and UC often go very hand in hand (particularly when dealing with things that used to be stored in a cluster like credentials), but it drives me insane that I can only mount external volumes with the same block storage as my workspace provider. So I can't mount an external volume to an AWS bucket on an Azure UC. That means if I want to write stuff that can run the same regardless of what my customer is running their Databricks workspace with, I need to use less sophisticated approaches.

It's still nowhere near the pain that Databrick's attempt at copying Snowflake's VARIANT data type has caused me, but there are many times when I find myself having to work around serverless limitations. Especially when these limitations aren't really mentioned much upfront when Databricks pushes serverless aggressively.

sh34r · 2025-05-15T04:07:31 1747282051

TBH it's really quite boring. You just have to go back in time to the late 2010s. They had an excellent Spark-as-a-Service product, at a time when you'd have better luck finding a leprechaun than a reliable self-hosted Spark instance in an enterprise environment. That was simply beyond the capabilities of most enterprise IT teams at the time. The first-party offerings from the hyperscalars were relatively spartan.

Databricks' proprietary notebook format that introduced subtle incompatibilities with Jupyter was infuriating embrace-extend-extinguish style bullshit, but on-prem cluster instability causing jobs to crash on a daily basis was way more infuriating, and at that time, enterprises were more than happy to pay a premium to accelerate analytics teams.

In the 2010s, Databricks had a solid billion-dollar business. But Spark-as-a-Service by itself was never going to be a unicorn idea. AWS EMR was the giant tortoise lurking in the background, slowly but surely closing the gap. The status quo couldn't hold, and who doesn't want to be a unicorn? So, they bloated the hell out of the product, drank that off-brand growth-hacker Kool-Aid, and started spewing some of the most incoherent buzz-word salad to ever come out of the Left Coast. Just slapping data, lake, and house onto the ends of everything, like it was baby oil at a Diddy Party.

Now, here we are in 2025, deep into the terminal decline of enshittification, and they're just rotting away, waiting for One Real Asshole Called Larry Ellison to scoop them up and take them straight to Hell. The State of Florida, but for Big Data companies.

It would be a mystery to me too, why anyone would pick Databricks today for a greenfield project, but those enterprises from 5+ years ago are locked in hard now. They'll squeeze those whales and they'll shit money like a golden goose for a few more years, but their market share will steadily decrease over the next few years.

It's the cycle of life. Entropy always wins. Eventually the Grim Reaper Larry comes for us all. I wouldn't hate on them too hard. They had a pretty solid run.

DarkWiiPlayer · 2025-05-14T11:19:24 1747221564

With cookies disabled I get a blank website, which is a massive red flag and an immediate nope from me.

Can't imagine someone incapable of building a website would deliver a good (digital) product.

fkyoureadthedoc · 2025-05-14T12:18:12 1747225092

They did build a website though. It even looks pretty nice. The restriction you've placed on yourself just prevents you from viewing it.

fuzzy_biscuit · 2025-05-14T11:58:00 1747223880

But.. but.... we MUST track you! That's the whole purpose of our site /s