Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Maybe unrelated but Databricks is the most annoying garbage I have ever had to use. It fascinates me how anyone uses it by choice.


Databricks started in 2013 when Spark sucked (it still does) and they aimed to make it better / faster (which they do).

The product is still centered Spark, but most companies don't want or need Spark and a combination of Iceberg and DuckDB will work for 95% of companies. It's cheaper, just as fast or faster and way easier to reason about.

We're building a data platform around that premise at Definite[0]. It includes everything you need to get started with data (ETL, BI, datalake).

0 - https://www.definite.app/


Aren't the alternatives you mentioned - icerberg and duckdb - both storage solutions while spark is a way to express distributed compute? I'm a bit out of touch with this space, is there a newer way to express distributed compute?


duckdb is primarily a query engine. It does have a storage format, but one of it's strengths is querying data where it already resides (e.g. a parquet file sitting in S3).

There are some examples[0] of enabling DuckDB to manage distributed workloads, but these are pretty experimental.

0 - https://www.definite.app/blog/smallpond


Thanks for the pointers!


I think what many people are finding out is they don’t really need distributed processing. DuckDB on a single node can get you really far, and it’s much simpler.


DuckDB is not only a storage solution. It can directly query a variety of file formats at rest, without having to re-store anything. That's one of its selling points: you can query across archival/log data stored in S3 (or wherever) without needing to "ingest" anything or double-pay to duplicate the data you've already stored.


I’m just getting into DuckDB lately and finding this feature so exciting. It’s a totally new paradigm. Such a great tool for scientists, and probably many other people. I wish I took it seriously sooner.


Not a new way like Ray, but a new way to express Spark super-efficiently (GPU-acceleration): https://news.ycombinator.com/item?id=43964505


Flink. It has more momentum than Spark right now.


"momentum" is a tricky word. Zig has more momentum than C++, but will it ever overtake the language? I'd bet not.


Well its not a tricky word it just wrong. Velocity maybe. Or more probably acceleration.


Flink is designed around streaming first, while Spark is built around batch first and you're likely best off selecting accordingly. Though any streaming application likely needs batch processing to some degree. Latency vs throughput.


Databricks is the Jira of dealing with data. No one wants to use it, it sucks, there are too many features to try to appease all possible users but none of them particularly good, and there are substantially better options now than there were not long ago. I would never, ever use it by choice.


What options do you use? I don't work for Databricks but I am building my own data infra startup, so I'd like to hear what "good" looks like!


Eh you don’t even need to go through all the trouble building a startup. imo Neon was interesting and filled a niche while open source solutions were really gaining maturity and adoption. Now they have, lots and lots of recommendations in this comment section, so my sense is that building a startup would be like reinventing the Neon wheel, just too late. Perhaps, depending on licensing, running OSS as a software is viable.


Oh, my startup isn't about Postgres, but rather a GPU-accelerated Spark: https://news.ycombinator.com/item?id=43964505

What are some bad UX choices you generally dislike in data products?


If you come up with the answers to some of these questions, I'd definitely read those blog articles on how you came to those conclusions. Keep asking interesting questions! Cheers


Which questions are you referring to specifically?


Really hard disagree. Coming from hadoop, databricks is utopia. It's stable, fast, scales really well if you have massive datasets.

The biggest gripe in have is how crazy expensive it is.


Spark was a really big step up from hadoop.

But these days just use trino or whatever. There are lots of new ways to work on data that are all bigger steps up - ergonomically, performance and price - over spark as spark was over hadoop.


The nice thing about spark is the scala/python/R APIs. That helps to avoid lots of the irritating things about SQL (the same transformation applied to multiple columns is a big one).


I really can't speak highly enough of Trino (though I used it as AWS Athena, and this was back when Trino was called Presto). It's impressive how well it took "ever growing pile of CSV/JSON/Excel/Parquet/whatever" and let you query it via SQL as-is without transforming it and putting it into some other system.

What an impressive feat of engineering.


Hadoop was fundamentally a batch processing system for large data files that was never intended for the sort of online reporting and analytics workloads for which the DW concept addressed. No amount of Pig and Hive and HBase and subsequent tools layered on top of it could ever change that basic fact.


If cost (or perf) is the issue, we're building a super-efficient, GPU-accelerated, easy-to-use Spark: https://news.ycombinator.com/item?id=43964505


I used to be a big fan of the platform because back in 2020 / 2021 it really was the only reasonable choice compared to AWS / Azure / Snowflake for building data platforms.

Today it suffers from feature creep and too many pivots & acquisitions. That they are insanely bad at naming features doesn't help either.


I’d settle for only one bad name per feature from them. Alas, they don’t feel so limited


I'm building another Spark-based choice now with ParaQuery (GPU-accelerated Spark): https://news.ycombinator.com/item?id=43964505


Is hosting spark really that groundbreaking ? Also isn't spark kind of too complicated for 90% of enterprisey data-processing .

I really don't understand the valuation for this company. Why is it so high.


Yes, spark is too complicated for most cases;

But if you're inclined to use it, databricks' setup of spark just saves you an incredible amount of time that you'd normally waste on configuration and wiring infrastructure (storage, compute, pipelines, unified access, VPNs etc). It's expensive and opinionated, but the data engineers you need to deal with spark OOM errors constantly is greater. Also databricks' default configs give you MUCH better performance out of the box than anything DIY and you don't have to fiddle with partitions and super niche config options to get even medium workloads stable


The market for IBM-like software and platforms (everyone else uses this! It must be good!) apparently wasn't saturated yet


They push Serverless so hard but there are SO MANY limitations and surprise gotchas. It's driving me absolutely insane.


And it tends to be notably more expensive! 4-5x the price for less features...


the new cost-optimized mode is very promising, though


Hey, what are the most painful limitations/gotchas you're hitting? I'm on this team and would like to hear about painpoints.


To list off a few:

* No persist(). Not being able to cache dataframes is a nightmare when it's a workflow that involves taking a massive source of data, doing some rough filtering on it that gets it down to a tiny subset, and then doing more complex stuff with that.

* No good way to get usage info programmatically that I've found. For things like monitoring for periodic queries that get out of hand.

* Can't set Spark config. There are often ways to get around this, like when I recently had to set S3A credentials and needed a way that wasn't OS environment variables (this doesn't work for worker nodes). Eventually, through much documentation browsing and finally an exasperated hail mary question to ChatGPT that solved it (told me the things to pass into options() ) I got it working. But all the documentation and online QA resources just say to use Spark config.

* This is more of a Unity Catalog problem, but kind of applies because Serverless and UC often go very hand in hand (particularly when dealing with things that used to be stored in a cluster like credentials), but it drives me insane that I can only mount external volumes with the same block storage as my workspace provider. So I can't mount an external volume to an AWS bucket on an Azure UC. That means if I want to write stuff that can run the same regardless of what my customer is running their Databricks workspace with, I need to use less sophisticated approaches.

It's still nowhere near the pain that Databrick's attempt at copying Snowflake's VARIANT data type has caused me, but there are many times when I find myself having to work around serverless limitations. Especially when these limitations aren't really mentioned much upfront when Databricks pushes serverless aggressively.


TBH it's really quite boring. You just have to go back in time to the late 2010s. They had an excellent Spark-as-a-Service product, at a time when you'd have better luck finding a leprechaun than a reliable self-hosted Spark instance in an enterprise environment. That was simply beyond the capabilities of most enterprise IT teams at the time. The first-party offerings from the hyperscalars were relatively spartan.

Databricks' proprietary notebook format that introduced subtle incompatibilities with Jupyter was infuriating embrace-extend-extinguish style bullshit, but on-prem cluster instability causing jobs to crash on a daily basis was way more infuriating, and at that time, enterprises were more than happy to pay a premium to accelerate analytics teams.

In the 2010s, Databricks had a solid billion-dollar business. But Spark-as-a-Service by itself was never going to be a unicorn idea. AWS EMR was the giant tortoise lurking in the background, slowly but surely closing the gap. The status quo couldn't hold, and who doesn't want to be a unicorn? So, they bloated the hell out of the product, drank that off-brand growth-hacker Kool-Aid, and started spewing some of the most incoherent buzz-word salad to ever come out of the Left Coast. Just slapping data, lake, and house onto the ends of everything, like it was baby oil at a Diddy Party.

Now, here we are in 2025, deep into the terminal decline of enshittification, and they're just rotting away, waiting for One Real Asshole Called Larry Ellison to scoop them up and take them straight to Hell. The State of Florida, but for Big Data companies.

It would be a mystery to me too, why anyone would pick Databricks today for a greenfield project, but those enterprises from 5+ years ago are locked in hard now. They'll squeeze those whales and they'll shit money like a golden goose for a few more years, but their market share will steadily decrease over the next few years.

It's the cycle of life. Entropy always wins. Eventually the Grim Reaper Larry comes for us all. I wouldn't hate on them too hard. They had a pretty solid run.


With cookies disabled I get a blank website, which is a massive red flag and an immediate nope from me.

Can't imagine someone incapable of building a website would deliver a good (digital) product.


They did build a website though. It even looks pretty nice. The restriction you've placed on yourself just prevents you from viewing it.


But.. but.... we MUST track you! That's the whole purpose of our site /s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: