data engineer here, offtopic, but am i the only guy tired of databricks shilling their tools as the end-all, be-all solutions for all things data engineering?
Lord no! I'm a data engineer also, feel the same. The part that I find most maddening is it seems pretty devoid from sincerely attempting to provide value.
Things databricks offers that makes peoples lives easier:
- Out the box kubernetes with no set up
- Preconfigured spark
Those are genuinely really useful, but then there's all this extra stuff that makes people's lives worse or drives bad practice:
- Everything is a notebook
- Local development is discouraged
- Version pinning of libraries has very ugly/bad support
- Clusters take 5 minutes to load even if you just want to "print('hello world')"
Sigh! I worked at a company that was databricks heavy and an still suffering PTSD. Sorry for the rant.
A lot of things has changed quite long ago - not everything is notebook, local dev is fully supported, version pinning wasn’t a problem, cluster startup time heavily dependent on underlying cloud provider, and serverless notebooks/jobs are coming
Data scientist here that’s also tired of the tools. We put so much effort in trying to educate DSes in our company to get away from notebooks and use IDEs like VS or RStudio and databricks has been a step backwards cause we didn’t get the integrated version
I'm a data scientist and I agree that work meant to last should be in a source-controlled project coded via a text editor or IDE. But sometimes it's extremely useful to get -- and iterate on -- immediate results. There's no good way to do that without either notebooks or at least a REPL.
Thank you ! I am so tired of all those unmaintainable nor debugable notebooks.
Years ago, Databricks had a specific page on their documentation where they stated that notebooks where not for production grade software. It has been removed. And now you have a chatgpt like in their notebooks ... What a step backwards.
How can all those developers be so happy without having the bare minimum tools to diagnosis their code ? And I am not even talking about unit testing here.
It’s less about notebooks, but more about SDLC practices. Notebooks may encourage writing throwaway code, but if you split code correctly, then you can do unit testing, write modular code, etc. And ability to use “arbitrary files” as Python packages exists for quite a while, so you can get best of both worlds - quick iteration, plus ability to package your code as a wheel and distribute