I really wish data engineers didn't have to hand-roll incremental materialization in 2024. This is really hard stuff to get right (as the post outlines) but it is absolutely critical to keeping latency and costs down if you're going to go all in on deep, layered, fine-grained transformations (which still seems to me to be the best way to scale a large / complex analytics stack).
My prediction a few years back was that Materialize (or similar tech) would magically solve this - data teams could operate in terms of pure views and let the database engine differentiate their SQL and determine how to apply incremental (ideally streaming) updates through the view stack. While I'm in an adjacent space, I don't do this day-to-day so I'm not quite sure what's holding back adoption here - maybe in a few years more we'll get there.
I wholeheartedly agree. When I worked at Shopify, we had to hand-roll our incremental data models using Spark, and the complexity of managing deep DAGs made tasks like backfilling and refactoring a huge pain. Tools like dbt and SQLMesh face similar challenges.
The chaos of existing approaches was a large part of what drove me to join Materialize. With Materialize, you can use dbt on “easy-mode”, while Materialize handles incremental logic, removing the usual headaches around processing time and keeping everything up to date within a second or two.
Great to see dbt finally rolling out microbatch incremental models! It's a much-needed feature and a step forward for data transformation. Excited to see how this evolves and complements tools like SQLMesh. Keep up the good work!
Is anyone using SQLMesh in production? I love “lessons learned” tools which have the opportunity to improve core design after seeing the weak points of the initial product in the space. That being said, I hate being an early adopter, so will let others determine if the new tool has an entirely novel set of shortcomings vs dbt.
They're using it for data transformation.They're long time dbt users, but are switching to SQLMesh because it's extremely efficient, provides a better development experience, and can help them become warehouse agnostic.
Side question: For ETL development based on PostGIS and pgRouting, which tool is more suitable? In other words, which one is easier to work with for geometric data? (e.g., visual test case data display, etc.)
Basically people are constantly calculating metrics based on existing tables. Think something as simple as a moving average or the sum of two separate columns in a table. Once upon a time you would set up a cronjob and populate these every day as a SQL query in some python or Perl script.
Dbt introduced a language for managing these “metrics” at scale including the ability to use variables and more complex templates (Jinja.)
More broadly dbt did two other things: 1. It pushed the paradigm from ETL to ELT (so stick all the data in your warehouse and then transform it rather than transform it at extraction time.) 2. It created the concept of an “analytics engineer” (previously know as guy who knows SQL or business analyst.)
I built the first half of my career as "a guy who knows SQL" (and Excel macros but I digress). I then rode the early wave of Analytics Engineering.
dbt is kinda like Vite (dbt = data build tool) for folks working with data warehouses. Their biggest contribution was a mindset shift that applied principles of the SDLC to the traditional BI/Analytics space.
Almost overnight, analysts went from building business logic in GUIs like Talend or Tableau to code-based models (SQL) checked into git repos instead. It took what Looker was doing with LookML and generalized it across the BI stack.
This shift (+ associated tooling) resulted in less brittle data pipelines, increased uptime for dashboards/reporting, and more sanity when working with more than 2-3 people in a data environment.
Imagine a situation where you're at an e-commerce company and need to reconcile orders from Woocommerce with shipments in ShipStation, returns from tickets in HubSpot, and refunds issued in Stripe. dbt simplifies the management of the relationships between these various systems.
Based on this, you can build data models that allow you and, increasingly, your business stakeholders to answer questions like "Which SKUs have seen an uptick in refunds due to reason X this quarter?" and "Where were they shipped?"
The benefit of having standard abstractions means you can build metrics on top of the models as [gkapur](https://news.ycombinator.com/item?id=41853925) mentions such that "revenue" is the same when marketing pulls it for calculating CAC as when finance pulls it their monthly reports, etc.
My prediction a few years back was that Materialize (or similar tech) would magically solve this - data teams could operate in terms of pure views and let the database engine differentiate their SQL and determine how to apply incremental (ideally streaming) updates through the view stack. While I'm in an adjacent space, I don't do this day-to-day so I'm not quite sure what's holding back adoption here - maybe in a few years more we'll get there.