Some people have noted that this is a very Airbyte specific article, but I think that the lessons learned are still important.
I have managed Airflow as a managed service for a company that has thousands of DAGs and one of our keys to success was splitting the compute and scheduling concepts into different components. We standardized on where our compute ran (Databricks, Spark, Lambdas, or K8s jobs) and had Airflow purely as a scheduler/orchestration tool.
Scaling your Airflow worker nodes to handle big-data scale transformations/extractions is a pain. Especially attempting to support customers who want to run larger and larger jobs. Splitting these concepts allowed for us to prevent noisy neighbor issues, Airflow as a component had high reliability for all of our customers, and we prevented the need for M * N operators.
In my org we never use Airflow to compute anything. We only use a single operator: PythonOperator.
All business and data access logic is encapsuled in REST APIs. Our DAGs are responsible for calling those APIs in the correct order, in the correct time, retrying when needed. I really dislike using Airflow for anything else, as it generally becomes a huge mess.
First time I read about Prefect, can you briefly share why you would use prefect over airflow if using only python? (that's what I am about to be doing soon...)
Have been a Prefect user since their first release. In a nutshell, it's a native pythonic DAG tool, where Airflow is a DAG tool that happens to be written in Python.
How best to achieve this? I'm considering Astronomer or AWS hosted airflow, with a benefit to AWS having the compute components easily accessible within the AWS ecosystem.
Starting out with a smaller scale and lower commitment to Airflow, so I'd like highest reliability with least hassle.
Amazon's Airflow offering sucks really hard. As in many services, they made an awful job at designing a good UX and hiding the product's complexity from their users.
Astronomer is cool, but expensive and they won't accept monthly billing. Also, it's even more expensive if you need DAGS to access things within a VPC.
Step Functions is decent for simple use cases and very reliable. For complex stuff, you will hate their JSON based DSL with passion.
At Grove we make extensive use of KubernetesPodOperator[0]. This allows to both encapsulate the job in containers (k8s pods), and also specify the resources we need for each DAG.
It depends on what other tools you are currently using for ETL (or want to use). One example is that we used Spark so we would use the Spark submit operator to submit jobs to clusters. You can also use the K8s Pod operator if you want to utilize containers for your compute.
There are a lot of options. We were adopters before AWS hosted airflow was a thing, so I don't have any experiencing running AWS hosted Airflow.
I haven't looked recently to see if some of the challenges we faced early on are solved now, but most of them stemmed from how DAG updates were handled: changing the start date on a DAG would break you DAG forever until you go update the database by hand. Things like this are(/were?) super painful and could get worse with a managed solution.
If you are on aws, then step functions could be your best option. They are responsible for the workflow definition, and the compute load is completely dependent on the underlying tasks and aws services you are going to use.
I have worked with Airflow during the past three years, but recently we adopted Dagster and I have been using it for the past 3 months. I have found it quite joyful to use and the experience has been very positive. Its main advantages compared to Airlfow (IMO):
- A great UI
- It forces you to clearly define inputs, outputs and types.
- Separation of concerns: Between configuration and data, between processing and IO, and between code and deployment options.
- It allows you to define flexible dags which you can configure at runtime, which makes it easiy to run locally or in k8s, or to switch the storage backend depending on the environment.
[author of the article] My main concern about using Airflow for the EL parts is that sources and destinations are highly coupled with Airflow transfer operators (e.g. PostgresToBigQueryOperator). The community needs to provide M * N operators to cover all possible transfers. Other open-source projects like Airbyte, decouple sources from destinations, so the community only needs to contribute 2 * (M + N) connectors.
Another concern about using Airflow for the T part is that you need to code the dependencies between models both in your SQL files and your Airflow DAG. Other open-source projects like dbt create a DAG from the model dependencies in the SQL files.
So I advocate for integrating Airflow scheduler with Airbyte and dbt.
Curious to know how other use Airflow for ETL/ELT pipelines?
Is that really that bad of a problem? In the worst case, you use the underlying hooks to create a custom operator and in the worst worst case you do something like have a Spark/Dataflow/k8s pod/whatever other single threaded or distributed task runner act as that custom operator.
I'm running into the "incremental load" problem now trying to batch intervals of Avro files into BigQuery, but Airflow doesn't seem to complain much about 10 minute intervals (less than that and I'd be a bit worried). One workaround we're considering to having 144 dag runs/dag/day for 10 minute intervals is to just define it as a daily dag and periodically clear it throughout the day.
I have heard of newer companies who don't need to deal with legacy ETL going the ELT route which is interesting to me. I'm curious who owns the transformation in that case. Do you just hire more data scientists or maybe a bunch of junior devs/analysts/DSs to own data cleaning and transformation? The way our transformation works is that the engineering team that owns the application data owns the transformation so they can document exactly what the data going into the BI tool is. It adds overhead on the engineering side, but not nearly as much as throwing the data in a lake and hoping someone downstream knows what to do with it.
In a previous job we/I trained the analysts to be more technical and write the T part of ELT in DBT. They effectively became what is known as "Analytics Engineers" so they owned the T and then wrote their analysis on top of the models they had created.
That works for ELT, especially if you have documentation around the raw data being loaded in but sounds like it adds a bit of overhead to the analysts' jobs which may or may not be more than just having the engineering team own it and document it well (something they already have to do for the analysts to write transformation code). I'm curious how you handle the upstream data schema changing. Loading in raw data means handling compatibility in another place outside the application.
Not if it's just a part of those engineers' jobs. They're already familiar with the underlying application data so owning the transformation is just understanding what the data needs to look like and documenting it. They're going to need to document the raw data anyway to avoid those analysts asking them a million questions. Might as well avoid hiring analysts who can also learn the transformation bit and just give them good data.
I think we've worked in very different jobs, in my case the analysts had a good idea of the underlying application data and often worked closely with both data engineering and regular engineering to understand it so they can make better analyses. They were quite competent in their own right, otherwise I wouldn't have given them control over the T which only made a net benefit to my life as reduced work.
exactly. analysts are always a step behind engineers when it comes to really understand what data really means and what changes are coming down the line. this always results in delays, broken pipelines ect.
modern tools like dbt make it easy for data producing teams to also own T part.
Really good points! I don't think that Airflow is necessarily a problem if your data engineering team knows how to best use Airflow Operators, Hooks and DAGs for incremental loads. But because Airflow is not an opinionated ETL/ELT tool, most often I see a lot of custom code that could be improved...
You know there is this "data mesh" hype now. I think the idea behind is to empower data consumers within the company (data analysts) who know best the data to create and maintain the models. That's easier said than done, and most often turns out into a worst situation than when is only data engineers who can model data... I've only heard of Zalando who has successfully distributed data ownership within the company.
Yea, I wasn't familiar with Airbyte before writing that comment so now I'm seeing the value in it. We have tons of teams asking "how do I get this data into BigQuery" and the answer is usually "use this airflow operator to dump it into GCS and then use this airflow operator to load it into BigQuery" which isn't super useful for a non-technical person or even really any technical person not familiar with Airflow.
A mesh is certainly something in-between a lake and a warehouse... Something super simple that I've gotten good feedback on so far from DSs is just documenting the transformed data in place. It was really difficult to do this in our old ETL stack (data pulled from HBase, transformed to parquet + Hive in HDFS) but we've moved a lot of it over to Avro files loaded into BigQuery where we can just put decorators on our Scala transformation code that's writing the Avro files and that updates the schema with descriptions in BigQuery. Gives a nice bit of ownership to the engineering team and lets the DS using the data be a lot more autonomous. That boundary has to exist somewhere (or I guess in many places for a "mesh") so having it distinctly at data getting loaded in feels right to me.
Nice work there! I also think that the next challenge for data teams is all this data documentation and discovery work.
I still think that Airflow is great for power data engineers. Airbyte and dbt are positioned to empower data analysts (or lazy data engineers like me) to own the ELTs.
Agreed. I see a lot of folks coming up with one off solutions for pulling data out of 3rd party sources like Kustomer or Lever. Giving a centralized UI for setting that up would be a great service.
I mostly don't bother writing separate Operators. The only part I write are the so called Hooks (which are basically just airflow's way of defining a standard way of grabbing credentials and instantiating a session object).
After that you just write a short python function that grabs the data from one hook and pushes it to another. Which is basically the (M + N) solution you mention (I think the factor 2 is unnecessary if you've already split sources and sinks).
This approach works with anything you can connect to python. Though for particularly large datasets you want to be careful that you don't accidentally store all data in memory at once. And sure you can sometimes specialize an operation for a particular use case (e.g. if in your example can instruct BigQuery to connect to the Postgres application natively), but usually it works just fine to use a python script in-between.
I am curious if Airflow is the most appropriate tool for such tasks. I would imagine Apache Camel to be a good starting point for building a library of connectors and then plugging them together using EIP [1] patterns?
Airbyte CDC is based on Debezium, but Airbyte abstracts it away and make it easier to CDC from Postgres, MySQL, MSSQL to any supported destination (included S3). Here is the doc for CDC: https://docs.airbyte.io/understanding-airbyte/cdc
I guess one benefit is that you can use Airbyte for all your data syncs, CDC and non-CDC. You can give it a try with your own data, and see if it's easier for your team. You can run Airbyte locally with Docker Compose: https://docs.airbyte.io/quickstart/deploy-airbyte
Can you elaborate on how Airbyte makes things easier for a user? Would love to pick up any potential improvements in Debezium itself, so that all its users get to benefit from them, rather than only users of a specific integrator like Airbyte.
This is a very bad idea. You cannot hide ETL in a Rest API, at most you can hide the T part and it's a lot more work. you need to build an extra service that reads the data from a database and does some padgination. It's way more complex than a one-liner to get data from a jdbc connection and way much slower.
I used to do that, but having to deal with timeouts in long running jobs was frustrating because i never knew if something was successful. Could bump up the timeouts, but theyre there for a reason.
We work with many businesses that are larger (Fortune 500) and the T per pipeline is say 60 steps with 1200 columns at 10TB scale and uses multiple things not in SQL. They lookup object stores, lookup web services, use rocksdb, partitioning is important. At scale, cost becomes critical- some are even moving to their own Spark on Kubernetes. ML on done on data after ETL into Data Lake.
None of them can use DBT for core ETL, but DBT might be good later for views, some dimensional modeling. They have done a good job here.
If you can write a SQL query or a set of SQL queries to do your transformation, then you can use DBT. DBT doesn't do transformation itself rather it helps you manage all the dependencies between your SQL models. Whether you can use SQL depends on your data and database/warehouse functionality. For example, JSON parsing support is pretty good now in many databases and warehouses. If your objects can be represented as JSON, then you could write SQL via DBT to parse the objects into columns and tables.
My understanding of dbt is that it builds a DAG based on the interdepencies between models. The interdepencies are parsed from 'ref' functions on the SQL files. The thing with dbt is that you transform the data within a single data warehouse.
So, you would normally first load all data to the data warehouse. Then dependencies between SQL models are easier to map.
Very solid article. Even with where it’s published, it’s jolly sensible.
I would like to jump in and say use Beam instead of DBT, but tbh that’s bad advice. What the world needs is something open source with the incremental model of beam, a fast incremental backend (thinking htap storage that mixes columns and rows automagically) and the ease and maintainability of DBT. There is just this massive hole. If some combination of tools could fill it, that would be the new LAMP stack for data.
>Just don’t do it. Because dbt is primarily designed for batch-based data processing, you should not schedule your dbt jobs to run continuously. This can open the door to unforeseeable bugs.
why not though. you can inplement incremental models and run them continously. sure its more work but what bugs does this cause?
Totally agree. While not using DBT specifically, I've done this on tables with billions of rows and it works perfectly. And even this can be combined with a Lambda view giving you the best of both worlds. Combining overcomes any latency from the incremental process since it can take time.
But I did end up questioning why I needed to continuously microbatch when the lambda views are able to bridge the gap. It turned out that the lambda views were good enough that we could reduce the microbatching back to ever 24hrs, and that was just being overly cautious. 48hrs or more might have been good enough, maybe more.
It turned out that the costly part of the microbatching was really merging (inserts and update, not append only) the delta data back into the prepared table. Selecting, combining and consolidating new and historic data is extremely fast.
This is just advertising copy. It isn't giving unbiased advice, there is an obvious conflict of interest here.
I was hoping to learn some things to help me avoid common airflow problems, but it only talks about database sync jobs for the most part and jumps into pitching Airbyte over and over, the last 2/3 of the article being a sales pitch, right out of a marketing class.
Hi to the author and other Airbyte employees pushing this to the frontpage! I hope you didn't have to get up too early to coordinate your voting. Make sure to give this comment a downvote so we know you are out there!
> The main issue with Airflow transfer operators is that if you want to support transfers from M sources to N destinations, the community would need to code N x M Airflow operators.
I'm biased but this is a nonissue with workflow-as-code solutions like temporal.io (which Airbyte uses). N activities pulling data from sources, M activities sending data to destinations, write whatever translation layers you want in your workflows.
Having always been on AWS and using Glue Spark Jobs for my jobs, I've never felt any benefit of using Airflow for orchestration over Glue Workflows. I can understand some people not wanting to deal with vendor lock-in. I'm curious what others opinions are.
I recently started working on my own DAG execution framework, after failing to get some patches into Airflow to make the scheduling easier to reason about.
My typical use case was orchestrating DAGs with thousands of vertices, and airflow would silently wedge itself and fail to report errors.
Daggy is just starting out, but I’m hoping it’ll become more robust and scalable as time goes on.
(workflow/DAG systems are an area of special interest for me...)
Happy to see you're supporting runtime-dynamic DAG shape. That's one of the critical things Airflow is missing.
But, if I'm inferring correctly, it looks like you're recording task state in a similar way to Airflow, i.e. in some kind of data store. I suggest you look at how Luigi models task state: using "targets" which are a task's materialized outputs (e.g. a file, or anything whose existence can be tested for). Luigi's model isn't perfect but it's very valuable for tasks to be able to recognize the condition "my output exists, therefore I don't need to run".
...which leads me to the most important feature you need to check off, that most competitors are getting wrong: resuming a partially failed build. It's often the case that one node of a DAG will fail with a non-recoverable error that can't be solved with retries, and so the DAG as a whole will fail. But if we can fix that error in a way that's compatible with the work that was already done (e.g. fix a network ACL, IAM policy, etc), we should be able to start the DAG again and have it resume at the point of failure, and not do any work over again.
Luigi does this very well, Airflow does it adequately, and I think all the other tools in this space (Prefect, Dagster, Nextflow... tell me about others!) don't do it at all.
Thanks for the feedback. I'll take a look at how Luigi models task state. Right now each TaskExecutor type is responsible for running and reporting on tasks (e.g. the Slurm executor submits jobs and monitors them for completion). I was considering adding a companion "verify" stage for every vertex, which would be a command that ran and verified output. It might be a way to do what I think you're describing above without having to build in a variety of expected outputs into the daggy core. I'll check what Luigi is doing, though.
> resuming a partially failed build
Daggy does this! Right now it will continue running the DAG until every path is completed or all vertices in a processing state (queued, running, retry, error) are in the error state, then the DAG goes to an error state.
It's possible to explicitly set task/vertex states (e.g. mark it complete if the step was manually completed), then change the DAG state to QUEUED, at which point the DAG will resume execution from where it left off. [1] is a unit test that walks through that functionality.
I've never used Airflow, but used Step Function in AWS to pretty much achieved the same things this article described. I wonder if anybody has used both and what are the pros and cons between them? Besides the obvious reason of Step Function in AWS so it would work better within AWS ecosystem and Airflow is open source and service/provider agnostic?
- I miss the airflow ui to monitor the workflow execution, clear failed tasks, its pre built notifications, emails and so on.
- Passing state between step functions is a bit tricky tbh. I found airflow’s way a little more straightforward.
- Step functions seem to scale better, especially when you aim to use it for dynamic workflows. The recent update of step functions where you can literally use any aws service api (compared to a couple weeks ago where you were restricted to around 10 and then you only had lambdas) minimises its weaknesses imo.
Ya, the new Step Function service integration feature is awesome, really seals the deal for Step Function being the orchestration tool if you are in AWS.
Airflow: Much has been written about good and bad. My bads included poor support for event driven workflows; goods included ease of extensibility. A lot has happened since I stopped using it.
AWS stepfunctions: Goods include okay event driven workflows, integration with AWS products, offloading lower level infrastructure to AWS-only people. Bads are everything else - It gives you a set of remarkably and weirdly limited primitives from which you may assemble something like Airflow. If you have any important use case, avoid it like the plague because it is half-baked and limited to a degree that only amazon can accomplish, yet it is capable enough for an initial POC to seem impressive.
Yea they look pretty similar. Not sure what the configuration language around step function is, but Airflow is nice because it's just python. There are a few gotchas you encounter, especially around templating, when you first get started but besides that it's pretty low overhead to start doing some pretty complex things since it's just writing python code.
The article says that "SQL is taking over Python to transform and analyze data in the modern data stack". Are other people starting to notice this at ELT becomes more populate than traditional ETL?
Haven't used Airflow before but use Azure Data Factory in my org to load the raw the data into the data warehouse and then transform into data models using SQL.
> Are other people starting to notice this at ELT becomes more populate than traditional ETL?
I invented ELT.
No, duh, of course not... But: I and my then-colleagues, other DW consultants, started noticing some time in the early-to-mid-00s that what we were doing didn't actually follow the order of the "ETL" acronym, and occasionally commented to each other along the lines of "shouldn't this be called 'ELT' to be more descriptive?". So I don't quite get the "starting to notice this at ELT becomes more popula[r]" bit -- this feels quite old to me. Before most of the meteoric rise of Python.
All the warehouses I’ve worked on have used ELT more than ETL, going back 20 years or more. I think that at the coalface it’s always been the more common paradigm. It’s just that the people doing it are too busy writing code to write blogs.
In a past role, we used Airflow to do the transformations. Airflow accomplished a couple of things that you won't get from a "barebone run my SQL transform":
Retrieve credentials from X
Rewrite SQL (maybe QA is repointed to PROD)
Run, test for performance, then promote/notify/whatever
That being said, my last org was an early adopter of Airflow, and we deployed it before there were "best practices" - and it was an unpleasant experience. In my current role, we are investigating schedulers and Airflow is in the "I really hope I don't have to do this" list of tools. ADF is in the "this will work until we find the real solution" list
Airflow is really nice and awesome solutions for simple workflow and tasks. But when we use Complex DAGs with too many subDAGs then real pain starts. Real issue with airflow is -
1. scalability
2. Scheduler delays.
3. security ( Open source airflow ) -
A ) Airflow uses single super role that has access to resources for all its orchestration jobs which is potential compliance risks.
B ) Lack of granular roles and security groups which leads to rely on trust that no airflow users mistakenly make any changes through UI
I feel there is some undocumented dependency between scheduler, celery and web server which always hit performance issue of ETL job.
Also We see more reliability issues on the platform as more workloads are added.
Something I've been thinking is, like the article says, SQL is a very good way to transform data, and it recommends dbt for it, but how to you test this transformation?
I know dbt has tests, but on a superficial look they seem pretty trivial stuff like "check if this field is null" and things like that, but what about tests which I setup scenarios and see if the end result of my transformation is what I expect? Is there any good tools for this?
You can only run superficial tests like is not null, is unique, to prevent row duplication in sql join for example, but not much else. You simply don't have enough information about what to test at the stage of the initial ETL/ELT that data engineering is responsible for.
You maintain pipes that can route water, oil or gasoline. You don't want to test for water purity, because next day you are asked to route sewage or oil through your system. You can at best test volume, pressure or velocity in the pipeline because these actually have an impact on your system.
The actual test that business is asking you to put at the "raw data" stage has to come at the application level later on. Eg. when you extract a metric out of the data you can do all kinds of time series tests on it, which will test the pipeline thoroughly.
Testing is what makes data engineering different from software engineering. You don't control the input data and don't want to. You have to make your organization work in such a way that downstream users communicate data issues back to you. You only test the data thoroughly by using it.
If you are a data vendor and your downstream user is a paying customer that you don't want to embarass yourself in front of, you need to invent a use for the data, that you sell, within your org.
I like this analogy, I often use the term plumbing in relation to Data Engineering, and this extends it perfectly!
> You maintain pipes that can route water, oil or gasoline. You don't want to test for water purity, because next day you are asked to route sewage or oil through your system. You can at best test volume, pressure or velocity in the pipeline because these actually have an impact on your system.
Checkout great expectations (https://greatexpectations.io/). You basically run assertions on your data. It auto-generates documentation for you and you can also store results of your validations in a flat file or database. I think there are dbt modules that try to mimic great expectations' validation but I like the docs that come with great expectations
Check out dbt-expectations package[1]. It's a port of the Great Expectations checks to dbt as tests. The advantage of this is you don't need another tool for these pretty standard tests, and can be early incorporated into dbt workflows.
Has anyone kicked the tires on Airflow, Prefect, and Dagster and care to give their thoughts? My initial foray into Airflow 1 met a lot of complexity that both Prefect and Dagster claim to minimize.
Tried some of them, threw them away for various reasons and used Argo Workflows instead.
Airflow mixes orchestration and the actual transformation, and I wasn’t a fan of being ao intimately tied to Python. Argo let’s us run arbitrary containers (a mix of Spark, Rust, etc) and played nicer with the tools we were already using (Kafka, K8s etc), didn’t require us to write Python and cleanly separated orchestration and actual ETL/ELT work. It’s working fantastically for us.
I once made a simple ETL concept library for a coding challenge, I actually never published it, and it's not like performant or anything, but maybe I should open source it?
I have managed Airflow as a managed service for a company that has thousands of DAGs and one of our keys to success was splitting the compute and scheduling concepts into different components. We standardized on where our compute ran (Databricks, Spark, Lambdas, or K8s jobs) and had Airflow purely as a scheduler/orchestration tool.
Scaling your Airflow worker nodes to handle big-data scale transformations/extractions is a pain. Especially attempting to support customers who want to run larger and larger jobs. Splitting these concepts allowed for us to prevent noisy neighbor issues, Airflow as a component had high reliability for all of our customers, and we prevented the need for M * N operators.