5/8/2023 0 Comments Airflow apacheLearning Airflow requires a time investment. Set machine learning around expected data parameters, get Slack alerts that indicate missing data or schema changes in Airflow Scheduler, trace issue lineage back, and back-test through historical data. Purpose-built solution: Adopting full-pipeline observability tools like Databand to automate alerts, isolate root causes, and fix issues faster.Duct tape and baling wire: Borrowing product observability tools and making it work, though it may not be ideal.Pre-awareness: Not monitoring data quality (68% of the Airflow community).We find there are several phases engineering organizations go through on their journey to full observability maturity: Those built to monitor products are a halfway measure, yet usually part of the journey because they already have those licenses. What you need is an observability tool built specifically to monitor data pipelines. As its maintainers point out, “When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.”Īirflow offers that formal representation of code. How does one go about monitoring data quality in Airflow? In truth, Airflow gets you halfway there. They did not ask this question in the 2019 or 2020 surveys. Today, only 32% say they’ve implemented data quality measurement, though the fact that the survey’s drafters are asking is an indication of improvement. However, the need for observability is still not yet fully understood, even within the Airflow community. Most product observability tools such as Datadog and New Relic were not built to analyze pipelines and can’t isolate where issues originated, group co-occurring issues to suggest a root cause, or to suggest fixes. This is not a simple challenge and if we’re being candid, it’s why our founders built Databand. Otherwise, you’ll be living through Groundhog Day. And if it didn’t run correctly, why, and where the error originated. And to do that, you need visibility into not just whether a job ran, but whether it ran correctly. While you can create Slack alerts to check each run manually, to incorporate Airflow as a useful piece of your data engineering organization and hit your SLAs, you want to automate quality checks. You can’t “clean” the source dataset or implement your governance policies there. In that situation, you are likely now using those DAGs to ingest data from external data sources and APIs which makes controlling data quality in Airflow even more difficult. This is especially true once the data organization matures and you go from 10 data acyclic graphics (DAGs) to thousands. Virtually every user has experienced some version of Airflow telling them a job completed and checking the data only to find that a column was missing and it’s all wrong, or no data actually passed through the systems. It doesn’t do anything to course-correct if things go wrong with the data-only with the pipeline. There’s no true way to monitor data qualityĪirflow is a workhorse with blinders.
0 Comments
Leave a Reply. |