Why use CustomOperator over PythonOperator in Apache Airflow? - airflow

As I'm using Apache Airflow, I can't seem to find why someone would create a CustomOperator over a PythonOperator. Wouldn't it lead to the same results if I'm using a python function inside a PythonOperator instead of a CustomOperator?
If someone would know what are the different use cases and best practices, that would be great! !
Thanks a lot for your help

Both operators while similar are really at different abstraction levels, and depending on your use-case, one may be a better fit than another.
Code defined in a CustomOperator can be easily used by multiple DAGs. If you have a lot of DAGs that need to perform the same task it may make more sense to expose this code to the DAGs via a CustomOperator.
PythonOperator is very general and is a better fit for one-off DAG specific tasks.
At the end of the day the default set of operators provided in Airflow are just tools. Which tool you end up using (default operators) or whether it makes sense to create your own custom tool (custom operators) is a choice determined by a bunch of factors:
The type of task you are trying to accomplish.
Code organization requirements necessitated by policy or the number of people
maintaining the pipeline.
Simple personal taste.

Related

Kubeflow VS generic workflow orchestrator?

i am struggling understanding the functional role of Kubeflow (KF) compared with other (generic) workflow orchestrator.
I know KF is oriented to ML tasks, and is built on top of Argo.
Two questions:
can KF be used at a higher level as a workflow orchestrator to perform more generic tasks (i.e. ETL) whose outcome might be useful in the following ML tasks?
can use all funcionalities of Argo within KF.
what can a generic workflow orchestrator (as Airflow, argo, etc.) do that KF cannot?
Yes, you can create a python function/ general containers with code baked in which executes whatever task you like.
pre-defined component -
https://www.kubeflow.org/docs/components/pipelines/sdk-v2/component-development/
python component - https://www.kubeflow.org/docs/components/pipelines/sdk-v2/python-function-components/
KFP is an abstraction op top of Argo workflows.
it gives you the ability to create Workflows using python instead of writing YAML files. Check out this article : https://towardsdatascience.com/build-your-data-pipeline-on-kubernetes-using-kubeflow-pipelines-sdk-and-argo-eef69a80237c
since Argo Workflows development is advancing independently from KFP it's safe to assume there will be missing features in KFP (Which are the community will add according to demands).
that's a big question.
in general, airflow has sensors, SLA feature/ huge store of operators/sensors/reports/plugins and a bigger community since it's not ML oriented.

Cyclic Workflow in Data Comparison Process

I am searching for a solution to automatize an iterative data comparison process until all data packages are consistent. My general guess is to use something like Apache Airflow, but the iterative nature seem to be a cyclic graph. Apache airflow only allows DAGs (directed acyclic graph). Since I have not even a lot of knowledge in Airflow, I am a bit lost and would appreciate some expert knowledge here.
Current status: I am in a position were I regularly need to compare data packages for consistency and communicate errors to and between the two different parties manually.
On the one hand there is a design data set and on the other hand there are measured data sets. Both datasets involve many manual steps from different parties. So if an inconsistency occurs, I contact one or the other party and the error is removed manually. There are also regular changes to both data sets that can introduce new errors to already checked datasets.
I guess this process was not automatized yet, because the datasets are not directly comparable, but some transformations need to be done in between. I automatized this transformation process the last weeks so all that need to be done now from my side is to run the script and to communicate the errors.
What I would need now is a tool that orchestrates my script against the correct datasets and contacts the according persons as long as errors exists. In case something changes or was added the script needs to be run again.
My first guess was that I would need to create a workflow in apache airflow, but this iterative process seems to me as a cyclic graph, which is not allowed in Airflow. Do you have any suggestions or is this a common occurrence, were also solutions with Airflow exists?
I think one way to solve your problem could be to have a DAG workflow for the main task of comparing the datasets and sending notifications. Then run a periodical task in Cron, Quartz, etc, that triggers that DAG workflow. You are correct in Airflow not liking cyclic workflows.
I worked on Cylc, a cyclic graph workflow tool. Cyclic workflows (or workflows with loops) are very common in areas such as Numerical Weather Prediction NWP (reason why Cylc was created), and also in other fields such as optimization.
In NWP workflows, some steps may be waiting for datasets, and the workflow may stall and send notifications if the data is not as it was expected (e.g. some satellite imaging data is missing, and also the tides model output file is missing).
Also, in production, NWP models run multiple times a day. Either because you have new observation data, or new input data, or maybe because you want to run ensemble models, etc. So you end up with multiple runs of the workflow in parallel, where the workflow manager is responsible to manage dependencies, optimize the use of resources, send notifications, and more.
Cyclic workflows are complicated, that's probably why most implementations opt to support only DAGs.
If you'd like to try Cylc, the team has been trying to make it more generic so that it's not specific to NWP only. It has a new GUI, and the input format and documentation were improved with ease of use in mind.
There are other tools that support cyclic workflows too, such as StackStorm, Prefect, and I am currently checking if Autosubmit supports it too. Take a look at these tools if you'd like.
If you are in life sciences, or are interested in reproducible workflows, the CWL standard also has some ongoing discussion about adding support to loops, which could allow you to achieve something akin to what you described I reckon.

Schedule tasks at fixed time in multiple timezones

I'm getting started with Airflow and I'm not sure how to approach this problem:
I'm building a data export system that should run at a fixed time daily for different locations. My issue is that the locations have several timezones and the definition of day start/end changes depending on the timezone.
I saw in the documentation that I can make a dag timezone aware but I'm not sure creating 100s of dags with different timezone is the right approach. I also have some common tasks so multiple dags creates more complexity or duplication in the tasks performed.
Is there an airflow idiomatic way to achieve this? I think it would be a fairly common use case to build reports that are timezone dependant but I didn't find any informations about it.
Dynamic DAGs is a hot topic in Airflow, but from my point of view, Airflow itself doesn't provide a straightforward way to implement that. You'll need to balance the pros and cons depending on your use case.
As a good starting point, you can check Astronomer guide for dynamically generating DAGs. There you have all the options available and some ideas of the pros and cons. Make sure you check out the scalability considerations to see the implications in terms of performance.
From your use case, I think the best approach will be either the Create_DAG Method (under Single-File Methods), or the DAG Factory. I personally prefer the first one because it's like a Factory (in terms of programming patterns), but you have the flexibility to create all the files you need for each DAG. In the second you won't have much control of what you create and there's an external dependency needed.
Another interesting article about dynamically generating DAGs is "How to build a DAG Factory on Airflow".

Data share between two task in airflow dag

I want to do hive query using HiveOperator and the output of that query should transfer to python script using PythonOperator. Is it possible and how?
One approach to this kind of problem in general is to use Airflow's xcoms - see documentation
However, I would use this sparingly and only where strictly necessary. Otherwise, you risk ending up with your operators being quite tangled and interdependent.

How to set up/monitor huge amounts of equivalent DAGs

I am new to Airflow and am still learning the concepts.
I am trying to monitor a huge amount of webpages (>1000) once a day.
At the moment I dynamically create a single DAG for each webpage (data acquisition and processing). This works from a functional point of view. However, looking at the User-Interface I find the amount of DAGs overwhelming and my question is:
Is this the right way to do it? (a single DAG for each webpage)
Is there any way to get a better overview of how the monitoring of all webpages is doing?
Since all DAGs are equivalent and only deal with a different url, it made me think that grouping these DAGs together or having a common overview might be possible or at least a good idea.
E.g. if the acquisition or processing of a certain webpage is failing I would like to see this easily in the UI without having to scroll many pages to find a certain DAG.
You should just have one DAG and have multiple tasks. Based on information you provided, the only thing that seem to change is the URL, so better have one DAG and have many tasks.

Resources