My goal is to write a job which incrementally loads data from a source to a destination table, and then calculates a 7-day moving average on that data. My initial instinct is to split this into two tasks, one which loads a single day of data into a table, and a second which calculates a 7-day moving average and loads the result into another table.
How can I configure airflow to trigger the next 7 task instances of the downstream task calculate_7d_moving_average automatically when a single task instance of load_day_of_data is re-run?
Related
I have a flow that needs to be trigger automatically on Tuesday every week.
In this, I have 4 jobs all dependent on each other.
JobA<-JobB<-JobC<-JobD
I have put all 4 jobs in one SMART folder.
Now I have scheduled this flow so that as soon as a job gets completed another job should trigger.
How can I achieve it.
Control-M has a design element called "Conditions" that defines the flowchart of a process mesh. These conditions determine the Jobs that are successors and predecessors through the Input and Output Conditions.
Keep in mind that a Condition in Control-M is a prerequisite for the execution of a Job and is composed of the following elements:
A Name,
A date
A sign
Is there a way to throttle a DAG backfill on certain tasks so if one task in a run is writing to a table another task that is trying to truncate that table has to wait?
I have about ~50 tables in a database I am mirroring to another server. The process is the same for all 50 tables. The only difference between DAGS is the stored procedures that get called after the data is staged for processing on the server. To save a bunch of work I opted to create a DAG factory and parameterize it in a way where each DAG sync's a different table and it's corresponding stored procedure.
The issue I am having is when I backfill too many DAG runs for a single table I get race condition where one execution date is trying to bulk insert into a stage table and another execution date is trying to truncate that same table so it can bulk insert in stage data. Pooling is not really an option because I would need to create a pool of 1 for each table which doesn't seem like a very good idea, especially if I need to re-deploy the airflow metadata db.
Have you tried setting max_active_runs =1 for this DAG? Documentation reference.
Also won't it be an option to use separate temp table for each DAG run? It's a common pattern to create resources for single DAG run and then tear them down.
Learning about Airflow and want to understand why it is not a data streaming solution.
https://airflow.apache.org/docs/apache-airflow/1.10.1/#beyond-the-horizon
Airflow is not a data streaming solution. Tasks do not move data from
one to the other (though tasks can exchange metadata!)
Not sure what the it means task do not move data from one to another. So I can't have tasks like Extract Data >> Calculate A >> Calculate B using data from prev step >> another step that depends on prev task's result >> ... ?
Furthermore read the following a few times and still doesn't click with me. What's an example of a static vs dynamic workflow?
Workflows are expected to be mostly static or slowly changing. You can think of the structure of the tasks in your workflow as slightly more dynamic than a database structure would be. Airflow workflows are expected to look similar from a run to the next, this allows for clarity around unit of work and continuity.
Can someone help provide an alternative explanation or example that can walk through why Airflow is not a good data streaming solution?
Not sure what the it means task do not move data from one to another. So I can't have tasks like Extract Data >> Calculate A >> Calculate B using data from prev step >> another step that depends on prev task's result >> ... ?
You can use XCom (cross-communication) to use data from previous step BUT bear in mind that XCom is stored in the metadata DB meaning that you should not use XCom for holding data that is being processed. You should use XCom for holding a reference to data (e.g.: S3/HDFS/GCS path, BigQuery table, etc).
So instead of passing data between tasks, in Airflow you pass reference to data between tasks using XCom. Airflow is an orchestrator, heavy tasks should be off-loaded to other data processing cluster (can be Spark cluster, BigQuery, etc).
We are using airflow as workflow manager and scheduler.
Requiremnet
1- We have a ETL pipeline in which data is arriving hourly in different file and need to be processed once data is arrived.
2- Data for every hour have a cutt-off limit in which it can get updated and once updated data needs to be reprocessed.
To solve first we can use file sensor with hourly macros to look for file and start processing once data is available.
For the second requirement we were thinking of using some kind of subdag/task which can run with different frequency till the cut-off time and process if there is any update in data.
But in airflow we couldn't find something like that which could run a task/subdag with different frequency.
How can we achieve this ?
I have an Oozie job that processes data incrementally. Going forward, I would like to run this job on an hourly basis to prepare the results as soon as possible. But to backfill old data, it would be faster to run sequential jobs processing a week's worth of data at a time.
Is it possible to have a single coordinator.xml file that allows for both of these modes, and simply choose between them based on a flag specified ad-hoc when the job is scheduled?
In the parameters of the <coordinator-app> tag in coordinator.xml, there is a single frequency, which suggests that this is not possible, at least not in a natural way.
I don't think there is an easy way to do different frequencies within a coordinator. Based on your description you would not need the weekly job after the backfill happens.
I imagine you'd have to change the parameterization of the workflow as well to process more or less data.
On the other hand, you could just start the coordinator in the past with the frequency you'd like and tweak parameters like concurrency, throttle and execution in the app definition so Oozie can chew through the backlog by executing the workflow in parallel.
My eventual solution was to create the workflow at a given frequency (say, daily), and then create a second "backfill" workflow with a different frequency (weekly or monthly) that calls the original workflow as a sub-workflow.