Let's say you have 2 scripts: Daily_summary.py and Weekly_summary.py.
You could create 2 separate DAGs with daily and weekly schedules, but is it possible to solve this with 1 DAG?
I've tried a daily schedule, and simply putting this at the bottom (simplified):
if datetime.today().strftime('%A') == 'Sunday':
SSHOperator(run weekly_summary.py)
But problem is that if it is still running on Sunday at midnight, airflow will terminate this task since the Operator no longer exists on Monday.
If I could somehow get the execution day's day-of-the-week, that would solve it, but with Jinja templating '{{ds}}' it is not actually a text of 'yyyy-mm-dd', so cannot change it to date with datetime package. It only becomes date format somehow AFTER the airflow script gets executed
You shoudl dynamically generate two DAGs. But you can reuse the same code for that. This is the power of airflow - this is Python code, so you can easily use the same code to generate same DAG "structure" but with two diferent dag ids, and two different schedules.
See this nice article from Astronomer with some examples: https://www.astronomer.io/guides/dynamically-generating-dags
Related
I am trying to get the dag start time and end time to calculate the duration/elapsed time and show it in airflow UI.
I tried with python date time but looks like airflow already records these things. I want to know if there is any way to leverage that.
I don't want to get the details from the database because it will complicate things. I want to keep it simple.
I am using Airflow 1.10.10 and wanted to know how to change a Aiflow DAG schedule . I checked online and in most of the comments its suggested that to change schedule of a DAG, create a new DAG with new dag_id, or change dag_id of existing DAG and give new schedule_interval . Attempt to change schedule of a existing DAG will not work in straight forward manner and will throw error or might create scheduling error.
However I tried to test this so that I can create the scenario where my DAG schedule change leads to erroneous cases . This I tried by only change schedule_interval in DAG file. I tried below change of schedule in my DAG and all worked as expected. Schedule was changed properly and no erroneous case was found .
Started with #Daily
Changed to 10 min
Changed to 17 min
Changed to 15 min
Changed to 5 min
Can someone please clarify what kind of problem may arise if we change the schedule_interval in a DAG without changing ID.
I do see this recommendation on the old Airflow Confluence page on Common Pitfalls.
When needing to change your start_date and schedule interval, change the name of the dag (a.k.a. dag_id) - I follow the convention : my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc...
Changing schedule interval always requires changing the dag_id, because previously run TaskInstances will not align with the new schedule interval
Changing start_date without changing schedule_interval is safe, but changing to
an earlier start_date will not create any new DagRuns for the time
between the new start_date and the old one, so tasks will not
automatically backfill to the new dates. If you manually create
DagRuns, tasks will be scheduled, as long as the DagRun date is after
both the task start_date and the dag start_date.
I don't know the author's intent but I imagine changing the schedule_interval can cause confusion for users. When they revisit these task, they will wonder why the current schedule_interval does not match past task executions because that information is not stored at the task level.
Changing the schedule_interval does not impact past dagruns or tasks. The change will affect when new dagruns are created, which impacts the tasks within those dagruns.
I personally do not modify the dag_id when I update a DAG's scheduler_interval for two reasons.
If I keep the previous DAG, I am unnecessarily inducing more stress on the scheduler for processing a DAG that will not be turned on.
If I do not keep the previous DAG, I essentially lose all the history of the dagrun where it had a different schedule_interval.
Edit: Looks like there is an Github Issue created to move the Common Pitfall page but it is stale.
Someone please tell me whether a DAG in airflow is just a graph (like a placeholder) without any actual data (like arguments) associated with it OR a DAG is like an instance (for a fixed argument)?
I want a system where the set of operations to perform (given a set of arguments) are fixed. But this input will be different everytime the set of operations are run. In simple terms, the pipeline is the same but the arguments to the pipeline will be different everytime it is run.
I want to know how to configure this in airflow? Should I create a new DAG for every new set of arguments? or any other method?
In my case, the graph is the same but want to run it on different data (from different users) as they come. So, should I create a new DAG everytime for new data?
Yes you are correct; A DAG is basically kind off a one-way graph. You can create a DAG once by chaining together multiple operators together to form your "structure".
Each operator, can then take in multiple arguments that you can pass from the DAG definition file itself (if needed).
Or you can pass in a configuration object to the DAG, and access custom data from there using the context.
I would recommend reading the Airflow Docs for more examples: https://airflow.apache.org/concepts.html#tasks
You can think of Airflow DAG as a program made of other programs, with the exception that it can't contain loops(acyclic). Will you change your program every time input data changes? Of course, it all depends on how you write your program, but usually you'd like you program to generalise, right? You don't want two different programs to do 2+2 and 3+3. But you'll have different programs to show Facebook pages and to play Pokemon Go. If you want to do the same thing to a similar data then you want to write your DAG once, and maybe only change environment arguments(DB connection, date, etc) - Airflow is perfectly suitable for that.
You do not need to create a new DAG every time, if the structure of the graph is the same.
Airflow DAGs are created via code, so you are free to create a code structure that allows you to pass in arguments each time. How you do that will require some creative thinking.
You could, for example, create a web form that accepts the arguments, stores them in a DB and then schedules the DAG with the Airflow restAPI. The DAG code would then need to be written to retrieve params from the database.
There are several other ways to accomplish what you are asking, they all just depend on your use case.
One caveat, the Airflow scheduler does not perform well if you change the start date of the DAG. For your idea above you will need to set the start date earlier than your first DAG run and then set the schedule interval to off. This way you have a start date that doesn’t change and dynamically triggered DAG runs.
We are using airflow as workflow manager and scheduler.
Requiremnet
1- We have a ETL pipeline in which data is arriving hourly in different file and need to be processed once data is arrived.
2- Data for every hour have a cutt-off limit in which it can get updated and once updated data needs to be reprocessed.
To solve first we can use file sensor with hourly macros to look for file and start processing once data is available.
For the second requirement we were thinking of using some kind of subdag/task which can run with different frequency till the cut-off time and process if there is any update in data.
But in airflow we couldn't find something like that which could run a task/subdag with different frequency.
How can we achieve this ?
The API in Airflow seems to suggest it is build around backfilling, catching up and scheduling to run regularly in interval.
I have an ETL that extract data on S3 with the versions of the previous node (where the data comes from) in DAG. For example, here are the nodes of the DAG:
ImageNet-mono
ImageNet-removed-red
ImageNet-mono-scaled-to-100x100
ImageNet-removed-red-scaled-to-100x100
where ImageNet-mono is the previous node of ImageNet-mono-scaled-to-100x100 and
where ImageNet-removed-red is the previous node of ImageNet-removed-red-scaled-to-100x100
Both of them go through transformation of scaled-to-100x100 pipeline but producing different data since the input is different.
As you can see there is no date is involved. Is Airflow a good fit?
EDIT
Currently, the graph is simple enough to be managed manually with less than 10 nodes. They won't run in regularly interval. But instead as soon as someone update the code for a node, I would have to run the downstream nodes manually one by one python GetImageNet.py removed-red and then python scale.py 100 100 ImageNet-removed-redand then python scale.py 100 100 ImageNet-mono. I am looking into a way to manage the graph with a way to one click to trigger the run.
I think it's fine to use Airflow as long as you find it useful to use the DAG representation. If your DAG does not need to be executed on a regular schedule, you can set the schedule to None instead of a crontab. You can then trigger your DAG via the API or manually via the web interface.
If you want to run specific tasks you can trigger your DAG and mark tasks as success or clear them using the web interface.