I know that it is possible to retry individual tasks, but is it possible to retry complete DAG?
I create tasks dynamically, that is why I need to retry not specific task, but complete DAG. If it is not supported by Airflow, maybe there is some workaround.
I wrote the below script and scheduled it on airflow master to rerun the failed DAG runs for DAGs mentioned in "dag_ids_to_monitor" array
import subprocess
import re
from datetime import datetime
dag_ids_to_monitor = ['dag1','dag2','dag2']
def runBash(cmd):
print ("running bash command {}".format(cmd))
output = subprocess.check_output(cmd.split())
return output
def datetime_valid(dt_str):
try:
datetime.strptime(dt_str, '%Y-%m-%dT%H:%M:%S')
print(dt_str)
print(datetime.strptime(dt_str, '%Y-%m-%dT%H:%M:%S'))
except:
return False
return True
def get_schedules_to_rerun(dag_id):
bashCommand = f"airflow list_dag_runs --state failed {dag_id}"
output = runBash(bashCommand)
schedules_to_rerun = []
for line in output.split('\n'):
parts = re.split("\s*\|\s*", line)
if len(parts) > 4 and datetime_valid(parts[3][:-6]):
schedules_to_rerun.append(parts[3])
return schedules_to_rerun
def trigger_runs(dag_id, re_run_start_times):
for start_time in re_run_start_times:
runBash(f"airflow clear --no_confirm --start_date {start_time} --end_date {start_time} {dag_id}")
def rerun_failed_dag_runs(dag_id):
re_run_start_times = get_schedules_to_rerun(dag_id)
trigger_runs(dag_id,re_run_start_times)
for dag_id in dag_ids_to_monitor:
rerun_failed_dag_runs(dag_id)
If you have access to the Airflow UI, go to Graph view.
In graph view, individual tasks are marked as boxes and the DAG run as a whole is indicated by circles. Click on a circle and then the clear option. This will restart the entire run.
Alternatively you can go to the tree view and clear the first task in the DAG.
Go to Airflow UI, click on the first task(s) of your DAG, to the right of the "Clear" button choose "Downstream" and "Recursive" and after that press "Clear". This will mark the DAG as "Haven't yet run" and rerun it if the DAG schedule permits it
Related
I try create graph with chain of dynamic tasks.
First of all, I start with expand function. But problem is program should wait, when all the Add tasks finished and only then start Mul tasks. I need the next Mul to run immediately after each Add. Then I got the code that the graph could make
with DAG(dag_id="simple_maping", schedule='* * * * *', start_date=datetime(2022, 12, 22)) as dag:
#task
def read_conf():
conf = Variable.get('tables', deserialize_json=True)
return conf
#task
def add_one(x: str):
sleep(5)
return x + '1'
#task
def mul_two(x: str):
return x * 2
for i in read_conf():
mul_two(add_one(i))
but now there is an error - 'xcomarg' object is not iterable. I can fix it just remove task decorator from read_conf method, but I am not sure it's the best decision, because in my case list configuration names could contain >1000 elements. Without decorator, method have to read configuration every time when scheduler parsed graph.
Maybe the load without the decorator will not be critical? Or is there a way to make an object iterable? How to do it right?
EDIT: This solution has a bug in 2.5.0 which was solved for 2.5.1 (not released yet).
Yes, when you are chaining dynamically mapped tasks the latter (mul_2) will wait until all mapped instances of the first task (add_one) are done by default because the default trigger rule is all_success. While you can change the trigger rule for example to one_done this will not solve your issue because the second task will only once, when it first starts running, decide how many mapped task instances it creates (with one_done it only creates one mapped task instance, so not helpful for your use-case).
The issue with the for-loop (and why Airflow wont allow you to iterate over an XComArg) is that for-loops are parsed when the DAG code is parsed, which happens outside of runtime, when Airflow does not know yet how many results read_conf() will return. If the number of the configurations only rarely change then having a for-loop like that iterating over list in a separate file is an option, but yes at scale this can cause performance issues.
The best solution in my opinion is to use dynamic task group mapping which was added in Airflow 2.5.0:
All mapped task groups will run in parallel and for every input from read_conf(). So for every add_one its mul_two will run immediately. I put the code for this below.
One note: You will not be able to see the mapped task groups in the Airflow UI or be able to access their logs just yet, the feature is still quite new and the UI extension should come in 2.5.1. That is why I added a task downstream of the mapped task groups that prints out the list of results of the mul_two tasks, so you can check if it is working.
from airflow import DAG
from airflow.decorators import task, task_group
from datetime import datetime
from time import sleep
with DAG(
dag_id="simple_mapping",
schedule=None,
start_date=datetime(2022, 12, 22),
catchup=False
) as dag:
#task
def read_conf():
return [10, 20, 30]
#task_group
def calculations(x):
#task
def add_one(x: int):
sleep(x)
return x + 1
#task()
def mul_two(x: int):
return x * 2
mul_two(add_one(x))
#task
def pull_xcom(**context):
pulled_xcom = context["ti"].xcom_pull(
task_ids=['calculations.mul_two'],
key="return_value"
)
print(pulled_xcom)
calculations.expand(x=read_conf()) >> pull_xcom()
Hope this helps! :)
PS: you might want to set catchup=False unless you want to backfill a few weeks of tasks.
I was curious if there's a way to customise the dag runs.
So I'm currently checking for updates for another table which gets updated manually by someone and once that's been updated, I would run my dag for the month.
At the moment I have created a branch operator that compares the dates of the 2 tables but is there a way to run the dag (compare the two dates) and run it everyday until there is a change and not run for the remaining of the month?
For example,
Table A (that is updated manually) has YYYYMM as 202209 and Table B also has YYYYMM as 202209.
Atm, my branch operator compares the two YYYYMM and would point to a dummy operator end when it's the same. However, when Table A has been updated to 202210, there's a difference in the two YYYYMM hence another task would run and overwrite Table B.
It all works but this would run the dag everyday even though the table A only gets updated once a month at a random point of time within the month. So is there way to trigger the dag to stop for the remaining days of the month after the task has been triggered?
Hope this is clear.
If you would be using data stored on S3 there would be easy solution starting from the version 2.4 - the Data-aware scheduling.
But probably you're not so there is another option.
A dag in Airflow is Dag object that is assigned to global scope. This allows for dynamic creation of dags. This implies each file is loaded on certain interval. A very good description with examples is here
Second thing you need to use is Airflow Variables
So the concept is as follows:
Create a variable in Airflow named dag_run that will hold the month when the dag has successfully run
Create a python file that has a function that creates a dag object based on input parameters.
In the same file use conditional statements that will set the 'schedule' param differently depending if the dag has run for current month
In your dag in the branch that executes when data has changed set the variable dag_run to the current months value like so: Variable.set(key='dag_run', value=datetime.now().month)
step 1:
python code:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
from airflow.models import Variable
#function that creates dag based on input
def create_dag(dag_id,
schedule,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_id)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py)
return dag
#run some checks
current_month = datetime.now().month
dag_run_month = int(Variable.get('run_month'))
if current_month == dag_run_month:
# keep the schedule off
schedule = None
dag_id = "Database_insync"
elif current_month != dag_run_month:
# keep the schedule on
schedule = "30 * * * *"
dag_id = "Database_notsynced"
#watch out for start_date if you leave
#it in the past airflow will execute past missing schedules
default_args = {'owner': 'airflow',
'start_date': datetime.datetime.now() - datetime.timedelta(minutes=15)
}
globals()[dag_id] = create_dag(dag_id,
schedule,
default_args)
I have a main dag which retrieves a file and splits the data in this file to separate csv files.
I have another set of tasks that must be done for each file of these csv files. eg (Uploading to GCS, Inserting to BigQuery)
How can I generate a SubDag for each file dynamically based on the number of files? SubDag will define the tasks like Uploading to GCS, Inserting to BigQuery, deleting the csv file)
So right now, this is what it looks like
main_dag = DAG(....)
download_operator = SFTPOperator(dag = main_dag, ...) # downloads file
transform_operator = PythonOperator(dag = main_dag, ...) # Splits data and writes csv files
def subdag_factory(): # Will return a subdag with tasks for uploading to GCS, inserting to BigQuery.
...
...
How can I call the subdag_factory for each file generated in transform_operator?
I tried creating subdags dynamically as follows
# create and return and DAG
def create_subdag(dag_parent, dag_id_child_prefix, db_name):
# dag params
dag_id_child = '%s.%s' % (dag_parent.dag_id, dag_id_child_prefix + db_name)
default_args_copy = default_args.copy()
# dag
dag = DAG(dag_id=dag_id_child,
default_args=default_args_copy,
schedule_interval='#once')
# operators
tid_check = 'check2_db_' + db_name
py_op_check = PythonOperator(task_id=tid_check, dag=dag,
python_callable=check_sync_enabled,
op_args=[db_name])
tid_spark = 'spark2_submit_' + db_name
py_op_spark = PythonOperator(task_id=tid_spark, dag=dag,
python_callable=spark_submit,
op_args=[db_name])
py_op_check >> py_op_spark
return dag
# wrap DAG into SubDagOperator
def create_subdag_operator(dag_parent, db_name):
tid_subdag = 'subdag_' + db_name
subdag = create_subdag(dag_parent, tid_prefix_subdag, db_name)
sd_op = SubDagOperator(task_id=tid_subdag, dag=dag_parent, subdag=subdag)
return sd_op
# create SubDagOperator for each db in db_names
def create_all_subdag_operators(dag_parent, db_names):
subdags = [create_subdag_operator(dag_parent, db_name) for db_name in db_names]
# chain subdag-operators together
airflow.utils.helpers.chain(*subdags)
return subdags
# (top-level) DAG & operators
dag = DAG(dag_id=dag_id_parent,
default_args=default_args,
schedule_interval=None)
subdag_ops = create_subdag_operators(dag, db_names)
Note that the list of inputs for which subdags are created, here db_names, can either be declared statically in the python file or could be read from external source.
The resulting DAG looks like this
Diving into SubDAG(s)
Airflow deals with DAG in two different ways.
One way is when you define your dynamic DAG in one python file and put it into dags_folder. And it generates dynamic DAG based on external source (config files in other dir, SQL, noSQL, etc). Less changes to the structure of the DAG - better (actually just true for all situations). For instance, our DAG file generates dags for every record(or file), it generates dag_id as well. Every airflow scheduler's heartbeat this code goes through the list and generates the corresponding DAG. Pros :) not too much, just one code file to change. Cons a lot and it goes to the way Airflow works. For every new DAG(dag_id) airflow writes steps into database so when number of steps changes or name of the step it might break the web server. When you delete a DAG from your list it became kind of orphanage you can't access it from web interface and have no control over a DAG you can't see the steps, you can't restart and so on. If you have a static list of DAGs and IDes are not going to change but steps occasionally do this method is acceptable.
So at some point I've come up with another solution. You have static DAGs (they are still dynamic the script generates them, but their structure, IDes do not change). So instead of one script that walks trough the list like in directory and generates DAGs. You do two static DAGs, one monitors the directory periodically (*/10 ****), the other one is triggered by the first. So when a new file/files appeared, the first DAG triggers the second one with arg conf. Next code has to be executed for every file in the directory.
session = settings.Session()
dr = DagRun(
dag_id=dag_to_be_triggered,
run_id=uuid_run_id,
conf={'file_path': path_to_the_file},
execution_date=datetime.now(),
start_date=datetime.now(),
external_trigger=True)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
session.close()
The triggered DAG can receive the conf arg and finish all the required tasks for the particular file. To access the conf param use this:
def work_with_the_file(**context):
path_to_file = context['dag_run'].conf['file_path'] \
if 'file_path' in context['dag_run'].conf else None
if not path_to_file:
raise Exception('path_to_file must be provided')
Pros all the flexibility and functionality of Airflow
Cons the monitor DAG can be spammy.
I need the status of the task like if it is running or upforretry or failed within the same dag. So i tried to get it using the below code, though i got no output...
Auto = PythonOperator(
task_id='test_sleep',
python_callable=execute_on_emr,
op_kwargs={'cmd':'python /home/hadoop/test/testsleep.py'},
dag=dag)
logger.info(Auto)
The intention is to kill certain running tasks once a particular task on airflow completes.
Question is how do i get the state of a task like is it in the running state or failed or success
I am doing something similar. I need to check for one task if the previous 10 runs of another task were successful.
taky2 sent me on the right path. It is actually fairly easy:
from airflow.models import TaskInstance
ti = TaskInstance(*your_task*, execution_date)
state = ti.current_state()
As I want to check that within the dag, it is not neccessary to specify the dag.
I simply created a function to loop through the past n_days and check the status.
def check_status(**kwargs):
last_n_days = 10
for n in range(0,last_n_days):
date = kwargs['execution_date']- timedelta(n)
ti = TaskInstance(*my_task*, date) #my_task is the task you defined within the DAG rather than the task_id (as in the example below: check_success_task rather than 'check_success_days_before')
state = ti.current_state()
if state != 'success':
raise ValueError('Not all previous tasks successfully completed.')
When you call the function make sure to set provide_context.
check_success_task = PythonOperator(
task_id='check_success_days_before',
python_callable= check_status,
provide_context=True,
dag=dag
)
UPDATE:
When you want to call a task from another dag, you need to call it like this:
from airflow import configuration as conf
from airflow.models import DagBag, TaskInstance
dag_folder = conf.get('core','DAGS_FOLDER')
dagbag = DagBag(dag_folder)
check_dag = dagbag.dags[*my_dag_id*]
my_task = check_dag.get_task(*my_task_id*)
ti = TaskInstance(my_task, date)
Apparently there is also an api-call by now doing the same thing:
from airflow.api.common.experimental.get_task_instance import get_task_instance
ti = get_task_instance(*my_dag_id*, *my_task_id*, date)
Take a look at the code responsible for the command line interface operation suggested by Priyank.
https://github.com/apache/incubator-airflow/blob/2318cea74d4f71fba353eaca9bb3c4fd3cdb06c0/airflow/bin/cli.py#L581
def task_state(args):
dag = get_dag(args)
task = dag.get_task(task_id=args.task_id)
ti = TaskInstance(task, args.execution_date)
print(ti.current_state())
Hence, it seem you should easily be able to accomplish this within your DAG codebase using similar code.
Alternatively you could execute these CLI operations from within your code using python's subprocess library.
Okay, I think I know what you're doing and I don't really agree with it, but I'll start with an answer.
A straightforward, but hackish, way would be to query the task_instance table. I'm in postgres, but the structure should be the same. Start by grabbing the task_ids and state of the task you're interested in with a db call.
SELECT task_id, state
FROM task_instance
WHERE dag_id = '<dag_id_attrib>'
AND execution_date = '<execution_date_attrib>'
AND task_id = '<task_to_check>'
That should give you the state (and name, for reference) of the task you're trying to monitor. State is stored as a simple lowercase string.
You can use the command line Interface for this:
airflow task_state [-h] [-sd SUBDIR] dag_id task_id execution_date
For more on this you can refer official airflow documentation:
http://airflow.incubator.apache.org/cli.html
I am really a newbie in this forum. But I have been playing with airflow, for sometime, for our company. Sorry if this question sounds really dumb.
I am writing a pipeline using bunch of BashOperators.
Basically, for each Task, I want to simply call a REST api using 'curl'
This is what my pipeline looks like(very simplified version):
from airflow import DAG
from airflow.operators import BashOperator, PythonOperator
from dateutil import tz
import datetime
datetime_obj = datetime.datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.datetime.combine(datetime_obj.today() - datetime.timedelta(1), datetime_obj.min.time()),
'email': ['xxxx#xxx.xxx'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 2,
'retry_delay': datetime.timedelta(minutes=5),
}
current_datetime = datetime_obj.now(tz=tz.tzlocal())
dag = DAG(
'test_run', default_args=default_args, schedule_interval=datetime.timedelta(minutes=60))
curl_cmd='curl -XPOST "'+hostname+':8000/run?st='+current_datetime +'"'
t1 = BashOperator(
task_id='rest-api-1',
bash_command=curl_cmd,
dag=dag)
If you notice I am doing current_datetime= datetime_obj.now(tz=tz.tzlocal())
Instead what I want here is 'execution_date'
How do I use 'execution_date' directly and assign it to a variable in my python file?
I have having this general issue of accessing args.
Any help will be genuinely appreciated.
Thanks
The BashOperator's bash_command argument is a template. You can access execution_date in any template as a datetime object using the execution_date variable. In the template, you can use any jinja2 methods to manipulate it.
Using the following as your BashOperator bash_command string:
# pass in the first of the current month
some_command.sh {{ execution_date.replace(day=1) }}
# last day of previous month
some_command.sh {{ execution_date.replace(day=1) - macros.timedelta(days=1) }}
If you just want the string equivalent of the execution date, ds will return a datestamp (YYYY-MM-DD), ds_nodash returns same without dashes (YYYYMMDD), etc. More on macros is available in the Api Docs.
Your final operator would look like:
command = """curl -XPOST '%(hostname)s:8000/run?st={{ ds }}'""" % locals()
t1 = BashOperator( task_id='rest-api-1', bash_command=command, dag=dag)
The PythonOperator constructor takes a 'provide_context' parameter (see https://pythonhosted.org/airflow/code.html). If it's True, then it passes a number of parameters into the python_callable via kwargs. kwargs['execution_date'] is what you want, I believe.
Something like this:
def python_method(ds, **kwargs):
Variable.set('execution_date', kwargs['execution_date'])
return
doit = PythonOperator(
task_id='doit',
provide_context=True,
python_callable=python_method,
dag=dag)
I'm not sure how to do it with the BashOperator, but you might start with this issue: https://github.com/airbnb/airflow/issues/775
I think you can't assign variables with values from the airflow context outside of a task instance, they are only available at run-time. Basically there are 2 different steps when a dag is loaded and executed in airflow :
First your dag file is interpreted and parsed. It has to work and compile and the task definitions must be correct (no syntax error or anything). During this step, if you make function calls to fill some values, these functions won't be able to access airflow context (the execution date for example, even more if you're doing some backfilling).
The second step is the execution of the dag. It's only during this second step that the variables provided by airflow (execution_date, ds, etc...) are available as they are related to an execution of the dag.
So you can't initialize global variables using the Airflow context, however, Airflow gives you multiple mechanisms to achieve the same effect :
Using jinja template in your command (it can be in a string in the code or in a file, both will be processed). You have the list of available templates here : https://airflow.apache.org/macros.html#default-variables. Note that some functions are also available, particularly for computing days delta and date formatting.
Using a PythonOperator in which you pass the context (with the provide_context argument). This will allow you to access the same template with the syntax kwargs['<variable_name']. If you need so, you can return a value from a PythonOperator, this one will be stored in an XCOM variable you can use later in any template. Access to XCOM variables use this syntax : https://airflow.apache.org/concepts.html#xcoms
If you write your own operator, you can access airflow variables with the dict context.
def execute(self, context):
execution_date = context.get("execution_date")
This should be inside the execute() method of Operator
To print execution date inside the callable function of your PythonOperator you can use the following in your Airflow Script and also can add start_time and end_time as follows:
def python_func(**kwargs):
execution_date = kwargs["execution_date"] #<datetime> type with timezone
end_time = str(execution_date)
start_time = str(execution_date.add(minutes=-30))
I have converted the datetime value to string as I need to pass it in a SQL Query. We can use it otherwise also.
You may consider SimpleHttpOperator https://airflow.apache.org/_api/airflow/operators/http_operator/index.html#airflow.operators.http_operator.SimpleHttpOperator. It’s so simple for making http request. you can pass execution_date with endpoint parameter via template.
Here's another way without context. using the dag's last execution time can be very helpful in scheduled ETL jobs. Such as a dag that 'downloads all newly added files'. Instead of hardcoding a datetime.datetime, use the dag's last execution date as your time filter.
Airflow Dags actually have a class called DagRun that can be accessed like so: dag_runs = DagRun.find(dag_id=dag_id)
Here's an easy way to get the most recent run's execution time:
def get_most_recent_dag_run(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
return dag_runs[1] if len(dag_runs) > 1 else None
Then, within your pythonOperator, you can dynamically access the dag's last execution by calling the function you created above:
last_execution = get_most_recent_dag_run('dag')
Now its a variable!