How to find out the delayed jobs in airflow - airflow

Some of my DAG are waiting to get scheduled, and some are waiting in the queue. I suspect there are reasons for this delay but not sure how I can start to debug this problem. Majority of the pipelines are running Spark jobs.
Can someone help to give me some directions in terms of where to look at to 1) anaylse which DAGs were delayed (did not start at the scheduled time) 2) where are the places I should look at to find out if the resources are enough. I'm quite new to scheduling in Airflow. Many thanks. Please let me know if I can describe the question better.

If you are looking for code that takes advantage of Airflows' wider capabilities.
There are three modules within airflow.models which can be harnessed.
To programmatically retrieve all DAGs which your Airflow is away of, we import DagBag. From the docs "A dagbag is a collection of dags, parsed out of a folder tree and has high"
We utilise DagModel and the method get_current, to initialise each dag_id present in our bag
We check if any DAG is active using the DagModel property is_paused
We retrieve the latest DAG run using the DagRun.find
Sort the individual dag runs by latest to earliest
Here you could just subset [0] to get 1, however, for your debugging purposes I just loop through them all
DagRun returns a lot of information for us to use. In my loop I have output print(i, run.state, run.execution_date, run.start_date). So you can see what is going on under the hood.
id
state
dag_id
queued_at
execution_date
start_date
end_date
run_id
data_interval_start
data_interval_end
last_scheduling_decision
I have commented out an if check for any queued Dags for you to uncomment. Additionally you can do some arithmetic on dates if you desire, to add further conditional functionality.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import DagBag, DagModel, DagRun
from airflow.operators.python import PythonOperator
# make a function that returns if a DAG is set to active or paused
def check_dag_active():
bag = DagBag()
for dag_id in bag.dags:
in_bag = DagModel.get_current(dag_id)
if not in_bag.is_paused:
latest = DagRun.find(dag_id=dag_id)
latest.sort(key=lambda x: x.execution_date, reverse=True)
for i, run in enumerate(latest):
print(i, run.state, run.execution_date, run.start_date)
# if run.state == 'queued':
# return [run.dag_id, run.execution_date, run.start_date]
with DAG(
'stack_overflow_ans_3',
tags = ['SO'],
start_date = datetime(2022, 1, 1),
schedule_interval = None,
catchup = False,
is_paused_upon_creation = False
) as dag:
t1 = PythonOperator(
task_id = 'task_that_will_fail',
python_callable = check_dag_active
)

Depending on your version of Airflow and your setup, you should be able to query the Airflow DB directly to get this information.
If you're using Airflow 1.x, there should be an "Ad Hoc Query" executor in the Data Profiling tab in the UI. This was disabled in 2.x though, so if you're running 2.x you'll need to connect directly to your Airflow DB using psql or something similar (this differs from Google to AWS to Docker).
Once you're in, check out this link for some queries on DAG runtime.

Related

run next tasks in dag if another dag is complete

dag1:
start >> clean >> end
I have a dag where i run a few tasks. But I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
Is there any way I can import information regarding my "dag2", check its status and if it is in success mode, I can proceed to the clean step
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve the wait_for_dag2 part?
There are some different answers depends on what you want to do:
if you have two dags with the same schedule interval, and you want to make the run of second dag waits the same run of first one, you can use ExternalTaskSensor on the last task of the first dag
if you want to run a dag2, after each run of a dag1 even if it's triggered manually, in this case you need to update dag1 and add a TriggerDagRunOperator and set schedule interval of the second to None
I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
if you have two dags and you don't want to run them in same time to avoid a conflict on an external server/service, you can use one of the first two propositions or just use higher priority for the task of the first dag, and use the same pool (with 1 slot) for the tasks which lead to the conflict, but you will lose the parallelism on these tasks.
Hossein's Approach is the way people usually go. However if you want to get info about any dag run data, you can use the airlfow functionality to get that info. The following appraoch is good when you do not want(or are not allowed) to modify another dag:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_you_want_to_check')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)

how to check that particular dag is present in dag bucket or not

we have requirement that we are using one DAG to trigger another DAG, now here we want to place one check that if DAG is present in airflow then only trigger it else ignore. so is there any way to check that DAG is present or not?
we have tried to find something related DAGBAG but not getting much
update:
from airflow.models import DagBag
def my_poc(**kwargs):
dagbag = DagBag(None, False, False, True, True)
dags=['my_test_dag1','my_test_dag2']
for dag in dags:
try:
dag_class = dagbag.get_dag(dag)
print(dag,' is present in bucket')
except:
print(dag,' is not present in bucket')
i have tried this and its working as expected but is there any better way to do it

Airflow - prevent dagrun from immediately running after deployment/unpause

Seems there there has been previous discussion about this.
How do i stop airflow running a task the first time when i unpause it?
https://groups.google.com/g/cloud-composer-discuss/c/JGtmAd7xcsM?pli=1
When I deploy a dag to run at a specific time (say, once a day at 9AM), Airflow immediately runs the dag at deployment.
dag = DAG(
'My Dag',
default_args=default_args,
schedule_interval='00 09 * * *',
start_date = datetime(2021, 1, 1),
catchup=False # dont run previous and backfill; run only latest
)
That's because with catchup=False, scheduler "creates a DAG run only for the latest interval", as indicated in the doc.
https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html
What I want to achieve is that I don't even want a DAG run for the latest interval to start. I want nothing to happen until the next time clock strikes 9AM.
It seems like out of the box, Airflow does not have any native solution to this problem.
What are some workarounds that people have been using? Perhaps something like check current time is close to next_execution_date?
When you update your dag you can set start_date to the next day.
However, it won't work if you pause/unpause dag.
Note it's recommended to be a static value (avoid using datetime.now() or similar dynamic values), so for every deployment, you need to specify a new value like datetime(2021, 10, 15), datetime(2021, 10, 16), ... which might make deployment more difficult.
with the dag paused: create dag run http.://.../dagrun/add with Execution Date set to the one needed to skip. This makes task instances in UI accessible
mark success those task instances in the UI
unpause the tag

Can an Airflow task dynamically generate a DAG at runtime?

I have an upload folder that gets irregular uploads. For each uploaded file, I want to spawn a DAG that is specific to that file.
My first thought was to do this with a FileSensor that monitors the upload folder and, conditional on presence of new files, triggers a task that creates the separate DAGs. Conceptually:
Sensor_DAG (FileSensor -> CreateDAGTask)
|-> File1_DAG (Task1 -> Task2 -> ...)
|-> File2_DAG (Task1 -> Task2 -> ...)
In my initial implementation, CreateDAGTask was a PythonOperator that created DAG globals, by placing them in the global namespace (see this SO answer), like so:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from datetime import datetime, timedelta
from pathlib import Path
UPLOAD_LOCATION = "/opt/files/uploaded"
# Dynamic DAG generation task code, for the Sensor_DAG below
def generate_dags_for_files(location=UPLOAD_LOCATION, **kwargs):
dags = []
for filepath in Path(location).glob('*'):
dag_name = f"process_{filepath.name}"
dag = DAG(dag_name, schedule_interval="#once", default_args={
"depends_on_past": True,
"start_date": datetime(2020, 7, 15),
"retries": 1,
"retry_delay": timedelta(hours=12)
}, catchup=False)
dag_task = DummyOperator(dag=dag, task_id=f"start_{dag_name}")
dags.append(dag)
# Try to place the DAG into globals(), which doesn't work
globals()[dag_name] = dag
return dags
The main DAG then invokes this logic via a PythonOperator:
# File-sensing DAG
default_args = {
"depends_on_past" : False,
"start_date" : datetime(2020, 7, 16),
"retries" : 1,
"retry_delay" : timedelta(hours=5),
}
with DAG("Sensor_DAG", default_args=default_args,
schedule_interval= "50 * * * *", catchup=False, ) as sensor_dag:
start_task = DummyOperator(task_id="start")
stop_task = DummyOperator(task_id="stop")
sensor_task = FileSensor(task_id="my_file_sensor_task",
poke_interval=60,
filepath=UPLOAD_LOCATION)
process_creator_task = PythonOperator(
task_id="process_creator",
python_callable=generate_dags_for_files,
)
start_task >> sensor_task >> process_creator_task >> stop_task
But that doesn't work, because by the time process_creator_task runs, the globals have already been parsed by Airflow. New globals after parse time are irrelevant.
Interim solution
Per Airflow dynamic DAG and task Ids, I can achieve what I'm trying to do by omitting the FileSensor task altogether and just letting Airflow generate the per-file task at each scheduler heartbeat, replacing the Sensor_DAG with just executing generate_dags_for_files: Update: Nevermind -- while this does create a DAG in the dashboard, actual execution runs into the "DAG seems to be missing" issue:
generate_dags_for_files()
This does mean that I can no longer regulate the frequency of folder polling with the poke_interval parameter of FileSensor; instead, Airflow will poll the folder every time it collects DAGs.
Is that the best pattern here?
Other related StackOverflow threads
Run Airflow DAG for each file and Airflow: Proper way to run DAG for each file: identical use case, but the accepted answer uses two static DAGs, presumably with different parameters.
Proper way to create dynamic workflows in Airflow - accepted answer dynamically creates tasks, not DAGs, via a complicated XCom setup.
In short: if the task writes where the DagBag reads from, yes, but it's best to avoid a pattern that requires this. Any DAG you're tempted to custom-create in a task should probably instead be a static, heavily parametrized, conditionally-triggered DAG. y2k-shubham provides an excellent example of such a setup, and I'm grateful for his guidance in the comments on this question.
That said, here are the approaches that would accomplish what the question is asking, no matter how bad of an idea it is, in the increasing degree of ham-handedness:
If you dynamically generate DAGs from a Variable (like so), modify the Variable.
If you dynamically generate DAGs from a list of config files, add a new config file to wherever you're pulling config files from, so that a new DAG gets generated on the next DAG collection.
Use something like Jinja templating to write a new Python file in the dags/ folder.
To retain access to the task after it runs, you'd have to keep the new DAG definition stable and accessible on future dashboard updates / DagBag collection. Otherwise, the Airflow dashboard won't be able to render much about it.
Airflow is suited for building DAGs dynamically; as pointed it out by its creator:
https://youtu.be/Fvu2oFyFCT0?t=411 p.s. thanks to #Yiannis for the video reference
Here is an example of how this could be accomplished:
https://docs.astronomer.io/learn/dynamically-generating-dags

Airflow: Next Run is not scheduled somehow [duplicate]

When I schedule DAGs to run at a specific time everyday, the DAG execution does not take place at all.
However, when I restart Airflow webserver and scheduler, the DAGs execute once on the scheduled time for that particular day and do not execute from the next day onwards.
I am using Airflow version v1.7.1.3 with python 2.7.6.
Here goes the DAG code:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
import time
n=time.strftime("%Y,%m,%d")
v=datetime.strptime(n,"%Y,%m,%d")
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': v,
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=10),
}
dag = DAG('dag_user_answer_attempts', default_args=default_args, schedule_interval='03 02 * * *')
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='user_answer_attempts',
bash_command='python /home/ubuntu/bigcrons/appengine-flask-skeleton-master/useranswerattemptsgen.py',
dag=dag)
Am I doing something wrong?
Your issue is the start_date being set for the current time. Airflow runs jobs at the end of an interval, not the beginning. This means that the first run of your job is going to be after the first interval.
Example:
You make a dag and put it live in Airflow at midnight. Today (20XX-01-01 00:00:00) is also the start_date, but it is hard-coded ("start_date":datetime(20XX,1,1)). The schedule interval is daily, like yours (3 2 * * *).
The first time this dag will be queued for execution is 20XX-01-02 02:03:00, because that is when the interval period ends. If you look at your dag being run at that time, it should have a started datetime of roughly one day after the schedule_date.
You can solve this by having your start_date hard-coded to a date or by making sure that the dynamic date is further in the past than the interval between executions (In your case, 2 days would be plenty). Airflow recommends you use static start_dates in case you need to re-run jobs or backfill (or end a dag).
For more information on backfilling (the opposite side of this common stackoverflow question), check the docs or this question:
Airflow not scheduling Correctly Python
Check the following:
start_date is a fix time in the past(don't use datetime.now())
if you don't want to run the historical data, use catchup=false
to set a specific time for DAG to run (e.g hourly, monthly, daily, at a specific time), try using
https://crontab.guru/#40_21_*_*_* to write what you need.
If you think you have 1, 2, 3 steps all correct but the DAG is not running. Or the DAG can run every xx minutes, but failed to trigger even once in a daily interval, try create a new python file, copy your DAG code there, rename it so that the file is unique and then test again. It could be the case that airflow scheduler got confused by the inconsistency between previous DAG Runs' metadata and the current schedule.
Hope this helped!
From the schedule your DAG should run everyday at 02:03 AM. My suspicion is the start_date might be impacting it. Can you hardcode that to something like 'start_date': datetime.datetime(2016, 11, 01) and try.
Great answer apathyman.
It helped me a lot to understand. I was using days_ago(0) and once I changed it to days_ago(1), scheduler started triggering.

Resources