Airflow ExternalTaskSensor Stuck and Error - airflow

I have define the external_sensor like that:
external_sensor = ExternalTaskSensor(task_id='ext_sensor_task',
execution_delta=timedelta(minutes=0),
external_dag_id='book_data',
external_task_id='Dataframe_Windows_test',
dag = dag)
The another task is defined like this:
dl_processing_windows = DL_Processing(task_id='dl_processing_windows',
df_dataset_location=dl_config.WINDOWS_DATASET,
....
While in the airflow UI:
I got the error:
Argument ['task_id'] is required
I have two problems:
1. Why does such error exist?
2. Why does it not work?
The attachment:
default_args = {
'owner': 'Newt',
'retries': 2,
'retry_delay': timedelta(seconds=30),
'depends_on_past': False,
}
dag = DAG(
dag_id,
start_date = datetime(2019, 11, 20),
description= 'xxxx',
default_args = default_args,
schedule_interval = timedelta(hours=1),
)
The parameters for dag are the same for both dags!

I fixed it.
Normally, the start_date is different from the scheduler_interval. I set the start_date for both dags into the same time with the current date.
After the first dependent bag finished, the new dag began to work!

Related

Airflow v2.4.2 - New monthly DAG not running when scheduled

I have the following in the dag.py file, this is a newly pushed to prod DAG, it should have run at 14UTC (9EST) it should have ran a few hours ago but it still hasn't run even thought in the UI is still saying it will run at 14UTC.
DAG_NAME = "revenue_monthly"
START_DATE = datetime(2023, 1, 12)
SCHEDULE_INTERVAL = "0 14 3 * *"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
dag = DAG(DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=False,
)
See picture below of the UI:
The date and time you are seeing as Next Run is the logical_date which is the start of the data interval. With the current configuration the first DAGrun will be on data from 2023-02-03 to 2023-03-03 so the DAG will only actually be running on 2023-03-03 (the Run After date, you can see that one when you are viewing the DAG and hover over the schedule in the upper right corner:
Assuming you want the DAG to do the run it would have done on 2023-02-03 (today) you can achieve that by backfilling one run, either by manually backfilling. Or by using catchup=True with a start_date before 2023-01-03:
from airflow import DAG
from pendulum import datetime
from airflow.operators.empty import EmptyOperator
DAG_NAME = "revenue_monthly_1"
START_DATE = datetime(2023, 1, 1)
SCHEDULE_INTERVAL = "0 14 3 * *"
doc_md="documentation"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
with DAG(
DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=True,
) as dag:
t1 = EmptyOperator(task_id="t1")
gave me one run with the run id scheduled__2023-01-03T14:00:00+00:00 and the next_run date interval 2023-02-03 to 2023-03-03 which will Run after 2023-03-03.
This guide might help with terminology Airflow uses around schedules.

Airflow job not runinng as scheduled

I have a job that i had set to run a 9:00 UTC on Wednesday. It didn't run as planned by the end of the delay interval, which I thought was curious because I believe I have everything defined properly.
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'noncomp_trial',
default_args=default_args,
description='test of dag',
schedule_interval='0 9 * * 3',
dagrun_timeout=timedelta(minutes=20))
If anyone has any advice here that would be greatly appreciated!
The Airflow Scheduler runs tasks once the start_date + one schedule_interval value has passed. In your example, the DAG won't run until 9:00AM on Wednesday the following week occurs.
See more information about the relationship between the start_date and schedule_interval here.
You could try setting your start_date to a static date in the past by a week or two to see if that works? And to make sure the Scheduler doesn't try to execute every start_date + schedule_interval occurrence between that new start_date and now, you can set catchup=False on the DAG. For example:
from datetime import datetime
dag = DAG(
'noncomp_trial',
default_args= {
'start_date': datetime(2021, 7, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5)
},
description='test of dag',
schedule_interval='0 9 * * 3',
dagrun_timeout=timedelta(minutes=20),
catchup=False,
)

Apache airflow scheduler not triggering once in a month task as expected

Here is the dag, which I want to execute on fixed date of every month, as of now I kept it on 18th of every month.
But the task is getting triggered daily by the scheduler. catchup_by_default = False is set to false in airflow.cfg file
default_args = {
'owner': 'muddassir',
'depends_on_past': True,
'start_date': datetime(2021, 3, 18),
'retries': 1,
'schedule_interval': '0 0 18 * *',
}
Image 1
Image 2
Image 3
Image 4
you have mentioned schedule_interval in your default_args, which is not the way to schedule the DAG. default_args are actually applied to the tasks, as they are passed to Operators and not the DAG itself.
you can modify your code as follows, just by removing the schedule_interval from default_args and passing it in the DAG instance as follows and you can set catchup flag as False to avoid any backfills :
# assuming this is how you initialize your DAG
dag = DAG('your DAG UI name', default_args=default_args, schedule_interval = '0 0 18 * *', catchup=False)

How to avoid run of task when already running

I have an airflow task which scheduled to run every 3 minutes.
Sometimes the duration of the task is longer than 3 minutes, and the next schedule start (or queued), despite it is already running.
Is there a way to define the dag to NOT even queue the task if it is already in run?
# airflow related
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators import MsSqlOperator
# other packages
from datetime import datetime
from datetime import timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 7, 22, 15,00,00),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
dag = DAG(
dag_id='sales',
description='Run sales',
schedule_interval = '*/3 4,5,6,7,8,9,10,11,12,13,14,15,16,17 * * 0-5',
default_args=default_args,
catchup = False)
job1 = BashOperator(
task_id='sales',
bash_command='python2 /home/manager/ETL/sales.py',
dag = dag)
job2 = MsSqlOperator(
task_id='refresh_tabular',
mssql_conn_id='mssql_globrands',
sql="USE msdb ; EXEC dbo.sp_start_job N'refresh Management-sales' ; ",
dag = dag)
job1>>job2

airflow does not satisfy task dependencies

I have a simple airflow workflow composed of two tasks. One does make a download of a csv file containing stock data. The other extracts the maximum stock price and write the data to another file.
If I run the first task and then the second everything works fine, instead if execute: airflow run stocks_d get_max_share it fails to meet the dependency.
import csv
from datetime import datetime
from datetime import timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
import requests
def get_stock_data():
url = "https://app.quotemedia.com/quotetools/getHistoryDownload.csv?&webmasterId=501&startDay=02&startMonth=02&startYear=2002&endDay=02&endMonth=07&endYear=2009&isRanged=false&symbol=APL"
try:
r = requests.get(url)
except requests.RequestException as re:
raise
else:
with open('/tmp/stocks/airflow_stock_data.txt', 'w') as f:
f.write(r.text)
def get_max_share():
stock_data = []
stock_max = {}
with open('/tmp/stocks/airflow_stock_data.txt', 'r') as f:
stock_reader = csv.reader(f)
next(stock_reader, None)
for row in stock_reader:
stock_data.append(row)
for stock in stock_data:
stock_max[stock[2]] = stock[0]
with open('/tmp/stocks/max_stock', 'w') as f:
stock_price = max(stock_max.keys())
stock_max_price_date = stock_max[stock_price]
stock_entry = stock_max_price_date + ' -> ' + stock_price
f.write(stock_entry)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 5, 30),
'email': ['mainl#domain.io'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'catchup': False,
}
dag = DAG('stocks_d', default_args=default_args, schedule_interval=timedelta(minutes=5))
task_get_stocks = PythonOperator(task_id='get_stocks', python_callable=get_stock_data, dag=dag)
task_get_max_share = PythonOperator(task_id='get_max_share', python_callable=get_max_share, dag=dag)
task_get_max_share.set_upstream(task_get_stocks)
Any ideas why does that happen ?
$ airflow run stocks_d get_max_share
Above command only run get_max_share task not the previous task before running it.
If you need to check the whole dag running, try below command
$ airflow trigger_dag stocks_d

Resources