Airflow cron not running in timezone - airflow

I have airflow setup with default_timezone = US/Eastern. If I set a DAG with schedule_interval="0 17 * * *" it runs at 12pm instead of the expected 5pm. I understand that airflow stores all dates as UTC. How can I get something to run in my timezone instead of writing the interval in UTC?
I have also tried setting the tzinfo in the dag's start_date to pendulum.timezone('US/Eastern') with no luck.
airflow=1.10.0
rhel7
python=3.6
server's tz = EST

Related

Airflow External Task Sensor with unscheduled upstream DAG

We use airflow in a hybrid ETL system. By this I mean that some of our DAGs are not scheduled but externally triggered using the Airflow API.
We are trying to do the following: Have a sensor in a scheduled DAG (DAG1) that senses that a task inside an externally triggered DAG (DAG2) has run.
For example, the DAG1 runs at 11 am, and we want to be sure that DAG2 has run (due to an external trigger) at least once since 00:00. I have tried to set execution_delta = timedelta(hours=11) but the sensor is sensing nothing. I think the problem is that the sensor tries to look for a task that has been scheduled exactly at 00:00. This won't be the case, as DAG2 can be triggered at any time from 00:00 to 11:00.
Is there any solution that can serve the purpose we need? I think we might need to create a custom Sensor, but it feels strange to me that the native Airflow Sensor does not solve this issue.
This is the sensor I'm defining:
from datetime import timedelta
from airflow.sensors import external_task
sensor = external_task.ExternalTaskSensor(
task_id='sensor',
dag=dag,
external_dag_id='DAG2',
external_task_id='sensed_task',
mode='reschedule',
check_existence=True,
execution_delta=timedelta(hours=int(execution_type)),
poke_interval=10 * 60, # Check every 10 minutes
timeout=1 * 60 * 60, # Allow for 1 hour of delay in execution
)
I had the same problem & used the execution_date_fn parameter:
ExternalTaskSensor(
task_id="sensor",
external_dag_id="dag_id",
execution_date_fn=get_most_recent_dag_run,
mode="reschedule",
where the get_most_recent_dag_run function looks like this :
from airflow.models import DagRun
def get_most_recent_dag_run(dt):
dag_runs = DagRun.find(dag_id="dag_id")
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
if dag_runs:
return dag_runs[0].execution_date
As the ExternalTaskSensor needs to know both the dag_id and the exact last_execution_date for cross-DAGs dependencies.

How to force a Airflow Task to restart at the new scheduling date?

I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...

Avoid expired dates in Airflow

I have the following airflow DAG:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#Runs every 1 minute
dag = DAG(dag_id='example_dag', start_date=datetime(2020, 1, 1), schedule_interval='*/1 * * * *')
t1 = BashOperator(task_id='bash_task', bash_command='echo Hello!', dag=dag)
Problem here is that Airflow is scheduling and executing tasks from past dates like the first minute of 2020, the second minute of 2020, the third minute of 2020 and so on.
I want Airflow to schedule and execute only the tasks that occur after the dag deploy (i.e. if I deploy today, I want the first task to be executed in the next minute) and not to execute expired tasks.
Any advice? Thanks!
I found the answer here. Read the "Catchup and Idempotent DAG" section.

Airflow: Can't set 'default_timezone' to 'system'

Running puckel/docker-airflow, modified build so that both environment variables, and airflow.cfg have:
ENV AIRFLOW__CORE__DEFAULT_TIMEZONE=system
and
default_timezone = system
accordingly.
But in the UI, it still shows UTC, even though system time is EAT. Here is some evidence from the container:
airflow#906d2275235d:~$ echo $AIRFLOW__CORE__DEFAULT_TIMEZONE
system
airflow#906d2275235d:~$ cat airflow.cfg | grep default_timez
default_timezone = system
airflow#906d2275235d:~$ date
Thu 01 Aug 2019 04:54:23 PM EAT
Would appreciate any help, or an advice on your practice with this.
According to Airflow docs:
Please note that the Web UI currently only runs in UTC.
Although UI uses UTC, Airflow uses local time to launch DAGs. So if you have for example schedule_interval set to 0 3 * * *, Airflow will start the DAG at 3:00 EAT, but it in the UI you will see it as 0:00.

Schedule a DAG in airflow to run every 5 minutes

I have a DAG in airflow and for now it is running each hour (#hourly).
Is it possible to have it running each 5 minutes ?
Yes, here's an example of a DAG that I have running every 5 min:
dag = DAG(dag_id='eth_rates',
default_args=args,
schedule_interval='*/5 * * * *',
dagrun_timeout=timedelta(seconds=5))
schedule_interval accepts a CRON expression: https://en.wikipedia.org/wiki/Cron#CRON_expression
The documentation states:
Each DAG may or may not have a schedule, which informs how DAG Runs
are created. schedule_interval is defined as a DAG arguments, and
receives preferably a cron expression as a str, or a
datetime.timedelta object.
When following the provided link for CRON expressions it appears you can specify it as */5 * * * * to run it every 5 minutes.
I'm not familiar on the matter, but this is what the documentation states.
Airflow 2 (I'm using 2.4.2) supports timedelta for scheduling DAGs on a particular cadence (hourly, every 5 minutes, etc.) so you can put:
schedule_interval = timedelta(minutes=5)

Resources