Airflow not picking up "start_date" from DAG - airflow

I'm registering a new DAG like so:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
from airflow.hooks.base_hook import BaseHook
from datetime import datetime, timedelta, timezone
import pendulum
local_tz = pendulum.timezone("UTC")
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2020, 6, 19, 9, 37, 35, tzinfo=local_tz),
'email': ["blah#blah.com"],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=15)
}
dag = DAG(
dag_id="some_id",
default_args=default_args,
description= "Some description",
schedule_interval="#once"
)
def to_be_executed_py():
print("I did it, ma!")
with dag:
t1 = PythonOperator(
task_id="some_id",
python_callable=to_be_executed_py)
I want this to run once and only once at the time given in start_date. After uploading the DAG (using S3), I'm not seeing the "start_date" in details. Instead, I see in details (under default_args):
{'owner': 'me',
'depends_on_past': False,
'start_date': datetime.datetime(2020, 6, 19, 9, 37, 35, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>),
'email': ['bleh#blah.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(0, 900)}
Am I doing something wrong here? Am I making the right assumption that this should execute at the given start_time? I've looked all over for similar use cases, but not many are setting their start_date to include a time.
UPDATE
Currently, the DAG is running immediately upon being unpaused. Definitely not picking up the start time. All the resources I've found online don't have an answer that works here.

Figured out the issue. It was twofold. One, a translator we were using was a 12 hour clock. As this was in the evening, it was setting it to be in the past (causing Airflow to play catchup).
Secondly, we didn't need the timezone. Plus, we weren't setting the dag in the task. So the code should read as the following:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
from airflow.hooks.base_hook import BaseHook
from datetime import datetime, timedelta, timezone
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2020, 6, 19, 21, 37, 35),
'email': ["blah#blah.com"],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=15)
}
dag = DAG(
dag_id="some_id",
default_args=default_args,
description= "Some description",
schedule_interval="#once"
)
def to_be_executed_py(ds, **kwargs):
print("I did it, ma!")
with dag:
t1 = PythonOperator(
dag=dag,
provide_context=True,
task_id="some_id",
python_callable=to_be_executed_py)
With those changes, everything runs at the given time, once and only once.

With a combination of schedule_interval='#daily', ShortCircuitOperator and airflow variables you could do a workaround; DAG runs every day and checks if today is in the list of the days that you entered as an airflow variable. If so, proceed and runs downstream tasks and if not, it skips the downstream tasks and waits until subsequent execution on tomorrow.
here is the DAG definition:
import airflow.utils.helpers
from airflow.models import DAG, Variable
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import ShortCircuitOperator
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2)
}
dag = DAG(
dag_id='run_on_release_day',
default_args=args,
schedule_interval='#daily'
)
def check_release_date(**context):
release_dates = Variable.get('release_dates')
print(context, release_dates)
return context['ds'] in release_dates
cond = ShortCircuitOperator(
task_id='condition',
python_callable=check_release_date,
dag=dag,
provide_context=True,
)
tasks = [DummyOperator(task_id='task_' + str(i), dag=dag) for i in [1, 2]]
airflow.utils.helpers.chain(cond, *tasks)

Related

Implementing cross-DAG dependency in Apache airflow

I am trying to implement DAG dependency between 2 DAGs say A and B. DAG A runs once every hour and DAG B runs every 15 mins.
Each time DAG B starts it's run I want to make sure DAG A is not in running state.
If DAG A is found to be running then DAG B has to wait until DAG A completes the run.
If DAG A is not running, DAG B can proceed with it's tasks.
DAG A :
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_A', schedule_interval='0/60 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task1 = DummyOperator(task_id='task1', retries=1, dag=dag)
task2 = DummyOperator(task_id='task2', retries=1, dag=dag)
task3 = DummyOperator(task_id='task3', retries=1, dag=dag)
task1 >> task2 >> task3
DAG B:
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_B', schedule_interval='0/15 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task4 = DummyOperator(task_id='task4', retries=1, dag=dag)
task5 = DummyOperator(task_id='task5', retries=1, dag=dag)
task6 = DummyOperator(task_id='task6', retries=1, dag=dag)
task4 >> task5 >> task6
I have tried using ExternalTaskSensor operator. I am unable to understand if the sensor finds DAG A to be in success state it triggers the next task else wait for the task to complete.
Thanks in advance.
I think the only way you can achieve that in "general" way is to use some external locking mechanism
You can achieve quite a good approximation though using pools:
https://airflow.apache.org/docs/apache-airflow/1.10.3/concepts.html?highlight=pool
if you set pool size to 1 and assign both dag A and B to the pool, only one of those can be running at a time. You can also add priority_weight in the way that you see best - in case you need to prioritise A over B or the other way round.
You could use ExternalTaskSensor to achieve what you are looking for. The key aspect is to initialize this sensor with the correct execution_date, being that in your example the execution_date of the last DagRun of DAG_A.
Check this example where DAG_A runs every 9 minutes for 200 seconds. DAG_B runs every 3 minutes and runs for 30 seconds. These values are arbitrary and only for demo purpose, could be pretty much anything.
DAG A (nothing new here):
import time
from airflow import DAG
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def _executing_task(**kwargs):
print("Starting task_a")
time.sleep(200)
print("Completed task_a")
dag = DAG(
dag_id="example_external_task_sensor_a",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/9 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
task_a = PythonOperator(
task_id='task_a',
python_callable=_executing_task,
)
chain(start, task_a)
DAG B:
import time
from airflow import DAG
from airflow.utils.db import provide_session
from airflow.models.dag import get_last_dagrun
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.sensors.external_task import ExternalTaskSensor
def _executing_task():
time.sleep(30)
print("Completed task_b")
#provide_session
def _get_execution_date_of_dag_a(exec_date, session=None, **kwargs):
dag_a_last_run = get_last_dagrun(
'example_external_task_sensor_a', session)
print(dag_a_last_run)
print(f"EXEC DATE: {dag_a_last_run.execution_date}")
return dag_a_last_run.execution_date
dag = DAG(
dag_id="example_external_task_sensor_b",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/3 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_dag_a',
external_dag_id='example_external_task_sensor_a',
allowed_states=['success', 'failed'],
execution_date_fn=_get_execution_date_of_dag_a,
poke_interval=30
)
task_b = PythonOperator(
task_id='task_b',
python_callable=_executing_task,
)
chain(start, wait_for_dag_a, task_b)
We are using the param execution_date_fn of the ExternalTaskSensor in order to obtain the execution_date of the last DagRun of the DAG_A, if we don't do so, it will wait for DAG_A with the same execution_date as the actual run of DAG_B which may not exists in many cases.
The function _get_execution_date_of_dag_a does a query to the metadata DB to obtain the exec_date by using get_last_dagrun from Airflow models.
Finally the other important parameter is allowed_states=['success', 'failed'] where we are telling it to wait until DAG_A is found in one of those states (i.e if it is in running state will keep executing poke).
Try it out and let me know if it worked for you!.

Apache-Airflow - Task is in the none state when running DAG

Just started with airflow and wanted to run simple dag with BashOperator that outputs 'Hello' to console
I noticed that my status is indefinitely stuck in 'Running'
When I go on task details, I get this:
Task is in the 'None' state which is not a valid state for execution. The task must be cleared in order to be run.
Any suggestions or hints are much appreciated.
Dag:
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'dude_whose_doors_open_like_this_-W-',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['yessure#gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'Test',
default_args=default_args,
description='Test',
schedule_interval=timedelta(days=1)
)
t1 = BashOperator(
task_id='ECHO',
bash_command='echo "Hello"',
dag=dag
)
t1
I've managed to solve it by adding 'start_date': dt(1970, 1, 1),
to default args object
and adding schedule_interval=None to my dag object
Could you remove the last line of t1- this isn't necessary. Also start_dateshouldn't be set dynamically - this can lead to problems with the scheduling.

Airflow does not run dags

Context: I successfully installed Airflow on EC2, changed things like executor to LocalExecutor; sql_alchemy_conn to postgresql+psycopg2://postgres#localhost:5432/airflow; max_threads to 10.
My problem is when I create a dag which I indicate to be run everyday everything is fine, but when I create a dag to be run like at 10am on Monday and Wednesday Airflow doesn't does not run it. Does anybody know what could I do wrong and should I do in order to fix this issue?
Dag for script which runs fine and properly:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
args = {
'owner': 'arseniyy123',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['exam#exam.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG(
'daily_script',
default_args=args,
description = 'daily_script',
schedule_interval = "0 10 * * *",
)
t1 = BashOperator(
task_id='daily',
bash_command='cd /root/ && python3 DAILY_WORK.py',
dag=dag)
t1
Dag for script which should run on Monday and Wednesday, but it does not run at all:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
args = {
'owner': 'arseniyy123',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['exam#exam.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG(
'monday_wednesday',
default_args=args,
description = 'monday_wednesday',
schedule_interval = "0 10 * * 1,3",
)
t1 = BashOperator(
task_id='monday_wednesday',
bash_command='cd /root/ && python3 not_daily_work.py',
dag=dag)
t1
I also have some problems with scheduler, it uses to die after being working more than 10 hours, anybody know why does it happen?
Thank you in advance!
Can you try changing the start_date to a static datetime e.g. datetime.date(2020, 3, 20) instead of using airflow.utils.dates.days_ago(1)
Maybe read through the scheduling examples here, to understand why your code didn't work. From that documentation:
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period

Airflow doesn't recognize DAG scheduling

I'm trying to make weekly, monthly airflow schedules but it's not working. Could anyone report what may be happening? If I make weekly, monthly scheduling, it stays still, as if it were turned off. No error message, simply does not execute. I sent a code example to demonstrate how I'm scheduling ... is there any other way to do this scheduling?
import airflow
import os
import six
import time
from datetime import datetime, timedelta
from airflow import DAG
from airflow import AirflowException
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.operators.slack_operator import SlackAPIPostOperator
default_args = {
'owner': 'bexs-data',
'start_date': airflow.utils.dates.days_ago(0),
'depends_on_past': False,
'email': ['airflow#apache.org'],
'email_on_failure': False,
'email_on_retry': False,
'depends_on_past': False,
# If a task fails, retry it once after waiting
# at least 5 minutes
'retries': 1,
'retry_delay': timedelta(minutes=5),
'on_failure_callback': slack_msg
}
dag = DAG(
dag_id=nm_dag,
default_args=default_args,
schedule_interval='51 18 * * 4',
dagrun_timeout=timedelta(minutes=60)
)
There is documentation around not doing the following:
'start_date': airflow.utils.dates.days_ago(0), because in this way there will never be a 1 week interval that has passed since your start date, which means the first interval doesn't close, and the first run doesn't get scheduled.
Suggestion: pick a fixed day on a day-of-the-week 4 (Thursday?) from last week as your start_date.
Airflow will accept a datetime or a string. Use airflow.utils.timezone's datetime for a non-UTC schedule. For example, either:
default_args = {
'owner': 'your-unix-user-id-or-ldap-etc',
'start_date': '2018-1-1',
...
}
or
from airflow.utils.timezone import datetime
default_args = {
'owner': 'your-unix-user-id-or-ldap-etc',
'start_date': datetime(2018, 1, 1),
...
}

How to enable Subdag in Airflow?

In Airflow document, it is mentioned as below
"Subdags must have a schedule and be enabled
Even though subdags are triggered as part of a larger dag, if their schedule is set to None or ‘#once’, the subdag operator will succeed without doing anything".
But not clear, how we can enable the Subdags. Is there any way to enable the Subdag?
You can create a SubDAG like this:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
default_args = {
'email_on_failure': False,
'email_on_retry': False,
'start_date': datetime(2017, 12, 16),
}
schedule_interval = "#daily"
def create_subdag(main_dag, subdag_id):
subdag = DAG('{0}.{1}'.format(main_dag.dag_id, subdag_id),
default_args=default_args)
DummyOperator(
task_id='foo',
dag=subdag)
return subdag
main_dag = DAG(
dag_id='main_dag',
schedule_interval=schedule_interval,
default_args=default_args,
max_active_runs=1
)
my_subdag = SubDagOperator(
task_id='subdag',
dag=main_dag,
retries=3,
subdag=create_subdag(main_dag, 'subdag')
)

Resources