Airflow Dag not running on scheduled interval - airflow

I am testing a simple dag to run on scheduled interval which is at 6 UTC on every Friday and Saturday('0 6 * * 5,6').
But the dag did not triggered on 6 am on Friday.
I know that Friday's instance will run on Saturday and Saturday's on Friday.
What can I do so that it will run Friday's instance on Friday only? or any work around?
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
def create_txt():
f=open("/home/abc/test1.txt","w+")
for i in range(10):
f.write("This is line %d\r\n" % (i+1))
f.close()
default_args = {
'owner': 'abc',
'depends_on_past': False,
'start_date': datetime(2020, 6, 24),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'catchup': False
}
with DAG('python_test',
default_args=default_args,schedule_interval='0 6 * * 5,6'
) as dag:
create_txt = PythonOperator(task_id='python_test',
python_callable=create_txt)

Check the scheduler health,
Try to increase the scheduler timeout
Make sure there are no other dags importing the same modules in and around the same time.
Suggesting you to import os,sys modules.
If the dag is going to be executed for the first time, then consider giving a manual execution for the first time.

Related

Airflow does not run dags

Context: I successfully installed Airflow on EC2, changed things like executor to LocalExecutor; sql_alchemy_conn to postgresql+psycopg2://postgres#localhost:5432/airflow; max_threads to 10.
My problem is when I create a dag which I indicate to be run everyday everything is fine, but when I create a dag to be run like at 10am on Monday and Wednesday Airflow doesn't does not run it. Does anybody know what could I do wrong and should I do in order to fix this issue?
Dag for script which runs fine and properly:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
args = {
'owner': 'arseniyy123',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['exam#exam.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG(
'daily_script',
default_args=args,
description = 'daily_script',
schedule_interval = "0 10 * * *",
)
t1 = BashOperator(
task_id='daily',
bash_command='cd /root/ && python3 DAILY_WORK.py',
dag=dag)
t1
Dag for script which should run on Monday and Wednesday, but it does not run at all:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
args = {
'owner': 'arseniyy123',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['exam#exam.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG(
'monday_wednesday',
default_args=args,
description = 'monday_wednesday',
schedule_interval = "0 10 * * 1,3",
)
t1 = BashOperator(
task_id='monday_wednesday',
bash_command='cd /root/ && python3 not_daily_work.py',
dag=dag)
t1
I also have some problems with scheduler, it uses to die after being working more than 10 hours, anybody know why does it happen?
Thank you in advance!
Can you try changing the start_date to a static datetime e.g. datetime.date(2020, 3, 20) instead of using airflow.utils.dates.days_ago(1)
Maybe read through the scheduling examples here, to understand why your code didn't work. From that documentation:
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period

Airflow doesn't recognize DAG scheduling

I'm trying to make weekly, monthly airflow schedules but it's not working. Could anyone report what may be happening? If I make weekly, monthly scheduling, it stays still, as if it were turned off. No error message, simply does not execute. I sent a code example to demonstrate how I'm scheduling ... is there any other way to do this scheduling?
import airflow
import os
import six
import time
from datetime import datetime, timedelta
from airflow import DAG
from airflow import AirflowException
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.operators.slack_operator import SlackAPIPostOperator
default_args = {
'owner': 'bexs-data',
'start_date': airflow.utils.dates.days_ago(0),
'depends_on_past': False,
'email': ['airflow#apache.org'],
'email_on_failure': False,
'email_on_retry': False,
'depends_on_past': False,
# If a task fails, retry it once after waiting
# at least 5 minutes
'retries': 1,
'retry_delay': timedelta(minutes=5),
'on_failure_callback': slack_msg
}
dag = DAG(
dag_id=nm_dag,
default_args=default_args,
schedule_interval='51 18 * * 4',
dagrun_timeout=timedelta(minutes=60)
)
There is documentation around not doing the following:
'start_date': airflow.utils.dates.days_ago(0), because in this way there will never be a 1 week interval that has passed since your start date, which means the first interval doesn't close, and the first run doesn't get scheduled.
Suggestion: pick a fixed day on a day-of-the-week 4 (Thursday?) from last week as your start_date.
Airflow will accept a datetime or a string. Use airflow.utils.timezone's datetime for a non-UTC schedule. For example, either:
default_args = {
'owner': 'your-unix-user-id-or-ldap-etc',
'start_date': '2018-1-1',
...
}
or
from airflow.utils.timezone import datetime
default_args = {
'owner': 'your-unix-user-id-or-ldap-etc',
'start_date': datetime(2018, 1, 1),
...
}

Airflow not running on scheduled interval

My Airflow webserver is up and running,As other job are running as per scheduled.
I added a new DAG to be executed every 5 minute.
Once added i ran it first time manually and it completed. However after that it is not running again
every 5 min.
Dag code is below
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
current_date = datetime.now()
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2019, 6, 11, current_date.hour, current_date.minute),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=1),
}
dag = DAG("Incremental", default_args=default_args, schedule_interval='*/5 * * * *')
Suggestion please
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended. Based on Airflow docs
In your case, if you have start date of 2019-01-01 00:00:00, 5 minutes interval. You might expect airflow to run at 2019-01-01 00:05:00, but it will run after 2019-01-01 00:10:00 because its waiting for the 5 minutes interval to finish (this is how I imagine it). Hope this helps. :)

PythonOperator with python_callable set gets executed constantly

import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from workflow.task import some_task
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['jimin.park1#aig.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=1),
'start_date': airflow.utils.dates.days_ago(0)
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('JiminTest', default_args=default_args, schedule_interval='*/1 * * * *', catchup=False)
t1 = PythonOperator(
task_id='Task1',
provide_context=True,
python_callable=some_task,
dag=dag
)
The actual some_task itself simply appends timestamp to some file. As you can see in the dag config file, the task itself is configured to run every 1 min.
def some_task(ds, **kwargs):
current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
with open("test.txt", "a") as myfile:
myfile.write(current_time + '\n')
I simply tail -f the output file and started up the webserver without the scheduler running. This function was being called and things were being appended to the file when webserver starts up. When I start up the scheduler, on each execution loop, the file gets appended.
What I want is for the function to be executed on every minute as intended, not every execution loop.
The scheduler will run each DAG file every scheduler loop, including all import statements.
Is there anything running code in the file from where you are importing the function?
Try to check the scheduler_heartbeat_sec config parameter in your config file. For your case it should be smaller than 60 seconds.
If you want the scheduler not to cahtchup previous runs set catchup_by_defaultto False (I am not sure if this relevant to your question though).
Please indicate which Apache Airflow version are you using

How to work correctly airflow schedule_interval

I want to try to use Airflow instead of Cron.
But schedule_interval doesn't work as I expected.
I wrote the python code like below.
And in my understanding, Airflow should have ran on "2016/03/30 8:15:00" but it didn't work at that time.
If I changed it like this "'schedule_interval': timedelta(minutes = 5)", it worked correctly, I think.
The "notice_slack.sh" is just to call slack api to my channels.
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import os
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 3, 29, 8, 15),
}
dag = DAG(
dag_id='notice_slack',
default_args=args,
schedule_interval="#daily",
dagrun_timeout=timedelta(minutes=1))
# cmd file name
CMD = '/tmp/notice_slack.sh'
run_this = BashOperator(
task_id='run_transport', bash_command=CMD, dag=dag)
I want to run some of my scripts at specific time every day like this cron setting.
15 08 * * * bash /tmp/notice_slack.sh
I have read the document Scheduling & Triggers, and I know it's a little bit different cron.
So I attempt to arrange at "start_date" and "schedule_interval" settings.
Does anyone know what should I do ?
airflow version
INFO - Using executor LocalExecutor
v1.7.0
amazon-linux-ami/2015.09-release-notes
Try this:
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import os
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 3, 29),
}
dag = DAG(
dag_id='notice_slack',
default_args=args,
schedule_interval="15 08 * * *",
dagrun_timeout=timedelta(minutes=1))
# cmd file name
CMD = 'bash /tmp/notice_slack.sh'
run_this = BashOperator(
task_id='run_transport', bash_command=CMD, dag=dag)
start_date (datetime) – The start_date for the task, determines the execution_date for the first task instance. The best practice is to have the start_date rounded to your DAG’s schedule_interval.
schedule_interval (datetime.timedelta or dateutil.relativedelta.relativedelta or str that acts as a cron expression) – Defines how often that DAG runs, this timedelta object gets added to your latest task instance’s execution_date to figure out the next schedule.
Simply configuring the schedule_interval and bash_command as the same in your cron setting is okay.
Airflow will start your DAG when the 2016/03/30 8:15:00 + schedule interval (daily) is passed. So your DAG will run on 2016/03/31 8:15:00.
You can check the Airflow FAQ
First, your start date should be in the past -
Instead of 'start_date': datetime(2016, 3, 29, 8, 15)
Would you try 'start_date': datetime(2016, 2, 29, 8, 15)
and apply 'catchup':False to prevent backfills - unless this was something you wanted to do.
From Airflow documentation -
The Airflow scheduler triggers the task soon after the start_date + schedule_interval is passed.
The schedule interval can be supplied as a cron -
If you want to run it everyday at 8:15 AM, the expression would be - *'15 8 * * '
If you want to run it only on Oct 31st at 8:15 AM, the expression would be - *'15 8 31 10 '
To supply this, 'schedule_inteval':'15 8 * * *' in your Dag property
You can figure this out more from https://crontab.guru/
Alternatively, there are Airflow presets -
If any of these meet your requirements, it would be simply, 'schedule_interval':'#hourly'
Lastly, you can also apply the schedule as python timedelta object e.g. for 12 PM
'schedule_interval': timedelta(hours=12)
With the example you've given #daily will run your job after it passes midnight. You might try changing it either to timedelta(days=1) which is relative to your fixed start_date that includes 08:15.
Or you could use a cron spec for the schedule_interval='15 08 * * *' in which case any start date prior to 8:15 on the day BEFORE the day you wanted the first run would work.
Note that depends_on_past: False is already the default, and you may have confused its behavior with catchup=false in the DAG parameters, which would avoid making past runs for time between the start date and now where the DAG schedule interval would have run.

Resources