airflow schedule_interval not working - airflow

I have my DAG like this,
dag = DAG('testing',description='Testing DAG',schedule_interval='0 4,15 * * *')
t1 = BashOperator(task_id = 'testing_task',bash_command = 'python /home/ubuntu/airflow/dags/scripts/test.py',dag=dag, start_date=datetime(2018, 2, 8))
I want to schedule it to run every day 3PM and 4AM, I changed my AWS instance local timezone to NZ.
In the airflow web UI, in the top right, i still see airflow showing UTC time. However if i see the last run(my manual run through UI) for my DAG, it shows NZ time. So i assumed the schedular works in local timezone (NZ time), so tried to schedule on that timezone, but it was not triggered on time. Job did not work on time. How to solve this?
Thanks,

Right now (as of Airflow 1.9) Airflow only operates in UTC. The "solution" for now is to put the schedule in UTC -- as horrible as that is.
The good news is that on the master branch (which will be in the next non-point release, Airflow 1.10) is support for Timezones! https://github.com/apache/incubator-airflow/blob/772dbae298680feb9d521e7cd5526f4059d7cb69/docs/timezone.rst

Related

airflow schedule issue - diff between schedule time and run time

I set the schedule like '* 1,5,10,18 * * *' in airflow.
But 18 in yesterday wasn't executed. So I checked the logs.
then I found the job scheduled in 10 executed in 18.
I want to know why and how can I fix.
Note that if you run a DAG on a schedule_interval of one day, the run
stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In
other words, the job instance is started once the period it covers has
ended.
~Airflow scheduler docs
So as you can see it will be scheduled after the period - 10->18 is closed, so after 18:00. Check task before, it should be ran just after 10:00.
You don't understand how the airflow scheduler works.
Airflow as DAG Run (data_interval_start) always takes the date of the previous execution, therefore the task performed at your place at 18, has DAG Run at 10. The same as the next task, call at 10 will have a DAG Run at 18 the previous day.

Issue with Airflow version 1.10.1

Recently I have upgraded airflow version to 1.10.1. I turned on some of the dags turned on which were marked OFF earlier.
I am always using start_date for dag as today's date.
After turning ON the dags it brought below issue.
Scheduler is however starting those DAG's but it is not picking related task's. On task Instance Detail its showing "The execution date is 2018-12-04T13:00:00+00:00 but this is before the task's start date 2019-02-04T00:00:00+00:00." It runs only after triggering it manually.
Is there any other way (apart from fixing the start_date for DAG) this issue can be fixed. i.e; using some config or any other option where I can by-pass the above check of execution date and task's start date.
My main purpose is to run dag's old schedule without manual intervention.
You should not use dynamic start date especially not today's date or datetime.now(). Have a read on official docs https://airflow.readthedocs.io/en/stable/faq.html#what-s-the-deal-with-start-date for more details.
I know you asked for the suggestion apart from start date but your start date definitely needs to be before the task execution date. Hence, I would strongly recommend changing your start_date to something like datetime(2018, 1, 1).

Apache Airflow tasks are stuck in a 'up_for_retry' state

I've been setting up an airflow cluster on our system and previously it has been working. I'm not sure what I may have done to change this.
I have a DAG which I want to run on a schedule. To make sure it's working I'd also like to trigger it manually. Neither of these seem to be working at the moment and no logs seem to be being written for the task instances. The only logs available are the airflow scheduler logs which generally look healthy.
I am just constantly met with this message:
Task is not ready for retry yet but will be retried automatically. Current date is 2018-12-12T11:34:46.978355+00:00 and task will be retried at 2018-12-12T11:35:08.093313+00:00.
However, if I wait a little the exact same message is presented again except the times have moved forward a little. Therefore, it seems the task is never actually being retried.
I am using a LocalExecutor and the task is an SSHOperator. Simplified code below. All it does is ssh onto another machine and start a bunch of application with a pre-determined directory structure.:
DB_INFO_FILE = 'info.json'
START_SCRIPT = '/bin/start.sh'
TIME_IN_PAST = timezone.convert_to_utc(datetime.today() -
timedelta(days=1))
DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': TIME_IN_PAST,
'email': ['some_email#blah.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
def _extract_instance_id(instance_string):
return re.findall(r'\d+', instance_string)[0]
def _read_file_as_json(file_name):
with open(file_name) as open_file:
return json.load(open_file)
DB_INFO = _read_file_as_json(DB_INFO_FILE)
CONFIG_CLIENT = ConfigDbClient(**DB_INFO)
APP_DIRS = CONFIG_CLIENT.get_values('%my-app-info%')
INSTANCE_START_SCRIPT_PATHS = {
_extract_instance_id(instance_string): directory+START_SCRIPT
for instance_string, directory in APP_DIRS.items()
}
# Create an ssh hook which refers to pre-existing connection information
# setup and stored by airflow
SSH_HOOK = SSHHook(ssh_conn_id='my-conn-id')
# Create a DAG object to add tasks to
DAG = DAG('my-dag-id',
default_args=DEFAULT_ARGS)
# Create a task for each app instance.
for instance_id, start_script in INSTANCE_START_SCRIPT_PATHS.items():
task = SSHOperator(
task_id='run-script-{0}'.format(instance_id),
command='bash {0}'.format(start_script),
ssh_hook=SSH_HOOK,
dag=DAG)
It works when I run the tasks individually, via the command line but not via the UI. It seems I can run tasks but I simply cannot trigger a DAG to run. I've tried many combinations of start_date s and interval schedules just to sanity check also.
Any ideas?
And yes, I am aware this question has been asked before and I have looked at all of them but not of the solutions have helped me.
Oh. Your start_date is changing at the same rate or faster than the schedule interval period.
Here's what the scheduler is seeing every couple of seconds:
start_date: 2018-12-11T12:12:12.000Z # E.G. IFF now is 2018-12-12T12:12:12.000Z, a day ago
schedule_interval: timedelta(days=1) # The default
Here's what the scheduler needs for a DAG to run: The last time a run occurred was more than one schedule interval ago. If no scheduled run has occurred, the first scheduled run could start now if one full schedule interval has passed since the start_date as that is the earliest allowable date for execution_date. In which case the dag_run with the execution_date set to the start of that interval period should be created. Then task_instances can be created for any tasks in the DAG whose dependencies are met as long as the task_instance execution_date is after the start_date of the DAG (this is not stored on the dag_run object but recomputed by loading the DAG file just while inspecting the dag's state).
So it's not getting scheduled automatically for the reason that the start date keeps changing just as the interval is satisfied. However if it were -2d at least one run would get scheduled and then any further runs would have to wait until it's 1d after that to be scheduled. It's easier though if you just set a fixed datetime on your start_date.
But what about those odd retries on your manual runs…
You did start a manual run or two. These runs took the current time as the execution_date unless you specified something else. This should be after the start_date, at least until tomorrow, which should clear them to run. But then it seems in your logs you're seeing that they're failing and being marked for retry and also not decrementing your retries. I'm not sure why that would be but could it be that something isn't right with the SSHOperator.
Did you install airflow with the [ssh] extra so that SSHOperator's dependencies are met (specifically paramiko and sshtunnel) on both the webserver and scheduler? One of them is working because I assume it's parsing and showing up in the UI based on being added to the DB.
What do you get if you execute:
airflow test my-dag-id run-script-an_instance_id 2018-12-12T12:12:12
You know that the scheduler and webserver are looping over refilling their DAG bag and so rerunning this DAG file a few 1000 times a day, reloading that json (it's local access, so similar to importing a module), and recreating that SSHHook with a DB lookup. I don't see anything fancy done setting up this hook, why not just remove ssh_hook from the SSHOperator and replace it with ssh_conn_id='my-conn-id' so it can be created once at execution time?
I doubt that's the issue that's causing the retries that just roll forward though.
I had a task stuck in up_for_retry for almost 24 hours before I noticed it, and it had nothing to do with the start_date, end_date, or any other classic beginner's problem.
I resorted to reading the source code, and found that Airflow treats up_for_retry tasks differently if they are part of a backfill DAG run.
So I connected to the metadata DB and changed backfill_ to scheduled__ in the dag_run row corresponding to the stuck task. Airflow started running the stuck task immediately.

My airflow DAG and operator are not doing what I expected

edit: I figured out my problem. I didn't understand the different between triggering a run and it running immediately and keeping it on and letting it do its job. The code is fine.
I wrote this simple program to figure out airflow. On the hour it is supposed to print to a file "hello world", but it's doing it immediately. Does someone see where I am going wrong?
def print_hello():
f = open('helloword.txt','a')
f.write( 'Hello World!')
f.close()
dag = DAG('hello_world', description='Simple tutorial DAG', schedule_interval='#hourly',
start_date=datetime(2018, 5, 31), catchup=False)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
The start date is 2018-05-31 and the schedule interval is #hourly, so the execution date for the first run would normally be 2018-05-31T00:00:00 with a start date >= ~2018-05-31T01:00:00.
In this case, you have set catchup to false, so instead only the most recent DAG run will be created. I would expect that DAG run created to be 2018-05-31T21:00:00 right now.
The current UTC time is 2018-05-31T22:00:00 right now. Since the start date timestamp 2018-05-31T00:00:00 is in the past, the Airflow scheduler will schedule and start the task immediately.
You can delete the DAG runs and task instances and then change the start date to 2018-06-01 if you want it to start fresh tomorrow. It would not run immediately in this case if you choose a start date in the future.
You can find a bit more info about how the scheduler works here:
Airflow Wiki > Scheduler Basics
Airflow Docs > Scheduling & Triggers > To Keep in Mind
Your code looks fine to me. Are you seeing some lines appended to the file if you put your DAG off?
I think what you're seeing is the backfill executions running. You put your start date today, implicitly at midnight. Airflow will therefore catch up and fire up these DAG runs first before eventually running your task every hour.

Airflow Controller triggers Target DAG with a future execution date; Target DAG stalls

I have a Controller DAG (SampleController) that will call a Target DAG (SampleWait), both with a start_date of datetime.now() and a schedule_interval of None.
I trigger the Controller DAG from command line or the webserver UI, and it will run right away with an execution date of "right now" in my system time zone. In the screenshot, it is 17:25 - which isn't my "real" UTC time; it is my local time.
However, when the triggered DAG Run for the target is created, the execution date is "adjusted" to the UTC time, regardless of how I try to manipulate the start_date - it will ALWAYS be in the future (21:25 here). In my case, it is four hours in the future, so the target DAG just sits there doing nothing. I actually have a sensor in the Controller that waits for the Target DAG to finish, so that guy is going to be polling for no reason too.
Even the examples in the Github for the Controller-Target pattern exhibit the exact same behavior when I run them and I can't find any proper documentation on how to actually handle this issue, just that it is a "pitfall".
It is strange that Airflow seems to be aware of my time zone and adjusts within one operator, but not when I do it from command line or the web server UI.
What gives?

Resources