I'm new to airflow and I wonder how can I have DAGs to auto-activate after being loaded.
I mean the switches, they are off by default
For All Dags:
Change dags_are_paused_at_creation to False in airflow.cfg.
You will find it in [core] section.
[core]
# Are DAGs paused by default at creation
dags_are_paused_at_creation = False
Single DAG:
If you do not want to change it for all DAGs, you can set is_paused_upon_creation=True when creating DAG object.
Example:
dag = DAG(
DAG_NAME,
schedule_interval='*/10 * * * *',
default_args=default_args,
is_paused_upon_creation=True)
Related
I am using Apache Airflow version 1.10.3 with the sequential executor, and I would like the DAG to fail after a certain amount of time if it has not finished. I tried setting dagrun_timeout in the example code
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 6, 1),
'retries': 0,
}
dag = DAG('min_timeout', default_args=default_args, schedule_interval=timedelta(minutes=5), dagrun_timeout = timedelta(seconds=30), max_active_runs=1)
t1 = BashOperator(
task_id='fast_task',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='slow_task',
bash_command='sleep 45',
dag=dag)
t2.set_upstream(t1)
slow_task alone takes more than the time limit set by dagrun_timeout, so my understanding is that airflow should stop DAG execution. However, this does not happen, and slow_task is allowed to run for its entire duration. After this occurs, the run is marked as failed, but this does not kill the task or DAG as desired. Using execution_timeout for slow_task does cause the task to be killed at the specified time limit, but I would prefer to use an overall time limit for the DAG rather than specifying execution_timeout for each task.
Is there anything else I should try to achieve this behavior, or any mistakes I can fix?
The Airflow scheduler runs a loop at least every SCHEDULER_HEARTBEAT_SEC (the default is 5 seconds).
Bear in mind at least here, because the scheduler performs some actions that may delay the next cycle of its loop.
These actions include:
parsing the dags
filling up the DagBag
checking the DagRun and updating their state
scheduling next DagRun
In your example, the delayed task isn't terminated at the dagrun_timeout because the scheduler performs its next cycle after the task completes.
According to Airflow documentation:
dagrun_timeout (datetime.timedelta) – specify how long a DagRun should be up before timing out / failing, so that new DagRuns can be created. The timeout is only enforced for scheduled DagRuns, and only once the # of active DagRuns == max_active_runs.
So dagrun_timeout wouldn't work for non-scheduled DagRuns (e.g. manually triggered) and if the number of active DagRuns < max_active_runs parameter.
I have scheduled a DAG with a simple bash task to run every 5th minute:
# bash_dag.py
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'start_date' : datetime(2019, 5, 30)
}
dag = DAG(
'bash_count',
default_args=default_args,
schedule_interval='*/5 * * * *',
catchup = False
)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag
)
Scheduling works fine, DAG is executing every 5th minute threshold. However, I have noticed that there is a significant delay between the 5th minute threshold and task queueing time. For the examples shown in the image, task queueing takes in between 3 to 50 seconds. For example, last DAG execution in the image was supposed to be triggered after 20:05:00 but task instance was queued 28 seconds later (20:05:28).
I'm surprised this is the case, since the DAG being scheduled has a single very simple task. Is this a normal airflow delay? Should I expect further delays when dealing with more complex DAGs?
I'm running a local airflow server with Postgres as db on a 16 GB Mac with OS Mojave. Machine is not resource constrained.
I am currently using a CeleryExecutor for submitting DAGs, and am running a webserver, scheduler, and worker in the same container on AWS. When I submit DAGs to run and their tasks are in the Redis queue (using Elasticache), the Celery workers execute the tasks, but the name is just airflow.executors.celery_executor.execute_command. How do I get these to be the args?
Currently these are the important sections of airflow.cfg:
# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor
executor = CeleryExecutor
[celery]
# This section only applies if you are using the CeleryExecutor in
# [core] section above
# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor
And in our DAG file:
dag = DAG(
'foo_dag',
schedule_interval='0 2-23 * * *',
catchup=False,
default_args=default_args
)
t1 = PythonOperator(
task_id="bar_task",
provide_context=True,
python_callable=bar_task,
op_kwargs={"emr_cluster_id": emr_cluster_id},
task_concurrency=1,
dag=foo_dag
)
I am new to airflow and I have written a simple SSHOperator to learn how it works.
default_args = {
'start_date': datetime(2018,6,20)
}
dag = DAG(dag_id='ssh_test', schedule_interval = '#hourly',default_args=default_args)
sshHook = SSHHook(ssh_conn_id='testing')
t1 = SSHOperator(
task_id='task1',
command='echo Hello World',
ssh_hook=sshHook,
dag=dag)
When I manually trigger it on the UI, the dag shows a status of running but the operator stays white, no status.
I'm wondering why my task isn't queuing. Does anyone have any ideas? My airflow.config is the default if that is useful information.
Even this isn't running
dag=DAG(dag_id='test',start_date = datetime(2018,6,21), schedule_interval='0 0 * * *')
runMe = DummyOperator(task_id = 'testest', dag = dag)
Make sure you've started the Airflow Scheduler in addition to the Airflow Web Server:
airflow scheduler
check if airflow scheduler is running
check if airflow webserver is running
check if all DAGs are set to On in the web UI
check if the DAGs have a start date which is in the past
check if the DAGs have a proper schedule (before the schedule date) which is shown in the web UI
check if the dag has the proper pool and queue.
I'm just getting started with Airbnb's airflow, and I'm still not clear on how/when backfilling is done.
Specifically, there are 2 use-cases that confuse me:
If I run airflow scheduler for a few minutes, stop it for a minute, then restart it again, my DAG seems to run extra tasks for the first 30 seconds or so, then it continues as normal (runs every 10 sec). Are these extra tasks "backfilled" tasks that weren't able to complete in an earlier run? If so, how would I tell airflow not to backfill those tasks?
If I run airflow scheduler for a few minutes, then run airflow clear MY_tutorial, then restart airflow scheduler, it seems to run a TON of extra tasks. Are these tasks also somehow "backfilled" tasks? Or am I missing something.
Currently, I have a very simple dag:
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2016, 10, 4),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG(
'MY_tutorial', default_args=default_args, schedule_interval=timedelta(seconds=10))
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 8)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
second_template = """
touch ~/airflow/logs/test
echo $(date) >> ~/airflow/logs/test
"""
t4 = BashOperator(
task_id='write_test',
bash_command=second_template,
dag=dag)
t1.set_upstream(t4)
t2.set_upstream(t1)
t3.set_upstream(t1)
The only two things I've changed in my airflow config are
I changed from using a sqlite db to using a postgres db
I'm using a CeleryExecutor instead of a SequentialExecutor
Thanks so much for you help!
When you change the scheduler toggle to "on" for a DAG, the scheduler will trigger a backfill of all dag run instances for which it has no status recorded, starting with the start_date you specify in your "default_args".
For example: If the start date was "2017-01-21" and you turned on the scheduling toggle at "2017-01-22T00:00:00" and your dag was configured to run hourly, then the scheduler will backfill 24 dag runs and then start running on the scheduled interval.
This is essentially what is happening in both of your question. In #1, it is filling in the 3 missing runs from the 30 seconds which you turned off the scheduler. In #2, it is filling in all of the DAG runs from start_date until "now".
There are 2 ways around this:
Set the start_date to a date in the future so that it will only start scheduling dag runs once that date is reached. Note that if you change the start_date of a DAG, you must change the name of the DAG as well due to the way the start date is stored in airflow's DB.
Manually run backfill from the command line with the "-m" (--mark-success) flag which tells airflow not to actually run the DAG, rather just mark it as successful in the DB.
e.g.
airflow backfill MY_tutorial -m -s 2016-10-04 -e 2017-01-22T14:28:30
Please note that since version 1.8, Airflow lets you control this behaviour using catchup. Either set catchup_by_default=False in airflow.cfg or
catchup=False in your DAG definition.
See https://airflow.apache.org/scheduler.html#backfill-and-catchup
The On/Off on Airflow's UI only states "PAUSE" which means, if its ON, it will only pause on the time it was triggered and continue on that date again if it is turned off.