I was running the tutorial of airflow. The content in the tutorial.py is as follows:
"""
Code that goes along with the Airflow located at:
http://airflow.readthedocs.org/en/latest/tutorial.html
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG(
'tutorial', default_args=default_args, schedule_interval=timedelta(1))
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)
The tutorial.py is under ~/airflow/dags. By running airflow list_dags I can see the tutorial at the end of list.
However, when I run airflow test tutorial print_date 2018-09-04, it only prints:
[2018-09-04 22:14:43,096] {__init__.py:51} INFO - Using executor SequentialExecutor
[2018-09-04 22:14:43,199] {models.py:258} INFO - Filling up the DagBag from /Users/chenyuanfei/airflow/dags
And nothing else.
I'm using apche-airflow 1.10 on OSX.
How can I correctly run the script?
I'm doubting if it's because I have both airflow1.8 and apache-airflow1.10 on my mac.
I uninstalled both and reinstalled airflow1.8, and this time it works
Related
I've seen a few responses to this before but they haven't worked for me.
I'm running the Bridge 1.10.15 airflow version so we can migrate to Airflow 2, and I ran airflow upgrade_check and I'm seeing the below error:
/usr/local/lib/python3.7/site-packages/airflow/models/dag.py:1342:
PendingDeprecationWarning: The requested task could not be added to
the DAG with dag_id snapchat_snowflake_daily because a task with
task_id snp_bl_global_content_reporting is already in the DAG.
Starting in Airflow 2.0, trying to overwrite a task will raise an
exception.
same error is happening but with task_id: snp_bl_global_article_reporting and snp_bl_global_video_reporting
I've also seen someone recommend setting load_examples = False in the airflow.cfg file which I already have.
Here is my code:
DAG_NAME = 'snapchat_snowflake_daily'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 6, 12),
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'provide_context': True,
'on_failure_callback': task_fail_slack_alert,
'sla': timedelta(hours=24),
}
dag = DAG(
DAG_NAME,
default_args=default_args,
catchup=False,
schedule_interval='0 3 * * *')
with dag:
s3_to_snowflake = SnowflakeLoadOperator(
task_id=f'load_into_snowflake_for_{region}',
pool='airflow_load',
retries=0, )
snp_il_global = SnowflakeQueryOperator(
task_id='snp_il_global',
sql='queries/snp_il_gl.sql',
retries=0)
snp_bl_global_video_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_video_reporting',
sql='snp_bl_gl_reporting.sql',
retries=0)
snp_bl_global_content_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_content_reporting',
sql='snp_bl_global_c.sql')
snp_bl_global_article_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_article_reporting',
sql='snp_bl_global_a.sql',
retries=0)
s3_to_snowflake >> snp_il_global >> [
snp_bl_global_video_reporting,
snp_bl_global_content_reporting,
snp_bl_global_article_reporting
]
I have an airflow task which scheduled to run every 3 minutes.
Sometimes the duration of the task is longer than 3 minutes, and the next schedule start (or queued), despite it is already running.
Is there a way to define the dag to NOT even queue the task if it is already in run?
# airflow related
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators import MsSqlOperator
# other packages
from datetime import datetime
from datetime import timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 7, 22, 15,00,00),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
dag = DAG(
dag_id='sales',
description='Run sales',
schedule_interval = '*/3 4,5,6,7,8,9,10,11,12,13,14,15,16,17 * * 0-5',
default_args=default_args,
catchup = False)
job1 = BashOperator(
task_id='sales',
bash_command='python2 /home/manager/ETL/sales.py',
dag = dag)
job2 = MsSqlOperator(
task_id='refresh_tabular',
mssql_conn_id='mssql_globrands',
sql="USE msdb ; EXEC dbo.sp_start_job N'refresh Management-sales' ; ",
dag = dag)
job1>>job2
task a > task b > task c
If C fails I want to retry A. Is this possible? There are a few other tickets which involve subdags, but I would like to just be able to clear A.
I'm hoping to use on_retry_callback in task C but I don't know how to call task A.
There is another question which does this in a subdag, but I am not using subdags.
I'm trying to do this, but it doesn't seem to work:
def callback_for_failures(context):
print("*** retrying ***")
if context['task'].upstream_list:
context['task'].upstream_list[0].clear()
As other comments mentioned, I would use caution to make sure you aren't getting into an endless loop of clearing/retries. But you can call a bash command as part of your on_failure_callback and then specify which tasks you want to clear, and if you want downstream/upstream tasks cleared etc.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
def clear_upstream_task(context):
execution_date = context.get("execution_date")
clear_tasks = BashOperator(
task_id='clear_tasks',
bash_command=f'airflow tasks clear -s {execution_date} -t t1 -d -y clear_upstream_task'
)
return clear_tasks.execute(context=context)
# Default settings applied to all tasks
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
with DAG('clear_upstream_task',
start_date=datetime(2021, 1, 1),
max_active_runs=3,
schedule_interval=timedelta(minutes=5),
default_args=default_args,
catchup=False
) as dag:
t0 = DummyOperator(
task_id='t0'
)
t1 = DummyOperator(
task_id='t1'
)
t2 = DummyOperator(
task_id='t2'
)
t3 = BashOperator(
task_id='t3',
bash_command='exit 123',
on_failure_callback=clear_upstream_task
)
t0 >> t1 >> t2 >> t3
Normally we define the Operators within the same python file where our DAG is defined (see this basic example). So was I doing the same. But my tasks are itself BIG, using custom operators, so I wanted to have a polymorphism structured dag project, where all such tasks using same operator are in a separate file. For simplicity, let me give a very basic example. I have an operator x having several tasks. This is my project structure;
main_directory
├──tasks
| ├──operator_x
| | └──op_x.py
| ├──operator_y
| : └──op_y.py
|
└──dag.py
op_x.py has following method;
def prepare_task():
from main_directory.dag import dag
t2 = BashOperator(
task_id='print_inner_date',
bash_command='date',
dag=dag)
return t2
and the dag.py contains following code;
from main_directory.tasks.operator_x import prepare_task
default_args = {
'retries': 5,
'retry_delay': dt.timedelta(minutes=5),
'on_failure_callback': gen_email(EMAIL_DISTRO, retry=False),
'on_retry_callback': gen_email(EMAIL_DISTRO, retry=True),
'start_date': dt.datetime(2019, 5, 10)
}
dag = DAG('test_dag', default_args=default_args, schedule_interval=dt.timedelta(days=1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = prepare_task()
Now when I execute this in my airflow environment and run airflow list_dags I get the desired dag named test_dag listed, but when I do airflow list_tasks -t test_dag I only get one task with id print_date and NOT the one defined inside the subdirectory with ID print_inner_date. can anyone help me understand what am I missing ?
Your code would create cyclic imports. Instead, try the following:
op_x.py should have:
def prepare_task(dag):
t2 = BashOperator(
task_id='print_inner_date',
bash_command='date',
dag=dag)
return t2
dag.py:
from main_directory.tasks.operator_x import prepare_task
default_args = {
'retries': 5,
'retry_delay': dt.timedelta(minutes=5),
'on_failure_callback': gen_email(EMAIL_DISTRO, retry=False),
'on_retry_callback': gen_email(EMAIL_DISTRO, retry=True),
'start_date': dt.datetime(2019, 5, 10)
}
dag = DAG('test_dag', default_args=default_args, schedule_interval=dt.timedelta(days=1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = prepare_task(dag=dag)
Also make sure that main_directory is in your PYTHONPATH.
I am using Airflow 1.8. I want the Airflow to send email when DAG times out. Currently it sends emails when the a task times out or when there is a retry. but not when the DAG is timing out.The DAG is intentionally set to run every minute. the tasks 10 seconds but DAG time out is 5 seconds. the dag fails but it doesn't send any email.
here is my code for DAG:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['email#email.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'execution_timeout': timedelta(seconds=60)
}
schedule = '* * * * *'
dag = DAG('leader_dag',
default_args=default_args,catchup=False,dagrun_timeout=timedelta(seconds=5),
schedule_interval=schedule)
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='sleep 10',
dag=dag)
here is smtp part from airflow.cfg:
email_backend = airflow.utils.email.send_email_smtp
#[smtp]
# If you want airflow to send emails on retries, failure, and you want to
# the airflow.utils.send_email function, you have to configure an smtp
# server here
smtp_host = "***********.amazonaws.com
smtp_starttls = True
smtp_ssl = False
smtp_user = user
smtp_port = 25
smtp_password = password
smtp_mail_from = no-reply#example.com