BashOperator doen't run bash command apache airflow - airflow

I was running the tutorial of airflow. The content in the tutorial.py is as follows:
"""
Code that goes along with the Airflow located at:
http://airflow.readthedocs.org/en/latest/tutorial.html
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG(
'tutorial', default_args=default_args, schedule_interval=timedelta(1))
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)
The tutorial.py is under ~/airflow/dags. By running airflow list_dags I can see the tutorial at the end of list.
However, when I run airflow test tutorial print_date 2018-09-04, it only prints:
[2018-09-04 22:14:43,096] {__init__.py:51} INFO - Using executor SequentialExecutor
[2018-09-04 22:14:43,199] {models.py:258} INFO - Filling up the DagBag from /Users/chenyuanfei/airflow/dags
And nothing else.
I'm using apche-airflow 1.10 on OSX.
How can I correctly run the script?

I'm doubting if it's because I have both airflow1.8 and apache-airflow1.10 on my mac.
I uninstalled both and reinstalled airflow1.8, and this time it works

Related

Airflow The requested task could not be added to the DAG because a task with task_id ... is already in the DAG

I've seen a few responses to this before but they haven't worked for me.
I'm running the Bridge 1.10.15 airflow version so we can migrate to Airflow 2, and I ran airflow upgrade_check and I'm seeing the below error:
/usr/local/lib/python3.7/site-packages/airflow/models/dag.py:1342:
PendingDeprecationWarning: The requested task could not be added to
the DAG with dag_id snapchat_snowflake_daily because a task with
task_id snp_bl_global_content_reporting is already in the DAG.
Starting in Airflow 2.0, trying to overwrite a task will raise an
exception.
same error is happening but with task_id: snp_bl_global_article_reporting and snp_bl_global_video_reporting
I've also seen someone recommend setting load_examples = False in the airflow.cfg file which I already have.
Here is my code:
DAG_NAME = 'snapchat_snowflake_daily'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 6, 12),
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'provide_context': True,
'on_failure_callback': task_fail_slack_alert,
'sla': timedelta(hours=24),
}
dag = DAG(
DAG_NAME,
default_args=default_args,
catchup=False,
schedule_interval='0 3 * * *')
with dag:
s3_to_snowflake = SnowflakeLoadOperator(
task_id=f'load_into_snowflake_for_{region}',
pool='airflow_load',
retries=0, )
snp_il_global = SnowflakeQueryOperator(
task_id='snp_il_global',
sql='queries/snp_il_gl.sql',
retries=0)
snp_bl_global_video_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_video_reporting',
sql='snp_bl_gl_reporting.sql',
retries=0)
snp_bl_global_content_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_content_reporting',
sql='snp_bl_global_c.sql')
snp_bl_global_article_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_article_reporting',
sql='snp_bl_global_a.sql',
retries=0)
s3_to_snowflake >> snp_il_global >> [
snp_bl_global_video_reporting,
snp_bl_global_content_reporting,
snp_bl_global_article_reporting
]

How to avoid run of task when already running

I have an airflow task which scheduled to run every 3 minutes.
Sometimes the duration of the task is longer than 3 minutes, and the next schedule start (or queued), despite it is already running.
Is there a way to define the dag to NOT even queue the task if it is already in run?
# airflow related
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators import MsSqlOperator
# other packages
from datetime import datetime
from datetime import timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 7, 22, 15,00,00),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
dag = DAG(
dag_id='sales',
description='Run sales',
schedule_interval = '*/3 4,5,6,7,8,9,10,11,12,13,14,15,16,17 * * 0-5',
default_args=default_args,
catchup = False)
job1 = BashOperator(
task_id='sales',
bash_command='python2 /home/manager/ETL/sales.py',
dag = dag)
job2 = MsSqlOperator(
task_id='refresh_tabular',
mssql_conn_id='mssql_globrands',
sql="USE msdb ; EXEC dbo.sp_start_job N'refresh Management-sales' ; ",
dag = dag)
job1>>job2

How to retry an upstream task?

task a > task b > task c
If C fails I want to retry A. Is this possible? There are a few other tickets which involve subdags, but I would like to just be able to clear A.
I'm hoping to use on_retry_callback in task C but I don't know how to call task A.
There is another question which does this in a subdag, but I am not using subdags.
I'm trying to do this, but it doesn't seem to work:
def callback_for_failures(context):
print("*** retrying ***")
if context['task'].upstream_list:
context['task'].upstream_list[0].clear()
As other comments mentioned, I would use caution to make sure you aren't getting into an endless loop of clearing/retries. But you can call a bash command as part of your on_failure_callback and then specify which tasks you want to clear, and if you want downstream/upstream tasks cleared etc.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
def clear_upstream_task(context):
execution_date = context.get("execution_date")
clear_tasks = BashOperator(
task_id='clear_tasks',
bash_command=f'airflow tasks clear -s {execution_date} -t t1 -d -y clear_upstream_task'
)
return clear_tasks.execute(context=context)
# Default settings applied to all tasks
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
with DAG('clear_upstream_task',
start_date=datetime(2021, 1, 1),
max_active_runs=3,
schedule_interval=timedelta(minutes=5),
default_args=default_args,
catchup=False
) as dag:
t0 = DummyOperator(
task_id='t0'
)
t1 = DummyOperator(
task_id='t1'
)
t2 = DummyOperator(
task_id='t2'
)
t3 = BashOperator(
task_id='t3',
bash_command='exit 123',
on_failure_callback=clear_upstream_task
)
t0 >> t1 >> t2 >> t3

airflow not loading operator tasks from file other then DAG file

Normally we define the Operators within the same python file where our DAG is defined (see this basic example). So was I doing the same. But my tasks are itself BIG, using custom operators, so I wanted to have a polymorphism structured dag project, where all such tasks using same operator are in a separate file. For simplicity, let me give a very basic example. I have an operator x having several tasks. This is my project structure;
main_directory
├──tasks
| ├──operator_x
| | └──op_x.py
| ├──operator_y
| : └──op_y.py
|
└──dag.py
op_x.py has following method;
def prepare_task():
from main_directory.dag import dag
t2 = BashOperator(
task_id='print_inner_date',
bash_command='date',
dag=dag)
return t2
and the dag.py contains following code;
from main_directory.tasks.operator_x import prepare_task
default_args = {
'retries': 5,
'retry_delay': dt.timedelta(minutes=5),
'on_failure_callback': gen_email(EMAIL_DISTRO, retry=False),
'on_retry_callback': gen_email(EMAIL_DISTRO, retry=True),
'start_date': dt.datetime(2019, 5, 10)
}
dag = DAG('test_dag', default_args=default_args, schedule_interval=dt.timedelta(days=1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = prepare_task()
Now when I execute this in my airflow environment and run airflow list_dags I get the desired dag named test_dag listed, but when I do airflow list_tasks -t test_dag I only get one task with id print_date and NOT the one defined inside the subdirectory with ID print_inner_date. can anyone help me understand what am I missing ?
Your code would create cyclic imports. Instead, try the following:
op_x.py should have:
def prepare_task(dag):
t2 = BashOperator(
task_id='print_inner_date',
bash_command='date',
dag=dag)
return t2
dag.py:
from main_directory.tasks.operator_x import prepare_task
default_args = {
'retries': 5,
'retry_delay': dt.timedelta(minutes=5),
'on_failure_callback': gen_email(EMAIL_DISTRO, retry=False),
'on_retry_callback': gen_email(EMAIL_DISTRO, retry=True),
'start_date': dt.datetime(2019, 5, 10)
}
dag = DAG('test_dag', default_args=default_args, schedule_interval=dt.timedelta(days=1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = prepare_task(dag=dag)
Also make sure that main_directory is in your PYTHONPATH.

Not Receiving Email on Airflow Dag Time out

I am using Airflow 1.8. I want the Airflow to send email when DAG times out. Currently it sends emails when the a task times out or when there is a retry. but not when the DAG is timing out.The DAG is intentionally set to run every minute. the tasks 10 seconds but DAG time out is 5 seconds. the dag fails but it doesn't send any email.
here is my code for DAG:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['email#email.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'execution_timeout': timedelta(seconds=60)
}
schedule = '* * * * *'
dag = DAG('leader_dag',
default_args=default_args,catchup=False,dagrun_timeout=timedelta(seconds=5),
schedule_interval=schedule)
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='sleep 10',
dag=dag)
here is smtp part from airflow.cfg:
email_backend = airflow.utils.email.send_email_smtp
#[smtp]
# If you want airflow to send emails on retries, failure, and you want to
# the airflow.utils.send_email function, you have to configure an smtp
# server here
smtp_host = "***********.amazonaws.com
smtp_starttls = True
smtp_ssl = False
smtp_user = user
smtp_port = 25
smtp_password = password
smtp_mail_from = no-reply#example.com

Resources