I have an airflow task which scheduled to run every 3 minutes.
Sometimes the duration of the task is longer than 3 minutes, and the next schedule start (or queued), despite it is already running.
Is there a way to define the dag to NOT even queue the task if it is already in run?
# airflow related
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators import MsSqlOperator
# other packages
from datetime import datetime
from datetime import timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 7, 22, 15,00,00),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
dag = DAG(
dag_id='sales',
description='Run sales',
schedule_interval = '*/3 4,5,6,7,8,9,10,11,12,13,14,15,16,17 * * 0-5',
default_args=default_args,
catchup = False)
job1 = BashOperator(
task_id='sales',
bash_command='python2 /home/manager/ETL/sales.py',
dag = dag)
job2 = MsSqlOperator(
task_id='refresh_tabular',
mssql_conn_id='mssql_globrands',
sql="USE msdb ; EXEC dbo.sp_start_job N'refresh Management-sales' ; ",
dag = dag)
job1>>job2
Related
I have the following in the dag.py file, this is a newly pushed to prod DAG, it should have run at 14UTC (9EST) it should have ran a few hours ago but it still hasn't run even thought in the UI is still saying it will run at 14UTC.
DAG_NAME = "revenue_monthly"
START_DATE = datetime(2023, 1, 12)
SCHEDULE_INTERVAL = "0 14 3 * *"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
dag = DAG(DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=False,
)
See picture below of the UI:
The date and time you are seeing as Next Run is the logical_date which is the start of the data interval. With the current configuration the first DAGrun will be on data from 2023-02-03 to 2023-03-03 so the DAG will only actually be running on 2023-03-03 (the Run After date, you can see that one when you are viewing the DAG and hover over the schedule in the upper right corner:
Assuming you want the DAG to do the run it would have done on 2023-02-03 (today) you can achieve that by backfilling one run, either by manually backfilling. Or by using catchup=True with a start_date before 2023-01-03:
from airflow import DAG
from pendulum import datetime
from airflow.operators.empty import EmptyOperator
DAG_NAME = "revenue_monthly_1"
START_DATE = datetime(2023, 1, 1)
SCHEDULE_INTERVAL = "0 14 3 * *"
doc_md="documentation"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
with DAG(
DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=True,
) as dag:
t1 = EmptyOperator(task_id="t1")
gave me one run with the run id scheduled__2023-01-03T14:00:00+00:00 and the next_run date interval 2023-02-03 to 2023-03-03 which will Run after 2023-03-03.
This guide might help with terminology Airflow uses around schedules.
I've seen a few responses to this before but they haven't worked for me.
I'm running the Bridge 1.10.15 airflow version so we can migrate to Airflow 2, and I ran airflow upgrade_check and I'm seeing the below error:
/usr/local/lib/python3.7/site-packages/airflow/models/dag.py:1342:
PendingDeprecationWarning: The requested task could not be added to
the DAG with dag_id snapchat_snowflake_daily because a task with
task_id snp_bl_global_content_reporting is already in the DAG.
Starting in Airflow 2.0, trying to overwrite a task will raise an
exception.
same error is happening but with task_id: snp_bl_global_article_reporting and snp_bl_global_video_reporting
I've also seen someone recommend setting load_examples = False in the airflow.cfg file which I already have.
Here is my code:
DAG_NAME = 'snapchat_snowflake_daily'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 6, 12),
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'provide_context': True,
'on_failure_callback': task_fail_slack_alert,
'sla': timedelta(hours=24),
}
dag = DAG(
DAG_NAME,
default_args=default_args,
catchup=False,
schedule_interval='0 3 * * *')
with dag:
s3_to_snowflake = SnowflakeLoadOperator(
task_id=f'load_into_snowflake_for_{region}',
pool='airflow_load',
retries=0, )
snp_il_global = SnowflakeQueryOperator(
task_id='snp_il_global',
sql='queries/snp_il_gl.sql',
retries=0)
snp_bl_global_video_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_video_reporting',
sql='snp_bl_gl_reporting.sql',
retries=0)
snp_bl_global_content_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_content_reporting',
sql='snp_bl_global_c.sql')
snp_bl_global_article_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_article_reporting',
sql='snp_bl_global_a.sql',
retries=0)
s3_to_snowflake >> snp_il_global >> [
snp_bl_global_video_reporting,
snp_bl_global_content_reporting,
snp_bl_global_article_reporting
]
I have a requirement to copy s3 files to redshift using copy command. I am bit new to airflow and having issues. Can some one correct the below code. Can I call the rs.execute() as such?
Error:
op.execute()
TypeError: execute() missing 1 required positional argument: 'context'
code:
import os
from airflow import DAG
from airflow.hooks.S3_hook import S3Hook
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
from airflow.operators.s3_to_redshift_operator import S3ToRedshiftTransfer
default_args = {
'owner': 'gra',
'depends_on_past': False,
'start_date': datetime(2020, 12, 13),
'email': ['ss.com'],
'email_on_failure': False,
'email_on_retry': False,
'schedule_interval': '#daily',
'retries': 1,
'retry_delay': timedelta(seconds=5),
}
def job1():
print('First Job to start')
def s3_redshift(**kwargs):
rs= S3ToRedshiftTransfer(redshift_conn_id ='12as',
aws_conn_id='gt_read',
schema='test',
table='dept',
s3_bucket="gng-test",
s3_key="copt.csv",
task_id="copy_redshift"
#copy_options=copy_options_,
)
rs.execute()
copy_redshift=PythonOperator(task_id='copy_redshift', python_callable=s3_redshift,provide_context=True, dag=dag)
app_start >> copy_redshift
I was able to use the boto3 to execute copy from s3 to redshift. S3ToRedshiftTransfer is can be used to do the same.
# airflow related
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
# other packages
from datetime import datetime
from datetime import timedelta
# from airflow.hooks import PostgresHook
from airflow.operators.s3_to_redshift_operator import S3ToRedshiftTransfer
#from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator
from airflow.contrib.operators.aws_athena_operator import AWSAthenaOperator
from airflow.operators import SimpleHttpOperator, HttpSensor, BashOperator, EmailOperator, S3KeySensor
import boto3
default_args = {
'owner': 'grit_delta',
'depends_on_past': False,
'start_date': datetime(2020, 12, 13),
'email': ['sa.com'],
'email_on_failure': False,
'email_on_retry': False,
'schedule_interval': '#daily',
'retries': 1,
'retry_delay': timedelta(seconds=5),
}
dag=DAG(dag_id='veritas_test',default_args=default_args,schedule_interval=timedelta(1))
def job1():
print('First Job to start')
file_sensor = S3KeySensor(task_id = 's3_key_sensor_task',
s3_conn_id='_read',
poke_interval=120,
timeout=18*60*60,
bucket_key = "data/test.*",
bucket_name = "g-test",
wildcard_match = True,
dag = dag
)
app_start=PythonOperator(task_id='app_start', python_callable=job1, dag=dag)
def s3_redshift(**kwargs):
rsd = boto3.client('redshift-data')
deptKey='s3://airflow-dev/code/gta/dag/dept.csv'
sqlQuery="copy test.dept from 's3://airflow-grole' CSV ;"
#sqlQuery="insert into test.dept values('d1221',100)"
print(sqlQuery)
resp = rsd.execute_statement(
ClusterIdentifier="opes",
Database="ee",
DbUser="aa",
Sql=sqlQuery
#Sql="CREATE TABLE IF NOT EXISTS test.dept (title varchar(10), rating int);"
)
print(resp)
print(" completed")
return "OK"
copy_redshift=PythonOperator(task_id='copy_redshift', python_callable=s3_redshift,provide_context=True, dag=dag)
file_sensor >>app_start >> copy_redshift
You have not defined any DAG and you don't use operators like that. I would recommend you to read a little bit more about how to use Airflow. Anyway, the code should be:
import os
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
from airflow.operators.s3_to_redshift_operator import S3ToRedshiftTransfer
default_args = {
'owner': 'gra',
'depends_on_past': False,
'start_date': datetime(2020, 12, 13),
'email': ['ss.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5),
}
with DAG('dag_name', schedule_interval='#daily', default_args=default_args) as dag:
rs= S3ToRedshiftTransfer(redshift_conn_id ='12as',
aws_conn_id='gt_read',
schema='test',
table='dept',
s3_bucket="gng-test",
s3_key="copt.csv",
task_id="copy_redshift"
)
app_start >> rs
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.models import Variable
from datetime import datetime, timedelta
from epsilon_spark_operator import EpsilonSparkOperator
#from merlin_spark_submit_operator import MerlinSparkSubmitOperator
#start=timedelta(hours=3)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
# 'end_date': datetime(2019, 12, 12),
# 'schedule_interval': '37 12 * * *'
'email': ['ankit.maloo#inmobi.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=1),
'queue': 'default'
}
dag = DAG('epsilon_spark_test', default_args=default_args, schedule_interval='#once')
spark_submit_task = EpsilonSparkOperator(
task_id='spark_submit_job',
conn_id='epsilon_spark',
application='abfs://***********.jar',
java_class='com.inmobi.EpsilonTest',
application_args=['10'],
verbose=True,
cluster_name="*****",
azure_storage_conn_id="*****",
keyvault_name='*****',
keyvault_client_id_key='*****',
keyvault_client_secret_key='*****',
conf={'spark.executors': '30', 'spark.eventLog.enabled': 'false', 'spark.eventLog.dir':'/tmp', 'spark.shuffle.service.enabled':'true'},
dag=dag)
dag >> spark_submit_task
I am trying to tesr airflow cluster with a sample spark job.
Above is my python code.
When trying to deploy the dag via curl
curl -X POST -H 'Content-Type: multipart/form-data' -F 'dag_file=#/Users/ankit.maloo/dag_test.py' -F 'force=on' 'http://*.*.*.*:8080/admin/rest_api/api?api=deploy_dag'
Gives error as
Broken DAG: [/root/airflow/dags/dag_test1.py] Argument ['execution_time'] is required.
Any Idea?
I am using Airflow 1.8. I want the Airflow to send email when DAG times out. Currently it sends emails when the a task times out or when there is a retry. but not when the DAG is timing out.The DAG is intentionally set to run every minute. the tasks 10 seconds but DAG time out is 5 seconds. the dag fails but it doesn't send any email.
here is my code for DAG:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['email#email.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'execution_timeout': timedelta(seconds=60)
}
schedule = '* * * * *'
dag = DAG('leader_dag',
default_args=default_args,catchup=False,dagrun_timeout=timedelta(seconds=5),
schedule_interval=schedule)
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='sleep 10',
dag=dag)
here is smtp part from airflow.cfg:
email_backend = airflow.utils.email.send_email_smtp
#[smtp]
# If you want airflow to send emails on retries, failure, and you want to
# the airflow.utils.send_email function, you have to configure an smtp
# server here
smtp_host = "***********.amazonaws.com
smtp_starttls = True
smtp_ssl = False
smtp_user = user
smtp_port = 25
smtp_password = password
smtp_mail_from = no-reply#example.com