I would like to create an Airflow DAG and want to learn which parameters should be set in field_1 vs default_args vs args?
my_dag = DAG(
"my_dag",
"field_1"="xxx",
default_agrs=default_args,
**args
)
I checked with documentation, I understand that some parameters such as "owner" have to be set through the default_args and can't be in field_1. But looks like there's no difference for the most of parameters. I tested some fields such as "catchup" and "on_failure_callback", and they all work in all these three places.
So I wonder what's the best practice of setting parameters when create a dag?
The best practice is something like the Airflow tutorial
with DAG(
'tutorial',
# These args will get passed on to each operator
# You can override them on a per-task basis during operator initialization
default_args={
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
# 'wait_for_downstream': False,
# 'sla': timedelta(hours=2),
# 'execution_timeout': timedelta(seconds=300),
# 'on_failure_callback': some_function,
# 'on_success_callback': some_other_function,
# 'on_retry_callback': another_function,
# 'sla_miss_callback': yet_another_function,
# 'trigger_rule': 'all_success'
},
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=datetime(2021, 1, 1),
catchup=False,
tags=['example'],
) as dag:
...
Reference: https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html#example-pipeline-definition
But I use something like that and is enough:
import pendulum
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}
with DAG(
default_args=default_args,
dag_id='dag_etl',
catchup=False,
start_date=pendulum.datetime(year=2022, month=1, day=1, tz='America/Chicago'),
schedule_interval='0 8 * * *', # https://crontab.guru/#0_8_*_*_*
description='DAG Extract Transform Load'
) as dag:
...
Related
I have been trying to create a trace using open-telemetry but the endpoint doesn't give out any results..
endpoint="http://localhost:55680"
any help appreciated
also did anyone try to install this package in airflow
! from opentelemetry.exporter.cloud_trace
(this package helps to create a trace in google cloud , trying to run that through airflow)
it throws a version error
import airflow
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import timedelta
from airflow.utils.dates import days_ago
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
# 'wait_for_downstream': False,
# 'dag': dag,
# 'sla': timedelta(hours=2),
# 'execution_timeout': timedelta(seconds=300),
# 'on_failure_callback': some_function,
# 'on_success_callback': some_other_function,
# 'on_retry_callback': another_function,
# 'sla_miss_callback': yet_another_function,
# 'trigger_rule': 'all_success'
}
dag = DAG(
'create_a_trace',
default_args=default_args,
description='open telemetry trace creation',
schedule_interval=timedelta(days=1),
)
def execute():
resource = Resource(attributes={
"service.name": "test_airflow_worker"
})
trace.set_tracer_provider(TracerProvider(resource=resource))
tracer = trace.get_tracer(__name__)
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:55680", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
with tracer.start_as_current_span("test_task_span"):
print("Hello Airflow!")
run_this = PythonOperator(
task_id='print_the_context',
provide_context=True,
python_callable=execute,
dag=dag,
)
run_this
I've seen a few responses to this before but they haven't worked for me.
I'm running the Bridge 1.10.15 airflow version so we can migrate to Airflow 2, and I ran airflow upgrade_check and I'm seeing the below error:
/usr/local/lib/python3.7/site-packages/airflow/models/dag.py:1342:
PendingDeprecationWarning: The requested task could not be added to
the DAG with dag_id snapchat_snowflake_daily because a task with
task_id snp_bl_global_content_reporting is already in the DAG.
Starting in Airflow 2.0, trying to overwrite a task will raise an
exception.
same error is happening but with task_id: snp_bl_global_article_reporting and snp_bl_global_video_reporting
I've also seen someone recommend setting load_examples = False in the airflow.cfg file which I already have.
Here is my code:
DAG_NAME = 'snapchat_snowflake_daily'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 6, 12),
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'provide_context': True,
'on_failure_callback': task_fail_slack_alert,
'sla': timedelta(hours=24),
}
dag = DAG(
DAG_NAME,
default_args=default_args,
catchup=False,
schedule_interval='0 3 * * *')
with dag:
s3_to_snowflake = SnowflakeLoadOperator(
task_id=f'load_into_snowflake_for_{region}',
pool='airflow_load',
retries=0, )
snp_il_global = SnowflakeQueryOperator(
task_id='snp_il_global',
sql='queries/snp_il_gl.sql',
retries=0)
snp_bl_global_video_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_video_reporting',
sql='snp_bl_gl_reporting.sql',
retries=0)
snp_bl_global_content_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_content_reporting',
sql='snp_bl_global_c.sql')
snp_bl_global_article_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_article_reporting',
sql='snp_bl_global_a.sql',
retries=0)
s3_to_snowflake >> snp_il_global >> [
snp_bl_global_video_reporting,
snp_bl_global_content_reporting,
snp_bl_global_article_reporting
]
I have 2 dags in airflow, both of which are scheduled to run at 22 UTC time (12PM my time (HST)). I find that only one of these dags is running at this time and am not sure why this would be happening. I can manually start the other dag while the one that works is running, but it just does not start on it's own.
Here is the dag configs for the dag that is running on schedule
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 10, 13),
'email': [
'me#co.org'
],
'email_on_failure': True,
'retries': 0,
'retry_delay': timedelta(minutes=5),
'max_active_runs': 1,
}
dag = DAG('my_dag_1', default_args=default_args, catchup=False, schedule_interval="0 22 * * *")
Here is the dag configs for the dag that is failing to run
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 10, 13),
'email': [
'me#co.org',
],
'email_on_failure': True,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('my_dag_2', default_args=default_args,
max_active_runs=1,
catchup=False, schedule_interval=f"0 19,22,1 * * *")
# run setup dag and trigger at 9AM, 12PM,and 3PM (need to convert from UTC time (-2 HST))
From the airflow.cfg file, some of the settings that I think are relevant are set as...
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
#parallelism = 32
parallelism = 8
# The number of task instances allowed to run concurrently by the scheduler
#dag_concurrency = 16
dag_concurrency = 3
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# The maximum number of active DAG runs per DAG
#max_active_runs_per_dag = 16
max_active_runs_per_dag = 1
Not sure what could be going on here. Is there some setting that I am mistakenly switching on that stops multiple different dags from running at the same time? Any more debugging info to get to add to this question?
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.models import Variable
from datetime import datetime, timedelta
from epsilon_spark_operator import EpsilonSparkOperator
#from merlin_spark_submit_operator import MerlinSparkSubmitOperator
#start=timedelta(hours=3)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
# 'end_date': datetime(2019, 12, 12),
# 'schedule_interval': '37 12 * * *'
'email': ['ankit.maloo#inmobi.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=1),
'queue': 'default'
}
dag = DAG('epsilon_spark_test', default_args=default_args, schedule_interval='#once')
spark_submit_task = EpsilonSparkOperator(
task_id='spark_submit_job',
conn_id='epsilon_spark',
application='abfs://***********.jar',
java_class='com.inmobi.EpsilonTest',
application_args=['10'],
verbose=True,
cluster_name="*****",
azure_storage_conn_id="*****",
keyvault_name='*****',
keyvault_client_id_key='*****',
keyvault_client_secret_key='*****',
conf={'spark.executors': '30', 'spark.eventLog.enabled': 'false', 'spark.eventLog.dir':'/tmp', 'spark.shuffle.service.enabled':'true'},
dag=dag)
dag >> spark_submit_task
I am trying to tesr airflow cluster with a sample spark job.
Above is my python code.
When trying to deploy the dag via curl
curl -X POST -H 'Content-Type: multipart/form-data' -F 'dag_file=#/Users/ankit.maloo/dag_test.py' -F 'force=on' 'http://*.*.*.*:8080/admin/rest_api/api?api=deploy_dag'
Gives error as
Broken DAG: [/root/airflow/dags/dag_test1.py] Argument ['execution_time'] is required.
Any Idea?
I have the following DAG:
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2018, 07, 19, 11,0,0),
'email': ['me#me.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=2),
'catchup' : False,
'depends_on_past' : False,
}
with DAG('some_dag', schedule_interval=timedelta(minutes=30), max_active_runs=1, default_args=default_args) as dag:
This dag runs every 30 minutes. It rewrite data in the table (delete all and write). So if Airflow was down for 2 days there is no point in running all the missing dag runs during that time.
However the above definition does not work. After 2 days that airflow was down it still try to run all the missing tasks.
How can i solve this?
OK. I solved it.
Apparently there is no meaning for 'catchup' : False on the default_args . It does nothing.
I changed the code to
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2018, 07, 19, 11,0,0),
'email': ['me#me.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=2),
'depends_on_past' : False,
}
with DAG('some_dag', catchup=False, schedule_interval=timedelta(minutes=30), max_active_runs=1, default_args=default_args) as dag:
now it works.
according to the docs: https://airflow.apache.org/scheduler.html#backfill-and-catchup
Adding dag.catchup = False to the DAG args.
Adding catchup_by_default = False to the airflow.cfg
Depends on your use case, a good practice could be set catchup_by_default = False and then only use dag.catchup = True if a given DAG requires the catchup.