Creating dag getting error
root/.venv/lib/python3.6/site-packages/airflow/models/dag.py:1342: PendingDeprecationWarning: The requested task could not be added to the DAG because a task with task_id create_tag_template_field_result is already in the DAG. Starting in Airflow 2.0, trying to overwrite a task will raise an exception.
default_args = {
'owner':'airflow',
'depend_on_past': False,
'start_date': datetime(2018, 11, 5, 10, 00, 00),
'retries':1,
'retry_delay': timedelta(minutes= 1)
}
def get_activated_sources():
request = "SELECT * FROM users"
pg_hook = PostgresHook(postgre_conn_id="postgres", schema="postgres")
connection = pg_hook.get_conn()
cursor = connection.cursor()
cursor.execute(request)
sources = cursor.fetchall
for source in sources:
print( "Source: {0}} activated {1}".format(source[0], source[1]))
return sources
with DAG('hook_dag',
default_args=default_args,
schedule_interval= '#once',
catchup=False
) as dag:
start_task = DummyOperator(task_id='start_task')
hook_task = PythonOperator(task_id='hook_task',
python_callable=get_activated_sources)
start_task >> hook_task
How to solve what is wrong? Please help me
Related
I've seen a few responses to this before but they haven't worked for me.
I'm running the Bridge 1.10.15 airflow version so we can migrate to Airflow 2, and I ran airflow upgrade_check and I'm seeing the below error:
/usr/local/lib/python3.7/site-packages/airflow/models/dag.py:1342:
PendingDeprecationWarning: The requested task could not be added to
the DAG with dag_id snapchat_snowflake_daily because a task with
task_id snp_bl_global_content_reporting is already in the DAG.
Starting in Airflow 2.0, trying to overwrite a task will raise an
exception.
same error is happening but with task_id: snp_bl_global_article_reporting and snp_bl_global_video_reporting
I've also seen someone recommend setting load_examples = False in the airflow.cfg file which I already have.
Here is my code:
DAG_NAME = 'snapchat_snowflake_daily'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 6, 12),
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'provide_context': True,
'on_failure_callback': task_fail_slack_alert,
'sla': timedelta(hours=24),
}
dag = DAG(
DAG_NAME,
default_args=default_args,
catchup=False,
schedule_interval='0 3 * * *')
with dag:
s3_to_snowflake = SnowflakeLoadOperator(
task_id=f'load_into_snowflake_for_{region}',
pool='airflow_load',
retries=0, )
snp_il_global = SnowflakeQueryOperator(
task_id='snp_il_global',
sql='queries/snp_il_gl.sql',
retries=0)
snp_bl_global_video_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_video_reporting',
sql='snp_bl_gl_reporting.sql',
retries=0)
snp_bl_global_content_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_content_reporting',
sql='snp_bl_global_c.sql')
snp_bl_global_article_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_article_reporting',
sql='snp_bl_global_a.sql',
retries=0)
s3_to_snowflake >> snp_il_global >> [
snp_bl_global_video_reporting,
snp_bl_global_content_reporting,
snp_bl_global_article_reporting
]
Suppose I have the follow DAG (basic placeholder functions), that uses a for-loop to dynamically generate tasks (from iterating over a list):
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'ETLUSER',
'depends_on_past': False,
'start_date': datetime(2019, 12, 16, 0, 0, 0),
'email': ['xxx#xxx.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('xxx', catchup=False,
default_args=default_args, schedule_interval='0 */4 * * *')
# Some dummy function
def StepOne(x):
print(x)
def StepTwo():
print("Okay, we finished all of Step 1.")
some_list = [1, 2, 3, 4, 5, 6]
for t in some_list:
task_id = f'FirstStep_{t}'
task = PythonOperator(
task_id=task_id,
python_callable=StepOne,
provide_context=False,
op_kwargs={'x': str(t)},
dag=dag
)
task
I want to introduce some additional task that's simply:
task2 = PythonOperator(
task_id="SecondStep",
python_callable=StepTwo,
provide_context=False,
dag=dag
)
That runs only after all the steps in the first have finished. Linearly, this would be task >> task2
How do I go about doing this?
You can have task dependencies with array.
Do taskC after both taskA and taskB finished.
[taskA, taskB] >> taskC
or
Do taskB and taskC in parallel after taskA finished.
taskA >> [taskB, taskC]
as long as 1 side of upstream or downstream are non-array.
Thus, for your example,
task1 = []
for t in some_list:
task_id = f'FirstStep_{t}'
task1.append(PythonOperator(
task_id=task_id,
python_callable=StepOne,
provide_context=False,
op_kwargs={'x': str(t)},
dag=dag))
task2 = PythonOperator(
task_id="SecondStep",
python_callable=StepTwo,
provide_context=False,
dag=dag)
task1 >> task2
I have define the external_sensor like that:
external_sensor = ExternalTaskSensor(task_id='ext_sensor_task',
execution_delta=timedelta(minutes=0),
external_dag_id='book_data',
external_task_id='Dataframe_Windows_test',
dag = dag)
The another task is defined like this:
dl_processing_windows = DL_Processing(task_id='dl_processing_windows',
df_dataset_location=dl_config.WINDOWS_DATASET,
....
While in the airflow UI:
I got the error:
Argument ['task_id'] is required
I have two problems:
1. Why does such error exist?
2. Why does it not work?
The attachment:
default_args = {
'owner': 'Newt',
'retries': 2,
'retry_delay': timedelta(seconds=30),
'depends_on_past': False,
}
dag = DAG(
dag_id,
start_date = datetime(2019, 11, 20),
description= 'xxxx',
default_args = default_args,
schedule_interval = timedelta(hours=1),
)
The parameters for dag are the same for both dags!
I fixed it.
Normally, the start_date is different from the scheduler_interval. I set the start_date for both dags into the same time with the current date.
After the first dependent bag finished, the new dag began to work!
I am trying to add airflow dag dynamically looping through the dictionary keys and assigning keys as dag name.
dags are creating fine but i am getting :"This DAG isn't available in the webserver DagBag object. It shows up in this list because the scheduler marked it as active in the metdata database" and its not clickable.
def create_dag(dag_id):
args = build_default_args(config_file)
dag = DAG(dag_id,schedule_interval='30 11 * * *', default_args=args)
with dag:
init_task = BashOperator(
task_id='test_init_task',
bash_command='echo "task"',
dag=dag
)
init_task
return dag
def get_data(**kwargs):
my_list=[]
file = open("/home/airflow/gcs/data/test.json")
data=json.load(file)
return data
data1 = data()
for dict in data1:
for pair in dict.items():
key , value = pair
print "key",ls_table ,"value",metrics
dag_id = '{}'.format(key)
default_args = {'owner': 'airflow',
'start_date': datetime(2019, 6, 18)
}
schedule = '#daily'
globals()[dag_id] = create_dag(dag_id)
I want to set the execution_date in a trigger DAG. I´m using the operator TriggerDagRunOperator, this operator have the parameter execution_date, I want to set the current execution_date.
def conditionally_trigger(context, dag_run_obj):
"""This function decides whether or not to Trigger the remote DAG"""
pp = pprint.PrettyPrinter(indent=4)
c_p = Variable.get("VAR2") == Variable.get("VAR1") and Variable.get("VAR3") == "1"
print("Controller DAG : conditionally_trigger = {}".format(c_p))
if Variable.get("VAR2") == Variable.get("VAR1") and Variable.get("VAR3") == "1":
pp.pprint(dag_run_obj.payload)
return dag_run_obj
default_args = {
'owner': 'pepito',
'depends_on_past': False,
'retries': 2,
'start_date': datetime(2018, 12, 1, 0, 0),
'email': ['xxxx#yyyyy.net'],
'email_on_failure': False,
'email_on_retry': False,
'retry_delay': timedelta(minutes=1)
}
dag = DAG(
'DAG_1',
default_args=default_args,
schedule_interval="0 12 * * 1",
dagrun_timeout=timedelta(hours=22),
max_active_runs=1,
catchup=False
)
trigger_dag_2 = TriggerDagRunOperator(
task_id='trigger_dag_2',
trigger_dag_id="DAG_2",
python_callable=conditionally_trigger,
execution_date={{ execution_date }},
dag=dag,
pool='a_roz'
)
But I obtain the next error
name 'execution_date' is not defined
If I set
execution_date={{ 'execution_date' }},
or
execution_date='{{ execution_date }}',
I obtain
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1659, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/dagrun_operator.py", line 78, in execute
replace_microseconds=False)
File "/usr/local/lib/python3.6/site-packages/airflow/api/common/experimental/trigger_dag.py", line 98, in trigger_dag
replace_microseconds=replace_microseconds,
File "/usr/local/lib/python3.6/site-packages/airflow/api/common/experimental/trigger_dag.py", line 45, in _trigger_dag
assert timezone.is_localized(execution_date)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/timezone.py", line 38, in is_localized
return value.utcoffset() is not None
AttributeError: 'str' object has no attribute 'utcoffset'
Does anyone know how I can set the execution date for DAG_2 if I want to be equal to DAG_1?
This question is diferent to airflow TriggerDagRunOperator how to change the execution date because In this post didn't explain how to send the execution_date through the operator TriggerDagRunOperator, in it is only said that the possibility exists. https://stackoverflow.com/a/49442868/10269204
it was not templated previously, but it is templated now with this commit
you can try your code with new version of airflow
additionally for hardcoded execution_date, you need to set tzinfo:
from datetime import datetime, timezone
execution_date=datetime(2019, 3, 27, tzinfo=timezone.utc)
# or:
execution_date=datetime.now().replace(tzinfo=timezone.utc)