Cloud Composer scheduler error when adding first dag - airflow

I have a DAG running on my local Airflow.
I lunched Cloud Composer and wanted to move my DAGs there.
When added the first DAG file the scheduler shows this error:
Traceback (most recent call last): File
"/usr/local/lib/airflow/airflow/models.py", line 363, in process_file
m = imp.load_source(mod_name, filepath) File
"/usr/local/lib/python3.6/imp.py", line 172, in load_source module =
_load(spec) File "", line 684, in _load File "", line 665, in _load_unlocked File
"", line 674, in exec_module
File "", line 781, in get_code
File "", line 741, in
source_to_code File "", line 219, in
_call_with_frames_removed File "/home/airflow/gcs/dags/testdag.py", line 95 'start_date': datetime(2018, 12, 05),
This is line 95:
args = {
'owner': 'Airflow',
'start_date': datetime(2018, 12, 05),
'retries': 5,
'retry_delay': timedelta(minutes=5)
}
Never encountered this error before.

If you want to run the DAG's and do catchup from the historical dates then you give the past dates as start_date
Try giving
from datetime import datetime, timedelta
args = {
'owner': 'Airflow',
'provide_context': True,
'depends_on_past': False,
'start_date': datetime.combine(datetime.today(),datetime.min.time()),
'retries': 5,
'retry_delay': timedelta(minutes=5)
}

May its the date value you gave in start_date.
Try providing just 5 in datetime(2018, 12, 05) and update the DAG folder again.

Related

Airflow The requested task could not be added to the DAG because a task with task_id ... is already in the DAG

I've seen a few responses to this before but they haven't worked for me.
I'm running the Bridge 1.10.15 airflow version so we can migrate to Airflow 2, and I ran airflow upgrade_check and I'm seeing the below error:
/usr/local/lib/python3.7/site-packages/airflow/models/dag.py:1342:
PendingDeprecationWarning: The requested task could not be added to
the DAG with dag_id snapchat_snowflake_daily because a task with
task_id snp_bl_global_content_reporting is already in the DAG.
Starting in Airflow 2.0, trying to overwrite a task will raise an
exception.
same error is happening but with task_id: snp_bl_global_article_reporting and snp_bl_global_video_reporting
I've also seen someone recommend setting load_examples = False in the airflow.cfg file which I already have.
Here is my code:
DAG_NAME = 'snapchat_snowflake_daily'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 6, 12),
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'provide_context': True,
'on_failure_callback': task_fail_slack_alert,
'sla': timedelta(hours=24),
}
dag = DAG(
DAG_NAME,
default_args=default_args,
catchup=False,
schedule_interval='0 3 * * *')
with dag:
s3_to_snowflake = SnowflakeLoadOperator(
task_id=f'load_into_snowflake_for_{region}',
pool='airflow_load',
retries=0, )
snp_il_global = SnowflakeQueryOperator(
task_id='snp_il_global',
sql='queries/snp_il_gl.sql',
retries=0)
snp_bl_global_video_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_video_reporting',
sql='snp_bl_gl_reporting.sql',
retries=0)
snp_bl_global_content_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_content_reporting',
sql='snp_bl_global_c.sql')
snp_bl_global_article_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_article_reporting',
sql='snp_bl_global_a.sql',
retries=0)
s3_to_snowflake >> snp_il_global >> [
snp_bl_global_video_reporting,
snp_bl_global_content_reporting,
snp_bl_global_article_reporting
]

In Apached Airflow Airflow 1.10.12 -No module named 'httplib2'

I am getting the below error for a sample dag I am trying to write.
My Airflow is of below configuration:-
pip install apache-airflow[crypto,celery,postgres,hive,jdbc,mysql,ssh,docker,hdfs,redis,slack,webhdfs,httplib2]==1.10.12
--constraint /requirements-python3.7.txt
Error:-
[2020-12-19 22:41:19,342] {dagbag.py:259} ERROR - Failed to import: /usr/local/airflow/dags/alert_dag.py
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/airflow/models/dagbag.py", line 256, in process_file
m = imp.load_source(mod_name, filepath)
File "/usr/lib/python3.7/imp.py", line 171, in load_source
module = _load(spec)
File "<frozen importlib._bootstrap>", line 696, in _load
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/usr/local/airflow/dags/alert_dag.py", line 6, in <module>
from httplib2 import Http
ModuleNotFoundError: No module named 'httplib2'
Code:-
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from json import dumps
from httplib2 import Http
default_args = {
'start_date': datetime(2020, 12, 19,17,0,0),
'owner': 'Airflow'
}
def on_success(dict):
print('on_success_call_back function')
print(dict)
def on_failure(dict):
print('on_failure_call_back function')
# """Hangouts Chat incoming webhook quickstart."""
# url = 'https://chat.googleapis.com/v1/spaces/XXXX'
# bot_message = {'text': 'alert_dag Failed'}
# message_headers = {'Content-Type': 'application/json; charset=UTF-8'}
# http_obj = Http()
# response = http_obj.request(
# uri=url,
# method='POST',
# headers=message_headers,
# body=dumps(bot_message),
# )
#on_success_call_back=on_success
with DAG(dag_id='alert_dag', schedule_interval="*/5 * * * *", default_args=default_args, catchup=True, dagrun_timeout=timedelta(seconds=25), on_failure_callback=on_failure) as dag:
# Task 1
t1 = BashOperator(task_id='t1', bash_command="exit 0")
# Task 2
t2 = BashOperator(task_id='t2', bash_command="echo 'second task'")
t1 >> t2

how to eliminate error in generating Airflow DAG

Creating dag getting error
root/.venv/lib/python3.6/site-packages/airflow/models/dag.py:1342: PendingDeprecationWarning: The requested task could not be added to the DAG because a task with task_id create_tag_template_field_result is already in the DAG. Starting in Airflow 2.0, trying to overwrite a task will raise an exception.
default_args = {
'owner':'airflow',
'depend_on_past': False,
'start_date': datetime(2018, 11, 5, 10, 00, 00),
'retries':1,
'retry_delay': timedelta(minutes= 1)
}
def get_activated_sources():
request = "SELECT * FROM users"
pg_hook = PostgresHook(postgre_conn_id="postgres", schema="postgres")
connection = pg_hook.get_conn()
cursor = connection.cursor()
cursor.execute(request)
sources = cursor.fetchall
for source in sources:
print( "Source: {0}} activated {1}".format(source[0], source[1]))
return sources
with DAG('hook_dag',
default_args=default_args,
schedule_interval= '#once',
catchup=False
) as dag:
start_task = DummyOperator(task_id='start_task')
hook_task = PythonOperator(task_id='hook_task',
python_callable=get_activated_sources)
start_task >> hook_task
How to solve what is wrong? Please help me

Apache Airflow - How to set execution_date using TriggerDagRunOperator in target DAG for use the current execution_date

I want to set the execution_date in a trigger DAG. I´m using the operator TriggerDagRunOperator, this operator have the parameter execution_date, I want to set the current execution_date.
def conditionally_trigger(context, dag_run_obj):
"""This function decides whether or not to Trigger the remote DAG"""
pp = pprint.PrettyPrinter(indent=4)
c_p = Variable.get("VAR2") == Variable.get("VAR1") and Variable.get("VAR3") == "1"
print("Controller DAG : conditionally_trigger = {}".format(c_p))
if Variable.get("VAR2") == Variable.get("VAR1") and Variable.get("VAR3") == "1":
pp.pprint(dag_run_obj.payload)
return dag_run_obj
default_args = {
'owner': 'pepito',
'depends_on_past': False,
'retries': 2,
'start_date': datetime(2018, 12, 1, 0, 0),
'email': ['xxxx#yyyyy.net'],
'email_on_failure': False,
'email_on_retry': False,
'retry_delay': timedelta(minutes=1)
}
dag = DAG(
'DAG_1',
default_args=default_args,
schedule_interval="0 12 * * 1",
dagrun_timeout=timedelta(hours=22),
max_active_runs=1,
catchup=False
)
trigger_dag_2 = TriggerDagRunOperator(
task_id='trigger_dag_2',
trigger_dag_id="DAG_2",
python_callable=conditionally_trigger,
execution_date={{ execution_date }},
dag=dag,
pool='a_roz'
)
But I obtain the next error
name 'execution_date' is not defined
If I set
execution_date={{ 'execution_date' }},
or
execution_date='{{ execution_date }}',
I obtain
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1659, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/dagrun_operator.py", line 78, in execute
replace_microseconds=False)
File "/usr/local/lib/python3.6/site-packages/airflow/api/common/experimental/trigger_dag.py", line 98, in trigger_dag
replace_microseconds=replace_microseconds,
File "/usr/local/lib/python3.6/site-packages/airflow/api/common/experimental/trigger_dag.py", line 45, in _trigger_dag
assert timezone.is_localized(execution_date)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/timezone.py", line 38, in is_localized
return value.utcoffset() is not None
AttributeError: 'str' object has no attribute 'utcoffset'
Does anyone know how I can set the execution date for DAG_2 if I want to be equal to DAG_1?
This question is diferent to airflow TriggerDagRunOperator how to change the execution date because In this post didn't explain how to send the execution_date through the operator TriggerDagRunOperator, in it is only said that the possibility exists. https://stackoverflow.com/a/49442868/10269204
it was not templated previously, but it is templated now with this commit
you can try your code with new version of airflow
additionally for hardcoded execution_date, you need to set tzinfo:
from datetime import datetime, timezone
execution_date=datetime(2019, 3, 27, tzinfo=timezone.utc)
# or:
execution_date=datetime.now().replace(tzinfo=timezone.utc)

airflow does not satisfy task dependencies

I have a simple airflow workflow composed of two tasks. One does make a download of a csv file containing stock data. The other extracts the maximum stock price and write the data to another file.
If I run the first task and then the second everything works fine, instead if execute: airflow run stocks_d get_max_share it fails to meet the dependency.
import csv
from datetime import datetime
from datetime import timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
import requests
def get_stock_data():
url = "https://app.quotemedia.com/quotetools/getHistoryDownload.csv?&webmasterId=501&startDay=02&startMonth=02&startYear=2002&endDay=02&endMonth=07&endYear=2009&isRanged=false&symbol=APL"
try:
r = requests.get(url)
except requests.RequestException as re:
raise
else:
with open('/tmp/stocks/airflow_stock_data.txt', 'w') as f:
f.write(r.text)
def get_max_share():
stock_data = []
stock_max = {}
with open('/tmp/stocks/airflow_stock_data.txt', 'r') as f:
stock_reader = csv.reader(f)
next(stock_reader, None)
for row in stock_reader:
stock_data.append(row)
for stock in stock_data:
stock_max[stock[2]] = stock[0]
with open('/tmp/stocks/max_stock', 'w') as f:
stock_price = max(stock_max.keys())
stock_max_price_date = stock_max[stock_price]
stock_entry = stock_max_price_date + ' -> ' + stock_price
f.write(stock_entry)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 5, 30),
'email': ['mainl#domain.io'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'catchup': False,
}
dag = DAG('stocks_d', default_args=default_args, schedule_interval=timedelta(minutes=5))
task_get_stocks = PythonOperator(task_id='get_stocks', python_callable=get_stock_data, dag=dag)
task_get_max_share = PythonOperator(task_id='get_max_share', python_callable=get_max_share, dag=dag)
task_get_max_share.set_upstream(task_get_stocks)
Any ideas why does that happen ?
$ airflow run stocks_d get_max_share
Above command only run get_max_share task not the previous task before running it.
If you need to check the whole dag running, try below command
$ airflow trigger_dag stocks_d

Resources