airflow errors out when trying to execute remote script through SSHHook - airflow

With airflow, I am trying to execute a remote script through SSHHook. The script is simply like this
echo "this is a test"
Inside the remote machine, I can run it through "bash test".
I created an airflow script like this:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.contrib.hooks.ssh_hook import SSHHook
# add a new SSH connection using the WEB UI under the admin --> connections tab.
sshHook = SSHHook(ssh_conn_id="test_ssh")
# Following are defaults which can be overridden later on
default_args = {
'owner': 'tester',
'depends_on_past': False,
'start_date': datetime(2019,6,24),
'email': ['user123#gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('test', default_args=default_args)
t1 = SSHOperator(
ssh_conn_id= sshHook,
task_id='test_20190624',
command='bash /home/tester/run_test',
dag=dag)
Then I got an error like this:
[2019-06-24 11:27:17,790] {ssh_operator.py:80} INFO - ssh_hook is not provided or invalid. Trying ssh_conn_id to create SSHHook.
[2019-06-24 11:27:17,792] {__init__.py:1580} ERROR - SSH operator error: 'SSHHook' object has no attribute 'upper'
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/operators/ssh_operator.py", line 82, in execute
timeout=self.timeout)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/hooks/ssh_hook.py", line 90, in __init__
conn = self.get_connection(self.ssh_conn_id)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/hooks/base_hook.py", line 80, in get_connection
conn = random.choice(cls.get_connections(conn_id))
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/hooks/base_hook.py", line 71, in get_connections
conn = cls._get_connection_from_env(conn_id)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/hooks/base_hook.py", line 63, in _get_connection_from_env
environment_uri = os.environ.get(CONN_ENV_PREFIX + conn_id.upper())
AttributeError: 'SSHHook' object has no attribute 'upper'

You should just directly use the SSH Connection ID or just use SSHHook. The problem here is you have mixed both.
1) Using SSHHook:
t1 = SSHOperator(
ssh_hook = sshHook,
task_id='test_20190624',
command='bash /home/tester/run_test',
dag=dag)
2) Using SSH Connection directly:
t1 = SSHOperator(
ssh_conn_id="test_ssh",
task_id='test_20190624',
command='bash /home/tester/run_test',
dag=dag)

Related

Facing task Timeout Error while dag parsing in Airflow version 2.2.5

I am hitting the task timeout error with Airflow Version 2.2.5/Composer 2.0.15. The same code is running absolutely fine in Airflow version2.2.3 /Composer Version 1.18.0
Error Message :
Broken DAG: [/home/airflow/gcs/dags/test_dag.py] Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/enum.py", line 256, in __new__
if canonical_member._value_ == enum_member._value_:
File "/opt/python3.8/lib/python3.8/site-packages/airflow/utils/timeout.py", line 37, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: DagBag import timeout for /home/airflow/gcs/dags/test_dag.py after 30.0s.
Please take a look at these docs to improve your DAG import time:
* https://airflow.apache.org/docs/apache-airflow/2.2.5/best-practices.html#top-level-python-code
* https://airflow.apache.org/docs/apache-airflow/2.2.5/best-practices.html#reducing-dag-complexity, PID: 1827
As per the documentation or the links in error message about Top Level Python code.
We have a framework in place for Dags and tasks.
main_folder
|___ dags
|___ tasks
|___ libs
a) All the main dag files are in dags folder
b) Actual functions or queries (from PythonOperator functions/ Sql Queries) are placed in *.py files under tasks folder
c) Common functionalities are placed in python files in libs folder.
Providing basic dag structure here:
# Import libraries and functions
import datetime
from airflow import models, DAG
from airflow.contrib.operators import bigquery_operator, bigquery_to_gcs, bigquery_table_delete_operator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
##from airflow.executors.sequential_executor import SequentialExecutor
from airflow.utils.task_group import TaskGroup
## Import codes from tasks and libs folder
from libs.compres_suppress.cot_suppress import *
from libs.teams_plugin.teams_plugin import *
from tasks.email_code.trigger_email import *
# Set up Airflow DAG
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.datetime(2020, 12, 15, 0),
'retries': 1,
'retry_delay': datetime.timedelta(minutes=1),
'on_failure_callback': trigger_email
}
DAG_ID = 'test_dag'
# Check exscution date
if "<some condition>" matches:
run_date = <date in config file>
else:
run_date = datetime.datetime.now().strftime("%Y-%m-%d")
run_date_day = datetime.datetime.now().isoweekday()
dag = DAG(
DAG_ID,
default_args=default_args, catchup=False,
max_active_runs=1, schedule_interval=SCHEDULE_INTERVAL
)
next_dag_name = "next_dag1"
if env == "prod":
if run_date_day == 7:
next_dag_name = "next_dag2"
else:
next_dag_name = "next_dag1"
run_id = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
# Define Airflow DAG
with dag:
team_notify_task = MSTeamsWebhookOperator(
task_id='teams_notifi_start_task',
http_conn_id='http_conn_id',
message=f"DAG has started <br />"
f"<strong> DAG ID:</strong> {DAG_ID}.<br />",
theme_color="00FF00",
button_text="My button",
dag=dag)
task1_bq = bigquery_operator.BigQueryOperator(
task_id='task1',
sql=task1_query(
table1="table1",
start_date=start_date),
use_legacy_sql=False,
destination_dataset_table="destination_tbl_name",
write_disposition='WRITE_TRUNCATE'
)
##### Base Skeletons #####
with TaskGroup("taskgroup_lbl", tooltip="taskgroup_sample") as task_grp:
tg_process(args=default_args,run_date=run_date)
if run_mode == "<env_name>" and next_dag != "":
next_dag_trigg = BashOperator(
task_id=f'trigger_{next_dag_name}',
bash_command="gcloud composer environments run " + <env> + "-cust_comp --location us-east1 dags trigger -- " + next_dag_name + " --run-id='trigger_ "'"
)
task_grp >> next_dag_trigger
team_notify_task >> task1_bq >> task_grp
enter code here
Can someone help on this on what is causing the issue?
Increasing the dag/task timeout time does the trick.
Go to Airflow (Web UI), On the top bar navigate to
Variables--> Configuration --> [core] --> dagbag_import_timeout = <changed from 30(default) to 160>.
If using Composer, the same can be done through following steps.
a) Go to Composer service and select the composer to which the settings are to be modified.
b) Click on AIRFLOW CONFIGURATION OVERRIDES --> EDIT --> (add/edit) dagbag_import_timeout=160
c) Click on save

Airflow, Connecting to MsSql error "Adaptive Server is unavailable or does not exist"

I'm getting this error when trying to use Airflow to get_records().
pymssql.OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (localhost:None)\n')
I used this guide on how to setup.
https://tech.marksblogg.com/mssql-sql-server-linux-install-tutorial-and-guide.html
Using Python REPL, I can connect and return a result.
with pymssql.connect(server="localhost",
user="SA",
password="password",
database="database_name") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM currency", conn)
print(df)
with pymssql.connect(server="127.0.0.1",
user="SA",
password="password",
database="database_name") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM currency", conn)
print(df)
I update my Airflow Connections with either of these setups and the run a test.
airflow test run_test_db test_database 2015-06-01
The error is produced....
Any ideas please? The whole setup is contained within the one linux (vagrant) machine, no virtual environments.... So its using the same version of pymssql to try an connect....
EDIT UPDATE
Whats really annoying is if i use the same connection string in a DAG there is no error and it runs perfectly fine...
So the connection string that is retrieved from the database must change.
Is there a way to debug/print the string/connection properties?
Example working DAG
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.hooks.mssql_hook import MsSqlHook
from datetime import datetime, timedelta
import pymssql
import pandas as pd
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2019, 2, 13),
'email': ['example#email.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('run_test_db', default_args=default_args, schedule_interval="0 01 * * 2-6")
def test_db(**context):
with pymssql.connect(server="localhost",
user="SA",
password="Password123",
database="database") as conn:
df = pd.read_sql("SELECT TOP 1 * FROM champ_dw_dim_currency", conn)
print(df)
test_database = PythonOperator(
task_id='test_database',
python_callable=test_db,
execution_timeout=timedelta(minutes=3),
dag=dag,
provide_context=True,
op_kwargs={
'extra_detail': 'nothing'
})

Airflow - Script does not execute when triggered

I have a airflow script that tries to insert data from one table to another, I am using a Amazon Redshift DB. The given below script when triggered does not execute. Task_id status remains as 'no status' in the Graph view and no other error is shown.
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_login():
global db_conn
try:
db_conn = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Connection Task Complete: Connected to DB')
return (db_conn)
#######################
def insert_data():
cur = db_conn.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2 ;""")
db_conn.commit()
print('ETL Task Complete')
def job_run():
db_login()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DBConnect',
python_callable=job_run,
bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
Could anyone assist to find where the problem could be. Thanks
Updated Code (05/28)
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def data_warehouse_login():
global dwh_connection
try:
dwh_connection = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
return (dwh_connection)
def insert_data():
cur = dwh_connection.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
dwh_connection.commit()
print('Task Complete: Insert success')
def job_run():
data_warehouse_login()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DWH_Connect',
python_callable=job_run(),
# bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
Log message when running the script
[2018-05-28 11:36:45,300] {jobs.py:343} DagFileProcessor26 INFO - Started process (PID=26489) to work on /Users/user/airflow/dags/sample.py
[2018-05-28 11:36:45,306] {jobs.py:534} DagFileProcessor26 ERROR - Cannot use more than 1 thread when using sqlite. Setting max_threads to 1
[2018-05-28 11:36:45,310] {jobs.py:1521} DagFileProcessor26 INFO - Processing file /Users/user/airflow/dags/sample.py for tasks to queue
[2018-05-28 11:36:45,310] {models.py:167} DagFileProcessor26 INFO - Filling up the DagBag from /Users/user/airflow/dags/sample.py
/Users/user/anaconda3/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
Task Complete: Insert success
[2018-05-28 11:36:50,964] {jobs.py:1535} DagFileProcessor26 INFO - DAG(s) dict_keys(['latest_only', 'example_python_operator', 'test_utils', 'example_bash_operator', 'example_short_circuit_operator', 'example_branch_operator', 'tutorial', 'example_passing_params_via_test_command', 'latest_only_with_trigger', 'example_xcom', 'example_http_operator', 'example_skip_dag', 'example_trigger_target_dag', 'example_branch_dop_operator_v3', 'example_subdag_operator', 'example_subdag_operator.section-1', 'example_subdag_operator.section-2', 'example_trigger_controller_dag', 'insert_data2']) retrieved from /Users/user/airflow/dags/sample.py
[2018-05-28 11:36:51,159] {jobs.py:1169} DagFileProcessor26 INFO - Processing example_subdag_operator
[2018-05-28 11:36:51,167] {jobs.py:566} DagFileProcessor26 INFO - Skipping SLA check for <DAG: example_subdag_operator> because no tasks in DAG have SLAs
[2018-05-28 11:36:51,170] {jobs.py:1169} DagFileProcessor26 INFO - Processing sample_dag
[2018-05-28 11:36:51,174] {jobs.py:354} DagFileProcessor26 ERROR - Got an exception! Propagating...
Traceback (most recent call last):
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 346, in helper
pickle_dags)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1581, in process_file
self._process_dags(dagbag, dags, ti_keys_to_schedule)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1171, in _process_dags
dag_run = self.create_dag_run(dag)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 776, in create_dag_run
if next_start <= now:
TypeError: '<=' not supported between instances of 'NoneType' and 'datetime.datetime'
Log from the Graph View
* Log file isn't local.
* Fetching here: http://:8793/log/sample_dag/DWH_Connect/2018-05-28T12:23:57.595234
*** Failed to fetch log file from worker.
* Reading remote logs...
* Unsupported remote log location.
Instead of having a PythonOperator you need to have a BashOperator and a PythonOperator.
You are getting the error because PythonOperator doesn't have a bash_command argument
t1 = PythonOperator(
task_id='DBConnect',
python_callable=db_login,
dag=dag
)
t2 = BashOperator(
task_id='Run Python File',
bash_command='python3 ~/airflow/dags/sample.py',
dag=dag
)
t1 >> t2
To the answer kaxil provided I would like to extend that you should be using an IDE to develop for Airflow. PyCharm works fine for me.
That being said, please make sure to look up the available fields in the docs next time. For PythonOperator, see the docs here:
https://airflow.apache.org/code.html#airflow.operators.PythonOperator
Signature looks like:
class airflow.operators.PythonOperator(python_callable, op_args=None, op_kwargs=None, provide_context=False, templates_dict=None, templates_exts=None, *args, **kwargs)
and for BashOperator, see the docs here:
https://airflow.apache.org/code.html#airflow.operators.BashOperator
Signature is:
class airflow.operators.BashOperator(bash_command, xcom_push=False, env=None, output_encoding='utf-8', *args, **kwargs)
Highlights are from me to show the parameters you have been using.
Make sure to dig through the documentation a bit before using an Operator is my recommendation.
EDIT
After seeing the code update there is one thing left:
Make sure when defining python_callable in a task that you do so without brackets, otherwise the code will be called (which is very unintuitive if you don't know about it). So your code should look like this:
t1 = PythonOperator(
task_id='DWH_Connect',
python_callable=job_run,
dag=dag)

Airflow - Broken DAG - Timeout

I have a DAG that executes a function that connects to a Postgres DB, deletes the contents in the table and then inserts a new data set.
I am trying this in my local and I see when I try to run this, the web server takes a long time to connect and in most cases doesn't succeed. However as part of the connecting process it seems to be executing the queries from the back-end. Since I have a delete function I see the data getting deleted from the table(basically one of the functions gets executed) even though I have not scheduled the script or manually started. Could someone advice as to what I am doing wrong in this.
One error that pops out in the UI is
Broken DAG: [/Users/user/airflow/dags/dwh_sample23.py] Timeout
Also see an i next to the dag id in the UI that says This is DAG isn't available in the web server's DAG object.
Given below is the code I am using:
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('dwh_sample23', default_args=default_args)
#######################
## Login to DB
def db_login():
''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)
def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()
def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
t1 = BashOperator(
task_id='DB_Connect',
python_callable=db_login(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t2 = BashOperator(
task_id='del',
python_callable=tbl1_del(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t3 = BashOperator(
task_id='populate',
python_callable=pop_tbl1(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t1.set_downstream(t2)
t2.set_downstream(t3)
Could anyone assist? Thanks.
Instead of using BashOperator you can use PythonOperator and call db_login(), tbl1_del(), pop_tbl1() in PythonOperator
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('dwh_sample23', default_args=default_args)
#######################
## Login to DB
def db_login():
''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)
def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()
def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
t1 = PythonOperator(
task_id='DB_Connect',
python_callable=db_login(),
dag=dag)
t2 = PythonOperator(
task_id='del',
python_callable=tbl1_del(),
dag=dag)
t3 = PythonOperator(
task_id='populate',
python_callable=pop_tbl1(),
dag=dag)
t1.set_downstream(t2)
t2.set_downstream(t3)
This is really old by now, but we got this error in prod and I found this question, and think its nice that it would have an answer.
Some of the code is getting executed during DAG load, i.e. you actually run
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
inside webserver and scheduler loop, when they load dag definition from the file.
I believe you didn't intend that to happen.
Everything should work just fine if you just remove these 4 lines.
Generally don't place function you want executors to execute on file/module level, because when interpreter of the scheduler/webserver loads the file to get dag definition, it would invoke them.
Just try putting this in your dag file and see check webserver logs to see what happens.
from time import sleep
def do_some_printing():
print(1111111)
sleep(60)
do_some_printing()

Ooops... AttributeError when clearing failed task state in airflow

I am trying to clear a failed task so that it will run again.
I usually do this with the web GUI from the tree view
After selecting "Clear" I am directed to an error page:
The traceback on this page is the same error I receive when trying to clear this task using the CLI:
[u#airflow01 ~]# airflow clear -s 2002-07-29T20:25:00 -t
coverage_check gom_modis_aqua_coverage_check
[2018-01-16 16:21:04,235] {__init__.py:57} INFO - Using executor CeleryExecutor
[2018-01-16 16:21:05,192] {models.py:167} INFO - Filling up the DagBag from /root/airflow/dags
Traceback (most recent call last):
File "/usr/bin/airflow", line 28, in <module>
args.func(args)
File "/usr/lib/python3.4/site-packages/airflow/bin/cli.py", line 612, in clear
include_upstream=args.upstream,
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 3173, in sub_dag
dag = copy.deepcopy(self)
File "/usr/lib64/python3.4/copy.py", line 166, in deepcopy
y = copier(memo)
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 3159, in __deepcopy__
setattr(result, k, copy.deepcopy(v, memo))
File "/usr/lib64/python3.4/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/usr/lib64/python3.4/copy.py", line 246, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib64/python3.4/copy.py", line 166, in deepcopy
y = copier(memo)
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 2202, in __deepcopy__
setattr(result, k, copy.deepcopy(v, memo))
File "/usr/lib64/python3.4/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/usr/lib64/python3.4/copy.py", line 246, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib64/python3.4/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/usr/lib64/python3.4/copy.py", line 309, in _reconstruct
y.__dict__.update(state)
AttributeError: 'NoneType' object has no attribute 'update'
Looking for ideas on what may have caused this, what I should do to fix this task, and how to avoid this in the future.
I was able to work around the issue by deleting the task record using the "Browse > Task Instances" search, but would still like to explore the issue as I have seen this multiple times.
Although my DAG code is getting complicated, here is an excerpt from where the operator is defined within the dag:
trigger_granule_dag_id = 'trigger_' + process_pass_dag_name
coverage_check = BranchPythonOperator(
task_id='coverage_check',
python_callable=_coverage_check,
provide_context=True,
retries=10,
retry_delay=timedelta(hours=3),
queue=QUEUE.PYCMR,
op_kwargs={
'roi':region,
'success_branch_id': trigger_granule_dag_id
}
)
The full source code can be browsed at github/USF-IMARS/imars_dags. Here are links to the most relevant parts:
Operator instantiated in /gom/gom_modis_aqua_coverage_check.py using modis_aqua_coverage_check factory
factory function defines coverage_check BranchPythonOperator in /builders/modis_aqua_coverage_check.py
python_callable is _coverage_check function in same file
Below is a sample DAG that I created to mimic the error that you are facing.
import logging
import os
from datetime import datetime, timedelta
import boto3
from airflow import DAG
from airflow import configuration as conf
from airflow.operators import ShortCircuitOperator, PythonOperator, DummyOperator
def athena_data_validation(**kwargs):
pass
start_date = datetime.now()
args = {
'owner': 'airflow',
'start_date': start_date,
'depends_on_past': False,
'wait_for_downstream': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=30)
}
dag_name = 'data_validation_dag'
schedule_interval = None
dag = DAG(
dag_id=dag_name,
default_args=args,
schedule_interval=schedule_interval)
athena_client = boto3.client('athena', region_name='us-west-2')
DAG_SCRIPTS_DIR = conf.get('core', 'DAGS_FOLDER') + "/data_validation/"
start_task = DummyOperator(task_id='Start_Task', dag=dag)
end_task = DummyOperator(task_id='End_Task', dag=dag)
data_validation_task = ShortCircuitOperator(
task_id='Data_Validation',
provide_context=True,
python_callable=athena_data_validation,
op_kwargs={
'athena_client': athena_client,
'sql_file': DAG_SCRIPTS_DIR + 'data_validation.sql',
's3_output_path': 's3://XXX/YYY/'
},
dag=dag)
data_validation_task.set_upstream(start_task)
data_validation_task.set_downstream(end_task)
After one successful run, I tried to clear the Data_Validation task and got the same error (see below).
I removed the athena_client object creation and placed it inside the athena_data_validation function and then it worked. So when we do a clear in Airflow UI, it tries to do a deepcopy and get all the objects from previous run. I am still trying to understand why its not able to get a copy of the object type but I got a workaround which was working for me.
During some operations, Airflow deep copies some objects. Unfortunately, some objects do not allow this. The boto client is a good example of something that does not deep copies nicely, thread objects are another, but large objects with nested references like a reference to a parent task below can also cause issues.
In general, you do not want to instantiate a client in the dag code itself. That said, I do not think that it is your issue here. Though I do not have access to the pyCMR code to see if it could be an issue.

Resources