So i have a test dag of one task, which is simple ETL try to extract data from mssql db and load them to postgres db. so in the way it working is select by date and insert to postgres db for the last 360 days. but the task is timeout on the select statement after say 10 days or so.
def get_receiveCars(**kwargs):
#get current date
end_date = datetime.now()
#loop for 360 days
for x in range(360):
startDate = today - timedelta(days=x)
delete_dataPostgres(startDate.strftime('%Y-%m-%d'), "received sample")
select_dataMsql(startDate)
and the select statement is:
def select_dataMsql(startDate):
#insert data
endDate = str(startDate.strftime('%Y-%m-%d')) + " 23:59:59"
ms_hook = MsSqlHook(mssql_conn_id='mssql_db')
select_sql="""select carColor, carBrand, fuelType, COUNT(DISTINCT RequestID ) AS received
FROM Requests
where
ReceivedDateTime >= %s
AND ReceivedDateTime< %s
GROUP BY carColor, carBrand, fuelType"""
cond = (startDate, endDate)
results =ms_hook.get_records(select_sql, parameters=cond)
insert_data(results, startDate)
and here is my dag
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from src.get_receiveCars import get_receiveCars
#from src.transform_data import transform_data
#from src.load_table import load_table
import requests
import json
import os
# Define the default dag arguments.
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': XXXXX,
'email_on_failure': True,
'email_on_retry': False,
'retries': 3,
'retry_delay': timedelta(minutes=1)
}
# Define the dag, the start date and how frequently it runs.
# I chose the dag to run everday by using 1440 minutes.
dag = DAG(
dag_id='reveive_sample',
default_args=default_args,
dagrun_timeout=timedelta(minutes=200),
schedule_interval= '#daily',
start_date=datetime(2020, 10, 30))
# First task is to query get the weather from openweathermap.org.
mid_task = PythonOperator(
task_id='get_receiveCars',
provide_context=True,
python_callable=get_receiveCars,
dag=dag)
# Set task1
mid_task
LOGS
- Start syncing user roles.
[2020-10-30 18:29:40,238] {timeout.py:42} ERROR - Process timed out, PID: 84214
[2020-10-30 18:29:40,238] {dagbag.py:259} ERROR - Failed to import: /root/airflow/dags/receive_sample.py
Traceback (most recent call last):
File "/root/airflow/lib/python3.6/site-packages/airflow/models/dagbag.py", line 256, in process_file
m = imp.load_source(mod_name, filepath)
File "/usr/local/lib/python3.6/imp.py", line 172, in load_source
module = _load(spec)
File "<frozen importlib._bootstrap>", line 684, in _load
File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/root/airflow/dags/receive_sample.py", line 5, in <module>
from src.get_receiveCars import get_receiveCars
File "/root/airflow/dags/src/get_receiveCars.py", line 56, in <module>
get_receiveCars()
File "/root/airflow/dags/src/get_receiveCars.py", line 17, in get_receiveCars
delete_data(startDate.strftime('%Y-%m-%d'), "received cars")
File "/root/airflow/dags/src/get_receiveCars.py", line 26, in delete_data
pg_hook.run(delete_sql, parameters=cond)
File "/root/airflow/lib/python3.6/site-packages/airflow/hooks/dbapi_hook.py", line 172, in run
cur.execute(s, parameters)
File "/usr/local/lib/python3.6/encodings/utf_8.py", line 15, in decode
def decode(input, errors='strict'):
File "/root/airflow/lib/python3.6/site-packages/airflow/utils/timeout.py", line 43, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 84214
[2020-10-30 18:29:40,260] {security.py:477} INFO - Start syncing user roles.
[2020-10-30 18:29:40,350] {security.py:477} INFO - Start syncing user roles.
[2020-10-30 18:29:40,494] {security.py:387} INFO - Fetching a set of all permission, view_menu from FAB meta-table
[2020-10-30 18:29:40,550] {security.py:387} INFO - Fetching a set of all permission, view_menu from FAB meta-table
[2020-10-30 18:29:40,639] {security.py:387} INFO - Fetching a set of all per
Check your configuration:
airflow config list|grep -i timeout
dagbag_import_timeout = 30
dag_file_processor_timeout = 50
web_server_master_timeout = 120
web_server_worker_timeout = 120
log_fetch_timeout_sec = 5
smtp_timeout = 30
operation_timeout = 1.0
task_adoption_timeout = 600
You'll want to change the dagbag_import_timeout setting so it has time to load your dag.
To do this update your airflow.cfg file or set the environment variable:
export AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=300
If they are in queued state and getting killed externally, then add env variable :
AIRFLOW__CELERY__OPERATION_TIMEOUT: "30"
You haven't specified a execution_timeout in your default_args - I would start with that:
execution_timeout (datetime.timedelta) – max time allowed for the execution of this task instance, if it goes beyond it will raise and fail.
dagrun_timeout has a different meaning:
dagrun_timeout (datetime.timedelta) – specify how long a DagRun should be up before timing out / failing, so that new DagRuns can be created. The timeout is only enforced for scheduled DagRuns, and only once the # of active DagRuns == max_active_runs.
Related
I am Getting DagRunAlreadyExists exception even after providing the custom run id and execution date.
This occurs when there are multiple request within a second.
Here is the MWAA CLI call
def get_unique_key():
from datetime import datetime
import random
import shortuuid
import string
timestamp = datetime.now().strftime(DT_FMT_HMSf)
random_str = timestamp + ''.join(random.choice(string.digits + string.ascii_letters) for _ in range(8))
uuid_str = shortuuid.ShortUUID().random(length=12)
return '{}{}'.format(uuid_str, random_str)
execution_date = datetime.utcnow().strftime("%Y-%m-%dT%H:%m:%S.%f")
dag_run_id = get_unique_key()
workflow_id = "my_workflow"
conf = json.dumps({"foo": "bar"})
"dags trigger {0} -c '{1}' -r {2} -e {3}".format(workflow_id, conf, dag_run_id, execution_date)
and here is the error log from MWAA CLI. If this can help to debug the issue.
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/airflow/__main__.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/utils/cli.py", line 92, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/commands/dag_command.py", line 138, in dag_trigger
dag_id=args.dag_id, run_id=args.run_id, conf=args.conf, execution_date=args.exec_date
File "/usr/local/lib/python3.7/site-packages/airflow/api/client/local_client.py", line 30, in trigger_dag
dag_id=dag_id, run_id=run_id, conf=conf, execution_date=execution_date
File "/usr/local/lib/python3.7/site-packages/airflow/api/common/experimental/trigger_dag.py", line 125, in trigger_dag
replace_microseconds=replace_microseconds,
File "/usr/local/lib/python3.7/site-packages/airflow/api/common/experimental/trigger_dag.py", line 75, in _trigger_dag
f"A Dag Run already exists for dag id {dag_id} at {execution_date} with run id {run_id}"
airflow.exceptions.DagRunAlreadyExists: A Dag Run already exists for dag id my_workflow at 2022-10-18T06:10:28+00:00 with run id CL4Adauihkvz121928332658Gp6bsTWU
The problem is that the execution_date resolution is seconds . Airflow ignoring the milliseconds.
You can see in the error that no milliseconds mentioned in the execution_date (2022-10-18T06:10:28)
I'm trying to set up an Airflow ETL pipeline that extracts images from the .bag file. I wanna extract it inside docker and I'm using DockerOperator. Docker image is pulled from private GitLab repository. The script I want to run is a python script inside a Docker container. The .bag file is on my external-SSD so I'm trying to mount it inside docker. Is there something wrong with the code or is it a different kind of problem?
Error:
[2021-09-16 10:39:17,010] {docker.py:246} INFO - Starting docker container from image registry.gitlab.com/url/of/gitlab:a24a3f05
[2021-09-16 10:39:17,010] {taskinstance.py:1462} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task
result = task_copy.execute(context=context)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 343, in execute
return self._run_image()
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 265, in _run_image
return self._run_image_with_mounts(self.mounts, add_tmp_variable=False)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 287, in _run_image_with_mounts
privileged=self.privileged,
File "/usr/lib/python3/dist-packages/docker/api/container.py", line 607, in create_host_config
return HostConfig(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'mounts'
[2021-09-16 10:39:17,014] {taskinstance.py:1512} INFO - Marking task as FAILED. dag_id=ETL-test, task_id=docker_extract, execution_date=20210916T083912, start_date=20210916T083915, end_date=20210916T083917
[2021-09-16 10:39:17,062] {local_task_job.py:151} INFO - Task exited with return code 1
[2021-09-16 10:39:17,085] {local_task_job.py:261} INFO - 0 downstream tasks scheduled from follow-on schedule check
This is my code :
from airflow import DAG
from airflow.utils.dates import days_ago
from datetime import datetime, timedelta
from airflow.operators.dummy import DummyOperator
from airflow.providers.docker.operators.docker import DockerOperator
from docker.types import Mount
from airflow.operators.bash_operator import BashOperator
ssd_dir=Mount(source='/media/filip/external-ssd', target='/external-ssd', type='bind')
dag = DAG(
'ETL-test',
default_args = {
'owner' : 'admin',
'description' : 'Extract data from bag, simple test',
'depend_on_past' : False,
'start_date' : datetime(2021, 9, 13),
},
)
start_dag = DummyOperator(
task_id='start_dag',
dag=dag
)
extract = DockerOperator(
api_version="auto",
task_id='docker_extract',
image='registry.gitlab.com/url/of/gitlab:a24a3f05',
container_name='extract-test',
mounts=[ssd_dir],
auto_remove = True,
force_pull = False,
mount_tmp_dir=False,
command='python3 rgb_image_extraction.py --bagfile /external-ssd/2021-09-01-13-17-10.bag --output_dir /external-ssd/airflow --camera_topic /kirby1/vm0/stereo/left/color/image_rect --every_n_img 20 --timestamp_as_name',
docker_conn_id='gitlab_registry',
dag=dag
)
test = BashOperator(
task_id='print_hello',
bash_command='echo "hello world"',
dag=dag
)
start_dag >> extract >> test
I think you have an old docker python library installed. If you want to make sure airflow 2.1.0 works, you should always use constraints mechanism as described in https://airflow.apache.org/docs/apache-airflow/stable/installation.html otherwise you risk you will have outdated dependencies.
For example if you use Python 3.6, the right constraints are https://raw.githubusercontent.com/apache/airflow/constraints-2.1.3/constraints-3.6.txt and there docker python library is 5.0.0 I bet you have much older version.
With airflow, I am trying to execute a remote script through SSHHook. The script is simply like this
echo "this is a test"
Inside the remote machine, I can run it through "bash test".
I created an airflow script like this:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.contrib.hooks.ssh_hook import SSHHook
# add a new SSH connection using the WEB UI under the admin --> connections tab.
sshHook = SSHHook(ssh_conn_id="test_ssh")
# Following are defaults which can be overridden later on
default_args = {
'owner': 'tester',
'depends_on_past': False,
'start_date': datetime(2019,6,24),
'email': ['user123#gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('test', default_args=default_args)
t1 = SSHOperator(
ssh_conn_id= sshHook,
task_id='test_20190624',
command='bash /home/tester/run_test',
dag=dag)
Then I got an error like this:
[2019-06-24 11:27:17,790] {ssh_operator.py:80} INFO - ssh_hook is not provided or invalid. Trying ssh_conn_id to create SSHHook.
[2019-06-24 11:27:17,792] {__init__.py:1580} ERROR - SSH operator error: 'SSHHook' object has no attribute 'upper'
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/operators/ssh_operator.py", line 82, in execute
timeout=self.timeout)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/hooks/ssh_hook.py", line 90, in __init__
conn = self.get_connection(self.ssh_conn_id)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/hooks/base_hook.py", line 80, in get_connection
conn = random.choice(cls.get_connections(conn_id))
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/hooks/base_hook.py", line 71, in get_connections
conn = cls._get_connection_from_env(conn_id)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/hooks/base_hook.py", line 63, in _get_connection_from_env
environment_uri = os.environ.get(CONN_ENV_PREFIX + conn_id.upper())
AttributeError: 'SSHHook' object has no attribute 'upper'
You should just directly use the SSH Connection ID or just use SSHHook. The problem here is you have mixed both.
1) Using SSHHook:
t1 = SSHOperator(
ssh_hook = sshHook,
task_id='test_20190624',
command='bash /home/tester/run_test',
dag=dag)
2) Using SSH Connection directly:
t1 = SSHOperator(
ssh_conn_id="test_ssh",
task_id='test_20190624',
command='bash /home/tester/run_test',
dag=dag)
I have a airflow script that tries to insert data from one table to another, I am using a Amazon Redshift DB. The given below script when triggered does not execute. Task_id status remains as 'no status' in the Graph view and no other error is shown.
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_login():
global db_conn
try:
db_conn = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Connection Task Complete: Connected to DB')
return (db_conn)
#######################
def insert_data():
cur = db_conn.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2 ;""")
db_conn.commit()
print('ETL Task Complete')
def job_run():
db_login()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DBConnect',
python_callable=job_run,
bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
Could anyone assist to find where the problem could be. Thanks
Updated Code (05/28)
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def data_warehouse_login():
global dwh_connection
try:
dwh_connection = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
return (dwh_connection)
def insert_data():
cur = dwh_connection.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
dwh_connection.commit()
print('Task Complete: Insert success')
def job_run():
data_warehouse_login()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DWH_Connect',
python_callable=job_run(),
# bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
Log message when running the script
[2018-05-28 11:36:45,300] {jobs.py:343} DagFileProcessor26 INFO - Started process (PID=26489) to work on /Users/user/airflow/dags/sample.py
[2018-05-28 11:36:45,306] {jobs.py:534} DagFileProcessor26 ERROR - Cannot use more than 1 thread when using sqlite. Setting max_threads to 1
[2018-05-28 11:36:45,310] {jobs.py:1521} DagFileProcessor26 INFO - Processing file /Users/user/airflow/dags/sample.py for tasks to queue
[2018-05-28 11:36:45,310] {models.py:167} DagFileProcessor26 INFO - Filling up the DagBag from /Users/user/airflow/dags/sample.py
/Users/user/anaconda3/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
Task Complete: Insert success
[2018-05-28 11:36:50,964] {jobs.py:1535} DagFileProcessor26 INFO - DAG(s) dict_keys(['latest_only', 'example_python_operator', 'test_utils', 'example_bash_operator', 'example_short_circuit_operator', 'example_branch_operator', 'tutorial', 'example_passing_params_via_test_command', 'latest_only_with_trigger', 'example_xcom', 'example_http_operator', 'example_skip_dag', 'example_trigger_target_dag', 'example_branch_dop_operator_v3', 'example_subdag_operator', 'example_subdag_operator.section-1', 'example_subdag_operator.section-2', 'example_trigger_controller_dag', 'insert_data2']) retrieved from /Users/user/airflow/dags/sample.py
[2018-05-28 11:36:51,159] {jobs.py:1169} DagFileProcessor26 INFO - Processing example_subdag_operator
[2018-05-28 11:36:51,167] {jobs.py:566} DagFileProcessor26 INFO - Skipping SLA check for <DAG: example_subdag_operator> because no tasks in DAG have SLAs
[2018-05-28 11:36:51,170] {jobs.py:1169} DagFileProcessor26 INFO - Processing sample_dag
[2018-05-28 11:36:51,174] {jobs.py:354} DagFileProcessor26 ERROR - Got an exception! Propagating...
Traceback (most recent call last):
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 346, in helper
pickle_dags)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1581, in process_file
self._process_dags(dagbag, dags, ti_keys_to_schedule)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1171, in _process_dags
dag_run = self.create_dag_run(dag)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 776, in create_dag_run
if next_start <= now:
TypeError: '<=' not supported between instances of 'NoneType' and 'datetime.datetime'
Log from the Graph View
* Log file isn't local.
* Fetching here: http://:8793/log/sample_dag/DWH_Connect/2018-05-28T12:23:57.595234
*** Failed to fetch log file from worker.
* Reading remote logs...
* Unsupported remote log location.
Instead of having a PythonOperator you need to have a BashOperator and a PythonOperator.
You are getting the error because PythonOperator doesn't have a bash_command argument
t1 = PythonOperator(
task_id='DBConnect',
python_callable=db_login,
dag=dag
)
t2 = BashOperator(
task_id='Run Python File',
bash_command='python3 ~/airflow/dags/sample.py',
dag=dag
)
t1 >> t2
To the answer kaxil provided I would like to extend that you should be using an IDE to develop for Airflow. PyCharm works fine for me.
That being said, please make sure to look up the available fields in the docs next time. For PythonOperator, see the docs here:
https://airflow.apache.org/code.html#airflow.operators.PythonOperator
Signature looks like:
class airflow.operators.PythonOperator(python_callable, op_args=None, op_kwargs=None, provide_context=False, templates_dict=None, templates_exts=None, *args, **kwargs)
and for BashOperator, see the docs here:
https://airflow.apache.org/code.html#airflow.operators.BashOperator
Signature is:
class airflow.operators.BashOperator(bash_command, xcom_push=False, env=None, output_encoding='utf-8', *args, **kwargs)
Highlights are from me to show the parameters you have been using.
Make sure to dig through the documentation a bit before using an Operator is my recommendation.
EDIT
After seeing the code update there is one thing left:
Make sure when defining python_callable in a task that you do so without brackets, otherwise the code will be called (which is very unintuitive if you don't know about it). So your code should look like this:
t1 = PythonOperator(
task_id='DWH_Connect',
python_callable=job_run,
dag=dag)
I am trying to clear a failed task so that it will run again.
I usually do this with the web GUI from the tree view
After selecting "Clear" I am directed to an error page:
The traceback on this page is the same error I receive when trying to clear this task using the CLI:
[u#airflow01 ~]# airflow clear -s 2002-07-29T20:25:00 -t
coverage_check gom_modis_aqua_coverage_check
[2018-01-16 16:21:04,235] {__init__.py:57} INFO - Using executor CeleryExecutor
[2018-01-16 16:21:05,192] {models.py:167} INFO - Filling up the DagBag from /root/airflow/dags
Traceback (most recent call last):
File "/usr/bin/airflow", line 28, in <module>
args.func(args)
File "/usr/lib/python3.4/site-packages/airflow/bin/cli.py", line 612, in clear
include_upstream=args.upstream,
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 3173, in sub_dag
dag = copy.deepcopy(self)
File "/usr/lib64/python3.4/copy.py", line 166, in deepcopy
y = copier(memo)
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 3159, in __deepcopy__
setattr(result, k, copy.deepcopy(v, memo))
File "/usr/lib64/python3.4/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/usr/lib64/python3.4/copy.py", line 246, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib64/python3.4/copy.py", line 166, in deepcopy
y = copier(memo)
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 2202, in __deepcopy__
setattr(result, k, copy.deepcopy(v, memo))
File "/usr/lib64/python3.4/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/usr/lib64/python3.4/copy.py", line 246, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib64/python3.4/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/usr/lib64/python3.4/copy.py", line 309, in _reconstruct
y.__dict__.update(state)
AttributeError: 'NoneType' object has no attribute 'update'
Looking for ideas on what may have caused this, what I should do to fix this task, and how to avoid this in the future.
I was able to work around the issue by deleting the task record using the "Browse > Task Instances" search, but would still like to explore the issue as I have seen this multiple times.
Although my DAG code is getting complicated, here is an excerpt from where the operator is defined within the dag:
trigger_granule_dag_id = 'trigger_' + process_pass_dag_name
coverage_check = BranchPythonOperator(
task_id='coverage_check',
python_callable=_coverage_check,
provide_context=True,
retries=10,
retry_delay=timedelta(hours=3),
queue=QUEUE.PYCMR,
op_kwargs={
'roi':region,
'success_branch_id': trigger_granule_dag_id
}
)
The full source code can be browsed at github/USF-IMARS/imars_dags. Here are links to the most relevant parts:
Operator instantiated in /gom/gom_modis_aqua_coverage_check.py using modis_aqua_coverage_check factory
factory function defines coverage_check BranchPythonOperator in /builders/modis_aqua_coverage_check.py
python_callable is _coverage_check function in same file
Below is a sample DAG that I created to mimic the error that you are facing.
import logging
import os
from datetime import datetime, timedelta
import boto3
from airflow import DAG
from airflow import configuration as conf
from airflow.operators import ShortCircuitOperator, PythonOperator, DummyOperator
def athena_data_validation(**kwargs):
pass
start_date = datetime.now()
args = {
'owner': 'airflow',
'start_date': start_date,
'depends_on_past': False,
'wait_for_downstream': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=30)
}
dag_name = 'data_validation_dag'
schedule_interval = None
dag = DAG(
dag_id=dag_name,
default_args=args,
schedule_interval=schedule_interval)
athena_client = boto3.client('athena', region_name='us-west-2')
DAG_SCRIPTS_DIR = conf.get('core', 'DAGS_FOLDER') + "/data_validation/"
start_task = DummyOperator(task_id='Start_Task', dag=dag)
end_task = DummyOperator(task_id='End_Task', dag=dag)
data_validation_task = ShortCircuitOperator(
task_id='Data_Validation',
provide_context=True,
python_callable=athena_data_validation,
op_kwargs={
'athena_client': athena_client,
'sql_file': DAG_SCRIPTS_DIR + 'data_validation.sql',
's3_output_path': 's3://XXX/YYY/'
},
dag=dag)
data_validation_task.set_upstream(start_task)
data_validation_task.set_downstream(end_task)
After one successful run, I tried to clear the Data_Validation task and got the same error (see below).
I removed the athena_client object creation and placed it inside the athena_data_validation function and then it worked. So when we do a clear in Airflow UI, it tries to do a deepcopy and get all the objects from previous run. I am still trying to understand why its not able to get a copy of the object type but I got a workaround which was working for me.
During some operations, Airflow deep copies some objects. Unfortunately, some objects do not allow this. The boto client is a good example of something that does not deep copies nicely, thread objects are another, but large objects with nested references like a reference to a parent task below can also cause issues.
In general, you do not want to instantiate a client in the dag code itself. That said, I do not think that it is your issue here. Though I do not have access to the pyCMR code to see if it could be an issue.