Ooops... AttributeError when clearing failed task state in airflow - airflow

I am trying to clear a failed task so that it will run again.
I usually do this with the web GUI from the tree view
After selecting "Clear" I am directed to an error page:
The traceback on this page is the same error I receive when trying to clear this task using the CLI:
[u#airflow01 ~]# airflow clear -s 2002-07-29T20:25:00 -t
coverage_check gom_modis_aqua_coverage_check
[2018-01-16 16:21:04,235] {__init__.py:57} INFO - Using executor CeleryExecutor
[2018-01-16 16:21:05,192] {models.py:167} INFO - Filling up the DagBag from /root/airflow/dags
Traceback (most recent call last):
File "/usr/bin/airflow", line 28, in <module>
args.func(args)
File "/usr/lib/python3.4/site-packages/airflow/bin/cli.py", line 612, in clear
include_upstream=args.upstream,
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 3173, in sub_dag
dag = copy.deepcopy(self)
File "/usr/lib64/python3.4/copy.py", line 166, in deepcopy
y = copier(memo)
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 3159, in __deepcopy__
setattr(result, k, copy.deepcopy(v, memo))
File "/usr/lib64/python3.4/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/usr/lib64/python3.4/copy.py", line 246, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib64/python3.4/copy.py", line 166, in deepcopy
y = copier(memo)
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 2202, in __deepcopy__
setattr(result, k, copy.deepcopy(v, memo))
File "/usr/lib64/python3.4/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/usr/lib64/python3.4/copy.py", line 246, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib64/python3.4/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/usr/lib64/python3.4/copy.py", line 309, in _reconstruct
y.__dict__.update(state)
AttributeError: 'NoneType' object has no attribute 'update'
Looking for ideas on what may have caused this, what I should do to fix this task, and how to avoid this in the future.
I was able to work around the issue by deleting the task record using the "Browse > Task Instances" search, but would still like to explore the issue as I have seen this multiple times.
Although my DAG code is getting complicated, here is an excerpt from where the operator is defined within the dag:
trigger_granule_dag_id = 'trigger_' + process_pass_dag_name
coverage_check = BranchPythonOperator(
task_id='coverage_check',
python_callable=_coverage_check,
provide_context=True,
retries=10,
retry_delay=timedelta(hours=3),
queue=QUEUE.PYCMR,
op_kwargs={
'roi':region,
'success_branch_id': trigger_granule_dag_id
}
)
The full source code can be browsed at github/USF-IMARS/imars_dags. Here are links to the most relevant parts:
Operator instantiated in /gom/gom_modis_aqua_coverage_check.py using modis_aqua_coverage_check factory
factory function defines coverage_check BranchPythonOperator in /builders/modis_aqua_coverage_check.py
python_callable is _coverage_check function in same file

Below is a sample DAG that I created to mimic the error that you are facing.
import logging
import os
from datetime import datetime, timedelta
import boto3
from airflow import DAG
from airflow import configuration as conf
from airflow.operators import ShortCircuitOperator, PythonOperator, DummyOperator
def athena_data_validation(**kwargs):
pass
start_date = datetime.now()
args = {
'owner': 'airflow',
'start_date': start_date,
'depends_on_past': False,
'wait_for_downstream': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=30)
}
dag_name = 'data_validation_dag'
schedule_interval = None
dag = DAG(
dag_id=dag_name,
default_args=args,
schedule_interval=schedule_interval)
athena_client = boto3.client('athena', region_name='us-west-2')
DAG_SCRIPTS_DIR = conf.get('core', 'DAGS_FOLDER') + "/data_validation/"
start_task = DummyOperator(task_id='Start_Task', dag=dag)
end_task = DummyOperator(task_id='End_Task', dag=dag)
data_validation_task = ShortCircuitOperator(
task_id='Data_Validation',
provide_context=True,
python_callable=athena_data_validation,
op_kwargs={
'athena_client': athena_client,
'sql_file': DAG_SCRIPTS_DIR + 'data_validation.sql',
's3_output_path': 's3://XXX/YYY/'
},
dag=dag)
data_validation_task.set_upstream(start_task)
data_validation_task.set_downstream(end_task)
After one successful run, I tried to clear the Data_Validation task and got the same error (see below).
I removed the athena_client object creation and placed it inside the athena_data_validation function and then it worked. So when we do a clear in Airflow UI, it tries to do a deepcopy and get all the objects from previous run. I am still trying to understand why its not able to get a copy of the object type but I got a workaround which was working for me.

During some operations, Airflow deep copies some objects. Unfortunately, some objects do not allow this. The boto client is a good example of something that does not deep copies nicely, thread objects are another, but large objects with nested references like a reference to a parent task below can also cause issues.
In general, you do not want to instantiate a client in the dag code itself. That said, I do not think that it is your issue here. Though I do not have access to the pyCMR code to see if it could be an issue.

Related

DAG object has no attribute 'schedule' (Airflow 2.5.0)

Using Airflow 2.5.0. I am using a subdag, but imports are working fine:
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.subdag import SubDagOperator
from subdags.subdag_downloads import subdag_downloads
from datetime import datetime
with DAG('group_dag', start_date=datetime(2022, 1, 1),
schedule='#daily', catchup=False) as dag:
args = {'start_date':dag.start_date,
'schedule':dag.schedule,
'catchup':dag.catchup}
downloads = SubDagOperator(
task_id="downloads",
subdag=subdag_downloads(dag.dag_id, "downloads", args)
)
This is the error I'm getting when trying to run the DAG in my CLI:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/dagbag.py", line 339, in parse
loader.exec_module(new_module)
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/opt/airflow/dags/group_dag.py", line 12, in <module>
'schedule':dag.schedule,
AttributeError: 'DAG' object has no attribute 'schedule'
As you can see from the code above, I am defining schedule='#daily' when I instantiate with DAG() as dag, so why can't I access that argument by using dag.schedule?
NOTE: I can access the other arguments just fine, such as dag.catchup and dag.start_date. I can also access the schedule by using dag.schedule_interval, but this seems silly and doesn't sit well with me that I don't understand why we can't use dag.schedule when schedule= is the argument we defined.
I agree with Elad about migrating to TaskGroup where SubDagOperator will be removed in Airflow 3.
But currently, you can access your dag schedule by dag.schedule_interval:
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.subdag import SubDagOperator
from subdags.subdag_downloads import subdag_downloads
from datetime import datetime
with DAG('group_dag', start_date=datetime(2022, 1, 1),
schedule='#daily', catchup=False) as dag:
args = {'start_date':dag.start_date,
'schedule':dag.schedule_interval,
'catchup':dag.catchup}
downloads = SubDagOperator(
task_id="downloads",
subdag=subdag_downloads(dag.dag_id, "downloads", args)
)
You can't access it because there is no such parameter.
While you assign
DAG(..., schedule='#daily') as dag:
The DAG class does not have self.schedule. Airflow accept several kinds of scheduling options and creates from it an internal scheduling_args parameter. You can see it in the codebase here
I'd like to point that you are using SubDAG which is a deprecated feature for 2 years. Please migrate to TaskGroup. SubDAGs are not going to stay in Airflow 3.

Airflow XCOMs communication from BashOperator to PythonOperator

I'm new to Apache Airflow and trying to write my first Dag which has a task based on another task (using ti.xcom_pull)
PS : I run Airflow in WSL Ubuntu 20.04 using VScode.
I created a task 1 (task_id = "get_datetime") that runs the "date" bash command (and it works)
then I created another task (task_id='process_datetime') which takes the datetime of the first task and processes it, and I set the python_callable and everything is fine..
the issue is that dt = ti.xcom_pull gives a NoneType when I run "airflow tasks test first_ariflow_dag process_datetime 2022-11-1" in the terminal, but when I see the log in the Airflow UI, I find that it works normally.
could someone give me a solution please?
`
from datetime import datetime
from airflow.models import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
def process_datetime(ti):
dt = ti.xcom_pull(task_ids=['get_datetime'])
if not dt :
raise Exception('No datetime value')
dt = str(dt[0]).split()
return{
'year':int(dt[-1]),
'month':dt[1],
'day':int(dt[2]),
'time':dt[3],
'day_of_week':dt[0]
}
with DAG(
dag_id='first_ariflow_dag',
schedule_interval='* * * * *',
start_date=datetime(year=2022, month=11, day=1),
catchup=False
) as dag:
# 1. Get the current datetime
task_get_datetime= BashOperator(
task_id = 'get_datetime',
bash_command='date'
)
# 2. Process the datetime
task_process_datetime= PythonOperator(
task_id = 'process_datetime',
python_callable=process_datetime
)
`
I get this error :
[2022-11-02 00:51:45,420] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
File "/mnt/c/Users/Salim/Desktop/A-Learning/Airflow_Conda/airflow_env/lib/python3.8/site-packages/airflow/operators/python.py", line 175, in execute
return_value = self.execute_callable()
File "/mnt/c/Users/Salim/Desktop/A-Learning/Airflow_Conda/airflow_env/lib/python3.8/site-packages/airflow/operators/python.py", line 193, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/salim/airflow/dags/first_dag.py", line 12, in process_datetime
raise Exception('No datetime value')
Exception: No datetime value
According to the documentation, to upload data to xcom you need to set the variable do_xcom_push (Airflow 2) or xcom_push (Airflow 1).
If BaseOperator.do_xcom_push is True, the last line written to stdout
will also be pushed to an XCom when the bash command completes
BashOperator should look like this:
task_get_datetime= BashOperator(
task_id = 'get_datetime',
bash_command='date',
do_xcom_push=True
)

how to pass default values for run time input variable in airflow for scheduled execution

I come across one issue while running DAG in airflow. my code is working in two scenarios where is failing for one.
below are my scenarios,
Manual trigger with input - Running Fine
Manual trigger without input - Running Fine
Scheduled Run - Failing
Below is my code:
def decide_the_flow(**kwargs):
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
print("IP is :",cleanup)
return cleanup
I am getting below error,
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
AttributeError: 'NoneType' object has no attribute 'get'
I tried to define default variables like,
default_dag_args = {
'start_date':days_ago(0),
'params': {
"cleanup": "N"
},
'retries': 0
}
but it wont work.
I am using BranchPythonOperator to call this function.
Scheduling : enter image description here
Can anyone please guide me here. what I am missing ?
For workaround i am using below code,
try:
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
except:
cleanup="N"
You can access the parameters from the context dict params, because airflow defines the default values on this dict after copying the dict dag_run.conf and checking if there is something missing:
from datetime import datetime
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import BranchPythonOperator
def decide_the_flow(**kwargs):
cleanup = kwargs['params']["cleanup"]
print(f"IP is : {cleanup}")
return cleanup
with DAG(
dag_id='airflow_params',
start_date=datetime(2022, 8, 25),
schedule_interval="* * * * *",
params={
"cleanup": "N",
},
catchup=False
) as dag:
branch_task = BranchPythonOperator(
task_id='test_param',
python_callable=decide_the_flow
)
task_n = EmptyOperator(task_id="N")
task_m = EmptyOperator(task_id="M")
branch_task >> [task_n, task_m]
I just tested it in scheduled and manual (with and without conf) runs, it works fine.

Airflow unexpected argument 'mounts'

I'm trying to set up an Airflow ETL pipeline that extracts images from the .bag file. I wanna extract it inside docker and I'm using DockerOperator. Docker image is pulled from private GitLab repository. The script I want to run is a python script inside a Docker container. The .bag file is on my external-SSD so I'm trying to mount it inside docker. Is there something wrong with the code or is it a different kind of problem?
Error:
[2021-09-16 10:39:17,010] {docker.py:246} INFO - Starting docker container from image registry.gitlab.com/url/of/gitlab:a24a3f05
[2021-09-16 10:39:17,010] {taskinstance.py:1462} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task
result = task_copy.execute(context=context)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 343, in execute
return self._run_image()
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 265, in _run_image
return self._run_image_with_mounts(self.mounts, add_tmp_variable=False)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 287, in _run_image_with_mounts
privileged=self.privileged,
File "/usr/lib/python3/dist-packages/docker/api/container.py", line 607, in create_host_config
return HostConfig(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'mounts'
[2021-09-16 10:39:17,014] {taskinstance.py:1512} INFO - Marking task as FAILED. dag_id=ETL-test, task_id=docker_extract, execution_date=20210916T083912, start_date=20210916T083915, end_date=20210916T083917
[2021-09-16 10:39:17,062] {local_task_job.py:151} INFO - Task exited with return code 1
[2021-09-16 10:39:17,085] {local_task_job.py:261} INFO - 0 downstream tasks scheduled from follow-on schedule check
This is my code :
from airflow import DAG
from airflow.utils.dates import days_ago
from datetime import datetime, timedelta
from airflow.operators.dummy import DummyOperator
from airflow.providers.docker.operators.docker import DockerOperator
from docker.types import Mount
from airflow.operators.bash_operator import BashOperator
ssd_dir=Mount(source='/media/filip/external-ssd', target='/external-ssd', type='bind')
dag = DAG(
'ETL-test',
default_args = {
'owner' : 'admin',
'description' : 'Extract data from bag, simple test',
'depend_on_past' : False,
'start_date' : datetime(2021, 9, 13),
},
)
start_dag = DummyOperator(
task_id='start_dag',
dag=dag
)
extract = DockerOperator(
api_version="auto",
task_id='docker_extract',
image='registry.gitlab.com/url/of/gitlab:a24a3f05',
container_name='extract-test',
mounts=[ssd_dir],
auto_remove = True,
force_pull = False,
mount_tmp_dir=False,
command='python3 rgb_image_extraction.py --bagfile /external-ssd/2021-09-01-13-17-10.bag --output_dir /external-ssd/airflow --camera_topic /kirby1/vm0/stereo/left/color/image_rect --every_n_img 20 --timestamp_as_name',
docker_conn_id='gitlab_registry',
dag=dag
)
test = BashOperator(
task_id='print_hello',
bash_command='echo "hello world"',
dag=dag
)
start_dag >> extract >> test
I think you have an old docker python library installed. If you want to make sure airflow 2.1.0 works, you should always use constraints mechanism as described in https://airflow.apache.org/docs/apache-airflow/stable/installation.html otherwise you risk you will have outdated dependencies.
For example if you use Python 3.6, the right constraints are https://raw.githubusercontent.com/apache/airflow/constraints-2.1.3/constraints-3.6.txt and there docker python library is 5.0.0 I bet you have much older version.

airflow errors out when trying to execute remote script through SSHHook

With airflow, I am trying to execute a remote script through SSHHook. The script is simply like this
echo "this is a test"
Inside the remote machine, I can run it through "bash test".
I created an airflow script like this:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.contrib.hooks.ssh_hook import SSHHook
# add a new SSH connection using the WEB UI under the admin --> connections tab.
sshHook = SSHHook(ssh_conn_id="test_ssh")
# Following are defaults which can be overridden later on
default_args = {
'owner': 'tester',
'depends_on_past': False,
'start_date': datetime(2019,6,24),
'email': ['user123#gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('test', default_args=default_args)
t1 = SSHOperator(
ssh_conn_id= sshHook,
task_id='test_20190624',
command='bash /home/tester/run_test',
dag=dag)
Then I got an error like this:
[2019-06-24 11:27:17,790] {ssh_operator.py:80} INFO - ssh_hook is not provided or invalid. Trying ssh_conn_id to create SSHHook.
[2019-06-24 11:27:17,792] {__init__.py:1580} ERROR - SSH operator error: 'SSHHook' object has no attribute 'upper'
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/operators/ssh_operator.py", line 82, in execute
timeout=self.timeout)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/hooks/ssh_hook.py", line 90, in __init__
conn = self.get_connection(self.ssh_conn_id)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/hooks/base_hook.py", line 80, in get_connection
conn = random.choice(cls.get_connections(conn_id))
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/hooks/base_hook.py", line 71, in get_connections
conn = cls._get_connection_from_env(conn_id)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/hooks/base_hook.py", line 63, in _get_connection_from_env
environment_uri = os.environ.get(CONN_ENV_PREFIX + conn_id.upper())
AttributeError: 'SSHHook' object has no attribute 'upper'
You should just directly use the SSH Connection ID or just use SSHHook. The problem here is you have mixed both.
1) Using SSHHook:
t1 = SSHOperator(
ssh_hook = sshHook,
task_id='test_20190624',
command='bash /home/tester/run_test',
dag=dag)
2) Using SSH Connection directly:
t1 = SSHOperator(
ssh_conn_id="test_ssh",
task_id='test_20190624',
command='bash /home/tester/run_test',
dag=dag)

Resources