Airflow unexpected argument 'mounts' - airflow

I'm trying to set up an Airflow ETL pipeline that extracts images from the .bag file. I wanna extract it inside docker and I'm using DockerOperator. Docker image is pulled from private GitLab repository. The script I want to run is a python script inside a Docker container. The .bag file is on my external-SSD so I'm trying to mount it inside docker. Is there something wrong with the code or is it a different kind of problem?
Error:
[2021-09-16 10:39:17,010] {docker.py:246} INFO - Starting docker container from image registry.gitlab.com/url/of/gitlab:a24a3f05
[2021-09-16 10:39:17,010] {taskinstance.py:1462} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task
result = task_copy.execute(context=context)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 343, in execute
return self._run_image()
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 265, in _run_image
return self._run_image_with_mounts(self.mounts, add_tmp_variable=False)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 287, in _run_image_with_mounts
privileged=self.privileged,
File "/usr/lib/python3/dist-packages/docker/api/container.py", line 607, in create_host_config
return HostConfig(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'mounts'
[2021-09-16 10:39:17,014] {taskinstance.py:1512} INFO - Marking task as FAILED. dag_id=ETL-test, task_id=docker_extract, execution_date=20210916T083912, start_date=20210916T083915, end_date=20210916T083917
[2021-09-16 10:39:17,062] {local_task_job.py:151} INFO - Task exited with return code 1
[2021-09-16 10:39:17,085] {local_task_job.py:261} INFO - 0 downstream tasks scheduled from follow-on schedule check
This is my code :
from airflow import DAG
from airflow.utils.dates import days_ago
from datetime import datetime, timedelta
from airflow.operators.dummy import DummyOperator
from airflow.providers.docker.operators.docker import DockerOperator
from docker.types import Mount
from airflow.operators.bash_operator import BashOperator
ssd_dir=Mount(source='/media/filip/external-ssd', target='/external-ssd', type='bind')
dag = DAG(
'ETL-test',
default_args = {
'owner' : 'admin',
'description' : 'Extract data from bag, simple test',
'depend_on_past' : False,
'start_date' : datetime(2021, 9, 13),
},
)
start_dag = DummyOperator(
task_id='start_dag',
dag=dag
)
extract = DockerOperator(
api_version="auto",
task_id='docker_extract',
image='registry.gitlab.com/url/of/gitlab:a24a3f05',
container_name='extract-test',
mounts=[ssd_dir],
auto_remove = True,
force_pull = False,
mount_tmp_dir=False,
command='python3 rgb_image_extraction.py --bagfile /external-ssd/2021-09-01-13-17-10.bag --output_dir /external-ssd/airflow --camera_topic /kirby1/vm0/stereo/left/color/image_rect --every_n_img 20 --timestamp_as_name',
docker_conn_id='gitlab_registry',
dag=dag
)
test = BashOperator(
task_id='print_hello',
bash_command='echo "hello world"',
dag=dag
)
start_dag >> extract >> test

I think you have an old docker python library installed. If you want to make sure airflow 2.1.0 works, you should always use constraints mechanism as described in https://airflow.apache.org/docs/apache-airflow/stable/installation.html otherwise you risk you will have outdated dependencies.
For example if you use Python 3.6, the right constraints are https://raw.githubusercontent.com/apache/airflow/constraints-2.1.3/constraints-3.6.txt and there docker python library is 5.0.0 I bet you have much older version.

Related

DAG object has no attribute 'schedule' (Airflow 2.5.0)

Using Airflow 2.5.0. I am using a subdag, but imports are working fine:
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.subdag import SubDagOperator
from subdags.subdag_downloads import subdag_downloads
from datetime import datetime
with DAG('group_dag', start_date=datetime(2022, 1, 1),
schedule='#daily', catchup=False) as dag:
args = {'start_date':dag.start_date,
'schedule':dag.schedule,
'catchup':dag.catchup}
downloads = SubDagOperator(
task_id="downloads",
subdag=subdag_downloads(dag.dag_id, "downloads", args)
)
This is the error I'm getting when trying to run the DAG in my CLI:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/dagbag.py", line 339, in parse
loader.exec_module(new_module)
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/opt/airflow/dags/group_dag.py", line 12, in <module>
'schedule':dag.schedule,
AttributeError: 'DAG' object has no attribute 'schedule'
As you can see from the code above, I am defining schedule='#daily' when I instantiate with DAG() as dag, so why can't I access that argument by using dag.schedule?
NOTE: I can access the other arguments just fine, such as dag.catchup and dag.start_date. I can also access the schedule by using dag.schedule_interval, but this seems silly and doesn't sit well with me that I don't understand why we can't use dag.schedule when schedule= is the argument we defined.
I agree with Elad about migrating to TaskGroup where SubDagOperator will be removed in Airflow 3.
But currently, you can access your dag schedule by dag.schedule_interval:
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.subdag import SubDagOperator
from subdags.subdag_downloads import subdag_downloads
from datetime import datetime
with DAG('group_dag', start_date=datetime(2022, 1, 1),
schedule='#daily', catchup=False) as dag:
args = {'start_date':dag.start_date,
'schedule':dag.schedule_interval,
'catchup':dag.catchup}
downloads = SubDagOperator(
task_id="downloads",
subdag=subdag_downloads(dag.dag_id, "downloads", args)
)
You can't access it because there is no such parameter.
While you assign
DAG(..., schedule='#daily') as dag:
The DAG class does not have self.schedule. Airflow accept several kinds of scheduling options and creates from it an internal scheduling_args parameter. You can see it in the codebase here
I'd like to point that you are using SubDAG which is a deprecated feature for 2 years. Please migrate to TaskGroup. SubDAGs are not going to stay in Airflow 3.

Airflow XCOMs communication from BashOperator to PythonOperator

I'm new to Apache Airflow and trying to write my first Dag which has a task based on another task (using ti.xcom_pull)
PS : I run Airflow in WSL Ubuntu 20.04 using VScode.
I created a task 1 (task_id = "get_datetime") that runs the "date" bash command (and it works)
then I created another task (task_id='process_datetime') which takes the datetime of the first task and processes it, and I set the python_callable and everything is fine..
the issue is that dt = ti.xcom_pull gives a NoneType when I run "airflow tasks test first_ariflow_dag process_datetime 2022-11-1" in the terminal, but when I see the log in the Airflow UI, I find that it works normally.
could someone give me a solution please?
`
from datetime import datetime
from airflow.models import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
def process_datetime(ti):
dt = ti.xcom_pull(task_ids=['get_datetime'])
if not dt :
raise Exception('No datetime value')
dt = str(dt[0]).split()
return{
'year':int(dt[-1]),
'month':dt[1],
'day':int(dt[2]),
'time':dt[3],
'day_of_week':dt[0]
}
with DAG(
dag_id='first_ariflow_dag',
schedule_interval='* * * * *',
start_date=datetime(year=2022, month=11, day=1),
catchup=False
) as dag:
# 1. Get the current datetime
task_get_datetime= BashOperator(
task_id = 'get_datetime',
bash_command='date'
)
# 2. Process the datetime
task_process_datetime= PythonOperator(
task_id = 'process_datetime',
python_callable=process_datetime
)
`
I get this error :
[2022-11-02 00:51:45,420] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
File "/mnt/c/Users/Salim/Desktop/A-Learning/Airflow_Conda/airflow_env/lib/python3.8/site-packages/airflow/operators/python.py", line 175, in execute
return_value = self.execute_callable()
File "/mnt/c/Users/Salim/Desktop/A-Learning/Airflow_Conda/airflow_env/lib/python3.8/site-packages/airflow/operators/python.py", line 193, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/salim/airflow/dags/first_dag.py", line 12, in process_datetime
raise Exception('No datetime value')
Exception: No datetime value
According to the documentation, to upload data to xcom you need to set the variable do_xcom_push (Airflow 2) or xcom_push (Airflow 1).
If BaseOperator.do_xcom_push is True, the last line written to stdout
will also be pushed to an XCom when the bash command completes
BashOperator should look like this:
task_get_datetime= BashOperator(
task_id = 'get_datetime',
bash_command='date',
do_xcom_push=True
)

Facing task Timeout Error while dag parsing in Airflow version 2.2.5

I am hitting the task timeout error with Airflow Version 2.2.5/Composer 2.0.15. The same code is running absolutely fine in Airflow version2.2.3 /Composer Version 1.18.0
Error Message :
Broken DAG: [/home/airflow/gcs/dags/test_dag.py] Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/enum.py", line 256, in __new__
if canonical_member._value_ == enum_member._value_:
File "/opt/python3.8/lib/python3.8/site-packages/airflow/utils/timeout.py", line 37, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: DagBag import timeout for /home/airflow/gcs/dags/test_dag.py after 30.0s.
Please take a look at these docs to improve your DAG import time:
* https://airflow.apache.org/docs/apache-airflow/2.2.5/best-practices.html#top-level-python-code
* https://airflow.apache.org/docs/apache-airflow/2.2.5/best-practices.html#reducing-dag-complexity, PID: 1827
As per the documentation or the links in error message about Top Level Python code.
We have a framework in place for Dags and tasks.
main_folder
|___ dags
|___ tasks
|___ libs
a) All the main dag files are in dags folder
b) Actual functions or queries (from PythonOperator functions/ Sql Queries) are placed in *.py files under tasks folder
c) Common functionalities are placed in python files in libs folder.
Providing basic dag structure here:
# Import libraries and functions
import datetime
from airflow import models, DAG
from airflow.contrib.operators import bigquery_operator, bigquery_to_gcs, bigquery_table_delete_operator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
##from airflow.executors.sequential_executor import SequentialExecutor
from airflow.utils.task_group import TaskGroup
## Import codes from tasks and libs folder
from libs.compres_suppress.cot_suppress import *
from libs.teams_plugin.teams_plugin import *
from tasks.email_code.trigger_email import *
# Set up Airflow DAG
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.datetime(2020, 12, 15, 0),
'retries': 1,
'retry_delay': datetime.timedelta(minutes=1),
'on_failure_callback': trigger_email
}
DAG_ID = 'test_dag'
# Check exscution date
if "<some condition>" matches:
run_date = <date in config file>
else:
run_date = datetime.datetime.now().strftime("%Y-%m-%d")
run_date_day = datetime.datetime.now().isoweekday()
dag = DAG(
DAG_ID,
default_args=default_args, catchup=False,
max_active_runs=1, schedule_interval=SCHEDULE_INTERVAL
)
next_dag_name = "next_dag1"
if env == "prod":
if run_date_day == 7:
next_dag_name = "next_dag2"
else:
next_dag_name = "next_dag1"
run_id = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
# Define Airflow DAG
with dag:
team_notify_task = MSTeamsWebhookOperator(
task_id='teams_notifi_start_task',
http_conn_id='http_conn_id',
message=f"DAG has started <br />"
f"<strong> DAG ID:</strong> {DAG_ID}.<br />",
theme_color="00FF00",
button_text="My button",
dag=dag)
task1_bq = bigquery_operator.BigQueryOperator(
task_id='task1',
sql=task1_query(
table1="table1",
start_date=start_date),
use_legacy_sql=False,
destination_dataset_table="destination_tbl_name",
write_disposition='WRITE_TRUNCATE'
)
##### Base Skeletons #####
with TaskGroup("taskgroup_lbl", tooltip="taskgroup_sample") as task_grp:
tg_process(args=default_args,run_date=run_date)
if run_mode == "<env_name>" and next_dag != "":
next_dag_trigg = BashOperator(
task_id=f'trigger_{next_dag_name}',
bash_command="gcloud composer environments run " + <env> + "-cust_comp --location us-east1 dags trigger -- " + next_dag_name + " --run-id='trigger_ "'"
)
task_grp >> next_dag_trigger
team_notify_task >> task1_bq >> task_grp
enter code here
Can someone help on this on what is causing the issue?
Increasing the dag/task timeout time does the trick.
Go to Airflow (Web UI), On the top bar navigate to
Variables--> Configuration --> [core] --> dagbag_import_timeout = <changed from 30(default) to 160>.
If using Composer, the same can be done through following steps.
a) Go to Composer service and select the composer to which the settings are to be modified.
b) Click on AIRFLOW CONFIGURATION OVERRIDES --> EDIT --> (add/edit) dagbag_import_timeout=160
c) Click on save

Airflow PostgresOperator :Task failed with exception while using postgres_conn_id="redshift"

~$ airflow version
2.1.2
python 3.8
I am trying to execute some basic queries on my redshift cluster using a dag but the task is failing with an exception(not shown in the logs)
import datetime
import logging
from airflow import DAG
from airflow.contrib.hooks.aws_hook import AwsHook
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import PythonOperator
import sql_statements
def load_data_to_redshift(*args, **kwargs):
aws_hook = AwsHook("aws_credentials")
credentials = aws_hook.get_credentials()
redshift_hook = PostgresHook("redshift")
sql_stmt = sql_statements.COPY_ALL_data_SQL.format(
credentials.access_key,
credentials.secret_key,
)
redshift_hook.run(sql_stmt)
dag = DAG(
'exercise1',
start_date=datetime.datetime.now()
)
create_t1_table = PostgresOperator(
task_id="create_t1_table",
dag=dag,
postgres_conn_id="redshift_default",
sql=sql_statements.CREATE_t1_TABLE_SQL
)
create_t2_table = PostgresOperator(
task_id="create_t2_table",
dag=dag,
postgres_conn_id="redshift_default",
sql=sql_statements.CREATE_t2_TABLE_SQL,
)
create_t1_table >> create_t2_table
following is the exception
[2021-09-17 05:23:33,902] {base.py:69} INFO - Using connection to: id: redshift_default. Host: rdscluster.123455.us-west-2.redshift.amazonaws.com, Port: 5439, Schema: udac, Login: ***, Password: ***, extra: {}
[2021-09-17 05:23:33,903] {taskinstance.py:1501} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/8085/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1157, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/8085/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1331, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/8085/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1361, in _execute_task
result = task_copy.execute(context=context)
File "/home/8085/.local/lib/python3.8/site-packages/airflow/providers/postgres/operators/postgres.py", line 70, in execute
self.hook.run(self.sql, self.autocommit, parameters=self.parameters)
File "/home/8085/.local/lib/python3.8/site-packages/airflow/hooks/dbapi.py", line 177, in run
with closing(self.get_conn()) as conn:
File "/home/8085/.local/lib/python3.8/site-packages/airflow/providers/postgres/hooks/postgres.py", line 115, in get_conn
self.conn = psycopg2.connect(**conn_args)
File "/home/8085/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 124, in connect
conn = psycopg2.connect("dbname=airflow user=abc password=ubantu host=127.0.0.1 port=5432")
File "/home/8085/.local/lib/python3.8/site-packages/psycopg2abc/__init__.py", line 124, in connect
conn = psycopg2.connect("dbname=airflow user=abc password=abc host=127.0.0.1 port=5432")
File "/home/8085/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 124, in connect
conn = psycopg2.connect("dbname=airflow user=abc password=abc host=127.0.0.1 port=5432")
[Previous line repeated 974 more times]
RecursionError: maximum recursion depth exceeded
[2021-09-17 05:23:33,907] {taskinstance.py:1544} INFO - Marking task as FAILED. dag_id=exercise1, task_id=create_t1_table, execution_date=20210917T092331, start_date=20210917T092333, end_date=20210917T092333
[2021-09-17 05:23:33,953] {local_task_job.py:149} INFO - Task exited with return code 1
I can't tell from the logs what is going wrong here, it appears that even after providing redshift connection ID the PostgresOperator is using default Postgres connection configured while installing the Airflow webserver but I could be wrong.
Any idea how do I resolve this or get more log out of airflow? (note I already tried with different airflow log levels from airflow config it didn't help either)
redshift - connection is defined properly and I can connect to redshift using another standalone python utility as well as plsql, so there is no issue with Redshift cluster.
-Thanks,
Resolved:
Somehow following file was referring to the airflow postgres DB created during the Airflow installation rather than connecting to the local postgres.
File "/home/8085/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 124, in connect
**conn = psycopg2.connect("dbname=airflow user=abc password=abc host=127.0.0.1 port=5432")**
Had to recreate the airflow DB from scratch to resolve the issue.

Ooops... AttributeError when clearing failed task state in airflow

I am trying to clear a failed task so that it will run again.
I usually do this with the web GUI from the tree view
After selecting "Clear" I am directed to an error page:
The traceback on this page is the same error I receive when trying to clear this task using the CLI:
[u#airflow01 ~]# airflow clear -s 2002-07-29T20:25:00 -t
coverage_check gom_modis_aqua_coverage_check
[2018-01-16 16:21:04,235] {__init__.py:57} INFO - Using executor CeleryExecutor
[2018-01-16 16:21:05,192] {models.py:167} INFO - Filling up the DagBag from /root/airflow/dags
Traceback (most recent call last):
File "/usr/bin/airflow", line 28, in <module>
args.func(args)
File "/usr/lib/python3.4/site-packages/airflow/bin/cli.py", line 612, in clear
include_upstream=args.upstream,
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 3173, in sub_dag
dag = copy.deepcopy(self)
File "/usr/lib64/python3.4/copy.py", line 166, in deepcopy
y = copier(memo)
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 3159, in __deepcopy__
setattr(result, k, copy.deepcopy(v, memo))
File "/usr/lib64/python3.4/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/usr/lib64/python3.4/copy.py", line 246, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib64/python3.4/copy.py", line 166, in deepcopy
y = copier(memo)
File "/usr/lib/python3.4/site-packages/airflow/models.py", line 2202, in __deepcopy__
setattr(result, k, copy.deepcopy(v, memo))
File "/usr/lib64/python3.4/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/usr/lib64/python3.4/copy.py", line 246, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib64/python3.4/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/usr/lib64/python3.4/copy.py", line 309, in _reconstruct
y.__dict__.update(state)
AttributeError: 'NoneType' object has no attribute 'update'
Looking for ideas on what may have caused this, what I should do to fix this task, and how to avoid this in the future.
I was able to work around the issue by deleting the task record using the "Browse > Task Instances" search, but would still like to explore the issue as I have seen this multiple times.
Although my DAG code is getting complicated, here is an excerpt from where the operator is defined within the dag:
trigger_granule_dag_id = 'trigger_' + process_pass_dag_name
coverage_check = BranchPythonOperator(
task_id='coverage_check',
python_callable=_coverage_check,
provide_context=True,
retries=10,
retry_delay=timedelta(hours=3),
queue=QUEUE.PYCMR,
op_kwargs={
'roi':region,
'success_branch_id': trigger_granule_dag_id
}
)
The full source code can be browsed at github/USF-IMARS/imars_dags. Here are links to the most relevant parts:
Operator instantiated in /gom/gom_modis_aqua_coverage_check.py using modis_aqua_coverage_check factory
factory function defines coverage_check BranchPythonOperator in /builders/modis_aqua_coverage_check.py
python_callable is _coverage_check function in same file
Below is a sample DAG that I created to mimic the error that you are facing.
import logging
import os
from datetime import datetime, timedelta
import boto3
from airflow import DAG
from airflow import configuration as conf
from airflow.operators import ShortCircuitOperator, PythonOperator, DummyOperator
def athena_data_validation(**kwargs):
pass
start_date = datetime.now()
args = {
'owner': 'airflow',
'start_date': start_date,
'depends_on_past': False,
'wait_for_downstream': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=30)
}
dag_name = 'data_validation_dag'
schedule_interval = None
dag = DAG(
dag_id=dag_name,
default_args=args,
schedule_interval=schedule_interval)
athena_client = boto3.client('athena', region_name='us-west-2')
DAG_SCRIPTS_DIR = conf.get('core', 'DAGS_FOLDER') + "/data_validation/"
start_task = DummyOperator(task_id='Start_Task', dag=dag)
end_task = DummyOperator(task_id='End_Task', dag=dag)
data_validation_task = ShortCircuitOperator(
task_id='Data_Validation',
provide_context=True,
python_callable=athena_data_validation,
op_kwargs={
'athena_client': athena_client,
'sql_file': DAG_SCRIPTS_DIR + 'data_validation.sql',
's3_output_path': 's3://XXX/YYY/'
},
dag=dag)
data_validation_task.set_upstream(start_task)
data_validation_task.set_downstream(end_task)
After one successful run, I tried to clear the Data_Validation task and got the same error (see below).
I removed the athena_client object creation and placed it inside the athena_data_validation function and then it worked. So when we do a clear in Airflow UI, it tries to do a deepcopy and get all the objects from previous run. I am still trying to understand why its not able to get a copy of the object type but I got a workaround which was working for me.
During some operations, Airflow deep copies some objects. Unfortunately, some objects do not allow this. The boto client is a good example of something that does not deep copies nicely, thread objects are another, but large objects with nested references like a reference to a parent task below can also cause issues.
In general, you do not want to instantiate a client in the dag code itself. That said, I do not think that it is your issue here. Though I do not have access to the pyCMR code to see if it could be an issue.

Resources