Airflow BashOperator completes code with error code 0. However, Airflow marks the task as failed - airflow

I am working on Airflow. I have several Bash operators which calls the Python code. Normally it works fine. However, from yesterday I faced a situation, I cannot understand. In the logs of the task, everything is ok as seen in the below;
*** Reading local file: /opt/airflow/logs/dag_id=derin_emto_preprocess/run_id=manual__2022-10-01T13:54:50.246801+00:00/task_id=emto_preprocess-month0day0/attempt=1.log
[2022-10-01, 13:55:21 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: derin_emto_preprocess.emto_preprocess-month0day0 manual__2022-10-01T13:54:50.246801+00:00 [queued]>
[2022-10-01, 13:55:21 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: derin_emto_preprocess.emto_preprocess-month0day0 manual__2022-10-01T13:54:50.246801+00:00 [queued]>
[2022-10-01, 13:55:21 UTC] {taskinstance.py:1356} INFO -
--------------------------------------------------------------------------------
[2022-10-01, 13:55:21 UTC] {taskinstance.py:1357} INFO - Starting attempt 1 of 1
[2022-10-01, 13:55:21 UTC] {taskinstance.py:1358} INFO -
--------------------------------------------------------------------------------
[2022-10-01, 13:55:21 UTC] {taskinstance.py:1377} INFO - Executing <Task(BashOperator): emto_preprocess-month0day0> on 2022-10-01 13:54:50.246801+00:00
[2022-10-01, 13:55:21 UTC] {standard_task_runner.py:52} INFO - Started process 624 to run task
[2022-10-01, 13:55:21 UTC] {standard_task_runner.py:79} INFO - Running: ['***', 'tasks', 'run', 'derin_emto_preprocess', 'emto_preprocess-month0day0', 'manual__2022-10-01T13:54:50.246801+00:00', '--job-id', '8958', '--raw', '--subdir', 'DAGS_FOLDER/derin_emto_preprocess.py', '--cfg-path', '/tmp/tmpjn_8tmiv', '--error-file', '/tmp/tmp_jr_2w3j']
[2022-10-01, 13:55:21 UTC] {standard_task_runner.py:80} INFO - Job 8958: Subtask emto_preprocess-month0day0
[2022-10-01, 13:55:21 UTC] {task_command.py:369} INFO - Running <TaskInstance: derin_emto_preprocess.emto_preprocess-month0day0 manual__2022-10-01T13:54:50.246801+00:00 [running]> on host 5b44f8453a08
[2022-10-01, 13:55:21 UTC] {taskinstance.py:1571} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=derin_emto_preprocess
AIRFLOW_CTX_TASK_ID=emto_preprocess-month0day0
AIRFLOW_CTX_EXECUTION_DATE=2022-10-01T13:54:50.246801+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-10-01T13:54:50.246801+00:00
[2022-10-01, 13:55:21 UTC] {subprocess.py:62} INFO - Tmp dir root location:
/tmp
[2022-10-01, 13:55:21 UTC] {subprocess.py:74} INFO - Running command: ['bash', '-c', 'python /opt/***/dags/scripts/derin/pipeline/pipeline.py --valid_from=20200101 --valid_until=20200102 --purpose=emto_preprocess --module=emto_preprocess --***=True']
[2022-10-01, 13:55:21 UTC] {subprocess.py:85} INFO - Output:
[2022-10-01, 13:55:24 UTC] {subprocess.py:92} INFO - 2022-10-01 13:55:22 : Hello, world!
[2022-10-01, 13:55:24 UTC] {subprocess.py:92} INFO - 2022-10-01 13:55:22 : [20200101, 20200102)
[2022-10-01, 13:55:24 UTC] {subprocess.py:92} INFO - 2022-10-01 13:55:22 : Running emto_preprocess purpose
[2022-10-01, 13:55:24 UTC] {subprocess.py:92} INFO - Current directory : /opt/***/dags
[2022-10-01, 13:55:24 UTC] {subprocess.py:92} INFO - 2022-10-01 13:55:22 : Airflow parameter passed: changing configuration..
[2022-10-01, 13:55:24 UTC] {subprocess.py:92} INFO - 2022-10-01 13:55:24 : Parallel threads: 15
[2022-10-01, 13:55:24 UTC] {subprocess.py:92} INFO - 2022-10-01 13:55:24 : External money transfer out: preprocess is starting..
[2022-10-01, 13:55:24 UTC] {subprocess.py:92} INFO -
Thread None for emto_preprocess: 0%| | 0/1 [00:00<?, ?it/s]
Thread None for emto_preprocess: 100%|██████████| 1/1 [00:00<00:00, 12633.45it/s]
[2022-10-01, 13:55:24 UTC] {subprocess.py:92} INFO - 2022-10-01 13:55:24 : DEBUG: Checking existing files
[2022-10-01, 13:55:24 UTC] {subprocess.py:92} INFO - 2022-10-01 13:55:24 : This module is already processed
[2022-10-01, 13:55:24 UTC] {subprocess.py:92} INFO - 2022-10-01 13:55:24 : Good bye!
[2022-10-01, 13:55:24 UTC] {subprocess.py:96} INFO - Command exited with return code 0
[2022-10-01, 13:55:24 UTC] {taskinstance.py:1400} INFO - Marking task as SUCCESS. dag_id=derin_emto_preprocess, task_id=emto_preprocess-month0day0, execution_date=20221001T135450, start_date=20221001T135521, end_date=20221001T135524
[2022-10-01, 13:55:24 UTC] {local_task_job.py:156} INFO - Task exited with return code 0
[2022-10-01, 13:55:25 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
However, Airflow marked this task as failed. How can I fix this?

I understood and solved this weird problem. When Airflow renders/upload the dags, it reads all files in /airflow/dags/ folder. I have a data storage folder under this with 250 GB consisting of many feather file. I guess reading files take too much time, and creates this kind of situation. Solution is creating an .airflowignore file, and adding other directories (not storing dag files) in .airflowignore file.

Related

Airflow: Indefinitely running HTTP Task with no response

Please help me understand on why is this http task running for long time, with no progress.
I`m running the official example on HTTP but looks like missing something here.
https://github.com/apache/airflow/blob/providers-http/4.1.1/tests/system/providers/http/example_http.py
AIRFLOW_CTX_DAG_EMAIL=airflow#example.com
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_http_operator
AIRFLOW_CTX_TASK_ID=http_sensor_check
AIRFLOW_CTX_EXECUTION_DATE=2023-02-17T20:53:45.614721+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=manual__2023-02-17T20:53:45.614721+00:00
[2023-02-17, 20:53:48 UTC] {__init__.py:117} DEBUG - Preparing lineage inlets and outlets
[2023-02-17, 20:53:48 UTC] {__init__.py:155} DEBUG - inlets: [], outlets: []
[2023-02-17, 20:53:48 UTC] {http.py:122} INFO - Poking:
[2023-02-17, 20:53:48 UTC] {base.py:73} INFO - Using connection ID 'http_default' for task execution.
[2023-02-17, 20:53:48 UTC] {http.py:150} DEBUG - Sending 'GET' to url: https://jsonplaceholder.typicode.com/
[2023-02-17, 20:53:52 UTC] {taskinstance.py:769} DEBUG - Refreshing TaskInstance <TaskInstance: example_http_operator.http_sensor_check manual__2023-02-17T20:53:45.614721+00:00 [running]> from DB
[2023-02-17, 20:53:52 UTC] {base_job.py:240} DEBUG - [heartbeat]
[2023-02-17, 20:53:58 UTC] {taskinstance.py:769} DEBUG - Refreshing TaskInstance <TaskInstance: example_http_operator.http_sensor_check manual__2023-02-17T20:53:45.614721+00:00 [running]> from DB
[2023-02-17, 20:53:58 UTC] {base_job.py:240} DEBUG - [heartbeat]
[2023-02-17, 20:54:03 UTC] {taskinstance.py:769} DEBUG - Refreshing TaskInstance <TaskInstance: example_http_operator.http_sensor_check manual__2023-02-17T20:53:45.614721+00:00 [running]> from DB
[2023-02-17, 20:54:03 UTC] {base_job.py:240} DEBUG - [heartbeat]
[2023-02-17, 20:54:08 UTC] {taskinstance.py:769} DEBUG - Refreshing TaskInstance <TaskInstance: example_http_operator.http_sensor_check manual__2023-02-17T20:53:45.614721+00:00 [running]> from DB
[2023-02-17, 20:54:08 UTC] {base_job.py:240} DEBUG - [heartbeat]
[2023-02-17, 20:54:13 UTC] {taskinstance.py:769} DEBUG - Refreshing TaskInstance <TaskInstance: example_http_operator.http_sensor_check manual__2023-02-17T20:53:45.614721+00:00 [running]> from DB
Surprisingly, I`m able to test this code from CLI without any issue but having trouble run this from UI.
AIRFLOW_CTX_DAG_EMAIL=airflow#example.com
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_http_operator
AIRFLOW_CTX_TASK_ID=http_sensor_check
AIRFLOW_CTX_EXECUTION_DATE=2023-02-17T21:05:22.781965+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=__airflow_temporary_run_2023-02-17T21:05:22.781968+00:00__
[2023-02-17 16:05:23,328] {__init__.py:117} DEBUG - Preparing lineage inlets and outlets
[2023-02-17 16:05:23,328] {__init__.py:155} DEBUG - inlets: [], outlets: []
[2023-02-17 16:05:23,329] {http.py:122} INFO - Poking:
[2023-02-17 16:05:23,332] {base.py:73} INFO - Using connection ID 'http_default' for task execution.
[2023-02-17 16:05:23,332] {http.py:150} DEBUG - Sending 'GET' to url: https://jsonplaceholder.typicode.com/
[2023-02-17 16:05:23,335] {connectionpool.py:1003} DEBUG - Starting new HTTPS connection (1): jsonplaceholder.typicode.com:443
[2023-02-17 16:05:23,667] {connectionpool.py:456} DEBUG - https://jsonplaceholder.typicode.com:443 "GET / HTTP/1.1" 200 None
[2023-02-17 16:05:23,669] {base.py:228} INFO - Success criteria met. Exiting.
[2023-02-17 16:05:23,669] {__init__.py:75} DEBUG - Lineage called with inlets: [], outlets: []
[2023-02-17 16:05:23,670] {taskinstance.py:1329} DEBUG - Clearing next_method and next_kwargs.
[2023-02-17 16:05:23,670] {taskinstance.py:1318} INFO - Marking task as SUCCESS. dag_id=example_http_operator, task_id=http_sensor_check, execution_date=20230217T210522, start_date=, end_date=20230217T210523
[2023-02-17 16:05:23,670] {taskinstance.py:2241} DEBUG - Task Duration set to None
[2023-02-17 16:05:23,696] {cli_action_loggers.py:83} DEBUG - Calling callbacks: []
[2023-02-17 16:05:23,696] {settings.py:407} DEBUG - Disposing DB connection pool (PID 65429)

Airflow 2: GoogleSheetsToGCSOperator gives Negsignal.SIGKILL

We're running airflow in google composer, and we're running into difficulties with the the GoogleSheetsToGCSOperator. We're using composer 2, and therefore I understand that we have to make sure to use a connection with the correct scopes. So that's fine, I've set up a connection with those scopes, and we now no longer get permission errors. However, the dag still doesn't work, it now fails in a couple of different ways.
Most of the time, any dag that tries to upload a google sheet to GCS fails with error Negsignal.SIGKILL. For example:
--------------------------------------------------------------------------------
[2022-10-03, 15:50:55 UTC] {taskinstance.py:1251} INFO - Starting attempt 1 of 1
[2022-10-03, 15:50:55 UTC] {taskinstance.py:1252} INFO -
--------------------------------------------------------------------------------
[2022-10-03, 15:50:55 UTC] {taskinstance.py:1271} INFO - Executing <Task(GoogleSheetsToGCSOperator): upload_sheet_to_gcs_airflow_permission_test_sheet> on 2022-10-03 15:50:38.412899+00:00
[2022-10-03, 15:50:55 UTC] {standard_task_runner.py:52} INFO - Started process 529848 to run task
[2022-10-03, 15:50:55 UTC] {standard_task_runner.py:79} INFO - Running: ['airflow', 'tasks', 'run', 'test_brunel_core_2', 'upload_sheet_to_gcs_airflow_permission_test_sheet', 'manual__2022-10-03T15:50:38.412899+00:00', '--job-id', '7342', '--raw', '--subdir', 'DAGS_FOLDER/DAGs/z_airflow_testing_dags/test_brunel_2_functions.py', '--cfg-path', '/tmp/tmpyuhkixqc', '--error-file', '/tmp/tmp7p2delaz']
[2022-10-03, 15:50:55 UTC] {standard_task_runner.py:80} INFO - Job 7342: Subtask upload_sheet_to_gcs_airflow_permission_test_sheet
/opt/python3.8/lib/python3.8/site-packages/airflow/utils/log/file_task_handler.py:110: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/airflow/gcs/logs/test_brunel_core_2/upload_sheet_to_gcs_airflow_permission_test_sheet/2022-10-03T15:50:38.412899+00:00/1.log' mode='a' encoding='utf-8'>
self.handler = NonCachingFileHandler(local_loc, encoding='utf-8')
[2022-10-03, 15:50:56 UTC] {task_command.py:298} INFO - Running <TaskInstance: test_brunel_core_2.upload_sheet_to_gcs_airflow_permission_test_sheet manual__2022-10-03T15:50:38.412899+00:00 [running]> on host airflow-worker-j28mn
[2022-10-03, 15:50:56 UTC] {taskinstance.py:1448} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=process_dev_joe_m
AIRFLOW_CTX_DAG_ID=test_brunel_core_2
AIRFLOW_CTX_TASK_ID=upload_sheet_to_gcs_airflow_permission_test_sheet
AIRFLOW_CTX_EXECUTION_DATE=2022-10-03T15:50:38.412899+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-10-03T15:50:38.412899+00:00
[2022-10-03, 15:51:02 UTC] {local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL
[2022-10-03, 15:51:02 UTC] {taskinstance.py:1279} INFO - Marking task as FAILED. dag_id=test_brunel_core_2, task_id=upload_sheet_to_gcs_airflow_permission_test_sheet, execution_date=20221003T155038, start_date=20221003T155055, end_date=20221003T155102
The rest of the time, some random task in the dag fails (not neccesarily the step with the GoogleSheetsToGCSOperator). Sometimes it a step fails with absolutely no log being generated at all, or sometimes log is generated but it contains no errors. Instead, the only clue is a warning:
/opt/python3.8/lib/python3.8/site-packages/airflow/utils/log/file_task_handler.py:110: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/airflow/gcs/logs/test_flakiness/create_table_JM_test_table.create/2022-10-04T09:11:58.425115+00:00/1.log' mode='a' encoding='utf-8'>
self.handler = NonCachingFileHandler(local_loc, encoding='utf-8')
The weird thing about that warning is that it's warning about the log file itself. As in, that message is written into log file gs://europe-west1-process-dev-ai-fd1dc540-bucket/logs/test_flakiness/create_table_JM_test_table.create/2022-10-04T09:11:58.425115+00:00/1.log. So of course the file is open, you're writing to it, so why are you warning about it being open?
Some other facts that may or may not be relevant:
composer-2.0.25 airflow-2.2.5
When monitoring the environment, all
resources (cpu, memory, etc) seem to be fine, nothing is hitting its
limits.
Our environment is configured to use between 1 and 4 workers.
Only ever one worker is used, so I don't think it can be a problem
with multiple workers all trying to write to the same file at once.
This is all happening in our test environment. The same dag will work absolutely fine in our prod environment. Our
prod environment is running composer-1.19.3-airflow-2.2.5, and
therefore is set up differently when it comes to things like Google
drive authentication scopes. So that's already 2 potential reasons
why things are different in the prod environment.

Airflow task randomly exited with return code 1 [Local Executor / PythonOperator]

To give some context, I am using Airflow 2.3.0 on Kubernetes with the Local Executor (which may sound weird, but it works for us for now) with one pod for the webserver and two for the scheduler.
I have a DAG consisting of a single task (PythonOperator) that makes many API calls (200K) using requests.
Every 15 calls, the data is loaded in a DataFrame and stored on AWS S3 (using boto3) to reduce the RAM usage.
The problem is that I can't get to the end of this task because it goes into error randomly (after 1, 10 or 120 minutes).
I have made more than 50 tries, no success and the only logs on the task are:
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: INGESTION-DAILY-dag.extract_task scheduled__2022-08-30T00:00:00+00:00 [queued]>
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: INGESTION-DAILY-dag.extract_task scheduled__2022-08-30T00:00:00+00:00 [queued]>
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1356} INFO -
--------------------------------------------------------------------------------
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1357} INFO - Starting attempt 23 of 24
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1358} INFO -
--------------------------------------------------------------------------------
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1377} INFO - Executing <Task(_PythonDecoratedOperator): extract_task> on 2022-08-30 00:00:00+00:00
[2022-09-01, 14:45:44 UTC] {standard_task_runner.py:52} INFO - Started process 942 to run task
[2022-09-01, 14:45:44 UTC] {standard_task_runner.py:79} INFO - Running: ['airflow', 'tasks', 'run', 'INGESTION-DAILY-dag', 'extract_task', 'scheduled__2022-08-30T00:00:00+00:00', '--job-id', '4390', '--raw', '--subdir', 'DAGS_FOLDER/dags/ingestion/daily_dag/dag.py', '--cfg-path', '/tmp/tmpwxasaq93', '--error-file', '/tmp/tmpl7t_gd8e']
[2022-09-01, 14:45:44 UTC] {standard_task_runner.py:80} INFO - Job 4390: Subtask extract_task
[2022-09-01, 14:45:45 UTC] {task_command.py:369} INFO - Running <TaskInstance: INGESTION-DAILY-dag.extract_task scheduled__2022-08-30T00:00:00+00:00 [running]> on host 10.XX.XXX.XXX
[2022-09-01, 14:48:17 UTC] {local_task_job.py:156} INFO - Task exited with return code 1
[2022-09-01, 14:48:17 UTC] {taskinstance.py:1395} INFO - Marking task as UP_FOR_RETRY. dag_id=INGESTION-DAILY-dag, task_id=extract_task, execution_date=20220830T000000, start_date=20220901T144544, end_date=20220901T144817
[2022-09-01, 14:48:17 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
But when I go to the pod logs, I get the following message:
[2022-09-01 14:06:31,624] {local_executor.py:128} ERROR - Failed to execute task an integer is required (got type ChunkedEncodingError).
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/local_executor.py", line 124, in _execute_work_in_fork
args.func(args)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 51, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 99, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 377, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 183, in _run_task_by_selected_method
_run_task_by_local_task_job(args, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 241, in _run_task_by_local_task_job
run_job.run()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 244, in run
self._execute()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 105, in _execute
self.task_runner.start()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 41, in start
self.process = self._start_by_fork()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 125, in _start_by_fork
os._exit(return_code)
TypeError: an integer is required (got type ChunkedEncodingError)
What I find strange is that I never had this error on other DAGs (where tasks are smaller and faster). I checked, during an attempt, CPU and RAM usages are stable and low.
I have the same error locally, I also tried to upgrade to 2.3.4 but nothing works.
Do you have any idea how to fix this?
Thanks a lot!
Nicolas
As #EDG956 said, this is not an error from Airflow but from the code.
I solved it using a context manager (which was not enough) and recreating a session:
s = requests.Session()
while True:
try:
with s.get(base_url) as r:
response = r
except requests.exceptions.ChunkedEncodingError:
s.close()
s.requests.Session()
response = s.get(base_url)

Airflow: How to use the same tasks in different dags

I am learning Airflow and run into a problem.
I have 2 tasks, that I want to use in several dags. The difference between these tasks will be only the parameters the operators are going to get.
This could be accomplished by simply copy and pasting the tasks into all the dags, but maintain this type of code would be a nightmare.
So what a want to do is to create a class that will contain the tasks I will be calling several times and just import this class from the dags.
I replicated the issue with a minimal example.
This is the code for the class:
from airflow.operators.bash_operator import BashOperator
class Operator_generator():
_instance = None
def __init__(self, var1, var2):
self.var1 = var1
self.var2 = var2
def create_task_1(self):
return BashOperator(
task_id='task1',
bash_command='echo Im running task 1, the current execution date is {{ds}} and the previous execution date is {{prev_ds}}'
)
def create_task_2(self):
return BashOperator(
task_id='task2',
bash_command='echo Im running task 2, the current execution date is {{ds}} and the previous execution date is {{prev_ds}}'
)
and this is a dag example where I would import the class
from include.src.date.decorator import DefaultDateTime
from airflow import DAG
from include.src.airflow.xcom import cleanup
from operator_creator import Operator_generator
dag_id = "dag1"
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": DefaultDateTime(2021, 6, 1),
'retries': 1
}
# Dag definition
with DAG(
dag_id,
schedule_interval='#monthly',
catchup=False,
on_failure_callback=cleanup,
on_success_callback=cleanup
) as dag:
dag.doc_md = __doc__
operator_generator = Operator_generator('var1','var2')
task1 = operator_generator.create_task_1()
task2 = operator_generator.create_task_2()
task1 >> task2
Note that 'var1' and 'var2' are variables that I need to parametrize the operators.
The problem is that when I run the dag the tasks run twice:
[2021-08-25 16:29:46,937] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:29:46,937] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:46,955] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task1> on 2021-07-01T06:00:00+00:00
[2021-08-25 16:29:46,961] {standard_task_runner.py:53} INFO - Started process 67689 to run task
[2021-08-25 16:29:47,011] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task1 2021-07-01T06:00:00+00:00 [running]> 30b770753547
[2021-08-25 16:29:47,032] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:29:47,033] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpixijgd4s/task1lfcwdvfa
[2021-08-25 16:29:47,033] {bash_operator.py:146} INFO - Running command: echo Im running task 1, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:47,039] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:29:47,040] {bash_operator.py:157} INFO - Im running task 1, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:47,040] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:29:47,052] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task1, execution_date=20210701T060000, start_date=20210825T162946, end_date=20210825T162947
[2021-08-25 16:29:55,335] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task1 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:29:55,335] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-07-01T06:00:00+00:00 [queued]>
[2021-08-25 16:29:55,348] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task1 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:29:55,348] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,348] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:29:55,348] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,357] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-07-01T06:00:00+00:00 [queued]>
[2021-08-25 16:29:55,357] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,357] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:29:55,357] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:29:55,363] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task1> on 2021-08-25T16:29:41+00:00
[2021-08-25 16:29:55,366] {standard_task_runner.py:53} INFO - Started process 67809 to run task
[2021-08-25 16:29:55,370] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task2> on 2021-07-01T06:00:00+00:00
[2021-08-25 16:29:55,374] {standard_task_runner.py:53} INFO - Started process 67810 to run task
[2021-08-25 16:29:55,412] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task1 2021-08-25T16:29:41+00:00 [running]> 30b770753547
[2021-08-25 16:29:55,422] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task2 2021-07-01T06:00:00+00:00 [running]> 30b770753547
[2021-08-25 16:29:55,430] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:29:55,432] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpsacovlfm/task1doc6fakb
[2021-08-25 16:29:55,432] {bash_operator.py:146} INFO - Running command: echo Im running task 1, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:29:55,440] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:29:55,440] {bash_operator.py:157} INFO - Im running task 1, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:29:55,441] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:29:55,444] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:29:55,445] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpyqqww8an/task2i29a2lk7
[2021-08-25 16:29:55,445] {bash_operator.py:146} INFO - Running command: echo Im running task 2, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:55,451] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task1, execution_date=20210825T162941, start_date=20210825T162955, end_date=20210825T162955
[2021-08-25 16:29:55,453] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:29:55,453] {bash_operator.py:157} INFO - Im running task 2, the current execution date is 2021-07-01 and the previous execution date is 2021-06-01
[2021-08-25 16:29:55,454] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:29:55,465] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task2, execution_date=20210701T060000, start_date=20210825T162955, end_date=20210825T162955
[2021-08-25 16:29:56,922] {logging_mixin.py:112} INFO - [2021-08-25 16:29:56,921] {local_task_job.py:103} INFO - Task exited with return code 0
[2021-08-25 16:30:05,333] {logging_mixin.py:112} INFO - [2021-08-25 16:30:05,333] {local_task_job.py:103} INFO - Task exited with return code 0
[2021-08-25 16:30:05,337] {logging_mixin.py:112} INFO - [2021-08-25 16:30:05,337] {local_task_job.py:103} INFO - Task exited with return code 0
[2021-08-25 16:30:06,794] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:30:06,809] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: dag1.task2 2021-08-25T16:29:41+00:00 [queued]>
[2021-08-25 16:30:06,809] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:30:06,810] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2021-08-25 16:30:06,810] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2021-08-25 16:30:06,822] {taskinstance.py:900} INFO - Executing <Task(BashOperator): task2> on 2021-08-25T16:29:41+00:00
[2021-08-25 16:30:06,826] {standard_task_runner.py:53} INFO - Started process 67937 to run task
[2021-08-25 16:30:06,875] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: dag1.task2 2021-08-25T16:29:41+00:00 [running]> 30b770753547
[2021-08-25 16:30:06,892] {bash_operator.py:113} INFO - Tmp dir root location:
/tmp
[2021-08-25 16:30:06,893] {bash_operator.py:136} INFO - Temporary script location: /tmp/airflowtmpot_xsukw/task2xo4uxspu
[2021-08-25 16:30:06,893] {bash_operator.py:146} INFO - Running command: echo Im running task 2, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:30:06,901] {bash_operator.py:153} INFO - Output:
[2021-08-25 16:30:06,902] {bash_operator.py:157} INFO - Im running task 2, the current execution date is 2021-08-25 and the previous execution date is 2021-08-25
[2021-08-25 16:30:06,902] {bash_operator.py:161} INFO - Command exited with return code 0
[2021-08-25 16:30:06,913] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=dag1, task_id=task2, execution_date=20210825T162941, start_date=20210825T163006, end_date=20210825T163006
[2021-08-25 16:30:16,800] {logging_mixin.py:112} INFO - [2021-08-25 16:30:16,799] {local_task_job.py:103} INFO - Task exited with return code 0
Notice how the tasks are executed 2 times:
In the first execution the values of {{ds}} and {{prev_ds}} are the current date.
In the second execution the values of {{ds}} and {{prev_ds}} correspond to the monthly interval.
Why the tasks run 2 times?
Is there a way to import tasks like this?
Note 1: I am not allowed to use subdags.
Edit: Adding the execution tree
Edit 2:
If anyone run into this problem I figured out.
The problem was that I was running the dag with an external trigger, that deletes the dag and start it again. So the the dag runs for the external trigger, but also the scheduler sees that the dag hasn't run for the month, so it schedules the execution, resulting in 2 runs.
The solution I found is:
Turn off the dag in the airflow interface
Delete the dag (with the red x in the far right of the dag)
Refresh the page
The dag appears again in the list, turn it on
This will make the scheduler do its job and the dag will run as it should.

Airflow dag_id did not exist or it failed to parse

currently im learning how to use Apache Airflow and trying to create a simple DAG script like this
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello():
return 'Hello world!'
dag = DAG('hello_world', description='Simple tutorial DAG',
schedule_interval='0 0 * * *',
start_date=datetime(2020, 5, 23), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
i run those DAG using web server and run succesfully even checked the logs
[2020-05-23 20:43:53,411] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [queued]>
[2020-05-23 20:43:53,431] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [queued]>
[2020-05-23 20:43:53,432] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2020-05-23 20:43:53,432] {taskinstance.py:880} INFO - Starting attempt 1 of 1
[2020-05-23 20:43:53,432] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2020-05-23 20:43:53,448] {taskinstance.py:900} INFO - Executing <Task(PythonOperator): hello_task> on 2020-05-23T13:42:17.463955+00:00
[2020-05-23 20:43:53,477] {standard_task_runner.py:53} INFO - Started process 7442 to run task
[2020-05-23 20:43:53,685] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: hello_world.hello_task 2020-05-23T13:42:17.463955+00:00 [running]> LAPTOP-9BCTKM5O.localdomain
[2020-05-23 20:43:53,715] {python_operator.py:114} INFO - Done. Returned value was: Hello world!
[2020-05-23 20:43:53,738] {taskinstance.py:1052} INFO - Marking task as SUCCESS.dag_id=hello_world, task_id=hello_task, execution_date=20200523T134217, start_date=20200523T134353, end_date=20200523T134353
[2020-05-23 20:44:03,372] {logging_mixin.py:112} INFO - [2020-05-23 20:44:03,372] {local_task_job.py:103} INFO - Task exited with return code 0
but when i tried to test run a single task using this command
airflow test dags/main.py hello_task 2020-05-23
it shows this error
airflow.exceptions.AirflowException: dag_id could not be found: dags/main.py. Either the dag did not exist or it failed to parse.
where i went wrong ?
You got your airflow test command a tad wrong, instead of giving the path to the dag, dags/main.py, you need to type in the dag_id itself which is hello_world looking at your code.
So try this:
airflow test hello_world hello_task 2020-05-23
You should get output similar to this :)
airflow#940836ce7da4:/opt/airflow$ airflow test hello_world hello_task 2020-05-23
[2020-05-23 14:18:51,144] {__init__.py:51} INFO - Using executor CeleryExecutor
[2020-05-23 14:18:51,145] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags
[2020-05-23 14:18:51,190] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T00:00:00+00:00 [None]>
[2020-05-23 14:18:51,203] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: hello_world.hello_task 2020-05-23T00:00:00+00:00 [None]>
[2020-05-23 14:18:51,203] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2020-05-23 14:18:51,203] {taskinstance.py:880} INFO - Starting attempt 1 of 1
[2020-05-23 14:18:51,203] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
[2020-05-23 14:18:51,204] {taskinstance.py:900} INFO - Executing <Task(PythonOperator): hello_task> on 2020-05-23T00:00:00+00:00
[2020-05-23 14:18:51,234] {python_operator.py:114} INFO - Done. Returned value was: Hello world!
[2020-05-23 14:18:51,249] {taskinstance.py:1065} INFO - Marking task as SUCCESS.dag_id=hello_world, task_id=hello_task, execution_date=20200523T000000, start_date=20200523T141851, end_date=20200523T141851
After af 2.0;
airflow tasks test dag_id task_id date

Resources