Related
I am in the process of migrating our Airflow environment from version 1.10.15 to 2.3.3. I have migrated 1 DAG over to the new environment and intermittently I get an email with this error: Executor reports task instance finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
When looking at the logs, this is what I find in the scheduler logs:
[2022-08-09 07:00:08,621] {dag.py:2968} INFO - Setting next_dagrun for DAGRP-Get_Overrides to 2022-08-09T11:00:00+00:00, run_after=2022-08-09T16:00:00+00:00
[2022-08-09 07:00:08,652] {scheduler_job.py:353} INFO - 1 tasks up for execution:
<TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [scheduled]>
[2022-08-09 07:00:08,652] {scheduler_job.py:418} INFO - DAG DAGRP-Get_Overrides has 0/3 running and queued tasks
[2022-08-09 07:00:08,652] {scheduler_job.py:504} INFO - Setting the following tasks to queued state:
<TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [scheduled]>
[2022-08-09 07:00:08,654] {scheduler_job.py:546} INFO - Sending TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default
[2022-08-09 07:00:08,654] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'DAGRP-Get_Overrides', 'Get_override', 'scheduled__2022-08-08T16:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/da_group/get_override.py']
[2022-08-09 07:00:12,665] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:12,667] {celery_executor.py:283} INFO - [Try 1 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:16,701] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:16,702] {celery_executor.py:283} INFO - [Try 2 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:21,704] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:21,705] {celery_executor.py:283} INFO - [Try 3 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:26,627] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:26,627] {celery_executor.py:294} ERROR - Error sending Celery task: Timeout, PID: 1
Celery Task ID: TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)
Traceback (most recent call last):
File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 30, in __call__
return self.__value__
AttributeError: 'ChannelPromise' object has no attribute '__value__'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/airflow/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 177, in send_task_to_executor
result = task_to_run.apply_async(args=[command], queue=queue)
File "/opt/airflow/lib/python3.8/site-packages/celery/app/task.py", line 575, in apply_async
return app.send_task(
File "/opt/airflow/lib/python3.8/site-packages/celery/app/base.py", line 788, in send_task
amqp.send_task_message(P, name, message, **options)
File "/opt/airflow/lib/python3.8/site-packages/celery/app/amqp.py", line 510, in send_task_message
ret = producer.publish(
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 177, in publish
return _publish(
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 523, in _ensured
return fun(*args, **kwargs)
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 186, in _publish
channel = self.channel
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 209, in _get_channel
channel = self._channel = channel()
File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 32, in __call__
value = self.__value__ = self.__contract__()
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 225, in <lambda>
channel = ChannelPromise(lambda: connection.default_channel)
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 895, in default_channel
self._ensure_connection(**conn_opts)
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 433, in _ensure_connection
return retry_over_time(
File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 312, in retry_over_time
return fun(*args, **kwargs)
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 877, in _connection_factory
self._connection = self._establish_connection()
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 812, in _establish_connection
conn = self.transport.establish_connection()
File "/opt/airflow/lib/python3.8/site-packages/kombu/transport/pyamqp.py", line 201, in establish_connection
conn.connect()
File "/opt/airflow/lib/python3.8/site-packages/amqp/connection.py", line 323, in connect
self.transport.connect()
File "/opt/airflow/lib/python3.8/site-packages/amqp/transport.py", line 129, in connect
self._connect(self.host, self.port, self.connect_timeout)
File "/opt/airflow/lib/python3.8/site-packages/amqp/transport.py", line 184, in _connect
self.sock.connect(sa)
File "/opt/airflow/lib/python3.8/site-packages/airflow/utils/timeout.py", line 68, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 1
[2022-08-09 07:00:26,627] {scheduler_job.py:599} INFO - Executor reports execution of DAGRP-Get_Overrides.Get_override run_id=scheduled__2022-08-08T16:00:00+00:00 exited with status failed for try_number 1
[2022-08-09 07:00:26,633] {scheduler_job.py:642} INFO - TaskInstance Finished: dag_id=DAGRP-Get_Overrides, task_id=Get_override, run_id=scheduled__2022-08-08T16:00:00+00:00, map_index=-1, run_start_date=None, run_end_date=None, run_duration=None, state=queued, executor_state=failed, try_number=1, max_tries=0, job_id=None, pool=default_pool, queue=default, priority_weight=1, operator=PythonOperator, queued_dttm=2022-08-09 11:00:08.652767+00:00, queued_by_job_id=56, pid=None
[2022-08-09 07:00:26,633] {scheduler_job.py:684} ERROR - Executor reports task instance <TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-08-09 07:01:16,687] {processor.py:233} WARNING - Killing DAGFileProcessorProcess (PID=1811)
[2022-08-09 07:04:00,640] {scheduler_job.py:1233} INFO - Resetting orphaned tasks for active dag runs
I am running Airflow on 2 servers with 2 of each service (2 schedulers, 2 workers, 2 webservers). They are running in docker containers. They are configured to use celery executor and I'm using RabbitMQ version 3.10.6 (also 2 instances in docker containers behind a LB). I am using Postgres 13.7 for our database (running one instance in a docker container on the 1st server). Our environment is running on Python 3.8.12.
From my understanding, the timeout is between the scheduler and rabbitmq? From what I can tell we are hitting this timeout: AIRFLOW__CELERY__OPERATION_TIMEOUT (it's currently set to 4).
I would like to track down what is causing the issue before I just increase timeout settings. What can I do to find out what's going on? Anyone else run into this issue? Am I correct in assuming the timeout is between the scheduler and rabbitmq? Is it between the scheduler and database? Why am I seeing this with Airflow 2 when I have the same setup with Airflow 1 and it works with no problems? Any help is greatly appreciated!
Update:
I was able to reproduce the error by shutting down 1 of the rabbitmq nodes. Even though rabbitmq is behind a LB with a health probe, whenever a job was picked up by scheduler 1, it would fail with this error... But if scheduler 2 picked up the job, it would finish successfully. The odd thing is that I shut down rabbitmq 2..
So I think I've been able to solve this issue. Here is what I did:
I added a custom celery_config.py to the scheduler and worker docker containers, adding this environment variable: AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS=celery_config.CELERY_CONFIG. As part of that celery config, I specified both my rabbitmq brokers under broker_url. This is the full config:
from airflow.config_templates.default_celery import DEFAULT_CELERY_CONFIG
import os
RABBITMQ_PW = os.environ["RABBITMQ_PW"]
CLUSTER_NODE = os.environ["RABBITMQ_CLUSTER_NODE"]
LOCAL_NODE = os.environ["RABBITMQ_NODE"]
CELERY_CONFIG = {
**DEFAULT_CELERY_CONFIG,
"worker_send_task_events": True,
"task_send_sent_event": True,
"result_extended": True,
"broker_url": [
f'amqp://rabbitmq:{RABBITMQ_PW}#{LOCAL_NODE}:5672',
f'amqp://rabbitmq:{RABBITMQ_PW}#{CLUSTER_NODE}:5672'
]
}
What happens now in the worker, if it looses connection to the 1st broker, it will attempt to connect to the 2nd broker.
[2022-08-11 12:00:52,876: ERROR/MainProcess] consumer: Cannot connect to amqp://rabbitmq:**#<LOCAL_NODE>:5672//: [Errno 111] Connection refused.
[2022-08-11 12:00:52,875: INFO/MainProcess] Connected to amqp://rabbitmq:**#<CLUSTER_NODE>:5672//
Also an interesting note, I still have the Airflow environment variable AIRFLOW__CELERY__BROKER_URL set to the load balancer URL. That's because Airflow 1 won't allow the worker to start without it, and 2 won't allow you to specify multiple brokers like the celery config does. So when the worker starts, it shows:
- ** ---------- .> transport: amqp://rabbitmq:**#<LOCAL_NODE>:5672//
[2022-08-26 11:37:17,952: INFO/MainProcess] Connected to amqp://rabbitmq:**#<LOCAL_NODE>:5672//
Even though I have the LB configured for the AIRFLOW__CELERY__BROKER_URL
I run command as follows.
mpirun --hostfile /home/user/share/hostlist.txt -np 4 /home/user/share/mpi-dask/venv/bin/dask-mpi --scheduler-file ~/dask-scheduler.json
I got result as follows.
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[rpi40000:14497] Local abort
before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
2022-06-23 06:40:12,321 - distributed.nanny - INFO - Worker process 14497 exited with status 1
2022-06-23 06:40:12,324 - distributed.nanny - WARNING - Restarting worker
^C[rpi40000:14416] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[rpi40000:14416] 8 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[rpi40000:14416] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[rpi40000:14416] 5 more processes have sent help message help-orte-runtime / orte_init:startup:internal-failure
[rpi40000:14416] 5 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
I am using Airflow in a Docker container. I run a DAG with multiple Jupyter notebooks. I have the following error everytime after 60 minutes:
[2021-08-22 09:15:15,650] {local_task_job.py:198} WARNING - State of this instance has been externally set to skipped. Terminating instance.
[2021-08-22 09:15:15,654] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 277
[2021-08-22 09:15:15,655] {taskinstance.py:1284} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-08-22 09:15:18,284] {taskinstance.py:1501} ERROR - Task failed with exception
I tried to tweak the config file but could not find the good option to remove the 1 hour timeout.
Any help would be appreciated.
The default is no timeout. When your DAG defines dagrun_timeout=timedelta(minutes=60) and execution time exceeds 60 minutes then active task stops with message "State of this instance has been externally set to skipped" logged.
I'm seeing an IO error on the Riak console. I'm not sure what the cause is as the owner of the directory is riak. Here's how the error looks.
2018-01-25 23:18:06.922 [info] <0.2301.0>#riak_kv_vnode:maybe_create_hashtrees:234 riak_kv/730750818665451459101842416358141509827966271488: unable to start index_hashtree: {error,{{badmatch,{error,{db_open,"IO error: lock /var/lib/riak/anti_entropy/v0/730750818665451459101842416358141509827966271488/LOCK: already held by process"}}},[{hashtree,new_segment_store,2,[{file,"src/hashtree.erl"},{line,725}]},{hashtree,new,2,[{file,"src/hashtree.erl"},{line,246}]},{riak_kv_index_hashtree,do_new_tree,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,712}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,init_trees,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,565}]},{riak_kv_index_hashtree,init,1,[{file,"src/riak_kv_index_hashtree.erl"},{line,308}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}}
2018-01-25 23:18:06.927 [info] <0.2315.0>#riak_kv_vnode:maybe_create_hashtrees:234 riak_kv/890602560248518965780370444936484965102833893376: unable to start index_hashtree: {error,{{badmatch,{error,{db_open,"IO error: lock /var/lib/riak/anti_entropy/v0/890602560248518965780370444936484965102833893376/LOCK: already held by process"}}},[{hashtree,new_segment_store,2,[{file,"src/hashtree.erl"},{line,725}]},{hashtree,new,2,[{file,"src/hashtree.erl"},{line,246}]},{riak_kv_index_hashtree,do_new_tree,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,712}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,init_trees,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,565}]},{riak_kv_index_hashtree,init,1,[{file,"src/riak_kv_index_hashtree.erl"},{line,308}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}}
2018-01-25 23:18:06.928 [error] <0.27284.0> CRASH REPORT Process <0.27284.0> with 0 neighbours exited with reason: no match of right hand value {error,{db_open,"IO error: lock /var/lib/riak/anti_entropy/v0/890602560248518965780370444936484965102833893376/LOCK: already held by process"}} in hashtree:new_segment_store/2 line 725 in gen_server:init_it/6 line 328
Any ideas on what the problem could be?
I have a R script which works perfectly fine in R Colsole ,but when I am running in Hadoop streaming it is failing with the below error in Map phase .Find the Task attempts log
The Hadoop Streaming Command I have :
/home/Bibhu/hadoop-0.20.2/bin/hadoop jar \
/home/Bibhu/hadoop-0.20.2/contrib/streaming/*.jar \
-input hdfs://localhost:54310/user/Bibhu/BookTE1.csv \
-output outsid -mapper `pwd`/code1.sh
stderr logs
Loading required package: class
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Calls: read.csv -> read.table
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
syslog logs
2013-07-03 19:32:36,080 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2013-07-03 19:32:36,654 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2013-07-03 19:32:36,675 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2013-07-03 19:32:36,835 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2013-07-03 19:32:36,835 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2013-07-03 19:32:36,899 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/home/Bibhu/Downloads/SentimentAnalysis/Sid/smallFile/code1.sh]
2013-07-03 19:32:37,256 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=0/1
2013-07-03 19:32:38,509 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2013-07-03 19:32:38,509 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
2013-07-03 19:32:38,557 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
2013-07-03 19:32:38,631 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
write hadoopStreamming jar with full version like hadoop-streaming-1.0.4.jar
specify separate file path for mapper & reducer with -file option
tell hadoop which is your mapper & reducer code with -mapper & -reducer option
for more ref see Running WordCount on Hadoop using R script
You need to find the logs from your mappers and reducers, since this is the place where the job is failing (as indicated by java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1). This says that your R script crashed.
If you are using the Hortonworks Hadoop distribuion, the easiest way is to open your jobhistory. It should be at http://127.0.0.1:19888/jobhistory . It should be possible to find the log in the filesystem using the command line as well, but I haven't yet found where.
Open http://127.0.0.1:19888/jobhistory in your web browser
Click on the Job ID of the failed job
Click the number indicating the failed job count
Click an attempt link
Click the logs link
You should see a page which looks something like
Log Type: stderr
Log Length: 418
Traceback (most recent call last):
File "/hadoop/yarn/local/usercache/root/appcache/application_1404203309115_0003/container_1404203309115_0003_01_000002/./mapper.py", line 45, in <module>
mapper()
File "/hadoop/yarn/local/usercache/root/appcache/application_1404203309115_0003/container_1404203309115_0003_01_000002/./mapper.py", line 37, in mapper
for record in reader:
_csv.Error: newline inside string
This is an error from my Python script, the errors from R look a bit different.
source: http://hortonworks.com/community/forums/topic/map-reduce-job-log-files/
I received this same error tonight, while also developing Map Reduce Streaming jobs with R.
I was working on a 10 node cluster, each with 12 cores, and tried to supply at submission time:
-D mapred.map.tasks=200\
-D mapred.reduce.tasks=200
The job completed successfully though when I changed these to
-D mapred.map.tasks=10\
-D mapred.reduce.tasks=10
This was a mysterious fix, and perhaps more context will arise this evening. But if any readers can elucidate, please do!