I’m confused what the function Distributed.interrupt() does. Documentation says it will ‘interrupt the current executing task on the specified workers’ but it seems to terminate the workers as well.
Example:
using Distributed
addprocs(1) # Adding one local worker
my_worker = workers()[1]
# Check number of processes
println("Processes: ", nprocs())
# Define a function
#everywhere function just_sleep(time)
println("Sleeping...")
sleep(time)
end
# Execute on the worker
remote_do(just_sleep, my_worker, 100)
# Wait a bit and interrupt
sleep(5)
interrupt(my_worker)
# Check number of processes again
sleep(5)
println("Processes: ", nprocs())
I get this output
> julia testing.jl
Processes: 2
From worker 2: Sleeping...
Worker 2 terminated.
Processes: 1
I’m expecting to have the worker process #2 still alive and number of processes still two at the end. It does not even help adding exception handling to the body of just_sleep():
function just_sleep(time)
println("Sleeping...")
try
sleep(time)
catch e
if isa(e, InterruptException)
println("interrupted")
else
println(e)
end
end
end
Now interrupt() seems to behave like Distributed.rmprocs(). I have Julia 1.5.3 on Windows 10.
EDIT
I tried it also on WSL Ubuntu as well. There is a bit more info, but the worker also gets terminated
Processes: 2
From worker 2: Sleeping...
From worker 2: fatal: error thrown and no exception handler available.
From worker 2: InterruptException()
From worker 2: jl_mutex_unlock at /buildworker/worker/package_linux64/build/src/locks.h:144 [inlined]
From worker 2: jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:476
From worker 2: poptask at ./task.jl:704
From worker 2: wait at ./task.jl:712 [inlined]
From worker 2: task_done_hook at ./task.jl:442
From worker 2: _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
From worker 2: jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
From worker 2: jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
Worker 2 terminated. From worker 2: jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:196
From worker 2: start_task at /buildworker/worker/package_linux64/build/src/task.c:715
From worker 2: unknown function (ip: (nil))
Processes: 1
Interestingly, it works on an interactive REPL session (only on Linux):
Processes: 2
workers() = [2]
From worker 2: Sleeping for 100 s...
From worker 2: interrupted!
Processes: 2
workers() = [2]
1-element Array{Int64,1}:
2
Related
Running airflow (v1.10.5) dag that ran fine with SequentialExecutor now has many (though not all) simple tasks that fail without any log information when running with LocalExecutor and minimal parallelism, eg.
<airflow.cfg>
# overall task concurrency limit for airflow
parallelism = 8 # which is same as number of cores shown by lscpu
# max tasks per dag
dag_concurrency = 2
# max instances of a given dag that can run on airflow
max_active_runs_per_dag = 1
# max threads used per worker / core
max_threads = 2
# 40G of RAM available total
# CPUs: 8 (sockets 4, cores per socket 4)
see https://www.astronomer.io/guides/airflow-scaling-workers/
Looking at the airflow-webserver.* logs nothing looks out of the ordinary, but looking at airflow-scheduler.out I see...
[airflow#airflowetl airflow]$ tail -n 20 airflow-scheduler.out
....
[2019-12-18 11:29:17,773] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table1 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:17,779] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table2 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:17,782] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table3 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:18,833] {scheduler_job.py:832} WARNING - Set 1 task instances to state=None as their associated DagRun was not in RUNNING state
[2019-12-18 11:29:18,844] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table4 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status success for try_number 1
....
but not really sure what to take away from this.
Anyone know what could be going on here or how to get more helpful debugging info?
Looking again at my lscpu specs, I noticed...
[airflow#airflowetl airflow]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
Notice Thread(s) per core: 1
Looking at my airflow.cfg settings I see max_threads = 2. Setting max_threads = 1 and restarting both the scheduler seems to have fixed the problem.
If anyone knows more about what exactly is going wrong under the hood (eg. why the task fails rather than just waiting for another thread to become available), would be interested to hear about it.
Our airflow installation is using CeleryExecutor.
The concurrency configs were
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 16
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 64
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
[celery]
# This section only applies if you are using the CeleryExecutor in
# [core] section above
# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor
# The concurrency that will be used when starting workers with the
# "airflow worker" command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
celeryd_concurrency = 16
We have a dag that executes daily. It has around some tasks in parallel following a pattern that senses whether the data exists in hdfs then sleep 10 mins, and finally upload to s3.
Some of the tasks has been encountering the following error:
2019-05-12 00:00:46,212 ERROR - Executor reports task instance <TaskInstance: example_dag.task1 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
2019-05-12 00:00:46,558 INFO - Marking task as UP_FOR_RETRY
2019-05-12 00:00:46,561 WARNING - section/key [smtp/smtp_user] not found in config
This kind of error occurs randomly in those tasks. When this error happens, the state of task instance is immediately set to up_for_retry, and no logs in the worker nodes. After some retries, they execute and finished eventually.
This problem sometimes gives us large ETL delay. Anyone knows how to solve this problem?
We were facing similar problems , which was resolved by
"-x, --donot_pickle" option.
For more information :- https://airflow.apache.org/cli.html#backfill
I was seeing very similar symptoms in my DagRuns. I thought it was due to the ExternalTaskSensor and concurrency issues given the queuing and killed task language that looked like this: Executor reports task instance <TaskInstance: dag1.data_table_temp_redshift_load 2019-05-20 08:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally? But when I looked at the worker logs, I saw there was an error caused by setting a variable with Variable.set in my DAG. The issue is described here duplicate key value violates unique constraint when adding path variable in airflow dag where the scheduler polls the dagbag at regular intervals to refresh any changes dynamically. The error with every heartbeat was causing significant ETL delays.
Are you performing any logic in your wh_hdfs_to_s3 DAG (or others) that might be causing errors or delays / these symptoms?
We fixed this already. Let me answer myself question:
We have 5 airflow worker nodes. After installing flower to monitor the tasks distributed to these nodes. We found out that the failed task was always sent to a specific node. We tried to use airflow test command to run the task in other nodes and they worked. Eventually, the reason was a wrong python package in that specific node.
I met a exception when I use grpc v1.8.x version and also v.1.7.x version:
E0111 07:32:20.953644757 2249 chttp2_transport.cc:748] server stream 17 still included in list 0
*** Aborted at 1515655940 (unix time) try "date -d #1515655940" if you are using GNU date ***
PC: # 0x0 (unknown)
*** SIGABRT (#0x6e3) received by PID 1763 (TID 0x7fdf13fff700) from PID 1763; stack trace: ***
# 0x7fe10a56f390 (unknown)
# 0x7fe10a1c9428 gsignal
# 0x7fe10a1cb02a abort
# 0x7fe0d92ea1c6 (unknown)
# 0x7fe0d92df4cc grpc_combiner_continue_exec_ctx
# 0x7fe0d92b58a2 grpc_exec_ctx_flush
# 0x7fe0d92b5fac (unknown)
# 0x7fe0d92b60f5 (unknown)
# 0x7fe0d92dc557 (unknown)
# 0x7fe10a5656ba start_thread
# 0x7fe10a29b3dd clone
# 0x0 (unknown)
This failure doesn't happen often, a few hours or a few minutes.
Can some give some advice about it? My server is grpc_server.h grpc_server.cc which is a async server with two type calls. And it's proto is here
I took a quick look. Could you try locking around AsyncGRPCServer::ShutDown() instead of AsyncGRPCServer::ShutdownQueue()?
If that doesn't work is there any way you could try to turn this into a reproducible unit test? It's very hard to track these things down otherwise. I don't see anything glaring in your code
My Genserver terminates after a little while, after sending a few http requests. I can't understand the reason:
[error] GenServer MyGenServer terminating
** (stop) exited in: Task.await(%Task{owner: #PID<0.420.0>, pid: #PID<0.1054.0>, ref: #Reference<....>}, 5000)
** (EXIT) time out
(elixir) lib/task.ex:416: Task.await/2
(elixir) lib/enum.ex:966: Enum.flat_map_list/2
(my_app123) lib/my_genserver.ex:260: MyApp.MyGenServer.do_work/1
(my_app123) lib/my_genserver.ex:180: MyApp.MyGenServer.handle_info/2
(stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:683: :gen_server.handle_msg/5
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: :tick
State: [%{var1: "fdsafdsfd", var2: "43243242"}]
A chunk of the code:
# it's called from handle_info
def do_work(some_data) do
Enum.map(some_data, fn(x) ->
Task.async(fn ->
case HTTPoison.post(.....) do
# ...........
Is "Task.async" causing the timeout? But why? Yes, it can take more than 5 seconds to complete, but why does it cause an exception which then terminates GenServer? How to fix it?
About await:
If the timeout is exceeded, await will exit; however, the task will continue to run. When the calling process exits, its exit signal will terminate the task if it is not trapping exits.
As the documentation says, Task.await has a default timeout of 5 seconds after which it exits (terminates) the the calling process. You can increase the timeout like this:
Task.await(task, 60000) # 1 minute
and you can remove the timeout completely by passing :infinity as the timeout instead of a number:
Task.await(task, :infinity)
I am using the Redis server from the link :
http://cloud.github.com/downloads/rgl/redis/redis-2.4.6-setup-64-bit.exe
with R version3.0.3, doRedis 1.1.0, rredis 1.6.8
The Redis worker end immediately after receiving jobs
> redisWorker('jobs')
Waiting for doRedis jobs.
Processing task for job 2 from queue jobs
Error in doTryCatch(return(expr), name, parentenv, handler) :
ERR unknown command 'EVAL'
But with the Redis server from the link :
https://github.com/MSOpenTech/redis
and with Redis server build from source on cygwin,
The worker seems to be able to process job, but the master receive error
> redisWorker('jobs')
Waiting for doRedis jobs.
Processing task for job 9 from queue jobs
Processing task 1 ... from queue jobs jobID 9
Processing task for job 9 from queue jobs
Processing task 2 ... from queue jobs jobID 9
Processing task for job 9 from queue jobs
Processing task 3 ... from queue jobs jobID 9
> registerDoRedis('jobs')
> foreach(i = 1:3)%dopar%i
Error in i : task 1 failed - "object '.doRedisGlobals' not found"
I reported this issue to Bryan Lewis, the author of the doRedis and rredis packages. He replied that he is working to resolve the problem and will update the package on CRAN when it is fixed. In the meantime, you could downgrade to doRedis version 1.0.5 which doesn't have this problem.