Task.await timeout which is causing GenServer to terminate - asynchronous

My Genserver terminates after a little while, after sending a few http requests. I can't understand the reason:
[error] GenServer MyGenServer terminating
** (stop) exited in: Task.await(%Task{owner: #PID<0.420.0>, pid: #PID<0.1054.0>, ref: #Reference<....>}, 5000)
** (EXIT) time out
(elixir) lib/task.ex:416: Task.await/2
(elixir) lib/enum.ex:966: Enum.flat_map_list/2
(my_app123) lib/my_genserver.ex:260: MyApp.MyGenServer.do_work/1
(my_app123) lib/my_genserver.ex:180: MyApp.MyGenServer.handle_info/2
(stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:683: :gen_server.handle_msg/5
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: :tick
State: [%{var1: "fdsafdsfd", var2: "43243242"}]
A chunk of the code:
# it's called from handle_info
def do_work(some_data) do
Enum.map(some_data, fn(x) ->
Task.async(fn ->
case HTTPoison.post(.....) do
# ...........
Is "Task.async" causing the timeout? But why? Yes, it can take more than 5 seconds to complete, but why does it cause an exception which then terminates GenServer? How to fix it?
About await:
If the timeout is exceeded, await will exit; however, the task will continue to run. When the calling process exits, its exit signal will terminate the task if it is not trapping exits.

As the documentation says, Task.await has a default timeout of 5 seconds after which it exits (terminates) the the calling process. You can increase the timeout like this:
Task.await(task, 60000) # 1 minute
and you can remove the timeout completely by passing :infinity as the timeout instead of a number:
Task.await(task, :infinity)

Related

Airflow (MWAA) tasks receiving SIGTERM but task is externally set to success

We face a lot of our Airflow (MWAA) tasks receiving SIGTERM:
[2022-10-06 06:23:48,347] {{logging_mixin.py:104}} INFO - [2022-10-06 06:23:48,347] {{local_task_job.py:188}} WARNING - State of this instance has been externally set to success. Terminating instance.
[2022-10-06 06:23:48,348] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 2740
[2022-10-06 06:23:55,113] {{taskinstance.py:1265}} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-10-06 06:23:55,164] {{process_utils.py:66}} INFO - Process psutil.Process(pid=2740, status='terminated', exitcode=1, started='06:23:42') (2740) terminated with exit code 1
It happens to a few of our tasks and it would not have been a big deal if the tasks were not set as a SUCCESS:
State of this instance has been externally set to success. Terminating instance
We understood that this can happen because of a lack of memory within the worker. We tried to increase the number of workers without any success. What would be our solutions to avoid having set tasks externally killed?
When tasks are getting killed, they are marked as failed. Here it seems to be the other way around. The task seem to get marked by something/someone as a success, after which the job is stopped/killed.
I am not aware of how Mwaa is deployed, but I would have a look at the action logging to see what/who is marking these tasks as success.

Airflow task succeed but returns sigterm

I have a task in Airflow 2.1.2 which is finishing with success status, but after that log shows a sigterm:
[2021-12-07 06:11:45,031] {python.py:151} INFO - Done. Returned value was: None
[2021-12-07 06:11:45,224] {taskinstance.py:1204} INFO - Marking task as SUCCESS. dag_id=DAG_ID, task_id=TASK_ID, execution_date=20211207T050000, start_date=20211207T061119, end_date=20211207T061145
[2021-12-07 06:11:45,308] {local_task_job.py:197} WARNING - State of this instance has been externally set to success. Terminating instance.
[2021-12-07 06:11:45,309] {taskinstance.py:1265} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-12-07 06:11:45,310] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 6666
[2021-12-07 06:11:45,310] {taskinstance.py:1284} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-12-07 06:11:45,362] {process_utils.py:66} INFO - Process psutil.Process(pid=6666, status='terminated', exitcode=1, started='06:11:19') (6666) terminated with exit code 1
As you can see the first row returns Done, and the previous rows of this log showed that all script worked fine and data has been inserted in the Datawarehouse.
In the line number 8 it shows SIGTERM due some external trigger mark it as success but I am sure that nobody used the API, or CLI to mark it as success neither the UI.
Any idea how to avoid it and why could this be happening?
I don't know if maybe increasing the AIRFLOW_CORE_KILLED_TASK_CLEANUP_TIME could fix it, but I would like to understand it.

Airflow Papermill operator: task externally skipped after 60 minutes

I am using Airflow in a Docker container. I run a DAG with multiple Jupyter notebooks. I have the following error everytime after 60 minutes:
[2021-08-22 09:15:15,650] {local_task_job.py:198} WARNING - State of this instance has been externally set to skipped. Terminating instance.
[2021-08-22 09:15:15,654] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 277
[2021-08-22 09:15:15,655] {taskinstance.py:1284} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-08-22 09:15:18,284] {taskinstance.py:1501} ERROR - Task failed with exception
I tried to tweak the config file but could not find the good option to remove the 1 hour timeout.
Any help would be appreciated.
The default is no timeout. When your DAG defines dagrun_timeout=timedelta(minutes=60) and execution time exceeds 60 minutes then active task stops with message "State of this instance has been externally set to skipped" logged.

Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

How to extend timeout when waiting for citrus async action to complete?

I'm using citrus to test a process that invoke a callback after performing several steps.
I've got the following sequence working:
-> httpClient kicks process
<- SUT answers OK
<-> Several Additional Steps
<- SUT invokes httpServer
-> httpServer answers OK
I'm now trying to make it more generic by using the citrus async container to wait for the SUT invocation in // to the Additional Steps execution.
async(
<- SUT invokes httpServer
-> httpServer answers OK
)
-> httpClient kicks process
<- SUT answers OK
<-> Several Additional Steps
The problem I'm facing is that after the last additional steps executes the async container does not seem to be waiting long enough for my SUT to invoke it. It seems to be waiting maximum 10 sec.
See below the output and the code snippet (without additional steps to make it simple)
14:14:46,423 INFO port.LoggingReporter|
14:14:46,423 DEBUG port.LoggingReporter| TEST STEP 3/4 SUCCESS
14:14:46,423 INFO port.LoggingReporter|
14:14:46,423 DEBUG port.LoggingReporter| TEST STEP 4/4: echo
14:14:46,423 INFO actions.EchoAction| VM Creation processInstanceID: 3543
14:14:46,423 INFO port.LoggingReporter|
14:14:46,423 DEBUG port.LoggingReporter| TEST STEP 4/4 SUCCESS
14:14:46,530 DEBUG citrus.TestCase| Wait for test actions to finish properly ...
14:14:47,530 DEBUG citrus.TestCase| Wait for test actions to finish properly ...
14:14:48,530 DEBUG citrus.TestCase| Wait for test actions to finish properly ...
14:14:49,528 DEBUG citrus.TestCase| Wait for test actions to finish properly ...
14:14:50,529 DEBUG citrus.TestCase| Wait for test actions to finish properly ...
14:14:51,530 DEBUG citrus.TestCase| Wait for test actions to finish properly ...
14:14:52,526 DEBUG citrus.TestCase| Wait for test actions to finish properly ...
14:14:53,529 DEBUG citrus.TestCase| Wait for test actions to finish properly ...
14:14:54,525 DEBUG citrus.TestCase| Wait for test actions to finish properly ...
14:14:55,525 DEBUG citrus.TestCase| Wait for test actions to finish properly ...
14:14:56,430 INFO port.LoggingReporter|
14:14:56,430 ERROR port.LoggingReporter| TEST FAILED StratusActorSSL.SRCreateVMInitGoodParamCentOST004 <com.grge.citrus.cmptest.stratus> Nested exception is: com.consol.citrus.exceptions.CitrusRuntimeException: Failed to wait for nested test actions to finish properly
at com.consol.citrus.TestCase.finish(TestCase.java:266)
Code snippet
async()
.actions(
http().server(extServer)
.receive()
.post("/api/SRResolved")
.contentType("application/json;charset=UTF-8")
.accept("text/plain,application/json,application/*+json,*/*"),
http().server("extServer")
.send()
.response(HttpStatus.OK)
.contentType("application/json")
);
http()
.client(extClientSSL)
.send()
.post("/bpm/process/key/SRCreateVMTest")
.messageType(MessageType.JSON)
.contentType(ContentType.APPLICATION_JSON.getMimeType())
http()
.client(extClientSSL)
.receive()
.response(HttpStatus.CREATED)
.messageType(MessageType.JSON)
.extractFromPayload("$.processInstanceID", "processId");
echo(" processInstanceID: ${processId}");
another update... hopefully this might help other citrus users...
I finally implemented the behaviour I wanted, using the parallel citrus container as shown below. Nevertheless, I'll let this question open for few days as this does not answer my initial question...
parallel().actions(
sequential().actions(
http().server(extServer)
.receive()
.post("/api/SRResolved")
.contentType("application/json;charset=UTF-8")
.accept("text/plain,application/json,application/*+json,*/*"),
http().server("extServer")
.send()
.response(HttpStatus.OK)
.contentType("application/json")
),
sequential().actions(
http()
.client(extClientSSL)
.send()
.post("/bpm/process/key/SRCreateVMTest")
.messageType(MessageType.JSON)
.contentType(ContentType.APPLICATION_JSON.getMimeType())
http()
.client(stratusClientSSL)
.receive()
.response(HttpStatus.CREATED)
.messageType(MessageType.JSON)
.extractFromPayload("$.processInstanceID", "processId"),
echo("VM Creation processInstanceID: ${processId}")
)
);
The more I think about it, the more I believe this is a bug as when using async (as described above) I'm expecting the async part (and thus the test) to continue until the timeout given on the http server (in my case 60 sec) will expire or an expected request is received from the SUT and not after an arbitrary (10 sec) delay following the end of the non-async part of the test case unless I missed something about the async container features & objectives.

Resources