OpenCL: Why does clCreateContext creates threads? - opencl

I am debugging a segmentation fault related to OpenCL. Using gdb I noticed that several threads are created as a result of clCreateContext
GDB output:
print Before clCreateContext
[New Thread 0x7ffff299b700 (LWP 10807)]
[New Thread 0x7ffff219a700 (LWP 10808)]
[New Thread 0x7ffff1999700 (LWP 10809)]
[New Thread 0x7ffff1198700 (LWP 10810)]
[New Thread 0x7ffff0997700 (LWP 10811)]
[New Thread 0x7fffebfff700 (LWP 10812)]
[New Thread 0x7fffeb7fe700 (LWP 10813)]
print After clCreateContext
Does somebody have a clue what is the reason for that ?
[I am using OpenCl 1.2 with NVIDIA GPU on Ubuntu]

OpenCL implementations need to spawn threads internally to support different features, such as monitoring device kernel execution or memory transfers, or to execute the user callbacks. This behaviour is implementation defined, so different implementations may spawn different number of threads.

Related

Dask job fails in Jupyter notebook cell with KilledWorker

I am running a join task in a Jupyter notebook which is producing many warnings from Dask about a possible memory leak before finally failing with a killed worker error:
2022-07-26 21:38:05,726 - distributed.worker_memory - WARNING - Worker is at 85% memory usage. Pausing worker. Process memory: 1.59 GiB -- Worker memory limit: 1.86 GiB
2022-07-26 21:38:06,319 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 1.59 GiB -- Worker memory limit: 1.86 GiB
2022-07-26 21:38:07,501 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:46137 (pid=538697) exceeded 95% memory budget. Restarting...
2022-07-26 21:38:07,641 - distributed.nanny - WARNING - Restarting worker
KilledWorker: ("('assign-6881b18750807133ba976bf463a98c23', 0)", <WorkerState 'tcp://127.0.0.1:46137', name: 0, status: closed, memory: 0, processing: 50>)
This happens when I run my code on a laptop with 32GB RAM (Kubuntu 20). Maybe I have not configured Dask correctly for the environment? I can watch the memory usage go up and down in the system monitor but at no point does it consume all the memory. How can I tell Dask to use all the cores and as much memory as it can manage? It seems to be running in single processor mode, maybe because I'm running on a laptop rather than a proper cluster?
For context: I'm joining two datasets, both are text files with sizes 25GB and 5GB. Both files have been read into Dask DataFrame objects using dd.read_fwf(), then I transform a string field on one of the frames, then join (merge) on that field.
There are certain memory constraints for a particular worker, which you can read from here : https://distributed.dask.org/en/stable/worker-memory.html .
Apart from this, you can try increasing the number of workers, threads while initializing the dask client.

apache ignite out of memory exception

I got out of memory exception and ignite got crashed. After going through the ignite logs, in last metrics I could see heap, off-heap memory usage was about 171 MB,70MB respectively and after 10 secs, ignite logs shows out of memory exception. also, other flags in metrics looks ok
Below is the log snippet
[01:04:29,690][INFO][grid-timeout-worker-#22][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=8a034404, uptime=39 days, 15:50:23.086]
^-- Cluster [hosts=1, CPUs=4, servers=1, clients=1, topVer=22, minorTopVer=0]
^-- Network [addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.28.230.222], discoPort=47500, commPort=47100]
^-- CPU [CPUs=4, curLoad=0.07%, avgLoad=0.15%, GC=0%]
^-- Heap [used=171MB, free=95.15%, comm=254MB]
^-- Off-heap memory [used=70MB, free=98.02%, allocated=3377MB]
^-- Page memory [pages=17878]
^-- sysMemPlc region [type=internal, persistence=true, lazyAlloc=false,
... initCfg=40MB, maxCfg=100MB, usedRam=0MB, freeRam=99.98%, allocRam=100MB, allocTotal=0MB]
^-- default region [type=default, persistence=true, lazyAlloc=true,
... initCfg=256MB, maxCfg=3177MB, usedRam=70MB, freeRam=97.78%, allocRam=3177MB, allocTotal=69MB]
^-- metastoreMemPlc region [type=internal, persistence=true, lazyAlloc=false,
... initCfg=40MB, maxCfg=100MB, usedRam=0MB, freeRam=99.95%, allocRam=0MB, allocTotal=0MB]
^-- TxLog region [type=internal, persistence=true, lazyAlloc=false,
... initCfg=40MB, maxCfg=100MB, usedRam=0MB, freeRam=100%, allocRam=100MB, allocTotal=0MB]
^-- volatileDsMemPlc region [type=user, persistence=false, lazyAlloc=true,
... initCfg=40MB, maxCfg=100MB, usedRam=0MB, freeRam=100%, allocRam=0MB]
^-- Ignite persistence [used=69MB]
^-- Outbound messages queue [size=0]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=7, qSize=0]
^-- Striped thread pool [active=0, idle=8, qSize=0]
[01:04:38,584][INFO][db-checkpoint-thread-#104][Checkpointer] Checkpoint started [checkpointId=41e99f38-7359-4af1-945f-61c92d2a5fb7, startPtr=WALPointer [idx=147, fileOff=11684440, len=381549], checkpointBeforeLockTime=9ms, checkpointLockWait=0ms, checkpointListenersExecuteTime=17ms, checkpointLockHoldTime=19ms, walCpRecordFsyncDuration=2ms, writeCheckpointEntryDuration=2ms, splitAndSortCpPagesDuration=0ms, pages=9, reason='timeout']
[01:04:38,619][SEVERE][db-checkpoint-thread-#104][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.IgniteCheckedException: Compound exception for CountDownFuture.]]
class org.apache.ignite.IgniteCheckedException: Compound exception for CountDownFuture.
at org.apache.ignite.internal.util.future.CountDownFuture.addError(CountDownFuture.java:72)
at org.apache.ignite.internal.util.future.CountDownFuture.onDone(CountDownFuture.java:46)
at org.apache.ignite.internal.util.future.CountDownFuture.onDone(CountDownFuture.java:28)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:478)
at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter.run(CheckpointPagesWriter.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Suppressed: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.addWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
at sun.nio.ch.SimpleAsynchronousFileChannelImpl.implWrite(Unknown Source)
at sun.nio.ch.AsynchronousFileChannelImpl.write(Unknown Source)
at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIO.write(AsyncFileIO.java:177)
at org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO$5.run(AbstractFileIO.java:117)
at org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.fully(AbstractFileIO.java:53)
at org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:115)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.write(FilePageStore.java:748)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageReadWriteManagerImpl.write(PageReadWriteManagerImpl.java:116)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.write(FilePageStoreManager.java:636)
at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointManager.lambda$new$0(CheckpointManager.java:175)
at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter$1.writePage(CheckpointPagesWriter.java:266)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.copyPageForCheckpoint(PageMemoryImpl.java:1343)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.checkpointWritePage(PageMemoryImpl.java:1250)
at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter.writePages(CheckpointPagesWriter.java:207)
at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter.run(CheckpointPagesWriter.java:151)
... 3 more
[01:04:38,620][SEVERE][db-checkpoint-thread-#104][FailureProcessor] No deadlocked threads detected.
[01:04:38,749][SEVERE][db-checkpoint-thread-#104][FailureProcessor] Thread dump at 2022/02/06 01:04:38 CST
unable to create new native thread
This seems to be a non-Ignite exception and most likely is about your system configuration.
Check your Process File Descriptor Limit by running the ulimit -a command and increase it if required. The recommended value is 32768 or above. If it requires an adjustment that can be accomplished by either running ulimit -n 32768 -u 32768 or by modifying the /etc/security/limits.conf

Openmpi 4.0.5 fails to distribute tasks to more than 1 node

We are having trouble with openmpi 4.0.5 on our cluster: It works as long as only 1 node is requested, but as soon as more than 1 is requested (e.g. mpirun -np 24 ./hello_world with --ntasks-per-node=12) it crashes and we get the following error message:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:
./hello_world
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
I have tried using --oversubscribe, but this will still only use 1 node, even though smaller jobs would run that way. I have also tried specifically requesting nodes (e.g. -host node36,node37), but this results in the following error message:
[node37:16739] *** Process received signal ***
[node37:16739] Signal: Segmentation fault (11)
[node37:16739] Signal code: Address not mapped (1)
[node37:16739] Failing at address: (nil)
[node37:16739] [ 0] /lib64/libpthread.so.0(+0xf5f0)[0x2ac57d70e5f0]
[node37:16739] [ 1] /lib64/libc.so.6(+0x13ed5a)[0x2ac57da59d5a]
[node37:16739] [ 2] /usr/lib64/openmpi/lib/libopen-rte.so.12(orte_daemon+0x10d7)[0x2ac57c6c4827]
[node37:16739] [ 3] orted[0x4007a7]
[node37:16739] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac57d93d505]
[node37:16739] [ 5] orted[0x400810]
[node37:16739] *** End of error message ***
The cluster has 59 nodes. Slurm 19.05.0 is used as a scheduler and gcc 9.1.0 to compile.
I don't have much experience with mpi - any help would be much appreciated! Maybe someone is familiar with this error and could point me towards what the problem might be.
Thanks for your help,
Johanna

I get "There are not enough slots available in the system" when I run mpi

I am a high school student. An error occurred while studying and coding the basic theory of mpi. I searched on the internet and tried everything, but I couldn't understand it well.
The code is really simple. There is no problem with the code and I understood it well.
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int num_procs, my_rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
printf("Hello world! I'm rank %d among %d processes.\n", my_rank, num_procs);
MPI_Finalize();
return 0;
}
But there was a problem with running mpi. It works well when i type it like this.
mpirun -np 2 ./hello
Hello world! I'm rank 1 among 2 processes.
Hello world! I'm rank 0 among 2 processes.
This error occurs at -np 3.
mpirun -np 3 ./hello
`There are not enough slots available in the system to satisfy the 3
slots that were requested by the application:
./hello
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
My laptop is Intel i5 and cpu core is 2 and 4 threads. Did such a problem happen because there were only 2 cores? I don't exactly understand this part.
There is not much data about mpi in Korea, so I always googling and studying. If that's the cause, is there any way to increase the number of processes? Other people wrote that there was an error in -np 17, how did they increase the process to double digits? Is the computer capable? Please explain it easily so that I can understand it well.
My laptop is Intel i5 and cpu core is 2 and 4 threads. Did such a problem happen because there were only 2 cores?
Yes. By default Open MPI uses the number of cores as slots. So since you only have 2 cores, you can only launch maximum of 2 processes.
If that's the cause, is there any way to increase the number of processes?
Yes, If you use --use-hwthread-cpus with your mpirun command you can use upto 4 mpi processes in your laptop since you have 4 threads in your laptop. Try running the command, mpirun -np 4 --use-hwthread-cpus a.out
Also, you can use --oversubscribe option to increase the number of processes greater than the available cores/threads. For example try this mpirun -np 10 --oversubscribe a.out

mpi + infiniband too many connections

I am running a MPI application on a cluster, using 4 nodes each with 64 cores.
The application performs an all to all communication pattern.
Executing the application by the following runs fine:
$: mpirun -npernode 36 ./Application
Adding a further process per node let the application crash:
$: mpirun -npernode 37 ./Application
--------------------------------------------------------------------------
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.
For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: laser045
Local device: qib0
Queue pair type: Reliable connected (RC)
--------------------------------------------------------------------------
[laser045:15359] *** An error occurred in MPI_Issend
[laser045:15359] *** on communicator MPI_COMM_WORLD
[laser045:15359] *** MPI_ERR_OTHER: known error not in list
[laser045:15359] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
[laser040:49950] [[53382,0],0]->[[53382,1],30] mca_oob_tcp_msg_send_handler: writev failed: Connection reset by peer (104) [sd = 163]
[laser040:49950] [[53382,0],0]->[[53382,1],21] mca_oob_tcp_msg_send_handler: writev failed: Connection reset by peer (104) [sd = 154]
--------------------------------------------------------------------------
mpirun has exited due to process rank 128 with PID 15358 on
node laser045 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[laser040:49950] 4 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed
[laser040:49950] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[laser040:49950] 4 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
EDIT added some source code of all to all communication pattern:
// Send data to all other ranks
for(unsigned i = 0; i < (unsigned)size; ++i){
if((unsigned)rank == i){
continue;
}
MPI_Request request;
MPI_Issend(&data, dataSize, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &request);
requests.push_back(request);
}
// Recv data from all other ranks
for(unsigned i = 0; i < (unsigned)size; ++i){
if((unsigned)rank == i){
continue;
}
MPI_Status status;
MPI_Recv(&recvData, recvDataSize, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
}
// Finish communication operations
for(MPI_Request &r: requests){
MPI_Status status;
MPI_Wait(&r, &status);
}
Is there something I can do as cluster user or some advices I can give the cluster admin ?
The line mca_oob_tcp_msg_send_handler error line may indicate that the node corresponding to a receiving rank died (ran out of memory or received a SIGSEGV):
http://www.open-mpi.org/faq/?category=tcp#tcp-connection-errors
The OOB (out-of-band) framework in Open-MPI is used for control messages, not for the messages of your applications. Indeed, messages typically go throught byte transfer layers (BTLs) such as self, sm, vader, openib (Infiniband), and so on.
The output of 'ompi_info -a' is useful in that regard.
Finally, it is not specified in the question is the Infiniband hardware vendor is Mellanox, so the XRC option may not work (for instance, Intel/QLogic Infiniband does not support this option).
The error is connected to the buffer size
of the mpi message queues commented here:
http://www.open-mpi.org/faq/?category=openfabrics#ib-xrc
The following environment setting solved my problem:
$ export OMPI_MCA_btl_openib_receive_queues="P,128,256,192,128:S,65536,256,192,128"

Resources