Using MPI, how do you wait for threads to finish?
For example:
for (int i = localstart; i < localend; i++)
{
// do stuff that is computationally intensive
}
// I need to wait for all other threads to finish here
if (rank == 0) do_something();
If by threads you meant processes/ranks, then the answer is MPI_Barrier.
But look at the other collective operations too: they might make sense in your application, and offer better performance than hand-coding communication. For example, you could use MPI_Allgather to communicate all data to all ranks, and so on.
If you meant threads (like pthreads), then you'd have to use whatever the threading library offers.
Related
It might be a stupid question but is there a way to return asynchronously from a kernel? For example, I have this kernel which does a first stream compaction which is outputted to the user but before it must do a second stream compaction to update its internal structure.
Is there a way to return the control to the user after the first stream compaction done while the GPU continues its second stream compaction in the background? Of course, the second stream compaction works only on shared memory and global memory, but nothing the user should retrieve.
I can't use thrust.
A GPU kernel does not, in itself, take control from the "user", i.e. from CPU threads on the system with the GPU.
However, with CUDA's runtime, the default way to invoke a GPU kernel has your thread wait until the kernel's execution concludes:
my_kernel<<<my_grid_dims,my_block_dims,dynamic_shared_memory_size>>>(args,go,here);
but you can also use streams. These are hardware-supported execution queues on which you can enqueue work (memory copying, kernel execution etc.) asynchronously, just like you asked.
Your launch in this case may look like:
cudaStream_t my_stream;
cudaError_t result = cudaStreamCreateWithFlags(&my_stream, cudaStreamNonBlocking);
if (result != cudaSuccess) { /* error handling */ }
my_kernel<<<my_grid_dims,my_block_dims,dynamic_shared_memory_size,my_stream>>>(args,go,here);
There are lots of resources on using streams; try this blog post for starters. The CUDA programming guide has a larg section on asynchronous execution .
Streams and various libraries
Thrust has offered asynchronous functionality for a while, using thrust::future and other constructs. See here.
My own Modern-C++ CUDA API wrappers make it somewhat easier to work with streams, relieving you of the need to check for errors all the time and to remember to destroy streams and release memory before it goes out of scope. make it somewhat easier to work with streams. See this example; the syntax looks something like this:
auto stream = device.create_stream(cuda::stream::async);
stream.enqueue.copy(d_a.get(), a.get(), nbytes);
stream.enqueue.kernel_launch(my_kernel, launch_config, d_a.get(), more, args);
(and errors throw an exception)
I know that MPI_SENDRECV allow to overcome the problem of deadlocks (when we use the classic MPI_SEND and MPI_RECV functions).
I would like to know if MPI_SENDRECV(sent_to_process_1, receive_from_process_0) is equivalent to:
MPI_ISEND(sent_to_process_1, request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request1)
MPI_WAIT(request2)
with asynchronous MPI_ISEND and MPI_RECV functions?
From I have seen, MPI_ISEND and MPI_RECV creates a fork (i.e. 2 processes). So if I follow this logic, the first call of MPI_ISEND generates 2 processes. One does the communication and the other calls MPI_RECV which forks itself 2 processes.
But once the communication of first MPI_ISEND is finished, does the second process call MPI_IRECV again? With this logic, the above equivalent doesn't seem to be valid...
Maybe I should change to this:
MPI_ISEND(sent_to_process_1, request1)
MPI_WAIT(request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request2)
But I think that it could be create also deadlocks.
Anyone could give to me another solution using MPI_ISEND, MPI_IRECV and MPI_WAIT to get the same behaviour of MPI_SEND_RECV?
There's some dangerous lines of thought in the question and other answers. When you start a non-blocking MPI operation, the MPI library doesn't create a new process/thread/etc. You're thinking of something more like a parallel region of OpenMP I believe, where new threads/tasks are created to do some work.
In MPI, starting a non-blocking operation is like telling the MPI library that you have some things that you'd like to get done whenever MPI gets a chance to do them. There are lots of equally valid options for when they are actually completed:
It could be that they all get done later when you call a blocking completion function (like MPI_WAIT or MPI_WAITALL). These functions guarantee that when the blocking completion call is done, all of the requests that you passed in as arguments are finished (in your case, the MPI_ISEND and the MPI_IRECV). Regardless of when the operations actually take place (see next few bullets), you as an application can't consider them done until they are actually marked as completed by a function like MPI_WAIT or MPI_TEST.
The operations could get done "in the background" during another MPI operation. For instance, if you do something like the code below:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Waitall(2, req);
The MPI_ISEND and the MPI_IRECV would probably actually do the data transfers in the background during the MPI_BARRIER. This is because as an application, you are transferring "control" of your application to the MPI library during the MPI_BARRIER call. This lets the library make progress on any ongoing MPI operation that it wants. Most likely, when the MPI_BARRIER is complete, so are most other things that finished first.
Some MPI libraries allow you to specify that you want a "progress thread". This tells the MPI library to start up another thread (not that thread != process) in the background that will actually do the MPI operations for you while your application continues in the main thread.
Remember that all of these in the end require that you actually call MPI_WAIT or MPI_TEST or some other function like it to ensure that your operation is actually complete, but none of these spawn new threads or processes to do the work for you when you call your nonblocking functions. Those really just act like you stick them on a list of things to do (which in reality, is how most MPI libraries implement them).
The best way to think of how MPI_SENDRECV is implemented is to do two non-blocking calls with one completion function:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Waitall(2, req);
How I usually do this on node i communicating with node i+1:
mpi_isend(send_to_process_iPlus1, requests(1))
mpi_irecv(recv_from_process_iPlus1, requests(2))
...
mpi_waitall(2, requests)
You can see how ordering your commands this way with non-blocking communication allows you (during the ... above) to perform any computation that does not rely on the send/recv buffers to be done during your communication. Overlapping computation with communication is often crucial for maximizing performance.
mpi_send_recv on the other hand (while avoiding any deadlock issues) is still a blocking operation. Thus, your program must remain in that routine during the entire send/recv process.
Final points: you can initialize more than 2 requests and wait on all of them the same way using the above structure as dealing with 2 requests. For instance, it's quite easy to start communication with node i-1 as well and wait on all 4 of the requests. Using mpi_send_recv you must always have a paired send and receive; what if you only want to send?
I am working on a monitoring project that it monitor different targets including (Switch,Router,PC ,…) in their specific interval(mostly 60 seconds) via SNMP . depending on the number of elements that is going to be monitored, some of the monitoring processes take long time maybe more than 30 seconds .the problem appear when the number of targets are more than 2000 and majority of the monitoring operations take long time .in this occasion monitoring operations overlap.it is because apparently there is no available thread to allocate them while the cpu has four cores.i tried below options.
I found that there is gap between when monitor come into queue and when it really start to perform.in fact , ( startMonitoringOperation – startMonitoring ) = big interval
This big interavel could be more than 2 minute.
I want to know what is the best practice for synchronous long-running Process .How I can decrease the gap I have mentioned before.
for (int I = 0 ;MonitorsQueue.Count;i++)
{
Var startMonitoring = DateTime.Now;
1.Threadpool.QueueUserWorkItem(new System.Threading.WaitCallback(DoMonitor),monitorObject);
2.var task = new task(() => DoMonitor(monitorObject),TakeCreationOptions.LongRuning);
3. var task = new task(() => DoMonitor(monitorObject));
}
Private void DoMonitor() //this Method take long in some occasions
{
Var startMonitoringOperation = DateTime.Now;
//Monintoring Operation
}
That seems to be a pool size problem. Increase the the number of threads in ThreadPool class to say 3000 to avoid this. However , I suggest you to use async/await feature of c# language instead of using threadpool to run such threads, as they are not CPU intensive threads and they are just waiting for I/O to accomplish their job.
Through using async/await, no thread would exist during the wait time which is what you can take benefit from.
Good luck,
Ahmad
I'm getting the titular error:
mcfork(): Unable to fork: Cannot allocate memory
after trying to run a function with mcapply, but top says I'm at 51%
This is on an EC2 instance, but I do have up-to-date R.
Does anyone know what else can cause this error?
Thanks,
-N
The issue might be exactly what the error message suggests: there isn't enough memory to fork and create parallel processes.
R essentially needs to create a copy of everything that's in memory for each individual process (to my knowledge it doesn't utilize shared memory). If you are already using 51% of your RAM with a single process, then you don't have enough memory to create a second process since that would required 102% of your RAM in total.
Try:
Using fewer cores - If you were trying to use 4 cores, it's possible you have enough RAM to support 3 parallel threads, but not 4. registerDoMC(2), for example, will set the number of parallel threads to 2 (if you are using the doMC parallel backend).
Using less memory - without seeing the rest of your code, it's hard to suggest ways to accomplish this. One thing that might help is figuring out which R objects are taking up all the memory (Determining memory usage of objects?) and then removing any objects from memory that you don't need (rm(my_big_object))
Adding more RAM - if all else fails, throw hardware at it so you have more capacity.
Sticking to single threading - multithreaded processing in R is a tradeoff of CPU and memory. It sounds like in this case you may not have enough memory to support the CPU power you have, so the best course of action might be to just stick to a single core.
R function mcfork is only a wrapper to the syscall fork (BtW, the man page says, that this call is itself a wrapper to the clone)
I created a simple C++ program to test fork's behaviour:
#include <stdio.h>
#include <unistd.h>
#include<vector>
int main(int argc, char **argv)
{
printf("--beginning of program\n");
std::vector<std::vector<int> > l(50000, std::vector<int>(50000, 0));
// while (true) {}
int counter = 0;
pid_t pid = fork();
pid = fork();
pid = fork();
if (pid == 0)
{
// child process
int i = 0;
for (; i < 5; ++i)
{
printf("child process: counter=%d\n", ++counter);
}
}
else if (pid > 0)
{
// parent process
int j = 0;
for (; j < 5; ++j)
{
printf("parent process: counter=%d\n", ++counter);
}
}
else
{
// fork failed
printf("fork() failed!\n");
return 1;
}
printf("--end of program--\n");
while (true) {}
return 0;
}
First, the program allocates about 8GB data on heap.
Then, it spawns 2^2^2 = 8 children via fork call and waits to be killed by the user, and enters an infinite loop to be easy to spot on task manager.
Here are my observations:
For the fork to succeed, you need to have at least 51% free memory on my system, but this includes swap. You can change this by editing /proc/sys/vm/overcommit_* proc files.
As expected, none of the children take more memory, so this 51% free memory remains free throughout course of the program, and all subsequent forks also don't fail.
The memory is shared between the forks, so it gets reclaimed only after you killed the last child.
Memory fragmentation issue
You should not be concerned about any layer of memory fragmentation with respect to fork. R's memory fragmentation doesn't apply here, because fork operates on virtual memory. You shouldn't worry about fragmentation of physical memory, because virtually all modern operating systems use virtual memory (which consequently enables them to use swap). The only memory fragmentation that might be of issue is a fragmentation of virtual memory space, but AFAIK on Linux virtual memory space is 2^47 which is more than huge, and for many decades you should not have any problems with finding a continuous regions of any practical size.
Summary:
Make sure you have more swap then physical memory, and as long as your computations don't actually need more memory then you have in RAM, you can mcfork them as much as you want.
Or, if you are willing to risk stability (memory starvation) of the whole system, try echo 1 >/proc/sys/vm/overcommit_memory as root on linux.
Or better yet: (more safe)
echo 2 >/proc/sys/vm/overcommit_memory
echo 100 >/proc/sys/vm/overcommit_ratio
You can read more about overcommiting here: https://www.win.tue.nl/~aeb/linux/lk/lk-9.html
A note for those who want to use GUI such as RStudio.
If you want to take advantage of parallel processing, it is advised not to use a GUI, as that interrupts the multithreaded processes between your code and the GUI programme. Here is an excerpt from registerDoMC package help manual on R:
The multicore functionality, originally written by Simon Urbanek and subsumed in the parallel package in R 2.14.0, provides functions for parallel execution of R code on machines with multiple cores or processors, using the system fork call to spawn copies of the current process.
The multicore functionality, and therefore registerDoMC, should not be used in a GUI environment, because multiple processes then share the same GUI.
I solved a similiar error experienced by the OP by disabling registerDoMC(cores = n) when running my program using RStudio. Multiprocessing works best with base R. Hope this helps.
I had the same error, while using caret to train a rpart model on a system with 64 GB of memory, with parallel processing using 6 core on a 7 core machine. Changed to 5 core and the problem ran.
library(doMC)
registerDoMC(5)
I'm running into a similar problem right now. I won't claim to know the right answer. Both of the above answers propose courses of action that may work, especially if your forks are creating additional write demands on memory at the same time. However, I have been thinking that something else might be the source of difficulty, vis. memory fragmentation. See https://raspberrypi.stackexchange.com/questions/7856/log-says-i-cant-allocate-memory-but-i-have-more-than-half-of-my-memory-free for a discussion of a case where a user in a Unix-alike sees free memory but hits an out of memory error due to memory fragmention. This seems like a likely culprit for R in particular because of R's love for contiguous blocks of RAM. Also per ?Memory-limits the requirement should be about address space rather than RAM itself - so this could be incorrect (esp on a 64-bit machine) YMMV.
To the best of my knowledge, event-driven programs require a main loop such as
while (1) {
}
I am just curious if this while loop can cost a high CPU usage? Is there any other way to implement event-driven programs without using the main loop?
Your example is misleading. Usually, an event loop looks something like this:
Event e;
while ((e = get_next_event()) != E_QUIT)
{
handle(e);
}
The crucial point is that the function call to our fictitious get_next_event() pumping function will be generous and encourage a context switch or whatever scheduling semantics apply to your platform, and if there are no events, the function would probably allow the entire process to sleep until an event arrives.
So in practice there's nothing to worry about, and no, there's not really any alternative to an unbounded loop if you want to process an unbounded amount of information during your program's runtime.
Usually, the problem with a loop like this is that while it's doing one piece of work, it can't be doing anything else (e.g. Windows SDK's old 'cooperative' multitasking). The next naive jump up from this is generally to spawn a thread for each piece of work, but that's incredibly dangerous. Most people would end up with an executor that generally has a thread pool inside. Then, the handle call is actually just enqueueing the work and the next available thread dequeues it and executes it. The number of concurrent threads remains fixed as the total number of worker threads in the pool and when threads don't have anything to do, they are not eating CPU.