I have to use fork() recursively, but limit the number of forked processes (including children and descendants) to (for example) 100. Considering this code snippet:
void recursive(int n) {
for(int i=0; i<n; i++) {
if(number_of_processes() < 100) {
if(fork() == 0) {
number_of_processes_minus_one();
recursive(i);
exit(0);
}
}
else
recursive(i);
}
}
How to implement number_of_processes() and number_of_processes_minus_one()? Do I have to use IPC? I tried to pre-create a file, write PROC_MAX into it and lock-read-write-unlock it in number_of_processes() but it still eat all my pids.
I suspect that the simplest thing to do is to use a pipe. Before you fork anything create a pipe, write 100 bytes into the write side, and close the write side. Then, try to read one byte from the pipe whenever you want to fork. If you are able to read a byte, then fork. If not, then don't. Trying to track the number of total forks with a global variable will fail if children are allowed to fork, but the pipe will persist across all descendants.
Related
I am utilizing OpenCL's enqueue_kernel() function to enqueue kernels dynamically from the GPU to reduce unnecessary host interactions. Here is a simplified example of what I am trying to do in the kernels:
kernel void kernelA(args)
{
//This kernel is the one that is enqueued from the host, with only one work item. This kernel
//could be considered the "master" kernel that controls the logic of when to enqueue tasks
//First, it checks if a condition is met, then it enqueues kernelB
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelB(args);});
}
else
{
//do other things
}
}
kernel void kernelB(args)
{
//Do some stuff
//Only enqueue the next kernel with the first work item. I do this because the things
//occurring in kernelC rely on the things that kernelB does, so it must take place after kernelB is completed,
//hence, the CLK_ENQUEUE_FLAGS_WAIT_KERNEL
if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
}
kernel void kernelC(args)
{
//Do some stuff. This one in particular is one step in a sorting algorithm
//This kernel will enqueue kernelD if a condition is met, otherwise it will
//return to kernelA
if (get_global_id(0) == 0 && other requirements)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelD(args);});
}
else if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
kernel void kernelD(args)
{
//Do some stuff
//Finally, if some condition is met, enqueue kernelC again. What this will do is it will
//bounce back and forth between kernelC and kernelD until the condition is
//no longer met. If it isn't met, go back to kernelA
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
else
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
So that is the general flow of the program, and it works perfectly and does exactly as I intended it to do, in the exact order I intended it to do it in, except for one issue. In certain cases when the workload is very high, a random one of the enqueue_kernel()s will fail to enqueue and halt the program. This happens because the device queue is full, and it cannot fit another task into it. But I cannot for the life of me figure out why this is, even after extensive research.
I thought that once a task in the queue (a kernel for instance) is finished, it would free up that spot in the queue. So my queue should really only reach a max of like 1 or 2 tasks at a time. But this program will literally fill up the entire 262,144 byte size of the device command queue, and stop functioning.
I would greatly appreciate some potential insight as to why this is happening if anyone has any ideas. I am sort of stuck and cannot continue until I get past this issue.
Thank you in advance!
(BTW I am running on a Radeon RX 590 card, and am using the AMD APP SDK 3.0 to use with OpenCL 2.0)
I don't know exactly what's going wrong, but I've noticed a few things in the code you posted and this feedback would be too long/hard to read in comments, so here goes - not a definite answer, but an attempt to get a bit closer:
Code doesn't quite do what the comments say
In kernelD, you have:
//Finally, if some condition is met, enqueue kernelC again.
ā¦
if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelD(args);});
}
This actually enqueues kernelD itself again, not kernelC as the comments suggest. The other condition branch enqueues kernelA.
This could be a typo in the reduced version of your code.
Potential task explosion
This could again be down to the way you've abridged the code, but I don't quite see how
So my queue should really only reach a max of like 1 or 2 tasks at a time.
can be true. By my reading, all work items of both kernelC and kernelD will spawn new tasks; and as there seems to be more than 1 work item in each case, this seems like it could easily spawn a very large number of tasks:
For example, in kernelC:
if (get_global_id(0) == 0 && other requirements)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelD(args);});
}
else
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
kernelB will have created at least 256 work items running kernelC. Here, work item 0 will (if other requirements met) spawn 1 task with at least 256 more work items, and 255+ tasks with 1 work-item running kernelA. kernelD behaves similarly.
So with a few iterations, you could easily end up with a few thousand tasks for running kernelA queued. I don't really know what your code does, but it seems like a good idea to check if cutting down these hundreds of kernelA tasks improves the situation, and whether you can perhaps modify kernelA so that you just enqueue it once with a range instead of enqueueing a work size of 1 from every work item. (Or something along those lines - perhaps enqueue once per group if that makes more sense. Basically, reduce the number of times enqueue_kernel gets called.)
enqueue_kernel() return value
Have you actually checked the return value for enqueue_kernel? It tells you exactly why it failed, so even if my suggestion above isn't possible, perhaps you can set some global state which will allow kernelA to restart the calculation once more tasks have drained, if it was interrupted?
I have a process which does multiple forks, generating several sub processes which have to write on stdout. So, the messages of the different sub processes might cross with themselves. How can I avoid this problem?
Say you have three processes, each trying to output an infinite series of lines composed of four characters followed by a newline:
void four(char c);
int main()
{
//insert your own error checking
pid_t p0, p1, p2;
#define PROC(pid,str) pid=fork(); if(0==pid) four(str);
PROC(p0,'a');
PROC(p1,'b');
PROC(p2,'c');
waitpid(p2, 0,0);
waitpid(p1, 0,0);
waitpid(p0, 0,0);
}
If your four function is:
void four(char c)
{
for(;;){
for(int i=0; i<4;i++)
putchar(c);
putchar('\n');
}
}
and you pipe your program into this grep invocation:
./a.out |grep -v -e aaaa -e bbbb -e cccc
You'll get matches that demonstrate your problem.
The easiest way to solve this is by relying on the Linux guarantee that it won't break write calls aimed at a pipe if the write arguments are smaller than the pipe buffer size (defaults to 4KiB on my system (you can get the size from the ulimit shell builtin).
void four(char c)
{
for(;;){
for(int i=0; i<4;i++)
putchar(c);
putchar('\n');
fflush(stdout);
//the stdout buffer is surely larger than 5
//so this is 1 `write`
}
}
If you want to be more portable and robust, you can use a lock on a shared file:
void four(char c)
{
int fd;
fd = open("/proc/self/exe", O_RDONLY);
for(;;){
if(0>flock(fd, LOCK_EX))
perror("flock");
for(int i=0; i<4;i++)
{ putchar(c); fflush(stdout); }
putchar('\n'); fflush(stdout);
//the pipe buf guarantee won't save us here
//given all these flushes
//but this lock will
if(0>flock(fd, LOCK_UN))
perror("flock");
}
}
Alternatively, can also set file locks with fcntl.
By "cross themselves", I presume you mean that you are worried about output being interleaved. That is, one process attempts to output "Hello, World!", while another prints "Goodbye, Chicago!", and the final output is "Hello, Goodby, World! Chicago!", or similar. The absolute simplest approach to solving this is to ensure that each message you write is written with a single write system call, and that the data is small. If your messages are less than 1k (or so, the exactly value is system dependent, often 4096, rarely less than 512). If you invoke write with a sufficiently small buffer, the write will be atomic and will not be interleaved with output from any other process. If your messages won't fit in the size for the system you are on, you will need to use some locking mechanism.
I am new to MPI, so sorry if this sounds stupid. I want a process to have an MPI_Irecv. If it has been called for a task, then it finds a result and sends the result back to the process that called it. How can I check if it has been actually assigned to a task? So that I can have an if{} in which that task takes place while the rest of the process continues with other stuff.
Code example:
for (i=0;i<size_of_Q;i++) {
MPI_Irecv( &shmeio, 1, mpi_point_C, root, 77, MPI_COMM_WORLD, &req );
//I want to put an if right here.
//If it's true process does task.
//Finds a number. then
MPI_Isend( &Bestcandidate, 1, mpi_point_C, root, 66, MPI_COMM_WORLD, &req );
//so that it can return the result.
//if it wasn't assigned a task it carries on with its other tasks.
} //(here is where for loop ends)
You might be confusing what MPI is supposed to do. MPI isn't really a tasking-based model as compared to some others (map reduce, some parts of OpenMP, etc.). MPI has historically focused on SPMD (single program multiple data) types of applications. That's not to say that MPI can't handle MPMD (there's an entire chapter in the standard about dynamic processes and most launchers can run different executables on different ranks.
With that in mind, when you start your job, you'll usually have all of the processes that you'll ever have (unless you're using dynamic processing like MPI_COMM_SPAWN). You probably used something like:
mpiexec -n 8 ./my_program arg1 arg2 arg3
Many times, if people are trying to emulate a tasking (or master/worker) model, they'll treat rank 0 as the special "master":
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (0 == rank) {
while (/* work not done */) {
/* Check if parts of work are done */
/* Send work to ranks without work */
}
} else {
while (/* work not done */ {
/* Get work from master */
/* Compute */
/* Send results to master */
}
}
Often, when waiting for the work, you'll do something like:
for (i = 1; i < num_process; i++) {
MPI_Irecv(&result[i], ..., &requests[i]);
}
This will set up the receives for each rank that will send you work. Then later, you can do something like:
MPI_Testany(num_processes - 1, requests, &index, &flag, statuses);
if (flag) {
/* Process work */
MPI_Send(work_data, ..., index, ...);
}
This will check to see if any of the requests (the handles used to track the status of a nonblocking operation) are completed and will then send new work to the worker that finished.
Obviously, all of this code is not copy/paste ready. You'll have to figure out how/if it applies to your work and adapt it accordingly.
My application involves run myfuns() in a serial time execution. It calls dothings(...), which has a object instances and others passing to it. This function involves a loop each does a breadth-first search and it is really time consuming. I have used OpenMP for the loop and it speed up just a little bit, not good enough. I am thinking of using MPI parallelism to get more processes to work on, but not sure how to use it in an efficient way for this portion of code embedded deep inside a sequential code.
void dothings (object obji...) {
std::vector retvec;
for (i = 0; i < somenumber; i++) {
/* this function involves breadth first search using std:queue */
retval = compute(obji, i);
retvec.push_back(retval);
}
}
/* myfuns() gets called in a sequential manner */
void myfuns() {
dothings(objectInstance,...)
}
I'm trying to study for an exam and I'm just not able to figure out a simple fork program.
I have this piece of code and have to add code to it In order for the parent process to send through a PIPE the value n to the child. The child should double the value, not print anything and return it to the parent.
Then the parent should print it on the screen.
int main() {
int n=1;
if(fork() == 0) {
}
printf(ā%d\nā, n);
return 1;
}
I don't really know how PIPEs work and how to use them. Can anyone help me?
pid_t cp;
int fi[2],st;
int n;
if(pipe(fi)==-1) {perror("pipe error");exit(0);}
if((cp=fork())==-1) {perror("fork"); exit(0);}
else if(cp==0)
{
sleep(2);
close(fi[1]);
read(fi[0],&n,2);
n*=2;
close(fi[0]);
exit(n);
}
else
{
close(fi[0]);
write(fi[1],n,2);
close(fi[1]);
waitpid(cp,&st,0);
printf("%d",st);
exit(0);
}}
The working of pipes is very simple. A PIPE contains two ends, 1 for reading and another for writing. You have to close the appropriate end while reading or writing. After that you use it as a regular file with read() and write() functions. Forgive me for my formatting, I'm typing on a mobile.