Linux kernel process limits

Linux kernel process limits - recursion

Could Someone please explain how the linux kernel handles a ballooning process queue?
I am implementing a recursive fibonacci program that for fib(n) spawns 3 processes:
add(fd1, fd2)
fib(n-1) (stdout piped to fd1)
&fib(n-2) (stdout piped to fd2)
with the base cases of course...
but as n approaches 20 or so, some of my fork()s are raising errno 11 and fib is not returning the correct value
from ~$man fork
EAGAIN fork() cannot allocate sufficient memory to copy the parent's page tables and allocate a task structure for the child.
EAGAIN It was not possible to create a new process because the caller's RLIMIT_NPROC resource limit was encountered. To
exceed this limit, the process must have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability.
If I implement a jump statement to try the fork() again in the event of this error, though, the process will eventually terminate with the correct value for fib(~20)
I understand that if I did fib(700) it would probably not terminate, but I am interested in what is happening right at the fringe of when fork() system calls start raising this error.

Related

When a process makes a system call to transmit a TCP packet over the network, which of the following steps do NOT occur always?

I am teaching myself OS by going through the lecture notes of the course at IIT Bombay (https://www.cse.iitb.ac.in/~mythili/os/). One of the questions in the Process worksheet asks which of the following doesn't always happen in the situation described at the title. The answer is C.
A. The process moves to kernel mode.
B. The program counter of the CPU shifts to the kernel part of the address space.
C. The process is context-switched out and a separate kernel process starts execution.
D. The OS code that deals with handling TCP/IP packets is invoked
I'm a bit confused though. I thought when an interrupt routine occurs the process is context-switched out so other processes can run and the CPU is not idle during that time. The kernel, then, will take care of the packet sending. Why would C not be correct then?

You are right in saying that "when an interrupt routine occurs the process is context-switched out so other processes can run and the CPU is not idle during that time", but the words "generally or mostly" need to be added to it.
In most cases, there is another process waiting for CPU time and that can be scheduled. However it is not the case 100% of the time. The question is about the word "always" and while other options always occur in the given situation, option C is a choice that OS makes at run time. If OS determines that switching out this process can be sub optimal than performing the system call and resuming the same process, then it may not perform the context switching.
There is a cost associated with context switching and if other processes are also blocked on some I/O then it may be optimal for OS to NOT switch the context or there might be other reasons to not switch the context such as what if only 1 process is running, there is no other process to switch the context to!

Effect of not using clWaitForEvents

I'm new to OpenCL programming. In one of my OpenCL applications, I use clWaitForEvents after launching every kernel.
Case 1:
time_start();
cl_event event;
cl_int status = clEnqueueNDRangeKernel(queue, ..., &event);
clWaitForEvents(1, &event);
time_end();
Time taken : 250 ms (with clWaitForEvents)
If I remove clWaitForEvents(), my kernel runs faster with the same output.
Case 2:
time_start();
cl_event event;
cl_int status = clEnqueueNDRangeKernel(queue, ..., &event);
time_end();
Time taken: 220 ms (without clWaitForEvents)
I've to launch 10 different kernels sequentially. Every kernel is dependent on the output of the previous kernel. Using clWaitForEvent after every kernel increases my execution time by few 100 ms.
Can the outputs go wrong if I do not use clWaitForEvents? I would like to understand what might possibly go wrong if I do not use clWaitForEvents or clFinish.
Any pointers are appreciated.

Hopefully a slightly less complicated answer:
I've to launch 10 different kernels sequentially. Every kernel is dependent on the output of the previous kernel.
If you don't explicitly set CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property in clCreateCommandQueue() call (= the usual case), it will be an in-order queue. You don't need to synchronize commands in them (actually you shouldn't, as you see it can considerably slow down execution). See the docs:
If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order. For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed. If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A.
I would like to understand what might possibly go wrong if I do not use clWaitForEvents or clFinish.
If you're doing simple stuff on a single in-order queue, you don't need clWaitForEvents() at all. It's mostly useful if you want to wait for multiple events from multiple queues, or you're using out-of-order queues, or you want to enqueue 20 commands but wait for the 4th, or something similar.
For a single in-order queue, after clFinish() returns all commands will be completed and any&all events will have their status updated to complete or failed. So in the simplest case you don't need to deal with events at all, just enqueue everything you need (check the enqueues for errors though) and call clFinish().
Note that if you don't use any form of wait/flush (WaitForEvents / Finish / a blocking command), the implementation may take as much time as it wants to actually push those commands to a device. IOW you must either 1) use WaitForEvents or Finish, or 2) enqueue a blocking command (read/write/map/unmap) as the last command.

In-order-queue implicitly waits for each command completion in the order they are enqueued but only on device-side. This means host can't know what happened.
Out-of-order-queue does not guarantee any command order in anywhere and can have issues.
'Wait-for-event' waits on host side for an event of a command.
'Finish' waits on host side until all commands are complete.
'Non blocking buffer read/write' does not wait on host side.
'Blocking buffer read/write' waits on host side but does not wait for other commands.
Recommended solutions:
Inter-command sync (for using output of a command as input of next command)
in-order-queue.
or passing event of a command to another (if its an out-of-order queue)
Inter-queue(or out-of-order queue) sync (for overlapping buffer copies and kernel executions)
pass events from command(s) to another command
Device - host sync (for getting latest data to RAM(or getting first data from RAM) or pausing host)
enable blocking option on buffer commands
or add a clFinish
or use clWaitForEvent
Be informed when a command is complete(for reasons like benchmarking)
use event callback
or constantly query event state(CPU/pci-e usage increases)
Enqueueing 1 non-blocking buffer write + 1000 x kernels + 1 blocking buffer read on an in-order-queue can successfully execute a chain of 1000 kernels on initial data and get latest results on host side.

MPI Spawn: Not enough slots available / All which nodes are allocated for this job are already filled

I am trying use MPI's Spawn functionality to run subprocesses that also use MPI. I am using MPI 2x and dynamic process management.
I have a master process (maybe I should say "master program") that runs in python (via mpi4py) that uses MPI to communicate between cores. This master process/program runs on 16 cores, and it will also make MPI_Comm_spawn_multiple calls to C and Fortran programs (which also use MPI). While the C and Fortran processes run, the master python program waits until they are finished.
A little more explicitly, the master python program does two primary things:
Uses MPI to do preprocessing for the spawning in step (2). MPI_Barrier is called after this preprocessing to ensure that all ranks have finished their preprocessing before step (2) begins. Note that the preprocessing is distributed across all 16 cores, and at the end of the preprocessing the resulting information is passed back to the root rank (e.g. rank == 0).
After the preprocessing, the root rank spawns 4 workers, each of which use 4 cores (i.e. all 16 cores are needed to run all 4 processes at the same time). This is done via MPI_Comm_spawn_multiple and these workers use MPI to communicate within their 4 cores. In the master python program, only rank == 0 spawns the C and Fortran subprocesses, and an MPI_Barrier is called after the spawn on all ranks so that all the rank != 0 cores wait until the spawned processes finish before they continue execution.
Repeat (1) and (2) many many times in a for loop.
The issue I am having is that if I use mpiexec -np 16 to start the master python program, all the cores are being taken up by the master program and I get the error:
All nodes which are allocated for this job are already filled.
when the program hits the MPI_Comm_spawn_multiple line.
If I use any other value less than 16 for -np, then only some of the cores are allocated and some are available (but I still need all 16), so I get a similar error:
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
/home/username/anaconda/envs/myenvironment/bin/python
Either request fewer slots for your application, or make more slots available
for use.
So it seems like even though I am going to run MPI_Barrier in step (2) to block until the spawned processes finish, MPI still thinks those cores are being used and won't allocate another process on top of them. Is there a way to fix this?
(If the answer is hostfiles, could you please explain them for me? I am not understanding the full idea and how they might be useful here.)

This is the poster of this question. I found out that I can use -oversubscribe as an argument to mpiexec to avoid these errors, but as Zulan mentioned in his comments, this could be a poor decision.
In addition, I don't know if the cores are being subscribed like I want them to be. For example, maybe all 4 C/Fortran processes are being run on the same 4 cores. I don't know how to tell.

Most MPIs have a parameter -usize 123 for the mpiexec program that indicates the size of the "universe", which can be larger than the world communicator. In that case you can spawn extra processes up to the size of the universe. You can query the size of the universe:
int universe_size, *universe_size_attr,uflag;
MPI_Comm_get_attr(comm_world,MPI_UNIVERSE_SIZE,
&universe_size_attr,&uflag);
universe_size = *universe_size_attr;

program with mpirun doesn't operate

I tried quantum computational program gamess with mpirun and it runned well yesterday. However, when I tried another calculation by the same procedure, it failed with next messeages.How can I fix it? I confirmed that there was no mpi process running and I cleaned cache memories also..
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with
errorcode 911.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
There are two reasons this could occur:
this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for
all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here).

Something in the application called MPI_ABORT on rank 0. You'll have to look at the code to figure out why it was called, but my guess is that there was some bad input. I don't know much about GAMESS though. You might try asking the GAMESS people directly. They have a website (http://www.msg.ameslab.gov/gamess/) which includes a way to contact them.

Cooperative Multitasking system

I'm trying to get around the concept of cooperative multitasking system and exactly how it works in a single threaded application.
My understanding is that this is a "form of multitasking in which multiple tasks execute by voluntarily ceding control to other tasks at programmer-defined points within each task."
So if you have a list of tasks and one task is executing, how do you determine to pass execution to another task? And when you give execution back to a previous task, how do resume from where you were previously?
I find this a bit confusing because I don't understand how this can be achieve without a multithreaded application.
Any advice would be very helpeful :)
Thanks

In your specific scenario where a single process (or thread of execution) uses cooperative multitasking, you can use something like Windows' fibers or POSIX setcontext family of functions. I will use the term fiber here.
Basically when one fiber is finished executing a chunk of work and wants to voluntarily allow other fibers to run (hence the "cooperative" term), it either manually switches to the other fiber's context or more typically it performs some kind of yield() or scheduler() call that jumps into the scheduler's context, then the scheduler finds a new fiber to run and switches to that fiber's context.
What do we mean by context here? Basically the stack and registers. There is nothing magic about the stack, it's just a block of memory the stack pointer happens to point to. There is also nothing magic about the program counter, it just points to the next instruction to execute. Switching contexts simply saves the current registers somewhere, changes the stack pointer to a different chunk of memory, updates the program counter to a different stream of instructions, copies that context's saved registers into the CPU, then does a jump. Bam, you're now executing different instructions with a different stack. Often the context switch code is written in assembly that is invoked in a way that doesn't modify the current stack or it backs out the changes, in either case it leaves no traces on the stack or in registers so when code resumes execution it has no idea anything happened. (Again, the theme: we assume that method calls fiddle with registers, push arguments to the stack, move the stack pointer, etc but that is just the C calling convention. Nothing requires you to maintain a stack at all or to have any particular method call leave any traces of itself on the stack).
Since each stack is separate, you don't have some continuous chain of seemingly random method calls eventually overflowing the stack (which might be the result if you naively tried to implement this scheme using standard C methods that continuously called each other). You could implement this manually with a state machine where each fiber kept a state machine of where it was in its work, periodically returning to the calling dispatcher's method, but why bother when actual fiber/co-routine support is widely available?
Also remember that cooperative multitasking is orthogonal to processes, protected memory, address spaces, etc. Witness Mac OS 9 or Windows 3.x. They supported the idea of separate processes. But when you yielded, the context was changed to the OS context, allowing the OS scheduler to run, which then potentially selected another process to switch to. In theory you could have a full protected virtual memory OS that still used cooperative multitasking. In those systems, if a errant process never yielded, the OS scheduler never ran, so all other processes in the system were frozen. **
The next natural question is what makes something pre-emptive... The answer is that the OS schedules an interrupt timer with the CPU to stop the currently executing task and switch back to the OS scheduler's context regardless of whether the current task cares to release the CPU or not, thus "pre-empting" it.
If the OS uses CPU privilege levels, the (kernel configured) timer is not cancelable by lower level (user mode) code, though in theory if the OS didn't use such protections an errant task could mask off or cancel the interrupt timer and hijack the CPU. There are some other scenarios like IO calls where the scheduler can be invoked outside the timer, and the scheduler may decide no other process has higher priority and return control to the same process without a switch... And in reality most OSes don't do a real context switch here because that's expensive, the scheduler code runs inside the context of whatever process was executing, so it has to be very careful not to step on the stack, to save register states, etc.
** You might ask why not just fire a timer if yield isn't called within a certain period of time. The answer lies in multi-threaded synchronization. In a cooperative system, you don't have to bother taking locks, worry about re-entrance, etc because you only yield when things are in a known good state. If this mythical timer fires, you have now potentially corrupted the state of the program that was interrupted. If programs have to be written to handle this, congrats... You now have a half-assed pre-emptive multitasking system. Might as well just do it right! And if you are changing things anyway, may as well add threads, protected memory, etc. That's pretty much the history of the major OSes right there.

The basic idea behind cooperative multitasking is trust - that each subtask will relinquish control, of its own accord, in a timely fashion, to avoid starving other tasks of processor time. This is why tasks in a cooperative multitasking system need to be tested extremely thoroughly, and in some cases certified for use.
I don't claim to be an expert, but I imagine cooperative tasks could be implemented as state machines, where passing control to the task would cause it to run for the absolute minimal amount of time it needs to make any kind of progress. For example, a file reader might read the next few bytes of a file, a parser might parse the next line of a document, or a sensor controller might take a single reading, before returning control back to a cooperative scheduler, which would check for task completion.
Each task would have to keep its internal state on the heap (at object level), rather than on the stack frame (at function level) like a conventional blocking function or thread.
And unlike conventional multitasking, which relies on a hardware timer to trigger a context switch, cooperative multitasking relies on the code to be written in such a way that each step of each long-running task is guaranteed to finish in an acceptably small amount of time.

The tasks will execute an explicit wait or pause or yield operation which makes the call to the dispatcher. There may be different operations for waiting on IO to complete or explicitly yielding in a heavy computation. In an application task's main loop, it could have a *wait_for_event* call instead of busy polling. This would suspend the task until it has input to process.
There may also be a time-out mechanism for catching runaway tasks, but it is not the primary means of switching (or else it wouldn't be cooperative).

One way to think of cooperative multitasking is to split a task into steps (or states). Each task keeps track of the next step it needs to execute. When it's the task's turn, it executes only that one step and returns. That way, in the main loop of your program you are simply calling each task in order, and because each task only takes up a small amount of time to complete a single step, we end up with a system which allows all of the tasks to share cpu time (ie. cooperate).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex