I'd like the MPI function MPI_Sendrecv() to run on the GPU. Normally I use something like:
#pragma acc host_data use_device(send_buf, recv_buf)
{
MPI_Sendrecv (send_buf, N, MPI_DOUBLE, proc[0], 0,
recv_buf, N, MPI_DOUBLE, proc[0], 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
And it works fine. However now, I call MPI_Sendrecv() inside a loop. If I try to accelerate this loop (with #pragma acc parallel loop) or even accelerate the whole routine (#pragma acc routine) where the loop and the MPI call are situated, I get an error:
64, Accelerator restriction: loop contains unsupported statement type
78, Accelerator restriction: unsupported statement type: opcode=ACCHOSTDATA
How can I make run the call on the device if, like in this case, the call is in an accelerated region?
An alternative could be maybe to do not accelerate the routine and the loop, and use #pragma acc host_data use_device(send_buf, recv_buf) alone, but the goal of having everything on the gpu would fail.
EDIT
I removed the #pragma. Anyway, the application runs hundreds of time slower and I cannot figure why.
I'm using nsight-sys to check: Do you have and idea why MPI_Sendrecv is slowing down the app? Now all the routine where it's called is running on the host. If I move the mouse pointer on the NVTX (MPI) section, it prints "ranges on this row have been projected from the CPU on the GPU". What does this mean?
Sorry if this is not clear but I lack of practicality with nsight and I don't know how to analyze the results properly. If you need more details I'm happy to give them to you.
However it seemes weird to me that the MPI calls appear in the GPU section.
You can't make MPI calls from within device code.
Also, the "host_data" is saying to use a device pointer within host code so can't be used within device code. Device pointers are used by default in device code, hence no need for the "host_data" construct.
Questions after edit:
Do you have and idea why MPI_Sendrecv is slowing down the app?
Sorry, no idea. I don't know what you're comparing to or anything about your app so hard for me to tell. Though Sendrecv is a blocking call so putting in in a loop will cause all the sends and receives to wait on the previous ones before proceeding. Are you able to rewrite the code to use ISend and IRecv instead?
"ranges on this row have been projected from the CPU on the GPU". What
does this mean?
I haven't seen this before, but presume it just means that even though these are host calls, the NVTX instrumentation is able to project them onto the GPU timeline. Most likely so the CUDA Aware MPI device to device data transfers will be correlated to the MPI region.
Related
Reading Intel's big manual, I see that if you want to return from a far call, that is, a call to a procedure in another code segment, you simply issue a return instruction (possibly with an immediate argument that moves the stack pointer up n bytes after the pointer popping).
This, apparently, if I'm interpreting things correctly, is enough for the hardware to pop both the segment selector and offset into the correct registers.
But, how does the system know that the return should be a far return and that both an offset AND a selector need to be popped?
If the hardware just pops the offset pointer and not the selector after it, then you'll be pointing to the right offset but wrong segment.
There is nothing special about the far return command compared to the near return version.
They both look identical as far as I can tell.
I assume then that the processor, perhaps at the micro-architecture level, keeps track of which calls are far and which are close so that when they're returned from, the system knows how many bytes to pop and where to pop them (pointer registers and segment selector registers).
Is my assumption correct?
What do you guys know about this mechanism?
The processor doesn't track whether or not a call should be far or near; the compiler decides how to encode the function call and return using either far or near opcodes.
As it is, FAR calls have no use on modern processors because you don't need to change any segment register values; that's the point of a flat memory model. Segment registers still exist, but the OS sets them up with base=0 and limit=0xffffffff so just a plain 32-bit pointer can access all memory. Everything is NEAR, if you need to put a name on it.
Normally you just don't even think about segmentation so you don't actually call it either. But the manual still describes the call/ret opcodes we use for normal code as the NEAR versions.
FAR and NEAR were used on old 86 processors, which used a segmented memory model. Programs at that time needed to choose what kind of architecture they wished to support, ranging from "tiny" to "large". If your program was small enough to fit in a single segment, then it could be compiled using NEAR calls and returns exclusively. If it was "large", the opposite was true. For anything in between, you had power to choose whether local functions needed to be able to be either callable/returnable from code in another segment.
Most modern programs (besides bootloaders and the like) run on a different construct: they expect a flat memory model. Behind the scenes the OS will swap out memory as needed (with paging not segmentation), but as far as the program is concerned, it has its virtual address space all to itself.
But, to answer your question, the difference in the call/return is the opcode used; the processor obeys the command given to it. If you mistake (say, give it a FAR return opcode when in flat mode), it'll fail.
I know that MPI_SENDRECV allow to overcome the problem of deadlocks (when we use the classic MPI_SEND and MPI_RECV functions).
I would like to know if MPI_SENDRECV(sent_to_process_1, receive_from_process_0) is equivalent to:
MPI_ISEND(sent_to_process_1, request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request1)
MPI_WAIT(request2)
with asynchronous MPI_ISEND and MPI_RECV functions?
From I have seen, MPI_ISEND and MPI_RECV creates a fork (i.e. 2 processes). So if I follow this logic, the first call of MPI_ISEND generates 2 processes. One does the communication and the other calls MPI_RECV which forks itself 2 processes.
But once the communication of first MPI_ISEND is finished, does the second process call MPI_IRECV again? With this logic, the above equivalent doesn't seem to be valid...
Maybe I should change to this:
MPI_ISEND(sent_to_process_1, request1)
MPI_WAIT(request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request2)
But I think that it could be create also deadlocks.
Anyone could give to me another solution using MPI_ISEND, MPI_IRECV and MPI_WAIT to get the same behaviour of MPI_SEND_RECV?
There's some dangerous lines of thought in the question and other answers. When you start a non-blocking MPI operation, the MPI library doesn't create a new process/thread/etc. You're thinking of something more like a parallel region of OpenMP I believe, where new threads/tasks are created to do some work.
In MPI, starting a non-blocking operation is like telling the MPI library that you have some things that you'd like to get done whenever MPI gets a chance to do them. There are lots of equally valid options for when they are actually completed:
It could be that they all get done later when you call a blocking completion function (like MPI_WAIT or MPI_WAITALL). These functions guarantee that when the blocking completion call is done, all of the requests that you passed in as arguments are finished (in your case, the MPI_ISEND and the MPI_IRECV). Regardless of when the operations actually take place (see next few bullets), you as an application can't consider them done until they are actually marked as completed by a function like MPI_WAIT or MPI_TEST.
The operations could get done "in the background" during another MPI operation. For instance, if you do something like the code below:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Waitall(2, req);
The MPI_ISEND and the MPI_IRECV would probably actually do the data transfers in the background during the MPI_BARRIER. This is because as an application, you are transferring "control" of your application to the MPI library during the MPI_BARRIER call. This lets the library make progress on any ongoing MPI operation that it wants. Most likely, when the MPI_BARRIER is complete, so are most other things that finished first.
Some MPI libraries allow you to specify that you want a "progress thread". This tells the MPI library to start up another thread (not that thread != process) in the background that will actually do the MPI operations for you while your application continues in the main thread.
Remember that all of these in the end require that you actually call MPI_WAIT or MPI_TEST or some other function like it to ensure that your operation is actually complete, but none of these spawn new threads or processes to do the work for you when you call your nonblocking functions. Those really just act like you stick them on a list of things to do (which in reality, is how most MPI libraries implement them).
The best way to think of how MPI_SENDRECV is implemented is to do two non-blocking calls with one completion function:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Waitall(2, req);
How I usually do this on node i communicating with node i+1:
mpi_isend(send_to_process_iPlus1, requests(1))
mpi_irecv(recv_from_process_iPlus1, requests(2))
...
mpi_waitall(2, requests)
You can see how ordering your commands this way with non-blocking communication allows you (during the ... above) to perform any computation that does not rely on the send/recv buffers to be done during your communication. Overlapping computation with communication is often crucial for maximizing performance.
mpi_send_recv on the other hand (while avoiding any deadlock issues) is still a blocking operation. Thus, your program must remain in that routine during the entire send/recv process.
Final points: you can initialize more than 2 requests and wait on all of them the same way using the above structure as dealing with 2 requests. For instance, it's quite easy to start communication with node i-1 as well and wait on all 4 of the requests. Using mpi_send_recv you must always have a paired send and receive; what if you only want to send?
On Nvidia GPUs, when I call clEnqueueNDRange, the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange, but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a bit hazy.
This makes spreading the work across multiple GPUs harder. I've tried starting a new thread for each call to enqueueNDRange using std::async/std::finish in the new C++11 spec, but that doesn't seem to work either - monitoring the GPU usage in nvidia-smi, I can see that the memory usage on GPU 0 goes up, then it does some work, then the memory on GPU 0 goes down and the memory on GPU 1 goes up, that one does some work, etc. My gcc version is 4.7.0.
Here's how I'm starting the kernels, where increment is the desired global work size divided by the number of devices, rounded up to the nearest multiple of the desired local work size:
std::vector<cl::CommandQueue> queues;
/* Population of queues happens somewhere
cl::NDrange offset, increment, local;
std::vector<std::future<cl_int>> enqueueReturns;
int numDevices = queues.size();
/* Calculation of increment (local is gotten from the function parameters)*/
//Distribute the job among each of the devices in the context
for(int i = 0; i < numDevices; i++)
{
//Update the offset for the current device
offset = cl::NDRange(i*increment[0], i*increment[1], i*increment[2]);
//Start a new thread for each call to enqueueNDRangeKernel
enqueueReturns.push_back(std::async(
std::launch::async,
&cl::CommandQueue::enqueueNDRangeKernel,
&queues[i],
kernels[kernel],
offset,
increment,
local,
(const std::vector<cl::Event>*)NULL,
(cl::Event*)NULL));
//Without those last two casts, the program won't even compile
}
//Wait for all threads to join before returning
for(int i = 0; i < numDevices; i++)
{
execError = enqueueReturns[i].get();
if(execError != CL_SUCCESS)
std::cerr << "Informative error omitted due to length" << std::endl
}
The kernels definitely should be running on the call to std::async, since I can create a little dummy function, set a breakpoint on it in GDB and have it step into it the moment std::async is called. However, if I make a wrapper function for enqueueNDRangeKernel, run it there, and put in a print statement after the run, I can see that it takes some time between prints.
P.S. The Nvidia dev zone is down due to hackers and such, so I haven't been able to post the question there.
EDIT: Forgot to mention - The buffer that I'm passing to the kernel as an argment (and the one I mention, above, that seems to get passed between the GPUs) is declared as using CL_MEM_COPY_HOST_PTR. I had been using CL_READ_WRITE_BUFFER, with the same effect happening.
I emailed the Nvidia guys and actually got a pretty fair response. There's a sample in the Nvidia SDK that shows, for each device you need to create seperate:
queues - So you can represent each device and enqueue orders to it
buffers - One buffer for each array you need to pass to the device, otherwise the devices will pass around a single buffer, waiting for it to become available and effectively serializing everything.
kernel - I think this one's optional, but it makes specifying arguments a lot easier.
Furthermore, you have to call EnqueueNDRangeKernel for each queue in separate threads. That's not in the SDK sample, but the Nvidia guy confirmed that the calls are blocking.
After doing all this, I achieved concurrency on multiple GPUs. However, there's still a bit of a problem. On to the next question...
Yes, you're right. AFAIK - the nvidia implementation has a synchronous "clEnqueueNDRange". I have noticed this when using my library (Brahma) as well. I don't know if there is a workaround or a way of preventing this, save using a different implementation (and hence device).
We have a large Fortran/MPI code-base which makes use of system-V shared memory segments on a node. We run on fat nodes with 32 processors, but only 2 or 4 NICs, and relatively little memory per CPU; so the idea is that we set up a shared memory segment, on which each CPU performs its calculation (in its block of the SMP array). MPI is then used to handle inter-node communications, but only on the master in the SMP group. The procedure is double-buffered, and has worked nicely for us.
The problem came when we decided to switch to asynchronous comms, for a bit of latency hiding. Since only a couple of CPUs on the node communicate over MPI, but all of the CPUs see the received array (via shared memory), a CPU doesn't know when the communicating CPU has finished, unless we enact some kind of barrier, and then why do asynchronous comms?
The ideal, hypothetical solution would be to put the request tags in an SMP segment and run mpi_request_get_status on the CPU which needs to know. Of course, the request tag is only registered on the communicating CPU, so it doesn't work! Another proposed possibility was to branch a thread off on the communicating thread and use it to run mpi_request_get_status in a loop, with the flag argument in a shared memory segment, so all the other images can see. Unfortunately, that's not an option either, since we are constrained not to use threading libraries.
The only viable option we've come up with seems to work, but feels like a dirty hack. We put an impossible value in the upper-bound address of the receive buffer, that way once the mpi_irecv has completed, the value has changed and hence every CPU knows when it can safely use the buffer. Is that ok? It seems that it would only work reliably if the MPI implementation can be guaranteed to transfer data consecutively. That almost sounds convincing, since we've written this thing in Fortran and so our arrays are contiguous; I would imagine that the access would be also.
Any thoughts?
Thanks,
Joly
Here's a pseudo-code template of the kind of thing I'm doing. Haven't got the code as a reference at home, so I hope I haven't forgotten anything crucial, but I'll make sure when I'm back to the office...
pseudo(array_arg1(:,:), array_arg2(:,:)...)
integer, parameter : num_buffers=2
Complex64bit, smp : buffer(:,:,num_buffers)
integer : prev_node, next_node
integer : send_tag(num_buffers), recv_tag(num_buffers)
integer : current, next
integer : num_nodes
boolean : do_comms
boolean, smp : safe(num_buffers)
boolean, smp : calc_complete(num_cores_on_node,num_buffers)
allocate_arrays(...)
work_out_neighbours(prev_node,next_node)
am_i_a_slave(do_comms)
setup_ipc(buffer,...)
setup_ipc(safe,...)
setup_ipc(calc_complete,...)
current = 1
next = mod(current,num_buffers)+1
safe=true
calc_complete=false
work_out_num_nodes_in_ring(num_nodes)
do i=1,num_nodes
if(do_comms)
check_all_tags_and_set_safe_flags(send_tag, recv_tag, safe) # just in case anything else has finished.
check_tags_and_wait_if_need_be(current, send_tag, recv_tag)
safe(current)=true
else
wait_until_true(safe(current))
end if
calc_complete(my_rank,current)=false
calc_complete(my_rank,current)=calculate_stuff(array_arg1,array_arg2..., buffer(current), bounds_on_process)
if(not calc_complete(my_rank,current)) error("fail!")
if(do_comms)
check_all_tags_and_set_safe(send_tag, recv_tag, safe)
check_tags_and_wait_if_need_be(next, send_tag, recv_tag)
recv(prev_node, buffer(next), recv_tag(next))
safe(next)=false
wait_until_true(all(calc_complete(:,current)))
check_tags_and_wait_if_need_be(current, send_tag, recv_tag)
send(next_node, buffer(current), send_tag(current))
safe(current)=false
end if
work_out_new_bounds()
current=next
next=mod(next,num_buffers)+1
end do
end pseudo
So ideally, I would have liked to have run "check_all_tags_and_set_safe_flags" in a loop in another thread on the communicating process, or even better: do away with "safe flags" and make the handle to the sends / receives available on the slaves, then I could run: "check_tags_and_wait_if_need_be(current, send_tag, recv_tag)" (mpi_wait) before the calculation on the slaves instead of "wait_until_true(safe(current))".
"...unless we enact some kind of barrier, and then why do asynchronous comms?"
That sentence is a bit confused. The purpose of asynchrononous communications is to overlap communications and computations; that you can hopefully get some real work done while the communications is going on. But this means you now have two tasks occuring which eventually have to be synchronized, so there has to be something which blocks the tasks at the end of the first communications phase before they go onto the second computation phase (or whatever).
The question of what to do in this case to implement things nicely (it seems like what you've got now works but you're rightly concerned about the fragility of the result) depends on how you're doing the implementation. You use the word threads, but (a) you're using sysv shared memory segments, which you wouldn't need to do if you had threads, and (b) you're constrained not to be using threading libraries, so presumably you actually mean you're fork()ing processes after MPI_Init() or something?
I agree with Hristo that your best bet is almost certainly to use OpenMP for on-node distribution of computation, and would probably greatly simplify your code. It would help to know more about your constraint to not use threading libraries.
Another approach which would still avoid you having to "roll your own" process-based communication layer that you use in addition to MPI would be to have all the processes on the node be MPI processes, but create a few communicators - one to do the global communications, and one "local" communicator per node. Only a couple of processes per node would be a part of a communicator which actually does off-node communications, and the others do work on the shared memory segment. Then you could use MPI-based methods for synchronization (Wait, or Barrier) for the on-node synchronization. The upcoming MPI3 will actually have some explicit support for using local shared memory segments this way.
Finally, if you're absolutely bound and determined to keep doing things through what's essentially your own local-node-only IPC implementation --- since you're already using SysV shared memory segments, you might as well use SysV semaphores to do the synchronization. You're already using your own (somewhat delicate) semaphore-like mechanism to "flag" when the data is ready for computation; here you could use a more robust, already-written semaphore to let the non-MPI processes know when the data is ready for computation (and a similar mechanism to let the MPI process know when the others are done with the computation).
I am interested in writing separate program modules that run as independent threads that I could hook together with pipes. The motivation would be that I could write and test each module completely independently, perhaps even write them in different languages, or run the different modules on different machines. There are a wide variety of possibilities here. I have used piping for a while, but I am unfamiliar with the nuances of its behaviour.
It seems like the receiving end will block waiting for input, which I would expect, but will the sending end block sometimes waiting for someone to read from the stream?
If I write an eof to the stream can I keep continue writing to that stream until I close it?
Are there differences in the behaviour named and unnamed pipes?
Does it matter which end of the pipe I open first with named pipes?
Is the behaviour of pipes consistent between different Linux systems?
Does the behaviour of the pipes depend on the shell I'm using or the way I've configured it?
Are there any other questions I should be asking or issues I should be aware of if I want to use pipes in this way?
Wow, that's a lot of questions. Let's see if I can cover everything...
It seems like the receiving end will
block waiting for input, which I would
expect
You expect correctly an actual 'read' call will block until something is there. However, I believe there are some C functions that will allow you to 'peek' at what (and how much) is waiting in the pipe. Unfortunately, I don't remember if this blocks as well.
will the sending end block sometimes
waiting for someone to read from the
stream
No, sending should never block. Think of the ramifications if this were a pipe across the network to another computer. Would you want to wait (through possibly high latency) for the other computer to respond that it received it? Now this is a different case if the reader handle of the destination has been closed. In this case, you should have some error checking to handle that.
If I write an eof to the stream can I
keep continue writing to that stream
until I close it
I would think this depends on what language you're using and its implementation of pipes. In C, I'd say no. In a linux shell, I'd say yes. Someone else with more experience would have to answer that.
Are there differences in the behaviour
named and unnamed pipes?
As far as I know, yes. However, I don't have much experience with named vs unnamed. I believe the difference is:
Single direction vs Bidirectional communication
Reading AND writing to the "in" and "out" streams of a thread
Does it matter which end of the pipe I
open first with named pipes?
Generally no, but you could run into problems on initialization trying to create and link the threads with each other. You'd need to have one main thread that creates all the sub-threads and syncs their respective pipes with each other.
Is the behaviour of pipes consistent
between different linux systems?
Again, this depends on what language, but generally yes. Ever heard of POSIX? That's the standard (at least for linux, Windows does it's own thing).
Does the behaviour of the pipes depend
on the shell I'm using or the way I've
configured it?
This is getting into a little more of a gray area. The answer should be no since the shell should essentially be making system calls. However, everything up until that point is up for grabs.
Are there any other questions I should
be asking
The questions you've asked shows that you have a decent understanding of the system. Keep researching and focus on what level you're going to be working on (shell, C, so on). You'll learn a lot more by just trying it though.
This is all based on a UNIX-like system; I'm not familiar with the specific behavior of recent versions of Windows.
It seems like the receiving end will block waiting for input, which I would expect, but will the sending end block sometimes waiting for someone to read from the stream?
Yes, although on a modern machine it may not happen often. The pipe has an intermediate buffer that can potentially fill up. If it does, the write side of the pipe will indeed block. But if you think about it, there aren't a lot of files that are big enough to risk this.
If I write an eof to the stream can I keep continue writing to that stream until I close it?
Um, you mean like a CTRL-D, 0x04? Sure, as long as the stream is set up that way. Viz.
506 # cat | od -c
abc
^D
efg
0000000 a b c \n 004 \n e f g \n
0000012
Are there differences in the behaviour named and unnamed pipes?
Yes, but they're subtle and implementation dependent. The biggest one is that you can write to a named pipe before the other end is running; with unnamed pipes, the file descriptors get shared during the fork/exec process, so there's no way to access the transient buffer without the processes being up.
Does it matter which end of the pipe I open first with named pipes?
Nope.
Is the behaviour of pipes consistent between different linux systems?
Within reason, yes. Buffer sizes etc may vary.
Does the behaviour of the pipes depend on the shell I'm using or the way I've configured it?
No. When you create a pipe, under the covers what happens is your parent process (the shell) creates a pipe which has a pair of file descriptors, then does a fork exec like this pseudocode:
Parent:
create pipe, returning two file descriptors, call them fd[0] and fd[1]
fork write-side process
fork read-side process
Write-side:
close fd[0]
connect fd[1] to stdout
exec writer program
Read-side:
close fd[1]
connect fd[0] to stdin
exec reader program
Are there any other questions I should be asking or issues I should be aware of if I want to use pipes in this way?
Is everything you want to do really going to lay out in a line like this? If not, you might want to think about a more general architecture. But the insight that having lots of separate processes interacting through the "narrow" interface of a pipe is desirable is a good one.
[Updated: I had the file descriptor indices reversed at first. They're correct now, see man 2 pipe.]
As Dashogun and Charlie Martin noted, this is a big question. Some parts of their answers are inaccurate, so I'm going to answer too.
I am interested in writing separate program modules that run as independent threads that I could hook together with pipes.
Be wary of trying to use pipes as a communication mechanism between threads of a single process. Because you would have both read and write ends of the pipe open in a single process, you would never get the EOF (zero bytes) indication.
If you were really referring to processes, then this is the basis of the classic Unix approach to building tools. Many of the standard Unix programs are filters that read from standard input, transform it somehow, and write the result to standard output. For example, tr, sort, grep, and cat are all filters, to name but a few. This is an excellent paradigm to follow when the data you are manipulating permits it. Not all data manipulations are conducive to this approach, but there are many that are.
The motivation would be that I could write and test each module completely independently, perhaps even write them in different languages, or run the different modules on different machines.
Good points. Be aware that there isn't really a pipe mechanism between machines, though you can get close to it with programs such as rsh or (better) ssh. However, internally, such programs may read local data from pipes and send that data to remote machines, but they communicate between machines over sockets, not using pipes.
There are a wide variety of possibilities here. I have used piping for a while, but I am unfamiliar with the nuances of its behaviour.
OK; asking questions is one (good) way to learn. Experimenting is another, of course.
It seems like the receiving end will block waiting for input, which I would expect, but will the sending end block sometimes waiting for someone to read from the stream?
Yes. There is a limit to the size of a pipe buffer. Classically, this was quite small - 4096 or 5120 were common values. You may find that modern Linux uses a larger value. You can use fpathconf() and _PC_PIPE_BUF to find out the size of a pipe buffer. POSIX only requires the buffer to be 512 (that is, _POSIX_PIPE_BUF is 512).
If I write an eof to the stream can I keep continue writing to that stream until I close it?
Technically, there is no way to write EOF to a stream; you close the pipe descriptor to indicate EOF. If you are thinking of control-D or control-Z as an EOF character, then those are just regular characters as far as pipes are concerned - they only have an effect like EOF when typed at a terminal that is running in canonical mode (cooked, or normal).
Are there differences in the behaviour named and unnamed pipes?
Yes, and no. The biggest differences are that unnamed pipes must be set up by one process and can only be used by that process and children who share that process as a common ancestor. By contrast, named pipes can be used by previously unassociated processes. The next big difference is a consequence of the first; with an unnamed pipe, you get back two file descriptors from a single function (system) call to pipe(), but you open a FIFO or named pipe using the regular open() function. (Someone must create a FIFO with the mkfifo() call before you can open it; unnamed pipes do not need any such prior setup.) However, once you have a file descriptor open, there is precious little difference between a named pipe and an unnamed pipe.
Does it matter which end of the pipe I open first with named pipes?
No. The first process to open the FIFO will (normally) block until there's a process with the other end open. If you open it for reading and writing (aconventional but possible) then you won't be blocked; if you use the O_NONBLOCK flag, you won't be blocked.
Is the behaviour of pipes consistent between different Linux systems?
Yes. I've not heard of or experienced any problems with pipes on any of the systems where I've used them.
Does the behaviour of the pipes depend on the shell I'm using or the way I've configured it?
No: pipes and FIFOs are independent of the shell you use.
Are there any other questions I should be asking or issues I should be aware of if I want to use pipes in this way?
Just remember that you must close the reading end of a pipe in the process that will be writing, and the writing end of the pipe in the process that will be reading. If you want bidirectional communication over pipes, use two separate pipes. If you create complicated plumbing arrangements, beware of deadlock - it is possible. A linear pipeline does not deadlock, however (though if the first process never closes its output, the downstream processes may wait indefinitely).
I observed both above and in comments to other answers that pipe buffers are classically limited to quite small sizes. #Charlie Martin counter-commented that some versions of Unix have dynamic pipe buffers and these can be quite large.
I'm not sure which ones he has in mind. I used the test program that follows on Solaris, AIX, HP-UX, MacOS X, Linux and Cygwin / Windows XP (results below):
#include <unistd.h>
#include <signal.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
static const char *arg0;
static void err_syserr(char *str)
{
int errnum = errno;
fprintf(stderr, "%s: %s - (%d) %s\n", arg0, str, errnum, strerror(errnum));
exit(1);
}
int main(int argc, char **argv)
{
int pd[2];
pid_t kid;
size_t i = 0;
char buffer[2] = "a";
int flags;
arg0 = argv[0];
if (pipe(pd) != 0)
err_syserr("pipe() failed");
if ((kid = fork()) < 0)
err_syserr("fork() failed");
else if (kid == 0)
{
close(pd[1]);
pause();
}
/* else */
close(pd[0]);
if (fcntl(pd[1], F_GETFL, &flags) == -1)
err_syserr("fcntl(F_GETFL) failed");
flags |= O_NONBLOCK;
if (fcntl(pd[1], F_SETFL, &flags) == -1)
err_syserr("fcntl(F_SETFL) failed");
while (write(pd[1], buffer, sizeof(buffer)-1) == sizeof(buffer)-1)
{
putchar('.');
if (++i % 50 == 0)
printf("%u\n", (unsigned)i);
}
if (i % 50 != 0)
printf("%u\n", (unsigned)i);
kill(kid, SIGINT);
return 0;
}
I'd be curious to get extra results from other platforms. Here are the sizes I found. All the results are larger than I expected, I must confess, but Charlie and I may be debating the meaning of 'quite large' when it comes to buffer sizes.
8196 - HP-UX 11.23 for IA-64 (fcntl(F_SETFL) failed)
16384 - Solaris 10
16384 - MacOS X 10.5 (O_NONBLOCK did not work, though fcntl(F_SETFL) did not fail)
32768 - AIX 5.3
65536 - Cygwin / Windows XP (O_NONBLOCK did not work, though fcntl(F_SETFL) did not fail)
65536 - SuSE Linux 10 (and CentOS) (fcntl(F_SETFL) failed)
One point that is clear from these tests is that O_NONBLOCK works with pipes on some platforms and not on others.
The program creates a pipe, and forks. The child closes the write end of the pipe, and then goes to sleep until it gets a signal - that's what pause() does. The parent then closes the read end of the pipe, and sets the flags on the write descriptor so that it won't block on an attempt to write on a full pipe. It then loops, writing one character at a time, and printing a dot for each character written, and a count and newline every 50 characters. When it detects a write problem (buffer full, since the child is not reading a thing), it stops the loop, writes the final count, and kills the child.