int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
I'm a little confused about the maxevents parameter. Let's say I want to write a server that can handle up to 10k connections. Would I define maxevents as 10000 then, or should it be be lower for some reason?
Maxevents is just the length of the struct epoll_events array pointed to by *events.
If the kernel has more than that number of events to feed to your program at that time it will see that it should not because you aren't expecting that many to be returned in that particular _wait.
You will probably need to experiment with the optimal size of this for your program. The optimal size may even differ by architecture. For small numbers of file descriptors being polled you can quite easily just set maxevents to the number of files (and size the events array accordingly), but the likelihood of all files needing attention at the same time is low, so you would probably be able to use a lower maxevents value.
Related
Context:
The need is to simulate a net of related discrete elements (complex electronic circuit). Thus each component receive input from several other components and output to several others.
The intended design is to have a kernel, with a configuration argument defining which component it shall represent. Each component of the circuit is represented by a work-item, and all the circuit will fit in a single work-group (or adequate splitting of the circuit will be done so each work-group can manage all the components as work-items).
The problem:
Is it possible, and in case how? to have some work-items waiting for other work-items data?
A work-item generate an output to an array (at a data-driven position). Another work-item needs to wait for this to happens before to start making it processing.
The net has no loops, thus, it is not possible that a single work-item needs to run twice.
Attempts:
In the following example, each component can have a maximum of one single input (to simplify) making the circuit a tree where the input to the circuit is the root, and the 3 outputs are leafs.
inputIndex modelize this tree, by indicating for each component which other component provide it input. The first component take itself as input but the kernel manage this case (for simplification).
result save the result of each component (voltage, intensity, etc.)
inputModified indicate if the given component already calculated his output.
// where the data come from (index in result)
constant int inputIndex[5]={0,0, 0, 2, 2};
kernel void update_component(
local int *result, // each work-item result.
local int *inputModified // If all inputs are ready (one only for this example)
) {
int id = get_local_id(0);
int size = get_local_size(0);
int barrierCount = 0;
// inputModified is a boolean indicating if the input is ready
inputModified[id]=(id!=0 ? 0 : 1);
// make sure all input are false by default (except the first input).
barrier(CLK_LOCAL_MEM_FENCE);
// Wait until all inputs are ready (only one in this example)
while( !inputModified[inputIndex[id]] && size > barrierCount++)
{
// If the input is not ready, wait for it
barrier(CLK_LOCAL_MEM_FENCE);
}
// all inputs are ready, compute output
if (id!=0) result[id] = result[inputIndex[id]]+1;
else result[0]=42;
// make sure any other work-item depending on this is unblocked
inputModified[id]=1;
// Even if finished, we needs to "barrier" for other working items.
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
}
This example has N barriers for N components, making it worse than a sequential solution.
Note: this is only the kernel, as the C++ minimal host is quite long. In case of need, I could find a way to add it.
Question:
Is it possible to efficiently, and by the kernel itself to have the different work-items waiting for their data to be provided by other work-items? Or what solution would be efficient?
This problem is (for me) not trivial to explain and I am far from expert in OpenCL. Please, be patient and feel free to ask if anything is unclear.
From documentation of barrier
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/barrier.html
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue
execution beyond the barrier.
But a while loop (containing a barrier) in the kernel has this condition:
inputModified[inputIndex[id]]
this can change its behavior with id of thread and lead to undefined behavior. Besides, another barrier before that
barrier(CLK_LOCAL_MEM_FENCE);
already synchronizes all work-items in the work-group so the while loop is redundant, even if it works.
Also the last barrier loop is redundant
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
when kernel ends, it does synchronize all workitems.
If you are meant to send some message to out-of-workgroup workitems, then you can only use atomic variables. Even when using atomics, you should not assume any working/issuing order between any two workitems.
Your question
how? to have some work-items waiting for other work-items data? A
work-item generate an output to an array (at a data-driven position).
Another work-item needs to wait for this to happens before to start
making it processing. The net has no loops, thus, it is not possible
that a single work-item needs to run twice.
can be answered with an OpenCL 2.x feature "dynamic parallelism" which lets a workitem spawn new workgroups/kernels inside kernel. It is much more efficient than waiting on a spin-wait loop and absolutely more hardware-independent than relying on number of in-flight threads a GPU supports (when GPU can't handle that many in-flight threads, any spin-wait will dead-lock, order of threads does not matter).
When you use barrier, you don't need to inform other threads about "inputModified". Data of result is already visible within workgroup.
If you can't use OpenCL v2.x, then you should process a tree using BFS:
start 1 workitem for top node
process it and prepare K outputs and push them into a queue
end kernel
start K workitems (each pop elements from queue)
process them and prepare N outputs and push them into queue
end kernel
repeat until queue doesn't have any more elements
Number of kernel calls is equal to maximum depth of tree, not number of nodes.
If you need a quicker synchronization than "kernel launches", then use a single workgroup for whole tree, use barrier instead of kernel recalls. Or, process first few steps on CPU, have multiple sub-trees and send them to different OpenCL workgroups. Perhaps computing on CPU until there are N sub-trees where N=compute units of GPU could be better for workgroup-barrier based faster asynchronous computing of sub-trees.
There is also a barrierless, atomicless and single-kernel-call way for this. Start tree from bottom and go up.
Map all deepest level child nodes to workitems. Move each of them to the top while recording their path(node id, etc) within their private memory / some other fast memory. Then have them traverse back top-down through that recorded path, computing on the move, without any synchronizations nor even atomics. This is less work efficient than barrier/kernel-call versions but the lack of barrier and being on totally asynchronous paths should make it fast enough.
If tree has 10 depth, this means 10 node pointers to save, not so much for private registers. If tree depth is about 30 40 then use local memory with less threads in each workgroup; if it is even more, then allocate global memory.
But you may need to sort the workitems on their spatiality / tree's topology to make them work together faster with less branching.
This way looks simplest to me, so I suggest you to try this barrierless version first.
If you want only data-visibility per workitem instead of group or kernel, use fence: https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/mem_fence.html
I want to send and receive more than 2 GB of data using MPI and I came across a lot of articles like the ones cited below:
http://blogs.cisco.com/performance/can-we-count-on-mpi-to-handle-large-datasets,
http://blogs.cisco.com/performance/new-things-in-mpi-3-mpi_count
talking about changes that are made starting with MPI 3.0 allowing to send and receive bigger chunks of data.
Most of the functions now are receiving as parameter an MPI_Count object instead of int, but not all of them.
How can I replace
int MPI_Pack_size(int incount, MPI_Datatype datatype, MPI_Comm comm,
int *size)
in order to get the size of a larger buffer? (because here the size can only be at most 2GB)
The MPI_Pack routines (MPI_Pack, MPI_Unpack, MPI_Pack_size, MPI_Pack_external) are, as you see, unable to support more than 32 bits worth of data, due to the integer pointer used as a return value. I don't know why the standard did not provide MPI_Pack_x, MPI_Unpack_x, MPI_Pack_size_x, and MPI_Pack_external_x -- presumably an oversight? As Jeff suggests, it might have been done so because packing multiple gigs of data is unlikely to provide much benefit. Still, it breaks orthogonality not to have those...
A quality implementation (I do not know if MPICH is one of those) should return an error about the type being too big, allowing you to pack a smaller amount of data.
I am using the readstream interface to sample at 100hz, I have been able to integrate the interface into Oscilloscope application. I just have a doubt in the way I pass on the buffer value on to the packet to be transmitted . Currently this is how I am doing it :
uint8_t i=0;
event void ReadStream.bufferDone( error_t result,uint16_t* buffer, uint16_t count )
{
if (reading < count )
i++;
local.readings[reading++] = buffer[i];
}
I have defined a buffer size of 50, I am not sure this is the way to do it as I am noticing just one sample per packet even though I have set Nreadings=2.
Also the sampling rate does not seem to be 100 samples/second when I check.I am not doing something right in the way I pass data to the packet to be transmitted.
I think I need to clarify a few things according to your questions and comments.
Reading a single sample from an accelerometer on micaZ motes works as follows:
Turn on the accelerometer.
Wait 17 milliseconds. According to the ADXL202E (the accelerometer) datasheet, startup time is 16.3 ms. This is because this particular hardware is capable of providing first reading not immediately after being powered on, but with some delay. If you decrease this delay, you will likely get a wrong reading, however, the behavior is undefined, so you may sometimes get a correct reading or the result may depend on environment conditions, such as ambient temperature. Changing this 17-ms delay to a lower value is certainly a bad idea.
Read values (in two axes) from the Analog to Digital Converter (ADC), which as an MCU component that converts analog output voltage of the accelerometer to the digital value (an integer). The speed at which ADC can sample is independent from the parameters of the accelerometer: it is another piece of hardware.
Turn off the accelerometer.
This is what happens when you call Read.read() in your code. You see that the maximum frequency at which you can sample is once every 17 ms, that is, 58 samples per second. It may be even a bit smaller because of some overhead from MCU or inaccuracy of timers. This is true when you sample by calling Read.read() in a loop or every fixed interval, because this call itself lasts no less than 17 ms (I mean the delay between the command and the event).
What you may want to do is:
Turn on the accelerometer.
Wait 17 ms.
Perform series of reads.
Turn off the accelerometer.
If you do so, you have one 17-ms delay for a set of samples instead of such delay for each sample. What is important, these steps have nothing to do with the interface you use for performing readings. You may call Read.read() multiple times in your application, however, it cannot be the same implementation of the read command that is already implemented for this accelerometer, because the existing implementation is responsible for turning on and off the accelerometer, and it waits 17 ms before reading each sample. For convenience, you may implement the ReadStream interface instead and call it once in your application.
Moreover, you wrote that ReadStream used a microsecond timer and is independent from the 17-ms settling time of the ADC. That sentence is completely wrong. First of all, you cannot say that an interface uses or does not use a timer. The interface is just a set of commands and events without their definitions. A particular implementation of the interface may use timers. The Read and ReadStream interfaces may be implemented multiple times on different platforms by various hardware components, such as accelerometers, thermometers, hygrometers, magnetometers, and so on. Secondly, the 17-ms settling time refers to the accelerometer, not the ADC. And no matter which interface you use, Read or ReadStream, and which timers a driver uses, milli- or microsecond, the 17-ms delay is always required after powering on the accelerometer. As I mentioned, you probably want to make this delay once per multiple reads instead of once per a single read.
It seems that the TinyOS source code already contains an implementation of the accelerometer driver providing the ReadStream interface which allows you to sample continuously. Look at the AccelXStreamC and AccelYStreamC components (in tos/sensorboards/mts300/).
The ReadStream interface consists of two commands. postBuffer(val_t *buf, uint16_t count) is called to provide a buffer for samples. In the accelerometer driver, val_t is defined as uint16_t. You may post multiple buffers, one by one. This command does not yet start sampling and filling buffers. For that purpose, there is a read(uint32_t usPeriod) command, which directs the device to start filling buffers by sampling with the specified period (in microseconds). When a buffer is full, you get an event bufferDone(error_t result, val_t *buf, uint16_t count) and a component starts filling a next buffer, if any. If there are no buffers left, you get additionally an event readDone(error_t result, uint32_t usActualPeriod), which passes to your application a parameter usActualPeriod, which indicates an actual sampling period and may be different (especially, higher) from a period you requested when calling read due to some hardware constraints.
So the solution is to use the ReadStream interface provided by AccelXStreamC and AccelYStreamC (or maybe some higher-level components that use them) and pass an expected period in microseconds to the read command. If the actual period is lower than one you expect, this means that sampling at higher rate is impossible either due to hardware constraints or because it was not implemented in the ADC driver. In the second case, you may try to fix the driver, although it requires good knowledge of low-level programming. The ADC driver source code for this platform is located in tos/chips/atm128/adc.
Often it is advised to keep the global_work_size the same as the logical amount of "elements" you must process. My application doesn't have such a thing, though. If I have N elements that need to be processed, then, after a single kernel pass, I will have M elements - a completely different number that doesn't depend on N.
In order to deal with this situation, I could write a loop such as:
while (elementsToBeProcessed)
read "elementsToBeProcessed" variable from device
enqueue ND range kernel with global_work_size = elemnetsToBeProcessed
But that requires one read per pass. An alternative would be to keep everything inside the GPU, by calling enqueueNDRangeKernel only once, with a fixed global_work_size and local_work_size matching the GPU layout and then use a master thread to synchronize the computation within.
My question is simple: is my intuition correct that the second option is better, or is there any reason to go with the first?
That is a tricky problem, which way to take. And depends on the global size values you are going to have and how much they change over time.
A read per pass: (better for highly changing values)
Fitted global size, all the work items will do useful work
Unfitted local size for the HW, if the work size is small
Blocking behavior in the queue, bad device utilization
Easy to understand and debug
Fixed kernel launch size: (better for stable but changing values)
Un-fitted global size, may waste some time running null work items
Fitted local size to the device
Non blocking behavior, 100% device usage
Complex to debug
As some answers already say, OpenCL 2.0 is the solution, by using pipes. But it is also possible to use another OpenCL 2.0 feature, kernel calling inside kernels. So that your kernels can launch the next batch of kernels without CPU intervention.
It is always good if you can avoid transferring data between host and device, even if it means little bit more work on the device. In many applications data transferring is the slowest part.
To find out better solution for your system configuration, you need to test both of them. If you are targeting to multiple platforms then the second one should be faster in general. But there are lot of things that can make it slower. For example the code for it might be harder to optimize for the compilers or the data access pattern might lead to more cache misses.
If you are targeting to OpenCL 2.0, pipes might be something you want to look at for this kind of random amount of elements. (Before I get some down votes because of the platforms not supporting 2.0, AMD has promised 2.0 drivers to come this year) With pipes, you can make producer kernel and consumer kernel. Consumer kernel can start work as soon as it has enough items to work on. This might lead to better utilization of all resources.
The tradeoff: The performance hit for doing the readback is that the GPU will be idle waiting for work, whereas if you just enqueue a bunch of kernels it will stay busy.
Simple: So I think the answer depends on how much elementsToBeProcessed will vary. If a sequence of runs might be (for example) 20000, 19760, 15789, 19345 then I'd always run 20000 and have a few idle work items. On the other hand, if a typical pattern is 20000, 4236, 1234, 9000 then I'd read back elementsToBeProcessed and enqueue the kernel for only what is needed.
Advanced: If your pattern is monotonically decreasing you could interleave the readback with the kernel enqueue, so that you're always keeping the GPU busy but you're also making them smaller as you go. Between every kernel enqueue start an async double-buffered readback of a copy of the elementsToBeProcessed and use it for the kernel after the one you enqueue next.
Like this:
elementsToBeProcessedA = starting value
elementsToBeProcessedB = starting value
eventA = NULL
eventB = NULL
Enqueue kernel with NDRange of elementsToBeProcessedA
non-blocking clEnqueueReadBuffer for elementsToBeProcessedA, taking eventA
if non-null, wait on eventB, release event
Enqueue kernel with NDRange of elementsToBeProcessedB
non-blocking clEnqueueReadBuffer for elementsToBeProcessedB, taking eventB
if non-null, wait on eventA, release event
goto 5
This will kepp the GPU fully saturated and yet will use smaller elementsToBeProcessed as it goes. It will not handle the case where elementsToBeProcessed increases so don't do it this way if that is the case.
An alternate solution: Always run a fixed number of global work items, enough to fill the GPU but not more. Each work item should then look at the total number of items to be done for this pass (elementsToBeProcessed) and then do it's portion of the total.
uint elementsToBeProcessed = <read from global memory>
uint step = get_global_size(0);
for (uint i = get_global_id(0); i < elementsToBeProcessed; i += step)
{
<process item "i">
}
A simplified example: global work size of 5 (artificially small for example), elementsToBeProcessed = 19: first pass through loop elements 0-4 are processed, second pass 5-9, third pass 10-14, forth pass 15-18.
You'd want to tune the fixed global work size to exactly match your hardware (compute units * max work group size or some division of that).
This is not unlike the algorithm for how work items cooperate to copy data into shared local memory regardless of work group size.
Global Work size doesn't have to be fixed. E. g. you have 128 stream processors. So, you make a kernel with local size 128 too. Your global work size can be any number, which is multiple to that value - 256, 4096, etc.
Though, size of local group usually is determined by hardware specs. In case you have more data to process, just increase number of local groups involved.
In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.