Never terminating cpu bound c program - cpu-usage

as you know there are two kind of process, i/o bound and cpu bound...
i need a cpu bound program that never terminates itself...
for example; is it like i wanted?
while(1){
for(int i=0;i<1000; i++);
}

for(;;); should do it!

First of all, why do you want a never terminating CPU bound program?
And yes, that would work, but you don't really need the inner for-loop. The while-loop will run forever on its own (assuming the compiler doesn't optimize it away).

There aren't only two kinds of processes. Even if you consider what resource is bounding the performance, there are more than two. The classic other ones are bandwidth, memory, database connections -- any finite resource or blocking one can be a bottleneck.
But, yes, your process is CPU-bound -- you can see that by looking at your task manager (Windows) or Activity Monitor (Mac) or top (Linux) and seeing it take 100% of your CPU.

Those programs probably aren't CPU bound.
I suggest implementing the Sieve of Eratosthenes, or something like that. How about a program that takes a number (say 42), divides it by Pi 1000 times, multiplies it by Pi 1000 times, subtracts the result from the original number, adds it to a variable and increments a counter. Then repeat that indefinitely. I suppose you might overflow one of the numeric values, but that should be fixable / preventable.

Related

Is it a bad idea to keep a fixed global_work_size and local_work_size when the number of elements to be processed grow randomly?

Often it is advised to keep the global_work_size the same as the logical amount of "elements" you must process. My application doesn't have such a thing, though. If I have N elements that need to be processed, then, after a single kernel pass, I will have M elements - a completely different number that doesn't depend on N.
In order to deal with this situation, I could write a loop such as:
while (elementsToBeProcessed)
read "elementsToBeProcessed" variable from device
enqueue ND range kernel with global_work_size = elemnetsToBeProcessed
But that requires one read per pass. An alternative would be to keep everything inside the GPU, by calling enqueueNDRangeKernel only once, with a fixed global_work_size and local_work_size matching the GPU layout and then use a master thread to synchronize the computation within.
My question is simple: is my intuition correct that the second option is better, or is there any reason to go with the first?
That is a tricky problem, which way to take. And depends on the global size values you are going to have and how much they change over time.
A read per pass: (better for highly changing values)
Fitted global size, all the work items will do useful work
Unfitted local size for the HW, if the work size is small
Blocking behavior in the queue, bad device utilization
Easy to understand and debug
Fixed kernel launch size: (better for stable but changing values)
Un-fitted global size, may waste some time running null work items
Fitted local size to the device
Non blocking behavior, 100% device usage
Complex to debug
As some answers already say, OpenCL 2.0 is the solution, by using pipes. But it is also possible to use another OpenCL 2.0 feature, kernel calling inside kernels. So that your kernels can launch the next batch of kernels without CPU intervention.
It is always good if you can avoid transferring data between host and device, even if it means little bit more work on the device. In many applications data transferring is the slowest part.
To find out better solution for your system configuration, you need to test both of them. If you are targeting to multiple platforms then the second one should be faster in general. But there are lot of things that can make it slower. For example the code for it might be harder to optimize for the compilers or the data access pattern might lead to more cache misses.
If you are targeting to OpenCL 2.0, pipes might be something you want to look at for this kind of random amount of elements. (Before I get some down votes because of the platforms not supporting 2.0, AMD has promised 2.0 drivers to come this year) With pipes, you can make producer kernel and consumer kernel. Consumer kernel can start work as soon as it has enough items to work on. This might lead to better utilization of all resources.
The tradeoff: The performance hit for doing the readback is that the GPU will be idle waiting for work, whereas if you just enqueue a bunch of kernels it will stay busy.
Simple: So I think the answer depends on how much elementsToBeProcessed will vary. If a sequence of runs might be (for example) 20000, 19760, 15789, 19345 then I'd always run 20000 and have a few idle work items. On the other hand, if a typical pattern is 20000, 4236, 1234, 9000 then I'd read back elementsToBeProcessed and enqueue the kernel for only what is needed.
Advanced: If your pattern is monotonically decreasing you could interleave the readback with the kernel enqueue, so that you're always keeping the GPU busy but you're also making them smaller as you go. Between every kernel enqueue start an async double-buffered readback of a copy of the elementsToBeProcessed and use it for the kernel after the one you enqueue next.
Like this:
elementsToBeProcessedA = starting value
elementsToBeProcessedB = starting value
eventA = NULL
eventB = NULL
Enqueue kernel with NDRange of elementsToBeProcessedA
non-blocking clEnqueueReadBuffer for elementsToBeProcessedA, taking eventA
if non-null, wait on eventB, release event
Enqueue kernel with NDRange of elementsToBeProcessedB
non-blocking clEnqueueReadBuffer for elementsToBeProcessedB, taking eventB
if non-null, wait on eventA, release event
goto 5
This will kepp the GPU fully saturated and yet will use smaller elementsToBeProcessed as it goes. It will not handle the case where elementsToBeProcessed increases so don't do it this way if that is the case.
An alternate solution: Always run a fixed number of global work items, enough to fill the GPU but not more. Each work item should then look at the total number of items to be done for this pass (elementsToBeProcessed) and then do it's portion of the total.
uint elementsToBeProcessed = <read from global memory>
uint step = get_global_size(0);
for (uint i = get_global_id(0); i < elementsToBeProcessed; i += step)
{
<process item "i">
}
A simplified example: global work size of 5 (artificially small for example), elementsToBeProcessed = 19: first pass through loop elements 0-4 are processed, second pass 5-9, third pass 10-14, forth pass 15-18.
You'd want to tune the fixed global work size to exactly match your hardware (compute units * max work group size or some division of that).
This is not unlike the algorithm for how work items cooperate to copy data into shared local memory regardless of work group size.
Global Work size doesn't have to be fixed. E. g. you have 128 stream processors. So, you make a kernel with local size 128 too. Your global work size can be any number, which is multiple to that value - 256, 4096, etc.
Though, size of local group usually is determined by hardware specs. In case you have more data to process, just increase number of local groups involved.

Large execution time of OpenCL Kernel causes crash

I'm currently building a ray marcher to look at things like the mandelbox, etc. It works great. However, with my current program, it uses each worker as a ray projected from the eye. This means that there is a large amount of execution per worker. So when looking at an incredibly complex object or trying to render with large enough precision it causes my display drivers to crash because the kernel was taking too long to execute on a single worker. I'm trying to avoid changing my registry values to make the timeout longer as I want this application to work on multiple computers.
Is there any way to remedy this? As it stands the executions of each work-item are completely independent of the work items nearby. I've contemplated subscribing a buffer to the GPU that would store the current progress on that ray and only execute a small amount of iterations. Then, I would just call the program over and over and the result would hopefully refine a bit more. The problem with this is that I am unsure how to deal with branching rays (eg. reflecting and refraction) unless I have a max number of each to anticipate.
Anyone have any pointers on what I should do to remedy this problem? I'm quite the greenhorn to OpenCL and have been having this issue for quite some time. I feel as though I'm doing something wrong or misusing OpenCL principally since my single workitems have a lot of logic behind them, but I don't know how to split the task as it is just a series of steps and checks and adjustments.
The crash you are experiencing is caused by the HW watchdog timer of nVIDIA. Also, the OS may as well detect the GPU as not responsive and reboot it (at least Windows7 does it).
You can avoid it by many ways:
Improve/optimize your kernel code to take less time
Buy faster Hardware ($$$$)
Disable the watchdog timer (but is not an easy task, and not all the devices have the feature)
Reduce the amount of work queued to the device each time, by launching multiple small kernels (NOTE: There is a small overhead of doing it this way, introduced by the launch of each small kernel)
The easier and straightforward solution is the last one. But if you can, try the first one as well.
As an example, a call like this (1000x1000 = 1M work items, Global size):
clEnqueueNDRangeKernel(queue, kernel, 2, NDRange(0,0)/*Offset*/, NDRange(1000,1000)/*Global*/, ... );
Can be split up in many small calls of ((100x100)x(10x10) = 1M ). Since the global size is now 100 times smaller the watchdog should not be triggered:
for(int i=0; i<10; i++)
for(int j=0; j<10; j++)
clEnqueueNDRangeKernel(queue, kernel, 2, NDRange(i*100,j*100)/*Offset*/, NDRange(100,100)/*Global*/, ... );

What happens when I divide by zero?

Now I'm not asking about the mathematical answer of dividing by zero, I want to what my computer does when it tries to divide by zero?
I can see different answers depending on what level we look at?
Looking at a high-level I expect the language specification may just say "hey you can't do that throw an error"
Looking at an assembly level, will the CPU try to call the divide instruction when we try to divide by 0?
If it does that'll take us to the machine code level. What happens now?
Now if that doesn't happen and we force it to happen, what would the result be?
I think what you're asking is what's going to happen if we perform binary division algorithm with 0 in the denominator.
The algorithm will go into an infinite loop, and the quotient will grow larger and larger until it exhausts all available memory.
Looking at an assembly level, will the CPU try to call the divide instruction when we try to divide by 0?
Yep instruction gets fetched and decoded. What happens then? The divisor is found to be zero and the process stops doing what it's doing. It throws some kind of an exception, pipeline (if there is one) is flushed and most likely control jumps to some predefined error handling code - often the OS controls a machine level jump table called an interrupt vector (or this could be a separate to the interrupt vector table).
There are many architectures however, and things like error handling vary. Intel x86 follows above procedure at least.
If it does that'll take us to the machine code level. What happens now?
I have no idea what that means. From the CPU's perspective, it is all machine code level.

How do I stop all 262,144 kernels if I find my answer

I am using pyopencl to find a certain pixel in a 512 x 512 (262,144 pixels) image. I am starting (512,512), when I run my program and comparing the pixel's neighbors to a known group of neighbors. I am doing image synthesis. I don't want to wait around for the remaining kernels to run if I find my group of pixels within a kernel. Is there a way to terminate the rest of the running kernels with a kernel program ?
Thanks
Tim
When you queue a kernel with many work items, it gets divided up into work groups and threads which keep the GPU busy. Really large global sizes start as many threads as they can and issue new ones when the old ones finish. So you could find the smallest global size that still performs well, and queue many of those (instead of one large one), but also be checking on the results of the previous ones you queued (use events to know when they are done, and read back memory to get their results). When you get the correct answer, stop queueing kernels.
so instead of this:
queue entire job (say, 4096 x 4906)
do:
do
{
queue some work (say, 32 x 32)
check if any of the prior work queued is done and check if it got the answer
}
while (no more work OR answer found)
You'll need to figure out the right tradeoff between the size of the smaller jobs and the overhead of checking their results versus extra work done.
Your question is a big issue and problem of parallelism.
What to do when one of your parallel threads has already the answer to the problem?
OpenCL does not allow to control the kernel execution. Not even at host level. And this is a big problem. However it is how it has to be, since, if the work items do not run freely detached one from another then it is not fully parallel.
The only solution is to split the computation into small parts and check the completion of each of them. But, sometimes the parts are already very small (like in your case 512x512 is quite small).
In your specific case I would process everything (512x512), after that I would use another kernel to get the final results out of the 512x512 set.
First thought it to have some sort of global memory flag that each kernel can read and set. This approach requires atomicity, so make sure to use the atomic_ functions.
__kernel void t(__global int *Data,
__global int *Flag){
if(atomic_max(*Flag, 0) == 0){
//perform calc on Data
if(PixelsFound){
//Set the flag to +1
*Flag = atomic_inc(*Flag);
}
}
}
Community, feel free to comment if this is known not to work!

Work-item execution order

I am working with OpenCL. And I am interested how work-item will be executed in the following example.
I have one-dimensional range of 10000 with a work-group size of 512. The kernel is the followin:
__kernel void
doStreaming() {
unsigned int id = get_global_id(0);
if (!isExecutable(id))
return;
/* do some work */
}
Here it check if it need to proceed the element with the following id or not.
Let assume that the execution started with the first work-group of 512 size and 20 of them were rejected by isExecutable. Does GPU continue to execute other 20 elements without waiting the first 492 elements?
There are no any barriers or other synchronization techniques involved.
When some workitems are branching far from the usual /* do some work */, they can use pipeline occupation advantage by getting instructions from next wavefront(amd) or next warp(nvidia) because current warp/wavefront workitem is busy doing other things. But this can cause memory access serialization and purge the accessing order of workgroup, decreasing performance.
Avoid having diverged warps/wavefronts: If you do if-statements in loop, it is really bad so better you find another way.
If every work item in a workgroup is having same branching, then it is ok.
If every work item does very few branching per hundreds of computing, it is ok.
Try to generate equal conditions for all workitems(emberrasingly parallel data/algorithm) to harness the power posessed by gpu.
Best way I know to get rid of simplest branch-vs-compute case is, using a global yes-no array. 0=yes, 1=no : always compute, then multiply your result with the yes-no element of work-item. Generally adding 1-byte element memory-access per core is much better then doing one branching per core. Actually making object length a power of 2 could be better after adding this 1-byte.
Yes and no. The following elaborations are based on documentation from NVIDIA, but I would doubt it to be any different on ATI hardware (though the actual numbers might differ maybe). In general the threads of a work group are executed in so-called warps, being sub-blocks of the work group size. On NVIDIA hardware each work group is divided into warps of 32 threads each. And each of those warps are executed in lock-step and thus perfectly in parallel (it may not be real-time parallel, meaning there could be 16 threads in parallel and then 16 again directly afterwards, but conceptually they're running perfectly parallel). So if only one of those 32 threads executes that additional code, the others will wait for it. But the threads in all the other warps won't care for all this.
So yes, there may be threads that will unneccessarily wait for the others, but that happens on a smaller scale than the whole work group size (32 on any NVIDIA hardware). This is why intra-warp branch deviation should be avoided if possible and this is also why code that is guaranteed to work inside a single warp only doesn't need any synchronization for e.g. shared memory access (a common optimization for algorithms).

Resources