OpenCl max warp and work-group per compute unit - opencl

Can I get maximum warp/work-group on one compute unit through some function like clGetDeviceInfo.
From what I've found the number depends only on Compute capability.So is there any function that can detect it?
thx
jikra

I think you are looking for clGetKernelWorkGroupInfo.
Specifically, CL_KERNEL_WORK_GROUP_SIZE and CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will help you tune your work group sizes.

Related

How would I normalize a float array to the range [0.0, 1.0] in parallel?

I want to design a kernel in which I can pass an array of floats and have them all come out with the maximum being 1.0 and the minimum being 0.0. Theoretically, each element would be mapped to something like (x-min)/(max-min). How can I parallelize this?
A simple solution would be to split the problem into 2 kernels:
Reduction kernel
Divide your array into chunks of N * M elements each, where N is the number of work-items per group, and M is the number of array elements processed by each work-item.
Each work-item computes the min() and max() of its M items.
Within the workgroup, perform a parallel reduction of min and max across the N work-items, giving you the min/max for each chunk.
With those values obtained, one of the items in the group can use atomics to update global min/max values. Given that you are using floats, you will need to use the well known workaround for the lack of atomic min/max/CAS operations on floats.
Application
After your first kernel has completed, you know that the global min and max values must be correct. You can compute your scale factor and normalisation offset, and then kick off as many work items as your array has elements, to multiply/add each array element to adjust it.
Tweak your values for N and M to find an optimum for a given OpenCL implementation and hardware combination. (Note that M = 1 may be the optimum, i.e. launching straight into the parallel reduction.)
Having to synchronise between the two kernels is not ideal but I don't really see a way around that. If you have multiple independent arrays to process, you can hide the synchronisation overhead by submitting them all in parallel.

what if the FD steps varied w.r.t output/input

I am using the finite difference scheme to find gradients.
Lets say i have 2 outputs (y1,y2) and 1 input (x) in a single component. And in advance I know that the sensitivity of y1 with respect to x is not same as the sensitivity of y2 to x. And thus i could potentially have two different steps for those as in ;
self.declare_partials(of=y1, wrt=x, method='fd',step=0.01, form='central')
self.declare_partials(of=y2, wrt=x, method='fd',step=0.05, form='central')
There is nothing that stops me (algorithmically) but it is not clear what would openmdao gradient calculation exactly do in this case?
does it exchange information from the case where the steps are different by looking at the steps ratios or simply treating them independently and therefore doubling computational time ?
I just tested this, and it does the finite difference twice with the two different step sizes, and only saves the requested outputs for each step. I don't think we could do anything with the ratios as you suggested, as the reason for using different stepsizes to resolve individual outputs is because you don't trust the accuracy of the outputs at the smaller (or large) stepsize.
This is a fair question about the effect of the API. In typical FD applications you would get only 1 function call per design variable for forward and backward difference and 2 function calls for central difference.
However in this case, you have asked for two different step sizes for two different outputs, both with central difference. So here, you'll end up with 4 function calls to compute all the derivatives. dy1_dx will be computed using the step size of .01 and dy2_dx will be computed with a step size of .05.
There is no crosstalk between the two different FD calls, and you do end up with more function calls than you would have if you just specified a single step size via:
self.declare_partials(of='*', wrt=x, method='fd',step=0.05, form='central')
If the cost is something you can bear, and you get improved accuracy, then you could use this method to get different step sizes for different outputs.

returning one result from a pyopencl kernel

My pyopencl kernel program is started with global size of (512,512), I assume it will run 512x512=262,144 times. I want to find the minimum value of a function in my 512x512 image but I don't want to return 262,144 floats to my CPU to calculate the min. I want to run another kernel (possibly waiting in the queue ) to find the min value of all 262,144 pixels then just send that one float to the CPU. I think this would be faster. Should my waiting kernel's global size be (1,1), ? I hope the large 262,144 Buffer of floats that I created using mf.COPY_HOST_PTR will not cross the GPU/CPU bus before I call the next kernel.
Thanks
Tim
Andreas is right: reduction is the solution. Here is a nice article from AMD explaining how to implement simple reduction. It discusses different approaches and the gain in terms of performance they bring. The example in the article is about summing all the elements and not to find the minimum, but it's fairly trivial to modify the given codes.
BTW, maybe I don't understand well you first sentence, but a kernel launched with a global size of (512, 512) will not run 262,144 times but only one time with 262,144 threads scheduled.
Use a reduction kernel to find the minimum.

argmin/argmax with PyOpenCL

How would I write argmin or argmax with PyOpenCL? I figure I would need to calculate the argmin/min for each workgroup, and then reduce these using subsequent invocations.
Adapt this to collect the minimum and its location rather than just the minimum.

Opencl and HD5850

I dont have an HD5850 but how can I know maximum workgroup size of it for opencl ? What is the preferred floating point vector width for HD5850? I suspected it was 5 but did not work on a friends computer who has 5850. Tried width 4 but did not work fast enough now I suspect work group size is not optimal. Doing NBody for 25k 50k and 100k particles consists of float8 variables for x,y,z, vx,vy,vz.
Thanks.
If you need the OpenCL specifics at development time but don't have access to the hardware, try http://clbenchmark.com. For example, the HD 5850 page is here: http://clbenchmark.com/device-environment.jsp?config=11975982. It shows CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT=4.
Use clGetDeviceInfo to poll for CL_DEVICE_MAX_WORK_GROUP_SIZE. I think the 5850 will have this at 256, but that may not be optimal for your kernel.
Use the same technique to poll for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT, which I think is 4 on your card.
clGetDeviceInfo

Resources