CUDA kernel's vectors' length based on threadIdx - vector

This is part of the pseudo code I am implementing in CUDA as part of an image reconstruction algorithm:
for each xbin(0->detectorXDim/2-1):
for each ybin(0->detectorYDim-1):
rayInit=(xbin*xBinSize+0.5,ybin*xBinSize+0.5,-detectordistance)
rayEnd=beamFocusCoord
slopeVector=rayEnd-rayInit
//knowing that r=rayInit+t*slopeVector;
//x=rayInit[0]+t*slopeVector[0]
//y=rayInit[1]+t*slopeVector[1]
//z=rayInit[2]+t*slopeVector[2]
//to find ray xx intersections:
for each xinteger(xbin+1->detectorXDim/2):
solve t for x=xinteger*xBinSize;
find corresponding y and z
add to intersections array
//find ray yy intersections(analogous to xx intersections)
//find ray zz intersections(analogous to xx intersections)
So far, this is what I have come up with:
__global__ void sysmat(int xfocus,int yfocus, int zfocus, int xbin,int xbinsize,int ybin,int ybinsize, int zbin, int projecoes){
int tx=threadIdx.x, ty=threadIdx.y,tz=threadIdx.z, bx=blockIdx.x, by=blockIdx.y,i,x,y,z;
int idx=ty+by*blocksize;
int idy=tx+bx*blocksize;
int slopeVectorx=xfocus-idx*xbinsize+0.5;
int slopeVectory=yfocus-idy*ybinsize+0.5;
int slopeVectorz=zfocus-zdetector;
__syncthreads();
//points where the ray intersects x axis
int xint=idx+1;
int yint=idy+1;
int*intersectionsx[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
int*intersectionsy[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
int*intersectionsz[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
for(xint=xint; xint<detectorXDim/2;xint++){
x=xint*xbinsize;
t=(x-idx)/slopeVectorx;
y=idy+t*slopeVectory;
z=z+t*slopeVectorz;
intersectionsx[xint-1]=x;
intersectionsy[xint-1]=y;
intersectionsz[xint-1]=z;
__syncthreads();
}
...
}
This is just a piece of the code. I know that there might be some errors(you can point them if they are blatantly wrong) but what I am more concerned is this:
Each thread(which corresponds to a detector bin) needs three arrays so it can save the points where the ray(which passes through this thread/bin) intersects multiples of the x,y and z axis. Each array's length depend on the place of the thread/bin(it's index) in the detector and on the beamFocusCoord(which are fixed). In order to do this I wrote this piece of code, which I am certain can not be done(confirmed it with a small test kernel and it returns the error: "expression must have constant value"):
int*intersectionsx[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
int*intersectionsy[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
int*intersectionsz[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
So in the end, I want to know if there is an alternative to this piece of code, where a vector's length depends on the index of the thread allocating that vector.
Thank you in advance ;)
EDIT: Given that each thread will have to save an array with the coordinates of the intersections between the ray(that goes from the beam source to the detector) and the xx,yy and zz axis, and that the spacial dimensions are around(I dont have the exact numbers at the moment, but they are very close to the real value) 1400x3600x60, is this problem feasible with CUDA?
For example, the thread (0,0) will have 1400 intersections in the x axis, 3600 in the y axis and 60 in the z axis, meaning that I will have to create an array of size (1400+3600+60)*sizeof(float) which is around 20kb per thread.
So given that each thread surpasses the 16kb local memory, that is out of the question. The other alternative was to allocate those arrays but, with some more math, we get (1400+3600+60)*4*numberofthreads(i.e. 1400*3600), which also surpasses the ammount of global memory available :(
So I am running out of ideas to deal with this problem and any help is appreciated.

No.
Every piece of memory in CUDA must be known at kernel-launch time. You can't allocate/deallocate/change anything while the kernel is running. This is true for global memory, shared memory and registers.
The common workaround is the allocate the maximum size of memory needed beforehand. This can be as simple as allocating the maximum size needed for one thread thread-multiple times or as complex as adding up all those thread-needed sizes for a total maximum and calculating appropriate thread-offsets into that array. That's a tradeoff between memory allocation and offset-computation time.
Go for the simple solution if you can and for the complex if you have to, due to memory limitations.

Why are you not using textures? Using a 2D or 3D texture would make this problem much easier. The GPU is designed to do very fast floating point interpolation, and CUDA includes excellent support for it. The literature has examples of projection reconstruction on the GPU, e.g. Accelerating simultaneous algebraic reconstruction technique with motion compensation using CUDA-enabled GPU, and textures are an integral part of their algorithms. Your own manual coordinates calculations can only be slower and more error prone than what the GPU provides, unless you need something weird like sinc interpolation.
1400x3600x60 is a little big for a single 3D texture, but you could break your problem up into 2D slices, 3D sub-volumes, or hierarchical multi-resolution reconstruction. These have all been used by other researchers. Just search PubMed.

Related

ATMega peformance for different operations

Has anyone experiences replacing floating point operations on ATMega (2560) based systems? There are a couple of very common situations which happen every day.
For example:
Are comparisons faster than divisions/multiplications?
Are float to int type cast with followed multiplication/division faster than pure floating point operations without type cast?
I hope I don't have to make a benchmark just for me.
Example one:
int iPartialRes = (int)fArg1 * (int)fArg2;
iPartialRes *= iFoo;
faster as?:
float fPartialRes = fArg1 * fArg2;
fPartialRes *= iFoo;
And example two:
iSign = fVal < 0 ? -1 : 1;
faster as?:
iSign = fVal / fabs(fVal);
the questions could be solved just by thinking a moment about it.
AVRs does not have a FPU so all floating point related stuff is done in software --> fp multiplication involves much more than a simple int multiplication
since AVRs also does not have a integer division unit a simple branch is also much faster than a software division. if dividing floating points this is the worst worst case :)
but please note, that your first 2 examples produce very different results.
This is an old answer but I will submit this elaborated answer for the curious.
Just typecasting a float will truncate it ie; 3.7 will become 3, there is no rounding.
Fastest math on a 2560 will be (+,-,*) with divide being the slowest due to no hardware divide. Typecasting to an unsigned long int after multiplying all operands by a pseudo decimal point that suits your fractal number(1) range that your floats are expected to see and tracking the sign as a bool will give the best range/accuracy compromise.
If your loop needs to be as fast as possible, avoid even integer division, instead multiplying by a pseudo fraction instead and then doing your typecast back into a float with myFloat(defined elsewhere) = float(myPseudoFloat) / myPseudoDecimalConstant;
Not sure if you came across the Show info page in the playground. It's basically a sketch that runs a benchmark on your (insert Arduino model here) Shows the actual compute times for various things and systems. The Mega 2560 will be very close to an At Mega 328 as far as FLOPs goes, up to 12.5K/s (80uS per divide float). Typecasting would likely handicap the CPU more as it introduces more overhead and might even give erroneous results due to rounding errors and lack of precision.
(1)ie: 543.509,291 * 100000 = 543,509,291 will move the decimal 6 places to the maximum precision of a float on an 8-bit AVR. If you first multiply all values by the same constant like 1000, or 100000, etc, then the decimal point is preserved and then you cast it back to a float number by dividing by your decimal constant when you are ready to print or store it.
float f = 3.1428;
int x;
x = f * 10000;
x now contains 31428

Converting Real and Imaginary FFT output to Frequency and Amplitude

I'm designing a real time Audio Analyser to be embedded on a FPGA chip. The finished system will read in a live audio stream and output frequency and amplitude pairs for the X most prevalent frequencies.
I've managed to implement the FFT so far, but it's current output is just the real and imaginary parts for each window, and what I want to know is, how do I convert this into the frequency and amplitude pairs?
I've been doing some reading on the FFT, and I see how they can be turned into a magnitude and phase relationship but I need a format that someone without a knowledge of complex mathematics could read!
Thanks
Thanks for these quick responses!
The output from the FFT I'm getting at the moment is a continuous stream of real and imaginary pairs. I'm not sure whether to break these up into packets of the same size as my input packets (64 values), and treat them as an array, or deal with them individually.
The sample rate, I have no problem with. As I configured the FFT myself, I know that it's running off the global clock of 50MHz. As for the Array Index (if the output is an array of course...), I have no idea.
If we say that the output is a series of One-Dimensional arrays of 64 complex values:
1) How do I find the array index [i]?
2) Will each array return a single frequency part, or a number of them?
Thankyou so much for all your help! I'd be lost without it.
Well, the bad news is, there's no way around needing to understand complex numbers. The good news is, just because they're called complex numbers doesn't mean they're, y'know, complicated. So first, check out the wikipedia page, and for an audio application I'd say, read down to about section 3.2, maybe skipping the section on square roots: http://en.wikipedia.org/wiki/Complex_number
What that's telling you is that if you have a complex number, a + bi, you can picture it as living in the x,y plane at location (a,b). To get the magnitude and phase, all you have to do is find two quantities:
The distance from the origin of the plane, which is the magnitude, and
The angle from the x-axis, which is the phase.
The magnitude is simple enough: sqrt(a^2 + b^2).
The phase is equally simple: atan2(b,a).
The FFT result will give you an array of complex values. The twice the magnitude (square root of sum of the complex components squared) of each array element is an amplitude. Or do a log magnitude if you want a dB scale. The array index will give you the center of the frequency bin with that amplitude. You need to know the sample rate and length to get the frequency of each array element or bin.
f[i] = i * sampleRate / fftLength
for the first half of the array (the other half is just duplicate information in the form of complex conjugates for real audio input).
The frequency of each FFT result bin may be different from any actual spectral frequencies present in the audio signal, due to windowing or so-called spectral leakage. Look up frequency estimation methods for the details.

OpenCL computation times much longer than CPU alternative

I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.

kernel OpenCL not saving results?

Im making my first steps in OpenCL, and I got a problem with I think results aren't saved correctly. I first made a simple program, that is working, giving the correct results etc. So main is functioning correctly, arrays are filled the way they are supposed to etc.
Brief explanation of the program:
I have a grid of 4x4 points (OP array), in a field that is 40x40. There is a plane flying above this field, and it's route is divided into 100 segments (Segment array). Coord is a typedef struct, with currently only a double x and double y value. I know there are simpler solutions, but I have to use this later on, so I do it this way. The kernel is supposed to calculate the distance from a point (an OP) to a segment. The result has to be saved in the array Distances.
Description of the kernel: it gets is global id, from wich it calculates which OP and which segment it has to calculate with. With that info it gets the x and y from those 2, and calculates the x and y distance between each other. With this information it does an square root from the distances and the flightaltitude.
The problem is, that the Distances table only shows zero's :P
kernel: http://pastebin.com/U9hTWvv2 There is an parameter that says __global const Coord* Afstanden, this has to be __global double* Afstanden
OpenCL stuff in main: http://pastebin.com/H3mPwuUH (because Im not 100% sure Im doing that is right)
I can give you the whole program, but it's mostly dutch :P
I tried to explain everything as good as possible, if something is not clear, I'll try to clear it up ;)
As I made numerous little faults in the first kernel, here is his successor: http://pastebin.com/rs6T1nKs Also the OpenCL part is a bit updated: http://pastebin.com/4NbUQ40L because it also had one or two faults in it.
It should work now, in my opinion... But it isn't, the problem is still standing
Well, it seems that the "Afstanden" variable is declared as a "const Coord*". Remove the constness - i.e, "Coord* only - and be sure that when you use clCreateBuffer "CL_MEM_WRITE_ONLY" or "CL_MEM_READ_WRITE" as an argument.
"The problem is, that the Distances table only shows zero's :P"
You are probably initializing Afstanden on the CPU side with zeros. Am I correct? This would explain this behavior.

OpenCL image histogram

I'm trying to write a histogram kernel in OpenCL to compute 256 bin R, G, and B histograms of an RGBA32F input image. My kernel looks like this:
const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP|
CLK_FILTER_NEAREST;
__kernel void computeHistogram(read_only image2d_t input, __global int* rOutput,
__global int* gOutput, __global int* bOutput)
{
int2 coords = {get_global_id(0), get_global_id(1)};
float4 sample = read_imagef(input, mSampler, coords);
uchar rbin = floor(sample.x * 255.0f);
uchar gbin = floor(sample.y * 255.0f);
uchar bbin = floor(sample.z * 255.0f);
rOutput[rbin]++;
gOutput[gbin]++;
bOutput[bbin]++;
}
When I run it on an 2100 x 894 image (1,877,400 pixels) i tend to only see in or around 1,870,000 total values being recorded when I sum up the histogram values for each channel. It's also a different number each time. I did expect this since once in a while two kernels probably grab the same value from the output array and increment it, effectively cancelling out one increment operation (I'm assuming?).
The 1,870,000 output is for a {1,1} workgroup size (which is what seems to get set by default if I don't specify otherwise). If I force a larger workgroup size like {10,6}, I get a drastically smaller sum in my histogram (proportional to the change in workgroup size). This seemed strange to me, but I'm guessing what happens is that all of the work items in the group increment the output array value at the same time, and so it just counts as a single increment?
Anyways, I've read in the spec that OpenCL has no global memory syncronization, only syncronization within local workgroups using their __local memory. The histogram example by nVidia breaks up the histogram workload into a bunch of subproblems of a specific size, computes their partial histograms, then merges the results into a single histogram after. This doesn't seem like it'll work all that well for images of arbitrary size. I suppose I could pad the image data out with dummy values...
Being new to OpenCL, I guess I'm wondering if there's a more straightforward way to do this (since it seems like it should be a relatively straightforward GPGPU problem).
Thanks!
As stated before, you write into a shared memory unsynchronized and non atomic. This leads to errors. If the picture is big enough, I have a suggestion:
Split your work group into a one dimensional one for cols or rows. Use each kernel to sum up the histogram for the col or row and afterwards sum it globally with atomic atom_inc. This brings the most sum ups in private memory which is much faster and reduces atomic ops.
If you work in two dimensions you can do it on parts of the picture.
[EDIT:]
I think, I have a better answer: ;-)
Have a look to: http://developer.download.nvidia.com/compute/opencl/sdk/website/samples.html#oclHistogram
They have an interesting implementation there...
Yes, you're writing to a shared memory from many work-items at the same time, so you will lose elements if you don't do the updates in a safe way (or worse ? Just don't do it). The increase in group size actually increases the utilization of your compute device, which in turn increases the likelihood of conflicts. So you end up losing more updates.
However, you seem to be confusing synchronization (ordering thread execution order) and shared memory updates (which typically require either atomic operations, or code synchronization and memory barriers, to make sure the memory updates are visible to other threads that are synchronized).
the synchronization+barrier is not particularly useful for your case (and as you noted is not available for global synchronization anyways. Reason is, 2 thread-groups may never run concurrently so trying to synchronize them is nonsensical). It's typically used when all threads start working on generating a common data-set, and then all start to consume that data-set with a different access pattern.
In your case, you can use atomic operations (e.g. atom_inc, see http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=113&Itemid=168). However, note that updating a highly contended memory address (say, because you have thousands of threads trying all to write to only 256 ints) is likely to yield poor performance. All the hoops typical histogram code goes through are there to reduce the contention on the histogram data.
You can check
The histogram example from AMD Accelerated Parallel Processing (APP) SDK.
Chapter 14 - Image Histogram of OpenCL Programming Guide book (ISBN-13: 978-0-321-74964-2).
GPU Histogram - Sample code from Apple

Resources