Related
I'm just getting started with OpenCL, so I'm sure there's a dozen things I can do to improve this code, but one thing in particular is standing out to me: If I sum columns rather than rows (basically contiguous versus strided, because all buffers are linear) in a 2D array of data, I get different run times by a factor of anywhere from 2 to 10x. Strangely, the contiguous access appear slower.
I'm using PyOpenCL to test.
Here's the two kernels of interest (reduce and reduce2), and another that's generating the data feeding them (forcesCL):
kernel void forcesCL(global float4 *chrgs, global float4 *chrgs2,float k, global float4 *frcs)
{
int i=get_global_id(0);
int j=get_global_id(1);
int idx=i+get_global_size(0)*j;
float3 c=chrgs[i].xyz-chrgs2[j].xyz;
float l=length(c);
frcs[idx].xyz= (l==0 ? 0
: ((chrgs[i].w*chrgs2[j].w)/(k*pown(l,2)))*normalize(c));
frcs[idx].w=0;
}
kernel void reduce(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi+i*gs].xyz;
result[gi].xyz=tmp;
}
kernel void reduce2(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi*gs+i].xyz;
result[gi].xyz=tmp;
}
It's the reduce kernels that are of interest here. The forcesCL kernel just estimates the Lorenz force between two charges where the position of each is encoded in the xyz component of a float4, and the charge in the w component. The physics isn't important, it's just a toy to play with OpenCL.
I won't go through the PyOpenCL setup unless asked, other than to show the build step:
program=cl.Program(context,'\n'.join((src_forcesCL,src_reduce,src_reduce2))).build()
I then setup of the arrays with random positions and the elementary charge:
a=np.random.rand(10000,4).astype(np.float32)
a[:,3]=np.float32(q)
b=np.random.rand(10000,4).astype(np.float32)
b[:,3]=np.float32(q)
Setup scratch space and allocations for return values:
c=np.empty((10000,10000,4),dtype=np.float32)
cc=cl.Buffer(context,cl.mem_flags.READ_WRITE,c.nbytes)
r=np.empty((10000,4),dtype=np.float32)
rr=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,r.nbytes)
s=np.empty((10000,4),dtype=np.float32)
ss=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,s.nbytes)
Then I try to run this in each of two ways-- once using reduce(), and once reduce2(). The only difference should be in which axis I'm summing:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,aa,bb,k,cc)
evt2=program.reduce(queue,r.shape[0:1],None,cc,np.uint32(b.shape[0:1]),rr,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,r,rr,wait_for=[evt2],is_blocking=True)
Notice I've swapped the arguments to forcesCL so I can check the results against the first method:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,bb,aa,k,cc)
evt2=program.reduce2(queue,s.shape[0:1],None,cc,np.uint32(a.shape[0:1]),ss,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,s,ss,wait_for=[evt2],is_blocking=True)
The version using the reduce() kernel gives me times on the order of 140ms, the version using the reduce2() kernel gives me times on the order of 360ms. The values returned are the same, save a sign change because they're being subtracted in the opposite order.
If I do the forcesCL step once, and run the two reduce kernels, the difference is much more pronounced-- on the order of 30ms versus 250ms.
I wasn't expecting any difference, but if I were I would have expected the contiguous accesses to perform better, not worse.
Can anyone give me some insight into what's happening here?
Thanks!
This is a clear case of example of coalescence. It is not about how the index is used (in the rows or columns), but how the memory is accessed in the HW. Just a matter of following step by step how the real accesses are performed and in which order.
Lets analyze it properly:
Suppose Work-Items are divided in local blocks of size N.
For the first case:
WI_0 will read 0, Gs, 2Gs, 3Gs, .... (k-1)Gs
WI_1 will read 1, Gs+1, 2Gs+1, 3Gs+1, .... (k-1)Gs+1
...
Since each of this WI is run in parallel, their memory accesses occur at the same time. So, the memory controller is requested:
First iteration: 0, 1, 2, 3 ... N-1 -> Groups into few memory access
Second iteration: Gs, Gs+1, Gs+2, ... Gs+N-1 -> Groups into few memory access
...
In that case, in each iteration, the memory controller groups all the N WI requests into a big 256bits reads/writes to global. There is no need to cache, since after processing the data it can be discarded.
For the second case:
WI_0 will read 0, 1, 2, 3, .... (k-1)
WI_1 will read Gs, Gs+1, Gs+2, Gs+3, .... Gs+(k-1)
...
The memory controller is requested:
First iteration: 0, Gs, 2Gs, 3Gs -> Scattered read, no grouping
Second iteration: 1, Gs+1, 2Gs+1, 3Gs+1 ->Scattered read, no grouping
...
In this case, the memory controller is not working in a proper mode. It would work if the cache memory is infinite, but it is not. It can cache some reads thanks to the inter-workitems memory requested being the same sometimes (due to the k size for loop) but not all of them.
When you reduce the value of k, you reduce the amount of cache reuses that is possibleI. And this lead to even bigger differences between the column and row access modes.
I was trying to find the best work-group size for a problem and I figured out something that I couldn't justify for myself.
These are my results :
GlobalWorkSize {6400 6400 1}, WorkGroupSize {64 4 1}, Time(Milliseconds) = 44.18
GlobalWorkSize {6400 6400 1}, WorkGroupSize {4 64 1}, Time(Milliseconds) = 24.39
Swapping axes caused a twice faster execution. Why !?
By the way, I was using an AMD GPU.
Thanks :-)
EDIT :
This is the kernel (a Simple Matrix Transposition):
__kernel void transpose(__global float *input, __global float *output, const int size){
int i = get_global_id(0);
int j = get_global_id(1);
output[i*size + j] = input[j*size + i];
}
I agree with #Thomas, it most probably depends on your kernel. Most probably, in the second case you access memory in a coalescent way and/or make a full use of memory transaction.
Coalescence: When threads need to access elements in the memory the hardware tries to access these elements in as less as possible transactions i.e. if the thread 0 and the thread 1 have to access contiguous elements there will be only one transaction.
full use of a memory transaction: Let's say you have a GPU that fetches 32 bytes in one transaction. Therefore if you have 4 threads that need to fetch one int each you are using only half of the data fetched by the transaction; you waste the rest (assuming an int is 4 bytes).
To illustrate this, let's say that you have a n by n matrix to access. Your matrix is in row major, and you use n threads organized in one dimension. You have two possibilities:
Each workitem takes care of one column, looping through each column element one at a time.
Each workitem takes care of one line, looping through each line element one at a time.
It might be counter-intuitive, but the first solution will be able to make coalescent access while the second won't be. The reason is that when the first workitem will need to access the first element in the first column, the second workitem will access the first element in the second column and so on. These elements are contiguous in the memory. This is not the case for the second solution.
Now if you take the same example, and apply the solution 1 but this time you have 4 workitems instead of n and the same GPU I've just spoken before you'll most probably increase the time by a factor 2 since you will waste half of your memory transactions.
EDIT: Now that you posted your kernel I see that I forgot to mention something else.
With your kernel, it seems that choosing a local size of (1, 256) or (256, 1) is always a bad choice. In the first case 256 transactions will be necessary to read a column (each fetching 32 bytes out of which only 4 will be used - keeping in mind the same GPU of my previous examples) in input while 32 transactions will be necessary to write in output: You can write 8 floats in one transaction hence 32 transactions to write the 256 elements.
This is the same problem with a workgroup size of (256, 1) but this time using 32 transactions to read, and 256 to write.
So why the first size works better? It's because there is a cache system, that can mitigate the bad access for the read part. Therefore the size (1, 256) is good for the write part and the cache system handle the not very good read part, decreasing the number of necessary read transactions.
Note that the number of transactions decreases overall (taking into considerations all the workgroups within the NDRange). For example the first workgroup issues the 256 transactions, to read the 256 first elements of the first column. The second workgroup might just go in the cache to retrieve the elements of the second column because they were fetched by the transactions (of 32 bytes) issued by the first workgroup.
Now, I'm almost sure that you can do better than (1, 256) try (8, 32).
How can i get the global threadId in 2 dimensions in OpenCL?
I know that for 1 dimension, the formula is:
int global_id = get_global_id(1) * get_global_size(0) + get_global_id(0);
But if i allocate like this:
size_t block_size[] = {2,2}
size_t grid_size[] = {35,20}
The above formula fails, giving indexes only from 0 to 35*20.
The indexes should go from 0 to 35*40*2*2.
Can you recommend any good documentation or writings that could give me the intuition to understand how all of this works?
Thanks!
If you're launching a 2D NDRange, then get_global_id(0) and get_global_id(1) will give you the Gx and Gy indices. You can also independently fetch the local ids using get_local_id(0/1).
There's no need to calculate it yourself.
Did you mean that you're launching a 2D thread block but want to map that thread to a position in a 1 dimensional buffer?
EDIT: After reading your comment, I thought an explanation is in order.
OpenCL launches as many kernels as get_global_size(0) * get_global_size(1) (which is 35 * 20), so you will have threads
(0 ,0) (0 ,1) ... (0,34)
(1 ,0) (1 ,1) ... (1,34)
.
.
.
(19,0) (19,1) ... (19,34)
Local worksize is simply a way of splitting up the total number of threads and distributing them across the compute units available. It is quite possible that only 2 * 2 = 4 threads are running at any point of time.
The clEnqueueNDRangeKernel documentation tells us that local_work_size can be null, in which case the implementation will determine the size to break up the total amount of work.
In no way does local work size increase the number of threads.
Perhaps this image explains it better than I can.
Note that the total number of kernel launches are still get_global_size(0) * get_global_size(1).
If you want your 1D indices to go from 0..(35*40*2*2 - 1) then launch the kernel so that get_global_size(0) * get_global_size(1) is 35*40*2*2 (perhaps 70 x 80 ?)
Hope this helps.
I am fairly new to using MPI. My question is the following: I have a matrix with 2000 rows and 3 columns stored as a 2D array (not contiguous data). Without changing the structure of the array, depending on the number of processes np, each process should get a portion of the matrix.
Example:
A: 2D array of 2000 arrays by 3 columns, np = 2, then P0 gets the first half of A which would be 2D array of first 1000 rows by 3 columns, and P1 gets the second half which would be the second 1000 rows by 3 columns.
Now np can be any number (as long as it divides the number of rows). Any easy way to go about this?
I will have to use FORTRAN 90 for this assignment.
Thank you
Row-wise distribution of 2D arrays in Fortran is tricky (but not impossible) using scatter/gather operations directly because of the column-major storage. Two possible solutions follow.
Pure Fortran 90 solution: With Fortran 90 you can specify array sections like A(1:4,2:3) which would take a small 4x2 block out of the matrix A. You can pass array slices to MPI routines. Note with current MPI implementations (conforming to the now old MPI-2.2 standard), the compiler would create temporary contiguous copy of the section data and would pass it to the MPI routine (since the lifetime of the temporary storage is not well defined, one should not pass array sectons to non-blocking MPI operations like MPI_ISEND). MPI-3.0 introduces new and very modern Fortran 2008 interface that allows MPI routines to directly take array sections (without intermediate arrays) and supports passing of sections to non-blocking calls.
With array sections you only have to implement a simple DO loop in the root process:
INTEGER :: i, rows_per_proc
rows_per_proc = 2000/nproc
IF (rank == root) THEN
DO i = 0, nproc-1
IF (i /= root) THEN
start_row = 1 + i*rows_per_proc
end_row = (i+1)*rows_per_proc
CALL MPI_SEND(mat(start_row:end_row,:), 3*rows_per_proc, MPI_REAL, &
i, 0, MPI_COMM_WORLD, ierr)
END IF
END DO
ELSE
CALL MPI_RECV(submat(1,1), 3*rows_per_proc, MPI_REAL, ...)
END IF
Pure MPI solution (also works with FORTRAN 77): First, you have to declare a vector datatype with MPI_TYPE_VECTOR. The number of blocks would be 3, the block length would be the number of rows that each process should get (e.g. 1000), the stride should be equal to the total height of the matrix (e.g. 2000). If this datatype is called blktype, then the following would send the top half of the matrix:
REAL, DIMENSION(2000,3) :: mat
CALL MPI_SEND(mat(1,1), 1, blktype, p0, ...)
CALL MPI_SEND(mat(1001,1), 1, blktype, p1, ...)
Calling MPI_SEND with blktype would take 1000 elements from the specified starting address, then skip the next 2000 - 1000 = 1000 elements, take another 1000 and so on, 3 times in total. This would form a 1000-row sub-matrix of your big matrix.
You can now run a loop to send a different sub-block to each process in the communicator, effectively performing a scatter operation. In order to receive this sub-block, the receiving process could simply specify:
REAL, DIMENSION(1000,3) :: submat
CALL MPI_RECV(submat(1,1), 3*1000, MPI_REAL, root, ...)
If you are new to MPI, this is all you need to know about scattering matrices by rows in Fortran. If you know well how the type system of MPI works, then read ahead for more elegant solution.
(See here for an excellent description on how to do that with MPI_SCATTERV by Jonathan Dursi. His solution deals with splitting a C matrix in columns, which essentially poses the same problem as the one here as C stores matrices in row-major fashion. Fortran version follows.)
You could also make use of MPI_SCATTERV but it is quite involved. It builds on the pure MPI solution presented above. First you have to resize the blktype datatype into a new type, that has an extent, equal to that of MPI_REAL so that offsets in array elements could be specified. This is needed because offsets in MPI_SCATTERV are specified in multiples of the extent of the datatype specified and the extent of blktype is the size of the matrix itself. But because of the strided storage, both sub-blocks would start at only 4000 bytes apart (1000 times the typical extent of MPI_REAL). To modify the extent of the type, one would use MPI_TYPE_CREATE_RESIZED:
INTEGER(KIND=MPI_ADDRESS_KIND) :: lb, extent
! Get the extent of MPI_REAL
CALL MPI_TYPE_GET_EXTENT(MPI_REAL, lb, extent, ierr)
! Bestow the same extent upon the brother of blktype
CALL MPI_TYPE_CREATE_RESIZED(blktype, lb, extent, blk1b, ierr)
This creates a new datatype, blk1b, which has all characteristics of blktype, e.g. can be used to send whole sub-blocks, but when used in array operations, MPI would only advance the data pointer with the size of a single MPI_REAL instead of with the size of the whole matrix. With this new type, you could now position the start of each chunk for MPI_SCATTERV on any element of mat, including the start of any matrix row. Example with two sub-blocks:
INTEGER, DIMENSION(2) :: sendcounts, displs
! First sub-block
sendcounts(1) = 1
displs(1) = 0
! Second sub-block
sendcounts(2) = 1
displs(2) = 1000
CALL MPI_SCATTERV(mat(1,1), sendcounts, displs, blk1b, &
submat(1,1), 3*1000, MPI_REAL, &
root, MPI_COMM_WORLD, ierr)
Here the displacement of the first sub-block is 0, which coincides with the beginning of the matrix. The displacement of the second sub-block is 1000, i.e. it would start on the 1000-th row of the first column. On the receiver's side the data count argument is 3*1000 elements, which matches the size of the sub-block type.
This is part of the pseudo code I am implementing in CUDA as part of an image reconstruction algorithm:
for each xbin(0->detectorXDim/2-1):
for each ybin(0->detectorYDim-1):
rayInit=(xbin*xBinSize+0.5,ybin*xBinSize+0.5,-detectordistance)
rayEnd=beamFocusCoord
slopeVector=rayEnd-rayInit
//knowing that r=rayInit+t*slopeVector;
//x=rayInit[0]+t*slopeVector[0]
//y=rayInit[1]+t*slopeVector[1]
//z=rayInit[2]+t*slopeVector[2]
//to find ray xx intersections:
for each xinteger(xbin+1->detectorXDim/2):
solve t for x=xinteger*xBinSize;
find corresponding y and z
add to intersections array
//find ray yy intersections(analogous to xx intersections)
//find ray zz intersections(analogous to xx intersections)
So far, this is what I have come up with:
__global__ void sysmat(int xfocus,int yfocus, int zfocus, int xbin,int xbinsize,int ybin,int ybinsize, int zbin, int projecoes){
int tx=threadIdx.x, ty=threadIdx.y,tz=threadIdx.z, bx=blockIdx.x, by=blockIdx.y,i,x,y,z;
int idx=ty+by*blocksize;
int idy=tx+bx*blocksize;
int slopeVectorx=xfocus-idx*xbinsize+0.5;
int slopeVectory=yfocus-idy*ybinsize+0.5;
int slopeVectorz=zfocus-zdetector;
__syncthreads();
//points where the ray intersects x axis
int xint=idx+1;
int yint=idy+1;
int*intersectionsx[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
int*intersectionsy[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
int*intersectionsz[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
for(xint=xint; xint<detectorXDim/2;xint++){
x=xint*xbinsize;
t=(x-idx)/slopeVectorx;
y=idy+t*slopeVectory;
z=z+t*slopeVectorz;
intersectionsx[xint-1]=x;
intersectionsy[xint-1]=y;
intersectionsz[xint-1]=z;
__syncthreads();
}
...
}
This is just a piece of the code. I know that there might be some errors(you can point them if they are blatantly wrong) but what I am more concerned is this:
Each thread(which corresponds to a detector bin) needs three arrays so it can save the points where the ray(which passes through this thread/bin) intersects multiples of the x,y and z axis. Each array's length depend on the place of the thread/bin(it's index) in the detector and on the beamFocusCoord(which are fixed). In order to do this I wrote this piece of code, which I am certain can not be done(confirmed it with a small test kernel and it returns the error: "expression must have constant value"):
int*intersectionsx[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
int*intersectionsy[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
int*intersectionsz[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
So in the end, I want to know if there is an alternative to this piece of code, where a vector's length depends on the index of the thread allocating that vector.
Thank you in advance ;)
EDIT: Given that each thread will have to save an array with the coordinates of the intersections between the ray(that goes from the beam source to the detector) and the xx,yy and zz axis, and that the spacial dimensions are around(I dont have the exact numbers at the moment, but they are very close to the real value) 1400x3600x60, is this problem feasible with CUDA?
For example, the thread (0,0) will have 1400 intersections in the x axis, 3600 in the y axis and 60 in the z axis, meaning that I will have to create an array of size (1400+3600+60)*sizeof(float) which is around 20kb per thread.
So given that each thread surpasses the 16kb local memory, that is out of the question. The other alternative was to allocate those arrays but, with some more math, we get (1400+3600+60)*4*numberofthreads(i.e. 1400*3600), which also surpasses the ammount of global memory available :(
So I am running out of ideas to deal with this problem and any help is appreciated.
No.
Every piece of memory in CUDA must be known at kernel-launch time. You can't allocate/deallocate/change anything while the kernel is running. This is true for global memory, shared memory and registers.
The common workaround is the allocate the maximum size of memory needed beforehand. This can be as simple as allocating the maximum size needed for one thread thread-multiple times or as complex as adding up all those thread-needed sizes for a total maximum and calculating appropriate thread-offsets into that array. That's a tradeoff between memory allocation and offset-computation time.
Go for the simple solution if you can and for the complex if you have to, due to memory limitations.
Why are you not using textures? Using a 2D or 3D texture would make this problem much easier. The GPU is designed to do very fast floating point interpolation, and CUDA includes excellent support for it. The literature has examples of projection reconstruction on the GPU, e.g. Accelerating simultaneous algebraic reconstruction technique with motion compensation using CUDA-enabled GPU, and textures are an integral part of their algorithms. Your own manual coordinates calculations can only be slower and more error prone than what the GPU provides, unless you need something weird like sinc interpolation.
1400x3600x60 is a little big for a single 3D texture, but you could break your problem up into 2D slices, 3D sub-volumes, or hierarchical multi-resolution reconstruction. These have all been used by other researchers. Just search PubMed.