Opengl asyncronous PBOs read - asynchronous

I am developing an application that needs to read back the whole frame from the openGL frame buffer.
In order to improve the performance, I'm using asynchronous glReadPixels with multiple PBOs:
glBindBuffer(GL_PIXEL_PACK_BUFFER, m_pbos[m_index]);
glReadPixels(0, 0, GLL_WINDOW_SIZE, GLL_WINDOW_SIZE, GL_RED, GL_UNSIGNED_BYTE, 0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
After reading some tutorials, I do the following (e.g. with 2 PBOs):
draw frame N (glDrawArrays)
start asynchronous read of PBO N from buffer (see above)
draw frame N+1 (glDrawArrays)
start asynchronous read of PBO N+1 from buffer (see above)
sync PBO read N (glMapBuffer)
process N
draw frame N+2 (glDrawArrays)
...
This seems to work correctly, but I have some concerns regarding the pipeline.
Typically, the drawing operation on frame N+1 (3) is done before the completion of the asynchronous reading of the frame N (2).
Is the frame buffer overwritten by the drawing operation?
If the glReadPixels runs asynchronously (2), is the frame N corrupted by this overwrite?

Related

Code never runs for arrays larger than 8000 entries with errors

I just started building a code for parallel computation with OpenCL.
As far as I understand, the data generated from CPU side (host) is transffered through the buffers (clCreateBuffer -> clEnqueueWriteBuffer -> clSetKernelArg, then processed by the device).
I mainly have to deal with arrays (or matrices) of large size with double precision.
However, I realized the code never runs for arrays larger than 8000 entries with errors.
(This makes sense because 64kb is equivalent to 8000 double precision numbers.)
The error codes were either -6 (CL_OUT_OF_HOST_MEMORY) or -31 (CL_INVALID_VALUE).
One more thing when I set the argument to 2-dimensional array, I could set the size up to 8000 x 8000.
So far, I guess the maximum data size for double precision is 8000 (64kb) for 1D arrays, but I have no idea what happens for 2D or 3D arrays.
Is there any other way to transfer the data larger than 64kb?
If I did something wrong for OpenCL setup in data transfer, what would be recommended?
I appreciate your kind answer.
The hardware that I'm using is Tesla V100 which is installed on the HPC cluster in my school.
The following is a part of my code snippet that I'm testing the data transfer.
bfr(0) = clCreateBuffer(context,
& CL_MEM_READ_WRITE + CL_MEM_COPY_HOST_PTR,
& sizeof(a), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err=clEnqueueWriteBuffer(queue,bfr(0),CL_TRUE,0_8,
& sizeof(a),c_loc(a),0,C_NULL_PTR,C_NULL_PTR)
err = clSetKernelArg(kernel, 0,
& sizeof(bfr(0)), C_LOC(bfr(0)))
print*, err
if(err.ne.0)then
print *, "clSetKernelArg kernel"
print*, err
stop
endif
The code was build by Fortran with using clfortran module.
Thank you again for your answer.
You can do much larger arrays in OpenCL, as large as memory is abailable. For example I'm commonly working with linearized 4D arrays of 2 Billion floats in a CFD application.
Arrays need to be 1D only; if you have 2D or 3D arrays, linearize them, for example with n=x+y*size_x for 2D->1D coordinates. Some older devices only allow arrays 1/4 the size of the device memory. However modern devices typically have an extension to the OpenCL specification to enable larger buffers.
Here is a quickover view on what the OpenCL C bindings do:
clCreateBuffer allocates memory on the device side (video memory for GPUs, RAM for CPUs). Buffers can be as large as host/device memory allows or on some older devices 1/4 of device memory.
clEnqueueWriteBuffer copies memory over PCIe from RAM to video memory. Both on CPU and GPU side buffers must be allocated beforehand. There is no limit on transfer size; it can be as large as the entire buffer or only a subrange of a buffer.
clSetKernelArg links the GPU buffers to the Input parameters of the kernel, so it knows which kernel parameter corresponds to which buffer. Make sure data types of the buffers and kernel arguments match as you won't get an error if they don't. Also make sure the order of kernel arguments matches.
In your case there are several possible causes for the error:
Maybe you have integer overflow during computation of the array size. In this case use 64-bit integer numbers instead to compute the array size/indices.
You are out of memory because other buffers already take up too much memory. Do some bookkeeping to keep track on total (video) memory utilization.
You have selected the wrong device, for example integrated graphics instead of the dedicated GPU, in which case much less memory is available and you end up with cause 2.
To give you a more definitive answer, please provide some additional details:
What hardware do you use?
Show a code snippet of how you allocate device memory.
UPDATE
I see some errors in your code:
The length argument in clCreateBuffer and clEnqueueWriteBuffer requires the number of bytes that your array a has. If a is of type double, then this is a_length*sizeof(double), where a_length is the number of elements in the array a. sizeof(double) returns the number of bytes for one double number which is 8. So the length argument is 8 bytes times the number of elements in the array.
For multiple flags, you typically use bitwise or | instead of +. Shouldn't be an issue here, but is unconvenional.
You had "0_8" as buffer offset. This needs to be zero (0).
const int a_length = 8000;
double* a = new double[a_length];
bfr(0) = clCreateBuffer(context, CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR, a_length*sizeof(double), c_loc(a), err);
if(err.ne.0) stop "Couldn't create a buffer";
err = clEnqueueWriteBuffer(queue, bfr(0), CL_TRUE, 0, a_length*sizeof(double), c_loc(a), 0, C_NULL_PTR, C_NULL_PTR);
err = clSetKernelArg(kernel, 0, sizeof(bfr(0)), C_LOC(bfr(0)));
print*, err;
if(err.ne.0) {
print *, "clSetKernelArg kernel"
print*, err
stop
}

C++ functions and MPI programing

From what I have learned in my supercomputing class I know that MPI is a communicating (and data passing) interface.
I'm confused on when you run a function in a C++ program and want each processor to perform a specific task.
For example, a prime number search (very popular for supercomputers). Say I have a range of values (531-564, some arbitrary range) and say I have 50 processes I could run a series of evaluations on for each number. If root (process 0) wants to examine 531 and knowing prime numbers I can use 8 processes (1-8) to evaluate the prime status. If the number is divisible by any number 2-9 with a remainder of 0, then it is not prime.
Is it possible that for MPI which passes data to each process to have these processes perform these actions?
The hardest part for me is understanding that if I perform an action in the original C++ program the processes taking place could be allocated on several different processes, then in MPI how can I structure this? Or is my understanding completely wrong? If so how am I supposed to truly go about this path of thinking in a correct manner?
The big idea is passing data to a process versus having a function sent to a process. I'm fairly certain I'm wrong but I'm trying to back track to fix my thinking.
Each MPI process is running the same program, but that doesn't mean that they are doing the same thing. Different processes can be running different branches of the code, depending on the id (or "rank") of the process, and in effect be completely independent. Like any distributed computation, the actors do need to agree on how they will communicate.
The most basic strategy in MPI is scatter-gather, where the "master" process (usually the one with rank 0) will split an array of work equally amongst the peers (including the master process itself) by having them all call scatter, the peers will do the work, then all peers will call gather to send the results back to master.
In your prime algorithm example, build an array of integers, "scatter" it to all the peers, each peer will run through its array saving 1 if it is prime, 0 if it is not then "gather" the results to master. [In this particular example, since the input data is completely predictable based on process rank, the scatter step is unnecessary but we will do it anyway.]
As pseudo-code:
main():
int x[n], n = 100
MPI_init()
// prepare data on master
if rank == 0:
for i in 1 ... n, x[i] = i
// send data from x on root to local on each process in world
MPI_scatter(x, n, int, local, n/k, int, root, world)
for i in 1 ... n/k
result[i] = 1 // assume prime
if 2 divides local[i], result[i] = 0
if 3 divides local[i], result[i] = 0
if 5 divides local[i], result[i] = 0
if 7 divides local[i], result[i] = 0
// gather reults from local on each process in world to x on root
MPI_gather(result, n/k, int, x, n, int, root, world)
// print results
if rank == 0:
for i in 1 ... n, print i if x[i] == 1
MPI_finalize()
There are lots of details to fill in such as proper declarations, and dealing with the fact that some ranks will have fewer elements than others, using
proper C syntax, etc., but getting them right doesn't help explain the overall picture.
More fine-grained synchronization and communication is possible using direct send/recv between processes. Such programs are harder to write since the different processes may be in different states. In particular, it is important that if process a is calling MPI_send to process b, then process b had better be calling MPI_recv from a.

OpenCL: 2-10x run time summing columns rather that rows of square array?

I'm just getting started with OpenCL, so I'm sure there's a dozen things I can do to improve this code, but one thing in particular is standing out to me: If I sum columns rather than rows (basically contiguous versus strided, because all buffers are linear) in a 2D array of data, I get different run times by a factor of anywhere from 2 to 10x. Strangely, the contiguous access appear slower.
I'm using PyOpenCL to test.
Here's the two kernels of interest (reduce and reduce2), and another that's generating the data feeding them (forcesCL):
kernel void forcesCL(global float4 *chrgs, global float4 *chrgs2,float k, global float4 *frcs)
{
int i=get_global_id(0);
int j=get_global_id(1);
int idx=i+get_global_size(0)*j;
float3 c=chrgs[i].xyz-chrgs2[j].xyz;
float l=length(c);
frcs[idx].xyz= (l==0 ? 0
: ((chrgs[i].w*chrgs2[j].w)/(k*pown(l,2)))*normalize(c));
frcs[idx].w=0;
}
kernel void reduce(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi+i*gs].xyz;
result[gi].xyz=tmp;
}
kernel void reduce2(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi*gs+i].xyz;
result[gi].xyz=tmp;
}
It's the reduce kernels that are of interest here. The forcesCL kernel just estimates the Lorenz force between two charges where the position of each is encoded in the xyz component of a float4, and the charge in the w component. The physics isn't important, it's just a toy to play with OpenCL.
I won't go through the PyOpenCL setup unless asked, other than to show the build step:
program=cl.Program(context,'\n'.join((src_forcesCL,src_reduce,src_reduce2))).build()
I then setup of the arrays with random positions and the elementary charge:
a=np.random.rand(10000,4).astype(np.float32)
a[:,3]=np.float32(q)
b=np.random.rand(10000,4).astype(np.float32)
b[:,3]=np.float32(q)
Setup scratch space and allocations for return values:
c=np.empty((10000,10000,4),dtype=np.float32)
cc=cl.Buffer(context,cl.mem_flags.READ_WRITE,c.nbytes)
r=np.empty((10000,4),dtype=np.float32)
rr=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,r.nbytes)
s=np.empty((10000,4),dtype=np.float32)
ss=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,s.nbytes)
Then I try to run this in each of two ways-- once using reduce(), and once reduce2(). The only difference should be in which axis I'm summing:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,aa,bb,k,cc)
evt2=program.reduce(queue,r.shape[0:1],None,cc,np.uint32(b.shape[0:1]),rr,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,r,rr,wait_for=[evt2],is_blocking=True)
Notice I've swapped the arguments to forcesCL so I can check the results against the first method:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,bb,aa,k,cc)
evt2=program.reduce2(queue,s.shape[0:1],None,cc,np.uint32(a.shape[0:1]),ss,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,s,ss,wait_for=[evt2],is_blocking=True)
The version using the reduce() kernel gives me times on the order of 140ms, the version using the reduce2() kernel gives me times on the order of 360ms. The values returned are the same, save a sign change because they're being subtracted in the opposite order.
If I do the forcesCL step once, and run the two reduce kernels, the difference is much more pronounced-- on the order of 30ms versus 250ms.
I wasn't expecting any difference, but if I were I would have expected the contiguous accesses to perform better, not worse.
Can anyone give me some insight into what's happening here?
Thanks!
This is a clear case of example of coalescence. It is not about how the index is used (in the rows or columns), but how the memory is accessed in the HW. Just a matter of following step by step how the real accesses are performed and in which order.
Lets analyze it properly:
Suppose Work-Items are divided in local blocks of size N.
For the first case:
WI_0 will read 0, Gs, 2Gs, 3Gs, .... (k-1)Gs
WI_1 will read 1, Gs+1, 2Gs+1, 3Gs+1, .... (k-1)Gs+1
...
Since each of this WI is run in parallel, their memory accesses occur at the same time. So, the memory controller is requested:
First iteration: 0, 1, 2, 3 ... N-1 -> Groups into few memory access
Second iteration: Gs, Gs+1, Gs+2, ... Gs+N-1 -> Groups into few memory access
...
In that case, in each iteration, the memory controller groups all the N WI requests into a big 256bits reads/writes to global. There is no need to cache, since after processing the data it can be discarded.
For the second case:
WI_0 will read 0, 1, 2, 3, .... (k-1)
WI_1 will read Gs, Gs+1, Gs+2, Gs+3, .... Gs+(k-1)
...
The memory controller is requested:
First iteration: 0, Gs, 2Gs, 3Gs -> Scattered read, no grouping
Second iteration: 1, Gs+1, 2Gs+1, 3Gs+1 ->Scattered read, no grouping
...
In this case, the memory controller is not working in a proper mode. It would work if the cache memory is infinite, but it is not. It can cache some reads thanks to the inter-workitems memory requested being the same sometimes (due to the k size for loop) but not all of them.
When you reduce the value of k, you reduce the amount of cache reuses that is possibleI. And this lead to even bigger differences between the column and row access modes.

OpenCL - Are work-group axes exchangeable?

I was trying to find the best work-group size for a problem and I figured out something that I couldn't justify for myself.
These are my results :
GlobalWorkSize {6400 6400 1}, WorkGroupSize {64 4 1}, Time(Milliseconds) = 44.18
GlobalWorkSize {6400 6400 1}, WorkGroupSize {4 64 1}, Time(Milliseconds) = 24.39
Swapping axes caused a twice faster execution. Why !?
By the way, I was using an AMD GPU.
Thanks :-)
EDIT :
This is the kernel (a Simple Matrix Transposition):
__kernel void transpose(__global float *input, __global float *output, const int size){
int i = get_global_id(0);
int j = get_global_id(1);
output[i*size + j] = input[j*size + i];
}
I agree with #Thomas, it most probably depends on your kernel. Most probably, in the second case you access memory in a coalescent way and/or make a full use of memory transaction.
Coalescence: When threads need to access elements in the memory the hardware tries to access these elements in as less as possible transactions i.e. if the thread 0 and the thread 1 have to access contiguous elements there will be only one transaction.
full use of a memory transaction: Let's say you have a GPU that fetches 32 bytes in one transaction. Therefore if you have 4 threads that need to fetch one int each you are using only half of the data fetched by the transaction; you waste the rest (assuming an int is 4 bytes).
To illustrate this, let's say that you have a n by n matrix to access. Your matrix is in row major, and you use n threads organized in one dimension. You have two possibilities:
Each workitem takes care of one column, looping through each column element one at a time.
Each workitem takes care of one line, looping through each line element one at a time.
It might be counter-intuitive, but the first solution will be able to make coalescent access while the second won't be. The reason is that when the first workitem will need to access the first element in the first column, the second workitem will access the first element in the second column and so on. These elements are contiguous in the memory. This is not the case for the second solution.
Now if you take the same example, and apply the solution 1 but this time you have 4 workitems instead of n and the same GPU I've just spoken before you'll most probably increase the time by a factor 2 since you will waste half of your memory transactions.
EDIT: Now that you posted your kernel I see that I forgot to mention something else.
With your kernel, it seems that choosing a local size of (1, 256) or (256, 1) is always a bad choice. In the first case 256 transactions will be necessary to read a column (each fetching 32 bytes out of which only 4 will be used - keeping in mind the same GPU of my previous examples) in input while 32 transactions will be necessary to write in output: You can write 8 floats in one transaction hence 32 transactions to write the 256 elements.
This is the same problem with a workgroup size of (256, 1) but this time using 32 transactions to read, and 256 to write.
So why the first size works better? It's because there is a cache system, that can mitigate the bad access for the read part. Therefore the size (1, 256) is good for the write part and the cache system handle the not very good read part, decreasing the number of necessary read transactions.
Note that the number of transactions decreases overall (taking into considerations all the workgroups within the NDRange). For example the first workgroup issues the 256 transactions, to read the 256 first elements of the first column. The second workgroup might just go in the cache to retrieve the elements of the second column because they were fetched by the transactions (of 32 bytes) issued by the first workgroup.
Now, I'm almost sure that you can do better than (1, 256) try (8, 32).

Breaking down a Matrix so that every process gets its share of the matrix using MPI

I am fairly new to using MPI. My question is the following: I have a matrix with 2000 rows and 3 columns stored as a 2D array (not contiguous data). Without changing the structure of the array, depending on the number of processes np, each process should get a portion of the matrix.
Example:
A: 2D array of 2000 arrays by 3 columns, np = 2, then P0 gets the first half of A which would be 2D array of first 1000 rows by 3 columns, and P1 gets the second half which would be the second 1000 rows by 3 columns.
Now np can be any number (as long as it divides the number of rows). Any easy way to go about this?
I will have to use FORTRAN 90 for this assignment.
Thank you
Row-wise distribution of 2D arrays in Fortran is tricky (but not impossible) using scatter/gather operations directly because of the column-major storage. Two possible solutions follow.
Pure Fortran 90 solution: With Fortran 90 you can specify array sections like A(1:4,2:3) which would take a small 4x2 block out of the matrix A. You can pass array slices to MPI routines. Note with current MPI implementations (conforming to the now old MPI-2.2 standard), the compiler would create temporary contiguous copy of the section data and would pass it to the MPI routine (since the lifetime of the temporary storage is not well defined, one should not pass array sectons to non-blocking MPI operations like MPI_ISEND). MPI-3.0 introduces new and very modern Fortran 2008 interface that allows MPI routines to directly take array sections (without intermediate arrays) and supports passing of sections to non-blocking calls.
With array sections you only have to implement a simple DO loop in the root process:
INTEGER :: i, rows_per_proc
rows_per_proc = 2000/nproc
IF (rank == root) THEN
DO i = 0, nproc-1
IF (i /= root) THEN
start_row = 1 + i*rows_per_proc
end_row = (i+1)*rows_per_proc
CALL MPI_SEND(mat(start_row:end_row,:), 3*rows_per_proc, MPI_REAL, &
i, 0, MPI_COMM_WORLD, ierr)
END IF
END DO
ELSE
CALL MPI_RECV(submat(1,1), 3*rows_per_proc, MPI_REAL, ...)
END IF
Pure MPI solution (also works with FORTRAN 77): First, you have to declare a vector datatype with MPI_TYPE_VECTOR. The number of blocks would be 3, the block length would be the number of rows that each process should get (e.g. 1000), the stride should be equal to the total height of the matrix (e.g. 2000). If this datatype is called blktype, then the following would send the top half of the matrix:
REAL, DIMENSION(2000,3) :: mat
CALL MPI_SEND(mat(1,1), 1, blktype, p0, ...)
CALL MPI_SEND(mat(1001,1), 1, blktype, p1, ...)
Calling MPI_SEND with blktype would take 1000 elements from the specified starting address, then skip the next 2000 - 1000 = 1000 elements, take another 1000 and so on, 3 times in total. This would form a 1000-row sub-matrix of your big matrix.
You can now run a loop to send a different sub-block to each process in the communicator, effectively performing a scatter operation. In order to receive this sub-block, the receiving process could simply specify:
REAL, DIMENSION(1000,3) :: submat
CALL MPI_RECV(submat(1,1), 3*1000, MPI_REAL, root, ...)
If you are new to MPI, this is all you need to know about scattering matrices by rows in Fortran. If you know well how the type system of MPI works, then read ahead for more elegant solution.
(See here for an excellent description on how to do that with MPI_SCATTERV by Jonathan Dursi. His solution deals with splitting a C matrix in columns, which essentially poses the same problem as the one here as C stores matrices in row-major fashion. Fortran version follows.)
You could also make use of MPI_SCATTERV but it is quite involved. It builds on the pure MPI solution presented above. First you have to resize the blktype datatype into a new type, that has an extent, equal to that of MPI_REAL so that offsets in array elements could be specified. This is needed because offsets in MPI_SCATTERV are specified in multiples of the extent of the datatype specified and the extent of blktype is the size of the matrix itself. But because of the strided storage, both sub-blocks would start at only 4000 bytes apart (1000 times the typical extent of MPI_REAL). To modify the extent of the type, one would use MPI_TYPE_CREATE_RESIZED:
INTEGER(KIND=MPI_ADDRESS_KIND) :: lb, extent
! Get the extent of MPI_REAL
CALL MPI_TYPE_GET_EXTENT(MPI_REAL, lb, extent, ierr)
! Bestow the same extent upon the brother of blktype
CALL MPI_TYPE_CREATE_RESIZED(blktype, lb, extent, blk1b, ierr)
This creates a new datatype, blk1b, which has all characteristics of blktype, e.g. can be used to send whole sub-blocks, but when used in array operations, MPI would only advance the data pointer with the size of a single MPI_REAL instead of with the size of the whole matrix. With this new type, you could now position the start of each chunk for MPI_SCATTERV on any element of mat, including the start of any matrix row. Example with two sub-blocks:
INTEGER, DIMENSION(2) :: sendcounts, displs
! First sub-block
sendcounts(1) = 1
displs(1) = 0
! Second sub-block
sendcounts(2) = 1
displs(2) = 1000
CALL MPI_SCATTERV(mat(1,1), sendcounts, displs, blk1b, &
submat(1,1), 3*1000, MPI_REAL, &
root, MPI_COMM_WORLD, ierr)
Here the displacement of the first sub-block is 0, which coincides with the beginning of the matrix. The displacement of the second sub-block is 1000, i.e. it would start on the 1000-th row of the first column. On the receiver's side the data count argument is 3*1000 elements, which matches the size of the sub-block type.

Resources