Copying an Image using PyOpenCL - opencl

I've been having some trouble making a copy of an image using PyOpenCL. I wanted to try copying as I really want to do other processing, but im not able to understand this basic task of accessing every pixel. Please help me catch the error to make sure it works.
Here is the program
import pyopencl as cl
import numpy
import Image
import sys
img = Image.open(sys.argv[1])
img_arr = numpy.asarray(img).astype(numpy.uint8)
dim = img_arr.shape
host_arr = img_arr.reshape(-1)
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=host_arr)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, host_arr.nbytes)
kernel_code = """
__kernel void copyImage(__global const uint8 *a, __global uint8 *c)
{
int rowid = get_global_id(0);
int colid = get_global_id(1);
int ncols = %d;
int npix = %d; //number of pixels, 3 for RGB 4 for RGBA
int index = rowid * ncols * npix + colid * npix;
c[index + 0] = a[index + 0];
c[index + 1] = a[index + 1];
c[index + 2] = a[index + 2];
}
""" % (dim[1], dim[2])
prg = cl.Program(ctx, kernel_code).build()
prg.copyImage(queue, (dim[0], dim[1]) , None, a_buf, dest_buf)
result = numpy.empty_like(host_arr)
cl.enqueue_copy(queue, result, dest_buf)
result_reshaped = result.reshape(dim)
img2 = Image.fromarray(result_reshaped, "RGB")
img2.save("new_image_gpu.bmp")
The image I gave as input was
However, the output from the program was
I'm not able to make sense of why those black lines appear.
Please help me solve this bug.
Thank You
OK ! so I've found a solution. I changed all uint8 to int, and in the numpy array i removed the "astype(numpy.uint8)". I dont know why, I just tried this and it worked. An explanation as to why would be helpful. Also, does this mean this will take much more memory now ?
It works, but now I think it takes much more memory. Any workaround using the uint8 will be helpful.

There is a mismatch between the datatypes your are using in Python and OpenCL. In numpy, a uint8 is an 8-bit unsigned integer (which I presume is what you were after). In OpenCL, a uint8 is an 8-element vector of 32-bit unsigned integers. The correct datatype for an 8-bit unsigned integer in OpenCL is just uchar. So, your astype(numpy.uint8) is fine, but it should be accompanied by arrays of __global const uchar* in your OpenCL kernel.
If you are dealing with images, I would also recommend looking into OpenCL's dedicated image types, which can take advantage of the native support for handling images available in some hardware.

Related

strange OpenCL behavior

I faced very strange behavior of OpenCL. I've linked a minimal code sample.
Starting from some random index (commonly 32-divisible) values is not written to array if I add one extra operation beforehand (g_idata[ai] = g_idata[ai-1]). Also notable that, i will get correct result if:
just read value, and writing a literal (see SHOW_BUG).
add if (ai >= n) g_idata[0]+=0; at beginning. see commented lines
tested on Intel and nvidia.
import numpy as np
import pyopencl as cl
ctx = cl.create_some_context()
prg = cl.Program(ctx, """
__kernel void prescan(__global float *g_idata, const int n) {
int thid = get_global_id(0);
int ai = thid*2+1;
// if uncomment strings bellow the bug dissappears
//if (ai >= n){
// g_idata[0]+=0;
//}
bool SHOW_BUG=1;
// make a dummy operation
if (SHOW_BUG)
g_idata[ai] = g_idata[ai-1];
else {
g_idata[ai-1]; //dummy read
g_idata[ai] = 3.14f; //constant write
}
barrier(CLK_GLOBAL_MEM_FENCE);
//set 0,1,2,3... as result
g_idata[thid] = thid;
}
""").build()
prescan_kernel = prg.prescan
prescan_kernel.set_scalar_arg_dtypes([None, np.int32])
def main():
N = 512
a_np = (np.random.random((N,))).astype(np.float32)
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a_np)
global_size = (512,)
local_size = None
prescan_kernel(queue, global_size, local_size, a_g, N)
cl.enqueue_copy(queue, a_np, a_g)
corect = np.array(range(N))
#assert np.allclose(a_np, 3.14), np.where(3.14 != a_np)
assert np.allclose(a_np, corect), np.where(corect != a_np)
if __name__ == '__main__':
for i in range(25):
main()
Several things in your code will, according to the OpenCL spec, create undefined behavior.
These include:
Accessing out-of-range memory. Array size expected to be N*2+1 for N work-items.
Multiple work-items (threads) accessing the same index of the array (read or write).
Furthermore barriers only synchronize work-items/threads in a work-group, so it has no effect in your code.
When discussing undefined behavior, it may behave differently on different platforms, sometimes crash the driver and sometimes take down the OS. Please fix these problems and then describe your problems.

Porting character buffers into Rcpp

I am trying to run C code in R using Rcpp, but am unsure how to convert a buffer used to hold data from a file. In the third line of code below, I allocate an unsigned char buffer and my problem is that I don't know what Rcpp data type to use. Once the data are read into the buffer, I figured out how to use Rcpp::NumericMatrix to hold the final result, but not the character buffer. I have seen several responses by Dirk Eddelbuettel to similar questions where he suggests replacing all 'malloc' calls with Rcpp initialization commands. I tried using an Rcpp::CharacterVector, but then there is a type mismatch in the loop at the end: the Rcpp::CharacterVector cannot be read as an unsigned long long int. The code runs for some C-compilers, but throws a 'memory corruption' error for others, so I would prefer to do things the way Dirk suggests (use Rcpp data types) so that the code will run regardless of the specific compiler.
FILE *fp = fopen( filename, "r" );
fseek( fp, index_data_offset, SEEK_SET );
unsigned char* buf = (unsigned char *)malloc( 3 * number_of_index_entries * sizeof(unsigned long long int) );
fread( buf, sizeof("unsigned long long int"), (long)(3 * number_of_index_entries), fp );
fclose( fp );
// Convert "buf" into a 3-column matrix.
unsigned long long int l;
Rcpp::NumericMatrix ToC(3, number_of_index_entries);
for (int col=0; col<number_of_index_entries; col++ ) {
l = 0;
int offset = (col*3 + 0)*sizeof(unsigned long long int);
for (int i = 0; i < 8; ++i) {
l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
}
ToC(0,col) = l;
l = 0;
offset = (col*3 + 1)*sizeof(unsigned long long int);
for (int i = 0; i < 8; ++i) {
l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
}
ToC(1,col) = l;
l = 0;
offset = (col*3 + 2)*sizeof(unsigned long long int);
for (int i = 0; i < 8; ++i) {
l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
}
ToC(2,col) = l;
}
return( ToC );
C and C++ can be lovely. If you know what you're doing, you have both a very direct line to the underlying hardware and higher-level abstraction for efficient reasoning.
I would suggest to simplify and reduce the problem. Start with a simple and known case, for example an STL vector of double. Let's call is x. Fill it with 10 or hundred elements, then open a FILE and write a blob from
x.data(), x.size() * sizeof(double)
Close the file. The read it into Rcpp by first allocation a NumericVector v of the same size, then reading the bytes back and then calling memcpy to &(v[0]).
It should be the same vector.
Then you can generalize to different types. Because vectors are guaranteed to be contiguous memory you can this serialization trick directly.
You can do variations on this with character buffers, or void*, or ... None of that matters for as long as you are careful not to mismatch. I.e. don't assing an int payload to a double and so on.
Now, is any this recommended? Hell no, unless you are chasing performance and know well enough what you are doing in which case it is reasonable. Otherwise rely on fantastic existing packages like fst or qs
to do it for you.
I hope this helps with your question. I wasn't entirely what it was you were asking. Maybe you clarify (and possibly shorten / focus) it if not.
A typecast did the trick:
Rcpp::NumericVector NumVecBuf( 3 * number_of_index_entries * sizeof(unsigned long long int) );
unsigned char* buf = (unsigned char*) &(NumVecBuf[0]);
Dirk's statement about "contiguous memory" suggested that this would work, so I went ahead and marked his comment as the answer. Thanks, Dirk! And, thanks for developing and maintaining Rcpp!

Copy portion of global array to local memory

I'm using PyOpenCL to let my GPU do some regression on a large data set. Right now the GPU is slower than the CPU, probably because there is a loop that requires access to the global memory during each increment (I think...). The data set is too large to store into the local memory, but each loop does not require the entire data set, so I want to copy a portion of this array to the local memory. My question is: how do I do this? In Python one can easily slice a portion, but I don't think that's possible in OpenCL.
Here's the OpenCL code I'm using, if you spot any more potential optimisations, please shout:
__kernel void gpu_slope(__global double * data, __global double * time, __global int * win_results, const unsigned int N, const unsigned int Nmax, const double e, __global double * result) {
__local unsigned int n, length, leftlim, rightlim, i;
__local double sumx, sumy, x, y, xx, xy, invlen, a, b;
n = get_global_id(0);
leftlim = win_results[n*2];
rightlim = win_results[n*2+1];
sumx = 0;
sumy = 0;
xy = 0;
xx = 0;
length = rightlim - leftlim;
for(i = leftlim; i <= rightlim; i++) {
x = time[i]; /* I think this is fetched from global memory */
y = data[i];
sumx += x;
sumy += y;
xy += x*y;
xx += x*x;
}
invlen = 1.0/length;
a = xy-(sumx*sumy)*invlen;
b = xx-(sumx*sumx)*invlen;
result[n] = a/b;
}
I'm new to OpenCL, so please bear with me. Thanks!
The main(ish) point in GPU computing is trying to utilize hardware parallelism as much as possible. Instead of using the loop, launch a kernel with a different thread for every one of the coordinates. Then, either use atomic operations (the quick-to-code, but slow-performance option), or parallel reduction, for the various sums.
AMD has A tutorial on this subject. (NVidia does too, but theirs would be CUDA-based...)
You will find examples copying to local memory in PyOpenCL's examples folder: https://github.com/inducer/pyopencl/tree/master/examples
I recommend you read, run, and customize several of these examples to learn.
I also recommend the Udacity parallel programming course: https://www.udacity.com/course/cs344 This course will help solidify your grasp of fundamental OpenCL concepts.

Working around pyopencl array offset limitation

Is there a way to work around the limitation in PyOpenCL whereby:
array.data
fails with
pyopencl.array.ArrayHasOffsetError: The operation you are attempting does not yet support arrays that start at an offset from the beginning of their buffer.
I tried:
a.base_data[a.offset: a.offset + a.nbytes]
This seems to work sometimes, but other times I get:
pyopencl.LogicError: clCreateSubBuffer failed: invalid value
clcreateSubBuffer needs to have the offset (or in this case it is called the origin) that is aligned, and the size + origin to fall within the limits of the buffer.
CL_INVALID_VALUE is returned in errcode_ret if the region specified by
(origin, size) is out of bounds in buffer.
CL_MISALIGNED_SUB_BUFFER_OFFSET is returned in errcode_ret if there
are no devices in context associated with buffer for which the origin
value is aligned to the CL_DEVICE_MEM_BASE_ADDR_ALIGN value.
For the particular error you are seeing it looks like either your program or pyopencl is miscalculating the size of the array after the offset. Even if you fixed this you may still have problems if the original offset is not aligned to CL_DEVICE_MEM_BASE_ADDR_ALIGN.
Having said that NVIDIA seems to break from spec and allow arbitrary offsets. So your mileage may vary depending on the hardware.
If you're just looking to get a buffer that marks the start of the array data, to pass to a kernel, you don't have to worry about the size. Here's a function that gets a size-1 buffer that points to the start of the offset data:
def data_ptr(array):
if array.offset:
return array.base_data.get_sub_region(array.offset, 1)
else:
return array.data
You can use this to pass to a kernel, if you need a pointer to the start of the offset data. Here's an example, where I want to set a sub-region clV of array clA to the value 3. I use data_ptr to get a pointer to the start of clV's data.
import numpy as np
import pyopencl as cl
import pyopencl.array
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
m, n = 5, 5
A = np.random.uniform(size=(m, n)).astype(np.float32)
clA = cl.array.Array(queue, A.shape, A.dtype)
clA.set(A)
clV = clA[1::2, 1::2]
def data(array):
if array.offset:
return array.base_data.get_sub_region(array.offset, 1)
else:
return array.data
source = """
__kernel void fn(long si, long sj, __global float *Y)
{
const int i = get_global_id(0);
const int j = get_global_id(1);
Y[i*si + j*sj] = 3;
}
"""
kernel = cl.Program(ctx, source).build().fn
gsize = clV.shape
lsize = None
estrides = np.array(clV.strides) / clV.dtype.itemsize
kernel(queue, gsize, lsize, estrides[0], estrides[1], data_ptr(clV))
print(clA.get())

how to convert double between host and network byte order?

Could somebody tell me how to convert double precision into network byte ordering.
I tried
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);
functions and they worked well but none of them does double (float) conversion because these types are different on every architecture. And through the XDR i found double-float precision format representations (http://en.wikipedia.org/wiki/Double_precision) but no byte ordering there.
So, I would much appreciate if somebody helps me out on this (C code would be great!).
NOTE: OS is Linux kernel (2.6.29), ARMv7 CPU architecture.
You could look at IEEE 754 at the interchanging formats of floating points.
But the key should be to define a network order, ex. 1. byte exponent and sign, bytes 2 to n as mantissa in msb order.
Then you can declare your functions
uint64_t htond(double hostdouble);
double ntohd(uint64_t netdouble);
The implementation only depends of your compiler/plattform.
The best should be to use some natural definition,
so you could use at the ARM-platform simple transformations.
EDIT:
From the comment
static void htond (double &x)
{
int *Double_Overlay;
int Holding_Buffer;
Double_Overlay = (int *) &x;
Holding_Buffer = Double_Overlay [0];
Double_Overlay [0] = htonl (Double_Overlay [1]);
Double_Overlay [1] = htonl (Holding_Buffer);
}
This could work, but obviously only if both platforms use the same coding schema for double and if int has the same size of long.
Btw. The way of returning the value is a bit odd.
But you could write a more stable version, like this (pseudo code)
void htond (const double hostDouble, uint8_t result[8])
{
result[0] = signOf(hostDouble);
result[1] = exponentOf(hostDouble);
result[2..7] = mantissaOf(hostDouble);
}
This might be hacky (the char* hack), but it works for me:
double Buffer::get8AsDouble(){
double little_endian = *(double*)this->cursor;
double big_endian;
int x = 0;
char *little_pointer = (char*)&little_endian;
char *big_pointer = (char*)&big_endian;
while( x < 8 ){
big_pointer[x] = little_pointer[7 - x];
++x;
}
return big_endian;
}
For brevity, I've not include the range guards. Though, you should include range guards when working at this level.

Resources