pyopencl kernel outputs black image - opencl

Why is the image coming out as black when saved? I am just beginning to learn opencl.
Without opencl, on purely CPU, the loop iterates through the matrix and uses the rgb2gray average formula to store the values in gray array.
Using windows and python 3.8
import pyopencl
import numpy as np
import imread
import matplotlib.pyplot as plt
ocl_platforms = (platform.name for platform in pyopencl.get_platforms())
print("\n".join(ocl_platforms))
# select platform
platform = pyopencl.get_platforms()[0]
# select device
device = platform.get_devices()[0]
# create context
ctx = pyopencl.Context(devices=[device])
img = imread.imread('gigapixel.jpg')
r = np.array(img[:, :, 0], dtype=np.float32)
g = np.array(img[:, :, 1], dtype=np.float32)
b = np.array(img[:, :, 2], dtype=np.float32)
gray = np.empty_like(r)
# without gpu
for i in range(r.shape[0]):
for j in range(r.shape[1]):
gray[i, j] = (r[i, j] + g[i, j] + b[i, j]) / 3
plt.imshow(gray)
plt.show()
# convert to uint8
gray = np.uint8(gray)
# save image
imread.imsave('gray_cpu.jpg', gray)
with GPU the rest of the code is
gray = np.empty_like(r)
program_source = """
__kernel void rgb2gray(__global float *r, __global float *g, __global float *b, __global
float *gray) {
int i = get_global_id(0);
int j = get_global_id(1);
gray[i, j] = (r[i, j] + g[i, j] + b[i, j])/ 3;
}
"""
gpu_program_source = pyopencl.Program(ctx, program_source)
gpu_program = gpu_program_source.build()
program_kernel_names = gpu_program.get_info(pyopencl.program_info.KERNEL_NAMES)
print(program_kernel_names)
queue = pyopencl.CommandQueue(ctx)
r_buf = pyopencl.Buffer(ctx, pyopencl.mem_flags.READ_ONLY |
pyopencl.mem_flags.COPY_HOST_PTR, hostbuf=r)
g_buf = pyopencl.Buffer(ctx, pyopencl.mem_flags.READ_ONLY |
pyopencl.mem_flags.COPY_HOST_PTR, hostbuf=g)
b_buf = pyopencl.Buffer(ctx, pyopencl.mem_flags.READ_ONLY |
pyopencl.mem_flags.COPY_HOST_PTR, hostbuf=b)
gray_buf = pyopencl.Buffer(ctx, pyopencl.mem_flags.WRITE_ONLY, r.nbytes)
gpu_program.rgb2gray(queue, r.shape, None, r_buf, g_buf, b_buf, gray_buf)
pyopencl.enqueue_copy(queue, gray, gray_buf)
plt.imshow(gray)
plt.show()
gray = np.uint8(gray)
imread.imsave('gigapixel_gray.jpg', gray)

If you need to keep the array[x,y] notation, then try Numba instead of PyOpenCL. Numba converts Python function's bytecode into OpenCL kernels.
PyOpenCL is only a wrapper over OpenCL API so it compiles the given kernel code for C or C++ languages directly. So you need to index like this:
int i = get_global_id(0);
int j = get_global_id(1);
gray[i][j] = (r[i][j] + g[i][j] + b[i][j])/ 3;
If you want to see the errors produced at any stage(kernel compiling, buffer copying, etc), you need to catch exceptions of Python because PyOpenCL binds OpenCL-errors to Python exceptions. You should check them like this:
try:
your_opencl_accelerated_function()
except:
print("Something didn't work")

Related

Multiple forcings in a multi-patch ode model - R package desolve and compiled C code

I am trying to create an SEIR model with multiple patches using the package deSolve in R. At each time step, there is some movement of individuals between patches that can infect individuals in other patches. I also have an external forcing parameter that is specific to each patch (representing different environmental conditions). I've been able to get this working in base R, but given the number of patches and compartments and the duration of the model, I'm trying to convert it to compiled code to speed it up.
I've gotten the different patches working, but am struggling with how to incorporate a different forcing parameter for each patch. When forcings are provided, there is an automatic check checkforcings (https://rdrr.io/cran/deSolve/src/R/forcings.R) that doesn't allow for a matrix with more than two columns, and I'm not quite sure what the best workaround is for this. Write my own ode and checkforcings functions to override this? Restructure the forcings data once it gets into C? My final model has 195 patches so I'd prefer to be to automate it somehow so I am not writing out thousands of equations or hundreds of functions.
Also fine if the answer is just, do this in a different language, but would appreciate insight into what language I should switch to. Julia maybe?
Below is code for a very simple example that just highlights this "different forcings in different patches problem".
R Code
# Packages #########################################################
library(deSolve)
library(ggplot2); theme_set(theme_bw())
library(tidyr)
library(dplyr)
# Initial Parameters and things ####################################
times <- 1:500
n_patch <- 2
patch_ind <- 100
state_names <- (c("S", "I"))
n_state <- length(state_names)
x <-rep(0, n_patch*n_state)
names(x) <- unlist(lapply(state_names, function(x) paste(x,
stringr::str_pad(seq(n_patch), width = 3, side = "left", pad =0),
sep = "_")))
#start with infected individuals in patch 1
x[startsWith(names(x), "S")] <- patch_ind
x['S_001'] <- x['S_001'] - 5
x['I_001'] <- x['I_001'] + 5
x['I_002'] <- x['I_002'] + 20
params <- c(gamma = 0.1, betam = 0.2)
#seasonality
forcing <- data.frame(times = times,
rain = rep(rep(c(0.95,1.05), each = 50), 5))
new_approx_fun <- function(rain.column, t){
approx_col <- approxfun(rain.column, rule = 2)
return(approx_col(t))
}
rainfall2 <- data.frame(P1 = forcing$rain,
P2 = forcing$rain+0.01)
# model in R
r.mod2 <- function(t,x,params){
# turn state.vec into matrix
# columns are different states, rows are different patches
states <- matrix(x,
nrow = n_patch,
ncol = n_state, byrow = F)
S <- states[,1]
I <- states[,2]
N <- rowSums(states[,1:2])
with(as.list(params),{
#seasonal forcing
rain <- as.numeric(apply(as.matrix(rainfall2), MARGIN = 2, FUN = new_approx_fun, t = t))
dS <- gamma*I - rain*betam*S*I/N
dI <- rain*betam*S*I/N - gamma*I
return(list(c(dS, dI), rain))
})
}
out.R2 <- data.frame(ode(y = x, times =times, func = r.mod2,
parms = params))
#create seasonality for C
ftime <- seq(0, max(times), by = 0.1)
rain.ft <- approx(times, rainfall2$P1, xout = ftime, rule = 2)$y
forcings2 <- cbind(ftime, rain.ft, rain.ft +0.01)
# C model
system("R CMD SHLIB ex-patch-season-multi.c")
dyn.load(paste("ex-patch-season-multi", .Platform$dynlib.ext, sep = ""))
out.dll <- data.frame(ode(y = x, times = times, func = "derivsc",
dllname = "ex-patch-season-multi", initfunc = "parmsc",
parms = params, forcings = forcings2,
initforc = "forcc", nout = 1, outnames = "rain"))
C code
#include <R.h>
#include <math.h>
#include <Rmath.h>
// this is for testing to try and get different forcing for each patch //
/*define parameters, pay attention to order */
static double parms[2];
static double forc[1];
#define gamma parms[0]
#define betam parms[1]
//define forcing
#define rain forc[0]
/* initialize parameters */
void parmsc(void (* odeparms)(int *, double *)){
int N=2;
odeparms(&N, parms);
}
/* forcing */
void forcc(void (* odeforcs)(int *, double *))
{
int N=1;
odeforcs(&N, forc);
}
/* model function */
void derivsc(int *neq, double *t, double *y, double *ydot, double *yout, int *ip){
//use for-loops for patches
//define all variables at start of block
int npatch=2;
double S[npatch]; double I[npatch]; double N[npatch];
int i;
for(i=0; i<npatch; i++){
S[i] = y[i];
};
for(i=0; i <npatch; i++){
int ind = npatch+i;
I[i] = y[ind];
};
for(i=0; i<npatch; i++){
N[i] = S[i] + I[i];
};
//use for loops for equations
{
// Susceptible
for(i=0; i<npatch; i++){
ydot[i] = gamma*I[i] - rain*betam*I[i]*S[i]/N[i] ;
};
//infected
for(i=0; i<npatch; i++){
int ind=npatch+i;
ydot[ind] = rain*betam*I[i]*S[i]/N[i] - gamma*I[i];
};
};
yout[0] = rain;
}
The standard way for multiple forcings in compiled code of the deSolve package is described in the lsoda help page:
forcings only used if ‘dllname’ is specified: a list with the forcing function data sets, each present as a two-columned matrix
Such a list can be created automatically in a script.
There are also other ways possible with some creative C or Fortran programming.
For more complex models, I would recommend to use the rodeo package. It allows to specify dynamic models in a tabular form (CSV, LibreOffice, Excel), including parameters and forcing functions. The code generator of the package creates then a fast Fortran code, that can be solved with deSolve. An overview can be found in a paper of Kneis et al (2017), https://doi.org/10.1016/j.envsoft.2017.06.036 and a more extended tutorial at https://dkneis.github.io/ .

Cannot convert Cython memoryviewslice to ndarray

I am trying to write an explicit Successive Overrelaxation Function over a 2D matrix. In this case for an electrostatic potential.
When trying to optimize this in Cython I seem to get an error that I am not quite sure I understand.
%%cython
cimport cython
import numpy as np
cimport numpy as np
from libc.math cimport pi
#SOR function
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.initializedcheck(False)
#cython.nonecheck(False)
def SOR_potential(np.float64_t[:, :] potential, mask, int max_iter, float error_threshold, float alpha):
#the ints
cdef int height = potential.shape[0]
cdef int width = potential.shape[1] #more general non quadratic
cdef int it = 0
#the floats
cdef float error = 0.0
cdef float sor_adjustment
#the copy array we will iterate over and return
cdef np.ndarray[np.float64_t, ndim=2] input_matrix = potential.copy()
#set the ideal alpha if user input is 0.0
if alpha == 0.0:
alpha = 2/(1+(pi/((height+width)*0.5)))
#start the SOR loop. The for loops omit the 0 and -1 index\
#because they are *shadow points* used for neuman boundary conditions\
cdef int row, col
#iteration loop
while True:
#2-stencil loop
for row in range(1, height-1):
for col in range(1, width-1):
if not(mask[row][col]):
potential[row][col] = 0.25*(input_matrix[row-1][col] + \
input_matrix[row+1][col] + \
input_matrix[row][col-1] + \
input_matrix[row][col+1])
sor_adjustment = alpha * (potential[row][col] - input_matrix[row][col])
input_matrix[row][col] = sor_adjustment + input_matrix[row][col]
error += np.abs(input_matrix[row][col] - potential[row][col])
#by the end of this loop input_matrix and potential have diff values
if error<error_threshold:
break
elif it>max_iter:
break
else:
error = 0
it = it + 1
return input_matrix, error, it
and I used a very simple example for an array to see if it would give an error output.
test = [[True, False], [True, False]]
pot = np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float64)
SOR_potential(pot, test, 50, 0.1, 0.0)
Gives out this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [30], line 1
----> 1 SOR_potential(pot, test, 50, 0.1, 0.0)
File _cython_magic_6c09a5060df996862b8e35adacc0e25c.pyx:21, in _cython_magic_6c09a5060df996862b8e35adacc0e25c.SOR_potential()
TypeError: Cannot convert _cython_magic_6c09a5060df996862b8e35adacc0e25c._memoryviewslice to numpy.ndarray
But when I delete the np.float64_t[:, :] part from
def SOR_potential(np.float64_t[:, :] potential,...)
the code works. Of course, the simple 2x2 matrix will not converge but it gives no errors. Where is the mistake here?
I also tried importing the modules differently as suggested here
Cython: how to resolve TypeError: Cannot convert memoryviewslice to numpy.ndarray?
but I got 2 errors instead of 1 where there were type mismatches.
Note: I would also like to ask, how would I define a numpy array of booleans to put in front of the "mask" input in the function?
A minimal reproducible example of your error message would look like this:
def foo(np.float64_t[:, :] A):
cdef np.ndarray[np.float64_t, ndim=2] B = A.copy()
# ... do something with B ...
return B
The problem is, that A is a memoryview while B is a np.ndarray. If both A and B are memoryviews, i.e.
def foo(np.float64_t[:, :] A):
cdef np.float64_t[:, :] B = A.copy()
# ... do something with B ...
return np.asarray(B)
your example will compile without errors. Note that you then need to call np.asarray if you want to return a np.ndarray.
Regarding your second question: You could use a memoryview with dtype np.uint8_t
def foo(np.float64_t[:, :] A, np.uint8_t[:, :] mask):
cdef np.float64_t[:, :] B = A.copy()
# ... do something with B and mask ...
return np.asarray(B)
and call it like this from Python:
mask = np.array([[True, True], [False, False]], dtype=bool)
A = np.ones((2,2), dtype=np.float64)
foo(A, mask)
PS: If your array's buffers are guaranteed to be C-Contiguous, you can use contiguous memoryviews for better performance:
def foo(np.float64_t[:, ::1] A, np.uint8_t[:, ::1] mask):
cdef np.float64_t[:, ::1] B = A.copy()
# ... do something with B and mask ...
return np.asarray(B)

cython - determining the number of items in pointer variable

How would I determine the number of elements in a pointer variable in cython? I saw that in C one way seems to be sizeof(ptr)/sizeof(int), if the pointer points to int variables. But that doesn't seem to work in cython. E.g. when I tried to join two memory views into a single pointer like so:
from libc.stdlib cimport malloc, free
cdef int * join(int[:] a, int[:] b):
cdef:
int n_a = a.shape[0]
int n_b = b.shape[0]
int new_size = n_a + n_b
int *joined = <int *> malloc(new_size*sizeof(int))
int i
try:
for i in range(n_a):
joined[i] = a[i]
for i in range(n_b):
joined[n_a+i] = b[i]
return joined
finally:
free(joined)
#cython.cdivision(True)
def join_memviews(int[:] n, int[:] m):
cdef int[:] arr_fst = n
cdef int[:] arr_snd = m
cdef int *arr_new
cdef int new_size
arr_new = join(arr_fst,arr_snd)
new_size = sizeof(arr_new)/sizeof(int)
return [arr_new[i] for i in range(new_size)]
I do not get the desired result when calling join_memviews from a python script, e.g.:
# in python
a = np.array([1,2])
b = np.array([3,4])
a_b = join_memviews(a,b)
I also tried using the types
DTYPE = np.int
ctypedef np.int_t DTYPE_t
as the arguement inside sizeof(), but that didn't work either.
Edit: The handling of the pointer variable was apparently a bit careless of me. I hope the following is fine (even though it might not be a prudent approach):
cdef int * join(int[:] a, int[:] b, int new_size):
cdef:
int *joined = <int *> malloc(new_size*sizeof(int))
int i
for i in range(n_a):
joined[i] = a[i]
for i in range(n_b):
joined[n_a+i] = b[i]
return joined
def join_memviews(int[:] n, int[:] m):
cdef int[:] arr_fst = n
cdef int[:] arr_snd = m
cdef int *arr_new
cdef int new_size = n.shape[0] + m.shape[0]
try:
arr_new = join(arr_fst,arr_snd, new_size)
return [arr_new[i] for i in range(new_size)]
finally:
free(arr_new)
You can't. It doesn't work in C either. sizeof(ptr) returns the amount of memory used to store the pointer (i.e. typically 4 or 8 depending on your system) rather than the length of the array. The lengths of your malloced arrays are something that you need to keep track of manually.
Additionally the following code is a recipe for disaster:
cdef int *joined = <int *> malloc(new_size*sizeof(int))
try:
return joined
finally:
free(joined)
The free happens immediately on function exit so that an invalid pointer is returned to the calling function.
You should be using properly managed Python arrays (either from numpy or the standard library array module) unless you absolutely can't avoid it.

cython static shaped array views

In cython, one can use array views, e.g.
cdef void func(float[:, :] arr)
In my usage the second dimension should always have a shape of 2. Can I tell cython this? I was thinking of something like:
cdef void func(float[:, 2] arr)
but this results in an invalid syntax; Or is it possible to have something more similar to c++, e.g.
cdef void func(tuple<float, float>[:] arr)
Thanks in advance!
You can use a 2D static array instead. Just use the pointer notation. Here is how you achieve it
def pyfunc():
# static 1D array
cdef float *arr1d = [1,-1, 0, 2,-1, -1, 4]
# static 2D array
cdef float[2] *arr2d = [[1,.2.],[3.,4.]]
# pass to a "cdef"ed function
cfunc(arr2d)
# your function signature would now look like this
cdef void cfunc(float[2] *arr2d):
print("my 2D static array")
print(arr2d[0][0],arr2d[0][1],arr2d[1][0],arr2d[1][1])
Calling it you get:
>>> pyfunc()
my 2D static array
1.0, 2.0, 3.0, 4.0
I don't think this is really supported, but if you want to do this then the best way is probably to use memoryviews of structs (which are compatible with numpys custom dtypes):
import numpy as np
cdef packed struct Pair1: # packed ensures it matches custom numpy dtypes
# (but probably doesn't matter here!)
double x
double y
# pair 1 matches arrays of this dtype
pair_1_dtype = [('x',np.float64), ('y',np.float64)]
cdef packed struct Pair2:
double data[2]
pair_2_dtype = [('data',np.float64, (2,))]
def pair_func1(Pair1[::1] x):
# do some very basic work
cdef Pair1 p
cdef Py_ssize_t i
p.x = 0; p.y = 0
for i in range(x.shape[0]):
p.x += x[i].x
p.y += x[i].y
return p # take advantage of auto-conversion to a dict
def pair_func2(Pair2[::1] x):
# do some very basic work
cdef Pair2 p
cdef Py_ssize_t i
p.data[0] = 0; p.data[1] = 0
for i in range(x.shape[0]):
p.data[0] += x[i].data[0]
p.data[1] += x[i].data[1]
return p # take advantage of auto-conversion to a dict
and a function to show you how to call it:
def call_pair_funcs_example():
# generate data of correct dtype
d = np.random.rand(100,2)
d1 = d.view(dtype=pair_1_dtype).reshape(-1)
print(pair_func1(d1))
d2 = d.view(dtype=pair_2_dtype).reshape(-1)
print(pair_func2(d2))
The thing I'd like to have done is:
ctypedef double[2] Pair3
def pair_func3(Pair3[::1] x):
# do some very basic work
cdef Pair3 p
cdef Py_ssize_t i
p[0] = 0; p[1] = 0
for i in range(x.shape[0]):
p[0] += x[i][0]
p[1] += x[i][1]
return p # ???
That compiles successfully, but I couldn't find any way of converting it from numpy. If you could work out how to get this version to work then I think it would be the most elegant solution.
Note that I'm not convinced of the performance advantages of any of these solutions. Your best move is probably to tell Cython that the trailing dimension is contiguous in memory (e.g. double [:,::1]) but let it be any size.

issue with OpenCL stencil code

I have a problem with a 4-point stencil OpenCL code. The code runs fine but I don't get symetrics final 2D values which are expected.
I suspect it is a problem of updates values in the kernel code. Here's the kernel code :
// kernel code
const char *source ="__kernel void line_compute(const double diagx, const double diagy,\
const double weightx, const double weighty, const int size_x,\
__global double* tab_new, __global double* r)\
{ int iy = get_global_id(0)+1;\
int ix = get_global_id(1)+1;\
double new_value, cell, cell_n, cell_s, cell_w, cell_e;\
double rk;\
cell_s = tab_new[(iy+1)*(size_x+2)+ix];\
cell_n = tab_new[(iy-1)*(size_x+2)+ix];\
cell_e = tab_new[iy*(size_x+2)+(ix+1)];\
cell_w = tab_new[iy*(size_x+2)+(ix-1)];\
cell = tab_new[iy*(size_x+2)+ix];\
new_value = weighty *( cell_n + cell_s + cell*diagy)+\
weightx *( cell_e + cell_w + cell*diagx);\
rk = cell - new_value;\
r[iy*(size_x+2)+ix] = rk *rk;\
barrier(CLK_GLOBAL_MEM_FENCE);\
tab_new[iy*(size_x+2)+ix] = new_value;\
}";
cell_s, cell_n, cell_e, cell_w represents the 4 values for the 2D stencil. I compute the new_value and update it after a "barrier(CLK_GLOBAL_MEM_FENCE)".
However, it seems there are conflicts between differents work-items. How could I fix this ?
The barrier GLOBAL_MEM_FENCE you use will not synchronize all work-items as intended. It does only synchronize access with one single workgroup.
Usually all workgroups won't be executed at the same time, because they are scheduled on only a small number of physical cores, and global synchronization is not possible within a kernel.
The solution is to write the output to a different buffer.

Resources