Julia benchmarking DSP.jl, CUDANative and CuArrays - julia

I'm experimenting with DSP.jl - the conv() method, in particular. I'm using CUDANative and CuArrays to create arrays to be arguments to conv(), so that the cuda versions of fft(), etc. will be used. I'm using BenchmarkTools to get performance data. I find that the Julia runtime complains about running out of CPU or GPU memory under odd circumstances. Here's my test setup:
using CUDAdrv, CUDAnative, CuArrays
using DSP
using FFTW
using BenchmarkTools
N = 120
A = rand(Float32, N, N, N);
B = rand(Float32, N, N, N);
A_d = cu(A);
B_d = cu(B);
function doConv(A, B)
C = conv(A, B)
finalize(C)
C = []
end
t = #benchmark doConv($A_d, $B_d)
display(t)
Here's an example of the odd behavior I mentioned. If I set N to 120, my script runs to completion. If I set N to 64, I get the "out of memory" error:ERROR: LoadError: CUFFTError(code 2, cuFFT failed to allocate GPU or CPU memory). I can run the smaller case first, get the error, then bump N to the larger value and have the script complete successfully.
Is there something I should be doing differently to prevent this from happening?

Related

Using `\` instead of `/` in Julia

For scalars, the \ (solve linear system) operator is equivalent to the division operator /. Is the performance similar?
I ask because currently my code has a line like
x = (1 / alpha) * averylongfunctionname(input1, input2, input3)
Visually, it is important that the division by alpha happens on the "left," so I am considering replacing this with
x = alpha \ averylongfunctionname(input1, input2, input3)
What is the best practice in this situation, from the standpoint of style and the standpoint of performance?
Here are some perplexing benchmarking results:
julia> using BenchmarkTools
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
julia> #btime x[1]\sum(x) setup=(x=rand(100))
15.014 ns (0 allocations: 0 bytes)
56.23358979466163
julia> #btime (1/x[1]) * sum(x) setup=(x=rand(100))
13.312 ns (0 allocations: 0 bytes)
257.4552413802698
julia> #btime sum(x)/x[1] setup=(x=rand(100))
14.929 ns (0 allocations: 0 bytes)
46.25209548841374
They are all about the same, but I'm surprised that the (1 / x) * foo approach has the best performance.
Scalar / and \ really should have the same meaning and performance. Let's define these two test functions:
f(a, b) = a / b
g(a, b) = b \ a
We can then see that they produce identical LLVM code:
julia> #code_llvm f(1.5, 2.5)
; # REPL[29]:1 within `f'
define double #julia_f_380(double %0, double %1) {
top:
; ┌ # float.jl:335 within `/'
%2 = fdiv double %0, %1
; └
ret double %2
}
julia> #code_llvm g(1.5, 2.5)
; # REPL[30]:1 within `g'
define double #julia_g_382(double %0, double %1) {
top:
; ┌ # operators.jl:579 within `\'
; │┌ # float.jl:335 within `/'
%2 = fdiv double %0, %1
; └└
ret double %2
}
And the same machine code too. I'm not sure what is causing the differences in #btime results, but I'm pretty sure that the difference between / and \ is an illusion and not real.
As to x*(1/y), that does not compute the same thing as x/y: it will be potentially less accurate since there is rounding done when computing 1/y and then that rounded value is multiplied by x, which also rounds. For example:
julia> 17/0.7
24.28571428571429
julia> 17*(1/0.7)
24.285714285714285
Since floating point division is guaranteed to be correctly rounded, doing division directly is always going to be more accurate. If the divisor is shared by a lot of loop iterations, however, you can get a speedup by rewriting the computation like this since floating-point multiplication is usually faster than division (although timing my current computer does not show this). Be aware that this comes at a loss of accuracy, however, and if the divisor is not shared there would still be a loss of accuracy and no performance gain.
I don't know, but I can suggest you to try using BenchmarkTools
package: it can help you to evaluate the performance of the two
different statements. Here you can find more details. Bye!
I think that the best choice is (1/x)*foo for two reasons:
it has the best performance (although not much compared to the other ones);
it is more clear for another person reading the code.

BigInt calculations on the GPU in Julia

I need to perform calculations on random batches of very larger integers. I have a function that compares the numbers for certain properties and returns a value based on those properties. Since the batches and the numbers themselves can be very large I want to speed up the process by utilizing the GPU.
Here is a short version of what i have running purely on the CPU now.
using Statistics
function check(M)
val = 0
#some code that calculates val based on M, e.g. the mean
val = mean(M)
return val
end
function distribution(N, n, exp) # N=batchsize, n=# of batches, exp=exponent of the upper limit of the integers
avg = 0
M = zeros(BigInt, N)
for i = 1 : n
M = rand(1 : BigInt(10) ^ exp, N)
avg += check(M)
end
avg /= n
println(avg, ":", N)
end
#example
distribution(10 ^ 3, 10 ^ 6, 100)
I have briefly used CUDAnative in Julia but I don't know how to implement the BigInt calculations. That package would be preferred but others are fine as well. Any help is appreciated.
BigInts are CPU only since they are not implemented in Julia, see 1.

Why does foreach %dopar% get slower with each additional node?

I wrote a simple matrix multiplication to test out multithreading/parallelization capabilities of my network and I noticed that the computation was much slower than expected.
The Test is simple : multiply 2 matrices (4096x4096) and return the computation time. Neither the matrices nor results are stored. The computation time is not trivial (50-90secs depending on your processor).
The Conditions : I repeated this computation 10 times using 1 processor, split these 10 computations to 2 processors (5 each), then 3 processors, ... up to 10 processors (1 computation to each processor). I expected the total computation time to decrease in stages, and i expected 10 processors to complete the computations 10 times as fast as it takes one processor to do the same.
The Results : Instead what i got was only a 2 fold reduction in computation time which is 5 times SLOWER than expected.
When i computed the average computation time per node, i expected each processor to compute the test in the same amount of time (on average) regardless of the number of processors assigned. I was surprised to see that merely sending the same operation to multiple processor was slowing down the average computation time of each processor.
Can anyone explain why this is happening?
Note this is question is NOT a duplicate of these questions:
foreach %dopar% slower than for loop
or
Why is the parallel package slower than just using apply?
Because the test computation is not trivial (ie 50-90secs not 1-2secs), and because there is no communication between processors that i can see (i.e. no results are returned or stored other than the computation time).
I have attached the scripts and functions bellow for replication.
library(foreach); library(doParallel);library(data.table)
# functions adapted from
# http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/BLAS_Testing.html
Matrix.Multiplier <- function(Dimensions=2^12){
# Creates a matrix of dim=Dimensions and runs multiplication
#Dimensions=2^12
m1 <- Dimensions; m2 <- Dimensions; n <- Dimensions;
z1 <- runif(m1*n); dim(z1) = c(m1,n)
z2 <- runif(m2*n); dim(z2) = c(m2,n)
a <- proc.time()[3]
z3 <- z1 %*% t(z2)
b <- proc.time()[3]
c <- b-a
names(c) <- NULL
rm(z1,z2,z3,m1,m2,n,a,b);gc()
return(c)
}
Nodes <- 10
Results <- NULL
for(i in 1:Nodes){
cl <- makeCluster(i)
registerDoParallel(cl)
ptm <- proc.time()[3]
i.Node.times <- foreach(z=1:Nodes,.combine="c",.multicombine=TRUE,
.inorder=FALSE) %dopar% {
t <- Matrix.Multiplier(Dimensions=2^12)
}
etm <- proc.time()[3]
i.TotalTime <- etm-ptm
i.Times <- cbind(Operations=Nodes,Node.No=i,Avr.Node.Time=mean(i.Node.times),
sd.Node.Time=sd(i.Node.times),
Total.Time=i.TotalTime)
Results <- rbind(Results,i.Times)
rm(ptm,etm,i.Node.times,i.TotalTime,i.Times)
stopCluster(cl)
}
library(data.table)
Results <- data.table(Results)
Results[,lower:=Avr.Node.Time-1.96*sd.Node.Time]
Results[,upper:=Avr.Node.Time+1.96*sd.Node.Time]
Exp.Total <- c(Results[Node.No==1][,Avr.Node.Time]*10,
Results[Node.No==1][,Avr.Node.Time]*5,
Results[Node.No==1][,Avr.Node.Time]*4,
Results[Node.No==1][,Avr.Node.Time]*3,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*1)
Results[,Exp.Total.Time:=Exp.Total]
jpeg("Multithread_Test_TotalTime_Results.jpeg")
par(oma=c(0,0,0,0)) # set outer margin to zero
par(mar=c(3.5,3.5,2.5,1.5)) # number of lines per margin (bottom,left,top,right)
plot(x=Results[,Node.No],y=Results[,Total.Time], type="o", xlab="", ylab="",ylim=c(80,900),
col="blue",xaxt="n", yaxt="n", bty="l")
title(main="Time to Complete 10 Multiplications", line=0,cex.lab=3)
title(xlab="Nodes",line=2,cex.lab=1.2,
ylab="Total Computation Time (secs)")
axis(2, at=seq(80, 900, by=100), tick=TRUE, labels=FALSE)
axis(2, at=seq(80, 900, by=100), tick=FALSE, labels=TRUE, line=-0.5)
axis(1, at=Results[,Node.No], tick=TRUE, labels=FALSE)
axis(1, at=Results[,Node.No], tick=FALSE, labels=TRUE, line=-0.5)
lines(x=Results[,Node.No],y=Results[,Exp.Total.Time], type="o",col="red")
legend('topright','groups',
legend=c("Measured", "Expected"), bty="n",lty=c(1,1),
col=c("blue","red"))
dev.off()
jpeg("Multithread_Test_PerNode_Results.jpeg")
par(oma=c(0,0,0,0)) # set outer margin to zero
par(mar=c(3.5,3.5,2.5,1.5)) # number of lines per margin (bottom,left,top,right)
plot(x=Results[,Node.No],y=Results[,Avr.Node.Time], type="o", xlab="", ylab="",
ylim=c(50,500),col="blue",xaxt="n", yaxt="n", bty="l")
title(main="Per Node Multiplication Time", line=0,cex.lab=3)
title(xlab="Nodes",line=2,cex.lab=1.2,
ylab="Computation Time (secs) per Node")
axis(2, at=seq(50,500, by=50), tick=TRUE, labels=FALSE)
axis(2, at=seq(50,500, by=50), tick=FALSE, labels=TRUE, line=-0.5)
axis(1, at=Results[,Node.No], tick=TRUE, labels=FALSE)
axis(1, at=Results[,Node.No], tick=FALSE, labels=TRUE, line=-0.5)
abline(h=Results[Node.No==1][,Avr.Node.Time], col="red")
epsilon = 0.2
segments(Results[,Node.No],Results[,lower],Results[,Node.No],Results[,upper])
segments(Results[,Node.No]-epsilon,Results[,upper],
Results[,Node.No]+epsilon,Results[,upper])
segments(Results[,Node.No]-epsilon, Results[,lower],
Results[,Node.No]+epsilon,Results[,lower])
legend('topleft','groups',
legend=c("Measured", "Expected"), bty="n",lty=c(1,1),
col=c("blue","red"))
dev.off()
EDIT : Response #Hong Ooi's comment
I used lscpu in UNIX to get;
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 30
On-line CPU(s) list: 0-29
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 30
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2630 v3 # 2.40GHz
Stepping: 2
CPU MHz: 2394.455
BogoMIPS: 4788.91
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-29
EDIT : Response to #Steve Weston's comment.
I am using a virtual machine network (but I'm not the admin) with access to up to 30 clusters. I ran the test you suggested. Opened up 5 R sessions and ran the matrix multiplication on 1,2...5 simultaneously (or as quickly as i could tab over and execute). Got very similar results to before (re: each additional process slows down all individual sessions). Note i checked memory usage using top and htop and the usage never exceeded 5% of the network capacity (~2.5/64Gb).
CONCLUSIONS:
The problem seems to be R specific. When i run other multi-threaded commands with other software (e.g. PLINK) i don't run into this problem and parallel process run as expected. I have also tried running the above with Rmpi and doMPI with same (slower) results. The problem appears to be related R sessions/parallelized commands on virtual machine network. What i really need help on is how to pinpoint the problem. Similar problem seems to be pointed out here
I find the per-node multiplication time very interesting because the timings don't include any of the overhead associated with the parallel loop, but only the time to perform the matrix multiplication, and they show that the time increases with the number of matrix multiplications executing in parallel on the same machine.
I can think of two reasons why that might happen:
The memory bandwidth of the machine is saturated by the matrix multiplications before you run out of cores;
The matrix multiplication is multi-threaded.
You can test for the first situation by starting multiple R sessions (I did this in multiple terminals), creating two matrices in each session:
> x <- matrix(rnorm(4096*4096), 4096)
> y <- matrix(rnorm(4096*4096), 4096)
and then executing a matrix multiplication in each of those sessions at about the same time:
> system.time(z <- x %*% t(y))
Ideally, this time will be the same regardless of the number of R sessions you use (up to the number of cores), but since matrix multiplication is a rather memory intensive operation, many machines will run out of memory bandwidth before they run out of cores, causing the times to increase.
If your R installation was built with a multi-threaded math library, such as MKL or ATLAS, then you could be using all of your cores with a single matrix multiplication, so you can't expect better performance by using multiple processes unless you use multiple computers.
You can use a tool such as "top" to see if you're using a multi-threaded math library.
Finally, the output from lscpu suggests that you're using a virtual machine. I've never done any performance testing on multi-core virtual machines, but that could also be a source of problems.
Update
I believe the reason that your parallel matrix multiplications run more slowly than a single matrix multiplication is that your CPU isn't able to read memory fast enough to feed more than about two cores at full speed, which I referred to as saturating your memory bandwidth. If your CPU had large enough caches, you might be able to avoid this problem, but it doesn't really have anything to do with the amount of memory that you have on your motherboard.
I think this is just a limitation of using a single computer for parallel computations. One of the advantages of using a cluster is that your memory bandwidth goes up as well as your total aggregate memory. So if you ran one or two matrix multiplications on each node of a multi-node parallel program, you wouldn't run into this particular problem.
Assuming you don't have access to a cluster, you could try benchmarking a multi-threaded math library such as MKL or ATLAS on your computer. It's very possible that you could get better performance running one multi-threaded matrix multiply than running them in parallel in multiple processes. But be careful when using both a multi-threaded math library and a parallel programming package.
You could also try using a GPU. They're obviously good at performing matrix multiplications.
Update 2
To see if the problem is R specific, I suggest that you benchmark the dgemm function, which is the BLAS function used by R to implement matrix multiplication.
Here's a simple Fortran program to benchmark dgemm. I suggest executing it from multiple terminals in the same way that I described for benchmarking %*% in R:
program main
implicit none
integer n, i, j
integer*8 stime, etime
parameter (n=4096)
double precision a(n,n), b(n,n), c(n,n)
do i = 1, n
do j = 1, n
a(i,j) = (i-1) * n + j
b(i,j) = -((i-1) * n + j)
c(i,j) = 0.0d0
end do
end do
stime = time8()
call dgemm('N','N',n,n,n,1.0d0,a,n,b,n,0.0d0,c,n)
etime = time8()
print *, etime - stime
end
On my Linux machine, one instance runs in 82 seconds, while four instances run in 116 seconds. This is consistent with the results that I see in R and with my guess that this is a memory bandwidth problem.
You can also link this against different BLAS libraries to see which implementation works better on your machine.
You might also get some useful information about the memory bandwidth of your virtual machine network using pmbw - Parallel Memory Bandwidth Benchmark, although I've never used it.
I think the obvious answer here is the correct one. Matrix multiplication is not embarrassingly parallel. And you do not appear to have modified the serial multiplication code to parallelize it.
Instead, you are multiplying two matrices. Since the multiplication of each matrix is likely being handled by only a single core, every core in excess of two is simply idle overhead. The result is that you only see a speed improvement of 2x.
You could test this by running more than 2 matrix multiplications. But I'm not familiar with the foreach, doParallel framework (I use parallel framework) nor do I see where in your code to modify this to test it.
An alternative test is to do a parallelized version of matrix multiplication, which I borrow directly from Matloff's Parallel Computing for Data Science. Draft available here, see page 27
mmulthread <- function(u, v, w) {
require(parallel)
# determine which rows for this thread
myidxs <- splitIndices(nrow(u), myinfo$nwrkrs ) [[ myinfo$id ]]
# compute this thread's portion of the result
w[myidxs, ] <- u [myidxs, ] %*% v [ , ]
0 # dont return result -- expensive
}
# t e s t on snow c l u s t e r c l s
test <- function (cls, n = 2^5) {
# i n i t Rdsm
mgrinit(cls)
# shared variables
mgrmakevar(cls, "a", n, n)
mgrmakevar(cls, "b", n, n)
mgrmakevar(cls, "c", n, n)
# f i l l i n some t e s t data
a [ , ] <- 1:n
b [ , ] <- rep (1 ,n)
# export function
clusterExport(cls , "mmulthread" )
# run function
clusterEvalQ(cls , mmulthread (a ,b ,c ))
#print ( c[ , ] ) # not p ri n t ( c ) !
}
library(parallel)
library(Rdsm)
c1 <- makeCluster(1)
c2 <- makeCluster (2)
c4 <- makeCluster(4)
c8 <- makeCluster(8)
library(microbenchmark)
microbenchmark(node1= test(c1, n= 2^10),
node2= test(c2, n= 2^10),
node4= test(c4, n= 2^10),
node8= test(c8, n= 2^10))
Unit: milliseconds
expr min lq mean median uq max neval cld
node1 715.8722 780.9861 818.0487 817.6826 847.5353 922.9746 100 d
node2 404.9928 422.9330 450.9016 437.5942 458.9213 589.1708 100 c
node4 255.3105 285.8409 309.5924 303.6403 320.8424 481.6833 100 a
node8 304.6386 328.6318 365.5114 343.0939 373.8573 836.2771 100 b
As expected, by parallelizing the matrix multiplication, we do see the spend improvement we wanted, although parallel overhead is clearly extensive.

Julia challenge - FitzHugh–Nagumo model PDE Runge-Kutta solver

I am newbie in Julia programming language, so I don't know much of how to optimize a code. I have heard that Julia should be faster in comparison to Python, but I've written a simple Julia code for solving the FitzHugh–Nagumo model , and it doesn't seems to be faster than Python.
The FitzHugh–Nagumo model equations are:
function FHN_equation(u,v,a0,a1,d,eps,dx)
u_t = u - u.^3 - v + laplacian(u,dx)
v_t = eps.*(u - a1 * v - a0) + d*laplacian(v,dx)
return u_t, v_t
end
where u and v are the variables, which are 2D fields (that is, 2 dimensional arrays), and a0,a1,d,eps are the model's parameters. Both parameters and the variables are of type Float. dx is the parameter that control the separation between grid point, for the use of the laplacian function, which is an implementation of finite differences with periodic boundary conditions.
If one of you expert Julia coders can give me a hint of how to do things better in Julia I will be happy to hear.
The Runge-Kutte step function is:
function uv_rk4_step(Vs,Ps, dt)
u = Vs.u
v = Vs.v
a0=Ps.a0
a1=Ps.a1
d=Ps.d
eps=Ps.eps
dx=Ps.dx
du_k1, dv_k1 = FHN_equation(u,v,a0,a1,d,eps,dx)
u_k1 = dt*du_k1י
v_k1 = dt*dv_k1
du_k2, dv_k2 = FHN_equation((u+(1/2)*u_k1),(v+(1/2)*v_k1),a0,a1,d,eps,dx)
u_k2 = dt*du_k2
v_k2 = dt*dv_k2
du_k3, dv_k3 = FHN_equation((u+(1/2)*u_k2),(v+(1/2)*v_k2),a0,a1,d,eps,dx)
u_k3 = dt*du_k3
v_k3 = dt*dv_k3
du_k4, dv_k4 = FHN_equation((u+u_k3),(v+v_k3),a0,a1,d,eps,dx)
u_k4 = dt*du_k4
v_k4 = dt*dv_k4
u_next = u+(1/6)*u_k1+(1/3)*u_k2+(1/3)*u_k3+(1/6)*u_k4
v_next = v+(1/6)*v_k1+(1/3)*v_k2+(1/3)*v_k3+(1/6)*v_k4
return u_next, v_next
end
And I've used imshow() from PyPlot package to plot the u field.
This is not a complete answer, but a taste of an optimization attempt on the laplacian function. The original laplacian on a 10x10 matrix gave me the #time:
0.000038 seconds (51 allocations: 12.531 KB)
While this version:
function laplacian2(a,dx)
# Computes Laplacian of a matrix
# Usage: al=laplacian(a,dx)
# where dx is the grid interval
ns=size(a,1)
ns != size(a,2) && error("Input matrix must be square")
aa=zeros(ns+2,ns+2)
for i=1:ns
aa[i+1,1]=a[i,end]
aa[i+1,end]=a[i,1]
aa[1,i+1]=a[end,i]
aa[end,i+1]=a[1,i]
end
for i=1:ns,j=1:ns
aa[i+1,j+1]=a[i,j]
end
lap = Array{eltype(a),2}(ns,ns)
scale = inv(dx*dx)
for i=1:ns,j=1:ns
lap[i,j]=(aa[i,j+1]+aa[i+2,j+1]+aa[i+1,j]+aa[i+1,j+2]-4*aa[i+1,j+1])*scale
end
return lap
end
Gives #time:
0.000010 seconds (6 allocations: 2.250 KB)
Notice the reduction in allocations. Extra allocations usually indicate the potential for optimization.

Allocation error with pyopencl with simple multiplication in for-loop

I am using pyopencl to speed up my calculations using a GPU and am at the moment mystified by the following problem.
Im doing a simple multiplication of two arrays in a for loop using the following code
import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array
from pyopencl.elementwise import ElementwiseKernel
ctx = cl.create_some_context(0)
queue = cl.CommandQueue(ctx)
multiply = ElementwiseKernel(ctx,
"float *x, float *y, float *z",
"z[i] = x[i] * y[i]",
"multiplication")
x = cl_array.arange(queue, 1000000, dtype=np.complex64)
y = cl_array.arange(queue, 1000000, dtype=np.complex64)
z = cl_array.empty_like(x)
for n in range(10000):
z = x*y
multiply(x.real, y.real, z.real)
multiply(x, y, z)
The last three lines do of course the same thing namely the multiplication. However, the first two options result in the following error (I commented out the other two of course):
pyopencl.MemoryError: clEnqueueNDRangeKernel failed: mem object allocation failure
I'm just lost why the first two options are running into allocation errors.
NOTES:
GPU: [0] pyopencl.Device 'Capeverde' on 'AMD Accelerated Parallel Processing' at 0x2a76d90
>>> pyopencl.VERSION
(2013, 1)
I am aware that the complex type is not handled correctly, but if you change them into np.float32 I still get the same problem.
I simplified your program and ran it once in a way that worked on my computer. Here is a version that worked for me:
import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array
from pyopencl.elementwise import ElementwiseKernel
ctx = cl.create_some_context(0)
queue = cl.CommandQueue(ctx)
multiply = ElementwiseKernel(ctx,
"float *x, float *y, float *z",
"z[i] = x[i] * y[i]",
"multiplication")
x = cl_array.arange(queue, 1000000, dtype=np.float32)
y = cl_array.arange(queue, 1000000, dtype=np.float32)
z = cl_array.empty_like(x)
for i in range(10000):
multiply(x, y, z)
This program runs the kernel with a np.float32 buffer. Your problem may stem from the np.complex64 type, or the fact that you call .real 30000 times - which may create a new buffer each time. Also, it is possible your buffers are too large for your GPU. Try bumping the size of those down.
I am not sure exactly what you are aiming to do, but I strongly recommend avoiding ElementWise until you have spent a little more time working with standard PyOpenCL. ElementWise is just some syntactic sugar that can confuse the true nature of what PyOpenCL is doing.
Trying to solve your problem without ElementWise will help you understand where your data is at all times, how to manage your queue, and when to copy your memory to and from the host.

Resources