Allocation error with pyopencl with simple multiplication in for-loop - opencl

I am using pyopencl to speed up my calculations using a GPU and am at the moment mystified by the following problem.
Im doing a simple multiplication of two arrays in a for loop using the following code
import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array
from pyopencl.elementwise import ElementwiseKernel
ctx = cl.create_some_context(0)
queue = cl.CommandQueue(ctx)
multiply = ElementwiseKernel(ctx,
"float *x, float *y, float *z",
"z[i] = x[i] * y[i]",
"multiplication")
x = cl_array.arange(queue, 1000000, dtype=np.complex64)
y = cl_array.arange(queue, 1000000, dtype=np.complex64)
z = cl_array.empty_like(x)
for n in range(10000):
z = x*y
multiply(x.real, y.real, z.real)
multiply(x, y, z)
The last three lines do of course the same thing namely the multiplication. However, the first two options result in the following error (I commented out the other two of course):
pyopencl.MemoryError: clEnqueueNDRangeKernel failed: mem object allocation failure
I'm just lost why the first two options are running into allocation errors.
NOTES:
GPU: [0] pyopencl.Device 'Capeverde' on 'AMD Accelerated Parallel Processing' at 0x2a76d90
>>> pyopencl.VERSION
(2013, 1)
I am aware that the complex type is not handled correctly, but if you change them into np.float32 I still get the same problem.

I simplified your program and ran it once in a way that worked on my computer. Here is a version that worked for me:
import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array
from pyopencl.elementwise import ElementwiseKernel
ctx = cl.create_some_context(0)
queue = cl.CommandQueue(ctx)
multiply = ElementwiseKernel(ctx,
"float *x, float *y, float *z",
"z[i] = x[i] * y[i]",
"multiplication")
x = cl_array.arange(queue, 1000000, dtype=np.float32)
y = cl_array.arange(queue, 1000000, dtype=np.float32)
z = cl_array.empty_like(x)
for i in range(10000):
multiply(x, y, z)
This program runs the kernel with a np.float32 buffer. Your problem may stem from the np.complex64 type, or the fact that you call .real 30000 times - which may create a new buffer each time. Also, it is possible your buffers are too large for your GPU. Try bumping the size of those down.
I am not sure exactly what you are aiming to do, but I strongly recommend avoiding ElementWise until you have spent a little more time working with standard PyOpenCL. ElementWise is just some syntactic sugar that can confuse the true nature of what PyOpenCL is doing.
Trying to solve your problem without ElementWise will help you understand where your data is at all times, how to manage your queue, and when to copy your memory to and from the host.

Related

check_totals wrt a large vector in OpenMDAO

I'd like to check the total derivatives of an output with respect to a large array of inputs, but I don't want to check the derivative with respect to every member of the array, since the array is too large, and the complex steps (or finite differences) across each member of the array would take too long. Is there a way to check_totals wrt just a single member of an array?
Alternatively, is there a way to perform a directional derivative across the entire array for check_totals? This feature seems to exist for check_partials only?
As of Version 3.1.1 of OpenMDAO we don't have directional checking for totals, but it is a good idea and we are probably going to implement it when we figure out the best way.
As a workaround for now, I think the easiest way to take a directional derivative of your model is to temporarily modify your model by creating a component that takes a "step" in some random direction, and then inserting it in front of your component with wide inputs. I've put together a simple example here:
import numpy as np
import openmdao.api as om
n = 50
class DirectionalComp(om.ExplicitComponent):
def setup(self):
self.add_input('x', 1.0)
self.add_output('y', np.ones(n))
self.A = -1.0 + 2.0 * np.random.random(n)
self.declare_partials('y', 'x', rows=np.arange(n), cols=np.repeat(0, n), val=self.A)
def compute(self, inputs, outputs, discrete_inputs=None, discrete_outputs=None):
x = inputs['x']
outputs['y'] = x * self.A
prob = om.Problem()
model = prob.model
# Add something like this
model.add_subsystem('p', om.IndepVarComp('x', 1.0))
model.add_subsystem('direction', DirectionalComp())
model.connect('p.x', 'direction.x')
model.connect('direction.y', 'comp.x')
model.add_design_var('p.x')
# Old Model
model.add_subsystem('comp', om.ExecComp('y = 2.0*x', x=np.ones((n, )), y=np.ones((n, ))))
model.add_constraint('comp.y', lower=0.0)
prob.setup()
prob.run_model()
totals = prob.check_totals()

BigInt calculations on the GPU in Julia

I need to perform calculations on random batches of very larger integers. I have a function that compares the numbers for certain properties and returns a value based on those properties. Since the batches and the numbers themselves can be very large I want to speed up the process by utilizing the GPU.
Here is a short version of what i have running purely on the CPU now.
using Statistics
function check(M)
val = 0
#some code that calculates val based on M, e.g. the mean
val = mean(M)
return val
end
function distribution(N, n, exp) # N=batchsize, n=# of batches, exp=exponent of the upper limit of the integers
avg = 0
M = zeros(BigInt, N)
for i = 1 : n
M = rand(1 : BigInt(10) ^ exp, N)
avg += check(M)
end
avg /= n
println(avg, ":", N)
end
#example
distribution(10 ^ 3, 10 ^ 6, 100)
I have briefly used CUDAnative in Julia but I don't know how to implement the BigInt calculations. That package would be preferred but others are fine as well. Any help is appreciated.
BigInts are CPU only since they are not implemented in Julia, see 1.

GEKKO - optimization in matrix form

I am trying to solve an optimization problem where I need to specify the problem and the constraints using a 2D matrix. I have been using SCIPY, where the 1D arrays are the requirements. I want to check if GEKKO allows one to specify the objective function, bounds and constraints using a 2D matrix.
I have provided details and a reproducible version of the problem in the post here:
SCIPY - building constraints without listing each variable separately
Thanks
C
You can use the m.Array function in gekko. I don't recommend that you use the np.triu() with the Gekko array because the eliminated variables will still solve but potentially be hidden from the results. Here is a solution:
import numpy as np
import scipy.optimize as opt
from gekko import GEKKO
p= np.array([4, 5, 6.65, 12]) #p = prices
pmx = np.triu(p - p[:, np.newaxis]) #pmx = price matrix, upper triangular
m = GEKKO(remote=False)
q = m.Array(m.Var,(4,4),lb=0,ub=10)
# only upper triangular can change
for i in range(4):
for j in range(4):
if j<=i:
q[i,j].upper=0 # set upper bound = 0
def profit(q):
profit = np.sum(q.flatten() * pmx.flatten())
return profit
for i in range(4):
m.Equation(np.sum(q[i,:])<=10)
m.Equation(np.sum(q[:,i])<=8)
m.Maximize(profit(q))
m.solve()
print(q)
This gives the solution:
[[[0.0] [2.5432017412] [3.7228765674] [3.7339217013]]
[[0.0] [0.0] [4.2771234426] [4.2660783187]]
[[0.0] [0.0] [0.0] [0.0]]
[[0.0] [0.0] [0.0] [0.0]]]

Julia benchmarking DSP.jl, CUDANative and CuArrays

I'm experimenting with DSP.jl - the conv() method, in particular. I'm using CUDANative and CuArrays to create arrays to be arguments to conv(), so that the cuda versions of fft(), etc. will be used. I'm using BenchmarkTools to get performance data. I find that the Julia runtime complains about running out of CPU or GPU memory under odd circumstances. Here's my test setup:
using CUDAdrv, CUDAnative, CuArrays
using DSP
using FFTW
using BenchmarkTools
N = 120
A = rand(Float32, N, N, N);
B = rand(Float32, N, N, N);
A_d = cu(A);
B_d = cu(B);
function doConv(A, B)
C = conv(A, B)
finalize(C)
C = []
end
t = #benchmark doConv($A_d, $B_d)
display(t)
Here's an example of the odd behavior I mentioned. If I set N to 120, my script runs to completion. If I set N to 64, I get the "out of memory" error:ERROR: LoadError: CUFFTError(code 2, cuFFT failed to allocate GPU or CPU memory). I can run the smaller case first, get the error, then bump N to the larger value and have the script complete successfully.
Is there something I should be doing differently to prevent this from happening?

Logsoftmax stability

I know how to make softmax stable by adding to element -max _i x_i. This avoids overflow and underflow.
Now, taking log of this can cause underflow. log softmax(x) can evaluate to zero, leading to -infinity.
I am not sure how to fix it. I know this is a common problem. I read several answers on it, which I didn't understand. But I am still confused on how to solve this problem.
PS: If you provide a simple example, it would be awesome.
In order to stabilize Logsoftmax, most implementations such as Tensorflow and Thenao, use a trick which takes out the largest component max(x_i). This trick is often used for stably computing softmax. For logsoftmax, we begin with:
After extracting out the exp(b) and using the fact that log(exp(x)) = x, we have:
If we set , this new equation has both overflow and underflow stability conditions.
In terms of code, if x is a vector:
def log_softmax(x):
x_off = x - np.max(x)
return x_off - np.log(np.sum(np.exp(x_off)))
See also: https://timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick/
logsoftmax = logits - log(reduce_sum(exp(logits), dim))
refer: https://www.tensorflow.org/api_docs/python/tf/nn/log_softmax
Just use this as it take care of Nan
tf.nn.softmax_cross_entropy_with_logits(
labels, logits, axis=-1, name=None
)
logits = tf.constant([[4, 5, 1000]], dtype = tf.float32)
labels = tf.constant([[1,0,1]], dtype = tf.float32)
# Case-1
output = tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits)
print(output)
>>> tf.Tensor([996.], shape=(1,), dtype=float32)
#Case-2
a = tf.nn.softmax(logits)
output = tf.reduce_sum(-(labels * tf.math.log(a)))
print(output)
>>> tf.Tensor(nan, shape=(), dtype=float32)
# this happens because value of softmax truncates to zero
print(a)
>>> <tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0., 0., 1.]], dtype=float32)>
Mathematical tricks cannot help you create log 0 be something other that -inf.
If you think it trough, the only way is you normalize the data so that you don't end in there.

Resources