Theano with opencl GPU - opencl

I have configured theano as follows:
[idf#localhost python]$ more ~idf/.theanorc
[global]
device = opencl0:0
floatX = float32
[lib]
cnmem=100
[idf#localhost python]$
I also needed to
[idf#localhost python]$ export MKL_THREADING_LAYER=GNU
although interestingly enough, if I install openblas and add
[blas]
ldflags = -lopenblas
to the .theanorc file, I no longer need to:
export MKL_THREADING_LAYER=GNU
Using a program I found on the internet which I modified slightly to use gpuarray, I am attempting to use theano with an Intel GPU through opencl:
import os
import shutil
from theano import function, config, shared, gpuarray
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
When I run the program, it seems as if it recognizes the GPU, but at the end the message "used the cpu" is printed.
[idf#localhost python]$ python theanoexam1.py
Mapped name None to device opencl0:0: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float32, vector)>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 1.231896 seconds
Result is [ 1.23178029 1.61879337 1.52278054 ..., 2.20771813 2.29967737
1.62323284]
Used the cpu
[idf#localhost python]$
I am skeptical of the message "used the cpu": 1.231896 seconds seems fast for an Intel i3 with four cores.
Is there an extra configuration that is needed to use opencl with theano? Or did this program indeed show that theano is configured to use the GPU through opencl?

Firstly thank you for you post.
I am running on Ubuntu 16.04, with Conda, and I have manually installed libgpuarray - all of which is well documented on the web.
I used the same test program you did (thank you for providing it).
So here are my settings
export MKL_THREADING_LAYER=GNU
The file ~/.theanorc looks like this
[global]
device = opencl0:0
floatX = float32
[lib]
cnmem=100
When I run the code
python test.py
I see the output
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
Mapped name None to device opencl0:0: Ellesmere
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float32, vector)>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.282664 seconds
Result is [1.2317803 1.6187935 1.5227805 ... 2.207718 2.2996776 1.6232328]
Used the gpu
I can not figure out how to use the 2nd GPU (also OpenCL) - but I am happy at the moment that at least I have 1 GPU running.

Related

Cannot truncate sys.stdout; returns "io.UnsupportedOperation: truncate"

FULL CODE:
import sys
import time
counter = 0
while True:
sys.stdout.write(str(counter))
time.sleep(1)
sys.stdout.truncate()
counter += 1
PART THAT MATTERS
sys.stdout.truncate()
QUESTION(S)
Why does sys.stdout.truncate() return an error? How can I truncate sys.stdout if sys.stdout.truncate() will not work?
OPERATING SYSTEM AND MORE INFO
Operating System: Windows
Operating System Version: Windows 10
Programming Language: Python
Programming Language Version: Python 3.6
Other Details: Run from command line
sys.stdout is a a file object which corresponds to the interpreter’s standard output and does not have a truncate() method:
https://docs.python.org/3/library/sys.html#sys.stdout
It seems like you want to create status bar style output.
This should work in Python3:
import time
counter = 0
while True:
print("{}".format(counter), end="\r")
time.sleep(1)
counter += 1
See How to overwrite the previous print to stdout in python? for more info

Get system memory information from julia

Is there a nice way to get the current system information in julia (my use case here is memory but also interested in basically anything information that I could get from running top on linux).
This is what I have at the moment: (basically just getting the output of `free -m`)<- I can't get this to let me escape backticks and keep code highlighting...
import Base.DataFmt: readdlm_string, invalid_dlm
"""
getmeminfo()
Returns (in MB) A tuple of containing:
- Memory(total, used, buffer, available)
- Swap(total, used, free)
"""
function getmeminfo()
memstats = readdlm_string(readstring(`free -m`),invalid_dlm(Char), Int, '\n', true, Dict())
return Tuple{Array{Int,1},Array{Int,1}}((memstats[2,[2;3;6;7]], memstats[3,[2;3;4]]))
end
Is there something in Base or any better ideas?
The built-in Sys module contains functions dedicated to retrieving system information.
julia> VERSION
v"1.0.0"
julia> Sys.total_memory() / 2^20
8071.77734375
julia> Sys.free_memory() / 2^20
5437.46484375
julia> Sys.CPU_NAME
"haswell"
julia> Sys.
ARCH KERNEL WORD_SIZE eval isexecutable set_process_title
BINDIR MACHINE __init__ free_memory islinux total_memory
CPU_NAME SC_CLK_TCK _cpu_summary get_process_title isunix uptime
CPU_THREADS STDLIB _show_cpuinfo include iswindows which
CPUinfo UV_cpu_info_t cpu_info isapple loadavg windows_version
JIT WINDOWS_VISTA_VER cpu_summary isbsd maxrss
julia> # Above after pressing Tab key twice
While it does not support all of the information provided by top, it will hopefully provide the information you are looking for.

CUDA IPC Memcpy + MPI fails in Theano, works in pycuda

For learning purposes, I wrote a small C Python module that is supposed to perform an IPC cuda memcopy to transfer data between processes. For testing, I wrote equivalent programs: one using theano's CudaNdarray, and the other using pycuda. The problem is, even though the test programs are nearly identical, the pycuda version works while the theano version does not. It doesn't crash: it just produces incorrect results.
Below is the relevant function in the C module. Here is what it does: every process has two buffers: a source and a destination. Calling _sillycopy(source, dest, n) copies n elements from each process's source buffer to the neighboring process's dest array. So, if I have two processes, 0 and 1, processes 0 will end up with process 1's source buffer and processes 1 will end up with process 0's source buffer.
Note that to transfer cudaIpcMemHandle_t values between processes, I use MPI (this is a small part of a larger project which uses MPI). _sillycopy is called by another function, "sillycopy" which is exposed in Python by the standard Python C API methods.
void _sillycopy(float *source, float* dest, int n, MPI_Comm comm) {
int localRank;
int localSize;
MPI_Comm_rank(comm, &localRank);
MPI_Comm_size(comm, &localSize);
// Figure out which process is to the "left".
// m() performs a mod and treats negative numbers
// appropriately
int neighbor = m(localRank - 1, localSize);
// Create a memory handle for *source and do a
// wasteful Allgather to distribute to other processes
// (could just use an MPI_Sendrecv, but irrelevant right now)
cudaIpcMemHandle_t *memHandles = new cudaIpcMemHandle_t[localSize];
cudaIpcGetMemHandle(memHandles + localRank, source);
MPI_Allgather(
memHandles + localRank, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
memHandles, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
comm);
// Open the neighbor's mem handle so we can do a cudaMemcpy
float *sourcePtr;
cudaIpcOpenMemHandle((void**)&sourcePtr, memHandles[neighbor], cudaIpcMemLazyEnablePeerAccess);
// Copy!
cudaMemcpy(dest, sourcePtr, n * sizeof(float), cudaMemcpyDefault);
cudaIpcCloseMemHandle(sourcePtr);
delete [] memHandles;
}
Now here is the pycuda example. For reference, using int() on a_gpu and b_gpu returns the pointer to the underlying buffer's memory address on the device.
import sillymodule # sillycopy lives in here
import simplempi as mpi
import pycuda.driver as drv
import numpy as np
import atexit
import time
mpi.init()
drv.init()
# Make sure each process uses a different GPU
dev = drv.Device(mpi.rank())
ctx = dev.make_context()
atexit.register(ctx.pop)
shape = (2**26,)
# allocate host memory
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)
# allocate device memory
a_gpu = drv.mem_alloc(a.nbytes)
b_gpu = drv.mem_alloc(b.nbytes)
# copy host to device
drv.memcpy_htod(a_gpu, a)
drv.memcpy_htod(b_gpu, b)
# A few more host buffers
a_p = np.zeros(shape, np.float32)
b_p = np.zeros(shape, np.float32)
# Sanity check: this should fill a_p with 1's
drv.memcpy_dtoh(a_p, a_gpu)
# Verify that
print(a_p[0:10])
sillymodule.sillycopy(
int(a_gpu),
int(b_gpu),
shape[0])
# After this, b_p should have all one's
drv.memcpy_dtoh(b_p, b_gpu)
print(c_p[0:10])
And now the theano version of the above code. Rather than using int() to get the buffers' address, the CudaNdarray way of accessing this is via the gpudata attribute.
import os
import simplempi as mpi
mpi.init()
# select's one gpu per process
os.environ['THEANO_FLAGS'] = "device=gpu{}".format(mpi.rank())
import theano.sandbox.cuda as cuda
import time
import numpy as np
import time
import sillymodule
shape = (2 ** 24, )
# Allocate host data
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)
# Allocate device data
a_gpu = cuda.CudaNdarray.zeros(shape)
b_gpu = cuda.CudaNdarray.zeros(shape)
# Copy from host to device
a_gpu[:] = a[:]
b_gpu[:] = b[:]
# Should print 1's as a sanity check
print(np.asarray(a_gpu[0:10]))
sillymodule.sillycopy(
a_gpu.gpudata,
b_gpu.gpudata,
shape[0])
# Should print 1's
print(np.asarray(b_gpu[0:10]))
Again, the pycuda code works perfectly and the theano version runs, but gives the wrong result. To be precise, at the end of the theano code, b_gpu is filled with garbage: neither 1's nor 0's, just random numbers as though it were copying from a wrong place in memory.
My original theory regarding why this was failing had to do with CUDA contexts. I wondered if it was possible theano was doing something with them that meant that the cuda calls made in sillycopy were run under a different CUDA context than had been used to create the gpu arrays. I don't think this is the case because:
I spent a lot of time digging deep in theano's code and saw no funny business being played with contexts
I would expect such a problem to result in a bad crash, not an incorrect result, which is not what happens.
A secondary thought is whether this has to do the fact that theano spawns several threads, even when using a cuda backend, which can be verified this by running "ps huH p ". I don't know how threads might affect anything, but I have run out of obvious things to consider.
Any thoughts on this would be greatly appreciated!
For reference: the processes are launched in the normal OpenMPI way:
mpirun --np 2 python test_pycuda.py

Callback from "multiprocessing" with CFFI segfaults after ~100 iterations

A PyPy callback, that works perfectly (in an infinite loop) when implemented (straightforwardly) as method of a Python object, segfaults after approximately 100 iterations when I move the Python object into a separate multiprocessing process.
In the main code I have:
import multiprocessing as mp
class Task(object):
def __init__(self, com, lib):
self.com = com # communication queue
self.lib = lib # ffi library
self.proc = mp.Process(target=self.spawn, args=(self.com,))
self.register_callback()
def spawn(self, com):
print('%s spawned.'%self.name)
# loop (keeping 'self' alive) until BREAK:
while True:
cmd = com.get()
if cmd == self.BREAK:
break
print("%s stopped."%self.name)
#ffi.calback("int(void*, Data*"): # old cffi (ABI mode)
def callback(self, data):
# <work on data>
return 1
def register_callback(self):
s = ffi.new_handle(self)
self.lib.register_callback(s, self.callback) # C-call
The idea is that multiple tasks should serve an equal number of callbacks concurrently. I have no clue what may cause the segfault, especially since it runs fine for the first ~100 iterations or so. Help much appreciated!
Solution
Handle 's' is garbage collected when returning from 'register_callback()'. Making the handle an attribute of 'self' and passing keeps it alive.
Standard CPython (cffi 1.6.0) segfaulted at the first iteration (i.e. gc was immediate) and provided me a crucial informative error message. PyPy on the other hand segfaulted after approximately 100 iterations without providing a message... Both run fine now.

ATI streams failed on AMD 7970 series

I have a program (not mine - downloaded from i-net) made on ATI streams (more accurate - on brook lang - file is *.br). There is a python script (see below), that compiles it into *.il file using brook compiler, provided by ATI streams SDK. After it script zips it into *.Z file. C-program's Makefile contains this code
my_kernel_dp11.o: my_kernel_dp11.Z
ld -s -r -o my_kernel_dp11.o -b binary my_kernel_dp11.Z
and then it linked to main executed file.
Data from that obj-file read by C-program into some buffer and then called calclCompile function (as I understand it's OpenCL func).
It works fine at AMD HD 6970-series, but failed at AMD HD 7970-series with following error
Unsupported program construct detected in back-end
Here is a python script
#!/usr/bin/python
import sys
import zlib
import os
def makebrz(dp_bits):
try:
os.unlink("a_slice_dpX_a_slicer.il")
except OSError:
pass
dpdefs=""
for i in range(dp_bits-11):
dpdefs = dpdefs + " -D DP_BIT_%i" % (i+12,)
print "DP_DEFS: ", dpdefs
os.system("/usr/local/atibrook/sdk/bin/brcc -k -pp %s a_slice_dpX.br" % (dpdefs,) )
f = open("a_slice_dpX_a_slicer.il")
if f==None:
print "Could not read ", sys.argv[1]
sys.exit(-1)
data = f.read()
f.close()
oname = "../my_kernel_dp%i.Z" % (dp_bits,)
data2 = zlib.compress(data)
fo = open( oname, "wb" )
fo.write(data2)
fo.close()
#os.system("ld -s -r -o ../%s.o -b binary %s" % (oname[:-2],oname))
makebrz(11)
makebrz(12)
makebrz(13)
makebrz(14)
And here is a program http://dl.dropbox.com/u/46469564/a_slice_dpX.br
The question is: what should I do to make it program "supported" ?
P.S. There is one problem - I don't know this technology (brook, ATI streams, OpenCL) at all. That's why advice like "you should try this or that" are useless. I need particular action to do - change this and you'll have a success :)
Thank you.
AFAIK Radeon HD7970 is built on GCN architecture, so if you are using brook to generate IL code, JIT in southern island may not know how to generate proper ISA for the h/w you are using,so if you would like continue using brook+ then you need to wait till an updated version of Brook+ gets released on sourceforge which can generate an IL which gets converted to right ISA(GCN).
Other option is to use AMD APP SDK 2.6 and rewrite your code in OpenCL.

Resources