I'm failing to compile programs, using the write_imagef() functions on Nvidia implementations.
Working with a Tesla K10.G2.8GB using the driver version 367.35 on python 2.7 with PyopenCL 2016.1,
I'm trying to compile the following program, which fails with a build error:
Host Code:
import pyopencl as cl
platform = cl.get_platforms()[0]
devs = platform.get_devices()
device1 = devs[1]
mf = cl.mem_flags
ctx = cl.Context([device1])
Queue1 = cl.CommandQueue(ctx)
f = open('Minimal.cl', 'r')
fstr = "".join(f.readlines())
prg = cl.Program(ctx, fstr).build()
Kernel (Minimal.cl)
__kernel void test(image2d_t d_output){
write_imagef(d_output,(int2)(1,1),(float4)(1.0f,1.0f,1.0f,1.0f));
}
The error I get is:
pyopencl.cffi_cl.RuntimeError: clbuildprogram failed: BUILD_PROGRAM_FAILURE -
I checked, if my device has image support and that it supports reading and writing to
texture buffers in the specified format. I think, that the same case won't work for the 3d case,
because the extension cl_khr_3d_image_writes is not supported on any of our Nvidia devices,
but I don't understand the problem for the 2D case.
Image arguments must be declared as either read_only or write_only (or read_write with OpenCL 2.x), so your kernel definition should look like this:
__kernel void test(write_only image2d_t d_output){
write_imagef(d_output,(int2)(1,1),(float4)(1.0f,1.0f,1.0f,1.0f));
}
Related
I have just started working on SYCL and ran ComputeCpp_info on my system and following data on 3 devices is showed
ComputeCpp Info (CE 1.1.0)
SYCL 1.2.1 revision 3
Device 1 ( GeForce GTX 1050 = NO - Device does not support SPIR)
Device 2 (Intel(R) HD Graphics 630 = UNTESTED - Device not tested on this OS)
Device 3 (Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz = UNTESTED - Device running untested driver)
Now my question is can I work on these devices as 2 are untested and 1 is not possible? or am i missing some drivers?
Also I implemented a simple example but it gives me CL/cl.h not found error
#include <CL/sycl.hpp>
#include <array>
#include <numeric>
#include <iostream>
int main() {
const size_t array_size = 1024 * 512;
std::array<cl::sycl::cl_int, array_size> in, out;
std::iota(begin(in), end(in), 0);
cl::sycl::queue device_queue;
cl::sycl::range<1> n_items{ array_size };
cl::sycl::buffer < cl::sycl::cl_int, 1> in_buffer(in.data(), n_items);
cl::sycl::buffer < cl::sycl::cl_int, 1> out_buffer(out.data(), n_items);
device_queue.submit([&](cl::sycl::handler &cgh) {
constexpr auto sycl_read = cl::sycl::access::mode::read;
constexpr auto sycl_write = cl::sycl::access::mode::write;
auto in_accessor = in_buffer.get_access<sycl_read>(cgh);
auto out_accessor = out_buffer.get_access<sycl_write>(cgh);
cgh.parallel_for<class VecScalMul>(n_items,
[=](cl::sycl::id<1> wiID) {
out_accessor[wiID] = in_accessor[wiID] * 2;
});
});
}
The computecpp_info tool shows the devices that are or are not supported by ComputeCpp on your system. Here's an explanation of the outputs:
NO - Device does not support SPIR: This means that the device can be seen but it does not support SPIR instructions and so cannot be supported by ComputeCpp
UNTESTED - Device not tested on this OS: This means that the device can be seen and is reporting that it supports SPIR instructions. It should work with ComputeCpp but this specific device has not been tested by the ComputeCpp team
The cl.h header missing error is because you are missing the OpenCL headers. These can be found here and you'll need to point at them when you compile your code. I'd suggest using the Getting Started guide with the sample code and then modifying the hello world example to test out your code. This has an existing CMake file that is designed to search for all the dependencies you need.
(This is a ComputeCpp-specific question, rather than SYCL) The UNTESTED platforms will probably work, but Codeplay cannot guarantee it. In my experience, both should work, maybe some OpenCL driver bugs will hit you on Intel GPU depending on your configuration.
You need the OpenCL headers in your system, since SYCL 1.2.1 specification is built on top of OpenCL
Disclaimer: I am a Codeplay employee working in ComputeCpp!
I am using OpenCL to do some image processing and want to use it to write RGBA image directly to framebuffer. Workflow is shown below:
1) map framebuffer to user space.
2) create OpenCL buffer using clCreateBuffer with flags of "CL_MEM_ALLOC_HOST_PTR"
3) use clEnqueueMapBuffer to map the results to framebuffer.
However, it doesn't work. Nothing on the screen. Then I found that the mapped virtual address from framebuffer are not same as the virtual address mapped OpenCL. Has any body done a zero-copy move of data from GPU to framebuffer?Any help on what approach should I use for this?
Some key codes:
if ((fd_fb = open("/dev/fb0", O_RDWR, 0)) < 0) {
printf("Unable to open /dev/fb0\n");
return -1;
}
fb0 = (unsigned char *)mmap(0, fb0_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd_fb, 0);
...
cmDevSrc4 = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(cl_uchar) * imagesize * 4, NULL, &status);
...
fb0 = (unsigned char*)clEnqueueMapBuffer(cqCommandQueue, cmDevSrc4, CL_TRUE, CL_MAP_READ, 0, sizeof(cl_uchar) * imagesize * 4, 0, NULL, NULL, &ciErr);
For zero-copy with an existing buffer you need to use CL_MEM_USE_HOST_PTR flag in the clCreateBuffer() function call. In addition you need give the pointer to the existing buffer as second to last argument.
I don't know how linux framebuffer internally works but it is possible that even with the zero-copy from device to host it leads to extra copying the data to GPU for rendering. So you might want to render the OpenCL buffer directly with OpenGL. Check out cl_khr_gl_sharing extension for OpenCL.
I don't know OpenCL yet, I was just doing a search to find out about writing to the framebuffer from it and hit your post. Opening it and mmapping it like in your code looks good.
I've done that with the CPU: https://sourceforge.net/projects/fbgrad/
That doesn't always work, it depends on the computer. I'm on an old Dell Latitude D530 and not only can't I write to the framebuffer but there's no GPU, so no advantage to using OpenCL over using the CPU. If you have a /dev/fb0 and you can get something on the screen with
cat /dev/random > /dev/fb0
Then you might have a chance from OpenCL. With a Mali at least there's a way to pass a pointer from the CPU to the GPU. You may need to add some offset (true on a Raspberry Pi I think). And it could be double-buffered by Xorg, there are lots of reasons why it might not work.
For learning purposes, I wrote a small C Python module that is supposed to perform an IPC cuda memcopy to transfer data between processes. For testing, I wrote equivalent programs: one using theano's CudaNdarray, and the other using pycuda. The problem is, even though the test programs are nearly identical, the pycuda version works while the theano version does not. It doesn't crash: it just produces incorrect results.
Below is the relevant function in the C module. Here is what it does: every process has two buffers: a source and a destination. Calling _sillycopy(source, dest, n) copies n elements from each process's source buffer to the neighboring process's dest array. So, if I have two processes, 0 and 1, processes 0 will end up with process 1's source buffer and processes 1 will end up with process 0's source buffer.
Note that to transfer cudaIpcMemHandle_t values between processes, I use MPI (this is a small part of a larger project which uses MPI). _sillycopy is called by another function, "sillycopy" which is exposed in Python by the standard Python C API methods.
void _sillycopy(float *source, float* dest, int n, MPI_Comm comm) {
int localRank;
int localSize;
MPI_Comm_rank(comm, &localRank);
MPI_Comm_size(comm, &localSize);
// Figure out which process is to the "left".
// m() performs a mod and treats negative numbers
// appropriately
int neighbor = m(localRank - 1, localSize);
// Create a memory handle for *source and do a
// wasteful Allgather to distribute to other processes
// (could just use an MPI_Sendrecv, but irrelevant right now)
cudaIpcMemHandle_t *memHandles = new cudaIpcMemHandle_t[localSize];
cudaIpcGetMemHandle(memHandles + localRank, source);
MPI_Allgather(
memHandles + localRank, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
memHandles, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
comm);
// Open the neighbor's mem handle so we can do a cudaMemcpy
float *sourcePtr;
cudaIpcOpenMemHandle((void**)&sourcePtr, memHandles[neighbor], cudaIpcMemLazyEnablePeerAccess);
// Copy!
cudaMemcpy(dest, sourcePtr, n * sizeof(float), cudaMemcpyDefault);
cudaIpcCloseMemHandle(sourcePtr);
delete [] memHandles;
}
Now here is the pycuda example. For reference, using int() on a_gpu and b_gpu returns the pointer to the underlying buffer's memory address on the device.
import sillymodule # sillycopy lives in here
import simplempi as mpi
import pycuda.driver as drv
import numpy as np
import atexit
import time
mpi.init()
drv.init()
# Make sure each process uses a different GPU
dev = drv.Device(mpi.rank())
ctx = dev.make_context()
atexit.register(ctx.pop)
shape = (2**26,)
# allocate host memory
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)
# allocate device memory
a_gpu = drv.mem_alloc(a.nbytes)
b_gpu = drv.mem_alloc(b.nbytes)
# copy host to device
drv.memcpy_htod(a_gpu, a)
drv.memcpy_htod(b_gpu, b)
# A few more host buffers
a_p = np.zeros(shape, np.float32)
b_p = np.zeros(shape, np.float32)
# Sanity check: this should fill a_p with 1's
drv.memcpy_dtoh(a_p, a_gpu)
# Verify that
print(a_p[0:10])
sillymodule.sillycopy(
int(a_gpu),
int(b_gpu),
shape[0])
# After this, b_p should have all one's
drv.memcpy_dtoh(b_p, b_gpu)
print(c_p[0:10])
And now the theano version of the above code. Rather than using int() to get the buffers' address, the CudaNdarray way of accessing this is via the gpudata attribute.
import os
import simplempi as mpi
mpi.init()
# select's one gpu per process
os.environ['THEANO_FLAGS'] = "device=gpu{}".format(mpi.rank())
import theano.sandbox.cuda as cuda
import time
import numpy as np
import time
import sillymodule
shape = (2 ** 24, )
# Allocate host data
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)
# Allocate device data
a_gpu = cuda.CudaNdarray.zeros(shape)
b_gpu = cuda.CudaNdarray.zeros(shape)
# Copy from host to device
a_gpu[:] = a[:]
b_gpu[:] = b[:]
# Should print 1's as a sanity check
print(np.asarray(a_gpu[0:10]))
sillymodule.sillycopy(
a_gpu.gpudata,
b_gpu.gpudata,
shape[0])
# Should print 1's
print(np.asarray(b_gpu[0:10]))
Again, the pycuda code works perfectly and the theano version runs, but gives the wrong result. To be precise, at the end of the theano code, b_gpu is filled with garbage: neither 1's nor 0's, just random numbers as though it were copying from a wrong place in memory.
My original theory regarding why this was failing had to do with CUDA contexts. I wondered if it was possible theano was doing something with them that meant that the cuda calls made in sillycopy were run under a different CUDA context than had been used to create the gpu arrays. I don't think this is the case because:
I spent a lot of time digging deep in theano's code and saw no funny business being played with contexts
I would expect such a problem to result in a bad crash, not an incorrect result, which is not what happens.
A secondary thought is whether this has to do the fact that theano spawns several threads, even when using a cuda backend, which can be verified this by running "ps huH p ". I don't know how threads might affect anything, but I have run out of obvious things to consider.
Any thoughts on this would be greatly appreciated!
For reference: the processes are launched in the normal OpenMPI way:
mpirun --np 2 python test_pycuda.py
I implemented an encoder in 2 ways.
1) based on the SDK Transcoder example, which uses topology and transcoding profile
2) based on IMFSourceReader and IMFSinkWriter, where the Sinkwriter delivers the samples to the Sourcewriter for transcoding
I tested both implementations on Windows 8.1 with Nvidia (Quadro K2200) and Intel GPU (P4600/P4700)
But bizarrly only the topology implementation uses GPU (on both).
In 2) I both I set "MF_READWRITE_ENABLE_HARDWARE_TRANSFORMS", which has not to be set I guess, because 1) works with GPU with and without this flag set for the container type.
Is there a trick to enable GPU with IMFSinkWriter or is this a bug in the media foundation?
I had initially ran into the same issue. You don't mention how you configured the output type of the source reader (and the input type of the sink), but I found that if you allow the system to handle it (by selecting the output type of the reader to be RGB32), the performance will be horrible and all CPU bound. (error checking omitted for brevity)
hr = videoMediaType->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video);
hr = videoMediaType->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_RGB32);
hr = reader->SetCurrentMediaType((DWORD)MF_SOURCE_READER_FIRST_VIDEO_STREAM, nullptr, videoMediaType);
reader->SetStreamSelection((DWORD)MF_SOURCE_READER_FIRST_VIDEO_STREAM, true);
And the documentation agrees, indicating that this configuration is useful for getting a single snapshot from the video. As a result, if you configure the reader to deliver the native media type, performance is excellent, but you now have to transform the format yourself.
reader->GetNativeMediaType((DWORD)MF_SOURCE_READER_FIRST_VIDEO_STREAM, videoMode->GetIndex(), videoMediaType);
From here, if you are dealing with simple color conversion (like YUY2 or YUV from a webcam) then there are a few options. I originally tried writing my own converter, and pushing that off to the GPU using HLSL with DirectCompute. This works very well, but in your case, the format isn't as trivial.
Ultimately, creating and configuring an instance of the color converter (as an IMFTransform) works perfectly.
Microsoft::WRL::ComPtr<IMFMediaType> mediaTransform;
hr = ::CoCreateInstance(CLSID_CColorConvertDMO, nullptr, CLSCTX_INPROC_SERVER, __uuidof(IMFTransform), reinterpret_cast<void**>(mediaTransform.GetAddressOf());
// set the input type of the transform to the NATIVE output type of the reader
hr = mediaTransform->SetInputType(0u, videoMediaType.Get(), 0u);
Create and configure a separate sample and buffer.
IMFSample* transformSample;
hr = ::MFCreateSample(&transformSample);
hr = ::MFCreateMemoryBuffer(RGB_MFT_OUTPUT_BUFFER_SIZE, &_transformBuffer);
hr = transformSample->AddBuffer(transformBuffer);
MFT_OUTPUT_DATA_BUFFER* transformDataBuffer;
transformDataBuffer = new MFT_OUTPUT_DATA_BUFFER();
transformDataBuffer->pSample = _transformSample;
transformDataBuffer->dwStreamID = 0u;
transformDataBuffer->dwStatus = 0u;
transformDataBuffer->pEvents = nullptr;
When receiving samples from the source, hand them off to the transform to be converted.
hr = mediaTransform->ProcessInput(0u, sample, 0u));
hr = mediaTransform->ProcessOutput(0u, 1u, transformDataBuffer, &outStatus));
hr = transformDataBuffer->pSample->GetBufferByIndex(0, &mediaBuffer);
Then of course, finally hand off the transformed sample to the sink just as you do today. I am confident that this will work, and you will be a very happy person. For me, I went from 20% CPU utilization (originally implementation) down to 2% (I am concurrently displaying the video). Good luck. I hope you enjoy your project.
I'm doing a small OpenCL benchmark using Nvidia drivers,
my kernel performs 1024 fuse multiply-adds and store the result in an array:
#define FLOPS_MACRO_1(x) { (x) = (x) * 0.99f + 10.f; } // Multiply-add
#define FLOPS_MACRO_2(x) { FLOPS_MACRO_1(x) FLOPS_MACRO_1(x) }
#define FLOPS_MACRO_4(x) { FLOPS_MACRO_2(x) FLOPS_MACRO_2(x) }
#define FLOPS_MACRO_8(x) { FLOPS_MACRO_4(x) FLOPS_MACRO_4(x) }
// more recursive macros ...
#define FLOPS_MACRO_1024(x) { FLOPS_MACRO_512(x) FLOPS_MACRO_512(x) }
__kernel void ocl_Kernel_FLOPS(int iNbElts, __global float *pf)
{
for (unsigned i = get_global_id(0); i < iNbElts; i += get_global_size(0))
{
float f = (float) i;
FLOPS_MACRO_1024(f)
pf[i] = f;
}
}
But when I look in the PTX generated, I see this:
.entry ocl_Kernel_FLOPS(
.param .u32 ocl_Kernel_FLOPS_param_0,
.param .u32 .ptr .global .align 4 ocl_Kernel_FLOPS_param_1
)
{
.reg .f32 %f<1026>; // 1026 float registers !
.reg .pred %p<3>;
.reg .s32 %r<19>;
ld.param.u32 %r1, [ocl_Kernel_FLOPS_param_0];
// some more code unrelated to the problem
// ...
BB1_1:
and.b32 %r13, %r18, 65535;
cvt.rn.f32.u32 %f1, %r13;
fma.rn.f32 %f2, %f1, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f3, %f2, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f4, %f3, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f5, %f4, 0f3F7D70A4, 0f41200000;
// etc
// ...
If I am correct, the PTX uses 1026 float registers to perform the 1024 operations and never reuse a register twice even if it could perform all the multiply-add operations using only 2 registers. 1026 is far above the maximum number of registers a thread is allow to have (according to the specs), so I guess this ends up in memory spilling.
Is it a compiler bug or am I totally missing something ?
I am using nvcc version 6.5 on a Quadro K2000 GPU.
EDIT
Actually I did miss something in the specs:
"Since PTX supports virtual registers, it is quite common for a compiler frontend to generate
a large number of register names. Rather than require explicit declaration of every name,
PTX supports a syntax for creating a set of variables having a common prefix string
appended with integer suffixes. For example, suppose a program uses a large number, say
one hundred, of .b32 variables, named %r0, %r1, ..., %r99"
The PTX file format is intended to describe a virtual machine and instruction set architecture:
PTX defines a virtual machine and ISA for general purpose parallel thread execution. PTX programs are translated at install time to the target hardware instruction set. The PTX-to-GPU translator and driver enable NVIDIA GPUs to be used as programmable parallel computers.
So the PTX output that you are obtaining there is not a form of "GPU assembler". It is only an intermediate representation, intended to be capable of describing virtually any form of parallel computation.
The PTX representation is then compiled into actual binaries for the respective target GPU. This is important in order to be possible to abstract from the actual architecture - specifically, regarding your example: It should be possible to use the same PTX representation of a program, regardless of the number of registers that are available on a specific target machine. The 1026 "registers" that you see there are "virtual" registers, and in the end, may be mapped to the (few) real hardware registers that are actually available. You may add the --ptxas-options=-v argument to the NVCC during the compilation to obtain addition information about the register usage.
(This is roughly the same idea as that behind the LLVM - namely, to have a representation that can be optimized on and argued about, both abstracting from the original source code and from the actual target architecture).