I have an opencl kernel that only fails on AMD and not NVIDIA. It fails with error code -13
Online, it gives this explanation: "if a sub-buffer object is specified as the value for an argument that is a buffer object and the offset specified when the sub-buffer object is created is not aligned to CL_DEVICE_MEM_BASE_ADDR_ALIGN value for device associated with queue."
I am unable to figure out what does this mean. What is CL_DEVICE_MEM_BASE_ADDR_ALIGN. This routine only fails when I call createSubBuffer.
CL_DEVICE_MEM_BASE_ADDR_ALIGN is a parameter to clGetDeviceInfo (see https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html for the documentation, and https://forums.khronos.org/showthread.php/9134-Looking-for-a-better-explanation-of-CL_DEVICE_MEM_BASE_ADDR_ALIGN for more explanation). You need to look at that value to determine your sub-buffer alignment.
Related
I want to query the size of an OpenCL kernel argument so that I can ensure that I send it a variable of the correct size. I am able to query lots of other properties of each kernel argument using clGetKernelArgInfo, as follows:
clGetKernelArgInfo(k, argc, CL_KERNEL_ARG_TYPE_NAME, sizeof(argType), &argType, &retSize);
This will tell me the string name of the type, for example. But that's not good enough, especially in complex cases where it's a struct and the string name is the same on host and device, but the packing is different, so the size is different. The things that I can query, according to https://man.opencl.org/clGetKernelArgInfo.html , are:
CL_KERNEL_ARG_ADDRESS_QUALIFIER
CL_KERNEL_ARG_ACCESS_QUALIFIER
CL_KERNEL_ARG_TYPE_NAME
CL_KERNEL_ARG_TYPE_QUALIFIER
CL_KERNEL_ARG_NAME
Any ideas?
FYI, this is NOT a duplicate of Get OpenCL Kernel-argument information because that is asking how to use the argument query function, not asking how to query the argument size.
There's no standard way to check before setting the argument as far as I'm aware, but the clSetKernelArg call will return CL_INVALID_ARG_SIZE if the sizes don't match properly, so that should allow you to detect and handle errors accordingly:
CL_INVALID_ARG_SIZE if arg_size does not match the size of the data type for an argument that is not a memory object or if the argument is a memory object and arg_size != sizeof(cl_mem) or if arg_size is zero and the argument is declared with the __local qualifier or if the argument is a sampler and arg_size != sizeof(cl_sampler).
I have the latest MPICH2 (3.0.4) compiled with intel fort compiler in a quad-core, dual CPU (Intel Xeon) machine.
I am encountering one MPI_bcast problem where, I am unable to broadcast the array
gpsi(1:201,1:381,1:38,1:20,1:7)
making it an array of size 407410920. When I try to broadcast this array I have the following error
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7f506d811010, count=407410920,
MPI_DOUBLE_PRECISION, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1369).:
MPIR_Bcast_intra(1160):
MPIR_SMP_Bcast(1077)..: Failure during collective
rank 1 in job 31 Grace_52261 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
MPI launch string is: mpiexec -n 2 %B/tvdbootstrap
Testing MPI configuration with 'mpich2version'
Exit value was 127 (expected 0), status: execute_command_t::exited
Launching MPI job with command: mpiexec -n 2 %B/tvdbootstrap
Server args: -callback 127.0.0.1:4142 -set_pw 65f76672:41f20a5c
So the question, is there a limit in the size of variable in MPI_bcast or is the size of my array is more than what it can handle?
As John said, your array is too big because it can no longer be described by an int variable. When this is the case, you have a few options.
Use multiple MPI calls to send your data. For this option, you would just divide your data up into chunks smaller than 2^31 and send them individually until you've received everything.
Use MPI datatypes. With this option, you need to create a datatype to describe some portion of your data, then send multiples of that datatype. For example, if you are just sending an array of 100 integers, you can create a datatype of 10 integers using MPI_TYPE_VECTOR, then send 10 of that new datatype. Datatypes can be a bit confusing when you're first taking a look at them, but they are very powerful for sending either large data or non-contiguous data.
Yes, there is a limit. It's usually 2^31 so about two billion elements. You say your array has 407 million elements so it seems like it should work. However, if the limit is two billion bytes, then you are exceeding it by about 30%. Try cutting your array size in half and see if that works.
See: Maximum amount of data that can be sent using MPI::Send
I've noticed that often, global and constant device memory is initialized to 0. Is this a universal rule? I wasn't able to find anything in the standard.
No it doesn't. For instance I had this small kernel to test atomic add:
kernel void atomicAdd(volatile global int *result){
atomic_add(&result[0], 1);
}
Calling it with this host code (pyopencl + unittest):
def test_atomic_add(self):
NDRange = (4, 4)
result = np.zeros(1, dtype=np.int32)
out_buf = cl.Buffer(self.ctx, self.mf.WRITE_ONLY, size=result.nbytes)
self.prog.atomicAdd(self.queue, NDRange, NDRange, out_buf)
cl.enqueue_copy(self.queue, result, out_buf).wait()
self.assertEqual(result, 16)
was always returning the correct value when using my CPU. However on a ATI HD 5450 the returned value was always junk.
And If I well recall, on an NVIDIA the first run was returning the correct value, i.e. 16, but for the following run, the values were 32, 48, etc. It was reusing the same location with the old value still stored there.
When I corrected my host code with this line (copying the 0 value to the buffer):
out_buf = cl.Buffer(self.ctx, self.mf.WRITE_ONLY | self.mf.COPY_HOST_PTR, hostbuf=result)
Everything worked fine on any devices.
As far as I know there is no sentence in standard that states this.
Maybe some driver implementations will do this automatically, but you shoudn't rely on it.
I remember that once I had a case where a buffer was not initialized to 0, but I can't remember the settings of "OS + driver".
Probably what is going on is that the typical OS does not use even 1% of now a days devices memory. So when you start a OpenCL, there is a huge probability that you will fall into an empty zone.
It depends on the platform that you are developing. As #DarkZeros mentioned in the previous answers, the spec does not imply anything. Please see page 104 of OpenCL 2.1 Spec.
However, based on our experience in Mali GPUs, the driver initializes all elements of the newly allocated buffers to zero. This is for the first touch. Later on, as time goes by, and we release this buffer and its memory space is occupied by a new buffer, that memory space is not initialized with zero. "Again, the first touch sees the zero values. After that, you would see normal gibberish values."
Hope this helps after such long time!
I am writing an openCL code to find shortest paths from each node to others in a graph using BFS. (Here is the details of what I am doing:
Shortest paths by BFS, porting a code from CUDA to openCL
and here is how I split the data to pass to clEnqueueNDRangeKernel
size_t global_size, local_size;
local_size=1024;
global_size=ceil(e_count/(float)local_size)*local_size;
cl_event sync1;
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
&global_size, &local_size, 0, NULL, &sync1); //wait for this to finish (to synchronize);
err = clWaitForEvents(1, &sync1)
The code works well with number of edge <= 50000 (through considerably slower than its equivalent cpu version). When I increased the number of edge, the program just exited and gave error -58 (after clEnqueueNDRangeKernel )
I am using NVIDIA Geforce 630M.
How can I figure out what happened and how to fix the problem?
Best Regards
Error -58 is a CL_INVALID_EVENT (as you can see in cl.h), and it isn't returned by clEnqueueNDRangeKernel, only by clWaitForEvents. So you're probably only checking for errors on the latter function. In order to find what the actual error is, you should check the return value clEnqueueNDRangeKernel against the different status constants it can return. In fact, you should do this for all host OpenCL functions, or else it will be very difficult to determine exactly what type of errors are occurring.
In this specific case, I bet you have a memory related error, such as CL_OUT_OF_RESOURCES (not enough local or private memory for your kernels).
Hope this helps.
How can one find out the size of the biggest 2D array that could be made in an OpenCL kernel?
For example
int anArray[1000][1000]; inside a kernel works fine.
but when I rewrite it for a bigger scenario like
int anArray[5000][5000] it shows a failed RUN.
Would like to know what exactly is/are the factor/factors that decides the maximum array size that could run successfully.
You can retrieve this kind of information using clGetDeviceInfo.
The following arguments should help you (depending on how you write your kernel) :
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
Reference : http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html