I have a local uchar pointer, and I would like to cast it to a local ulong pointer.
i.e.
local uchar* foo;
local ulong* bar = (local ulong*)foo;
When I do this, the memory bar points to does not equal the memory that foo points to. Is this a bug, or am I doing something wrong?
You can refer this pages:
https://software.intel.com/en-us/articles/the-generic-address-space-in-opencl-20
http://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/opencl-c/
How to type-cast char* to int* in openCL
void foo(global unsigned int *bar) // ‘global’ address space on bar, works in both OCL 1.2 and OCL 2.0 with no additional flags to compile
{
local unsigned int *temp = NULL;//’local’ address space on temp, works in both OCL 1.2 and OCL 2.0 with no additional flags to compile
}
Related
I'm writing a renderer from scratch using openCL and I have a little compilation problem on my kernel with the error :
CL_BUILD_PROGRAM : error: program scope variable must reside in constant address space static float* objects;
The problem is that this program compiles on my desktop (with nvidia drivers) and doesn't work on my laptop (with nvidia drivers), also I have the exact same kernel file in another project that works fine on both computers...
Does anyone have an idea what I could be doing wrong ?
As a clarification, I'm coding a raymarcher which's kernel takes a list of objects "encoded" in a float array that is needed a lot in the program and that's why I need it accessible to the hole kernel.
Here is the kernel code simplified :
float* objects;
float4 getDistCol(float3 position) {
int arr_length = objects[0];
float4 distCol = {INFINITY, 0, 0, 0};
int index = 1;
while (index < arr_length) {
float objType = objects[index];
if (compare(objType, SPHERE)) {
// Treats the part of the buffer as a sphere
index += SPHERE_ATR_LENGTH;
} else if (compare(objType, PLANE)) {
//Treats the part of the buffer as a plane
index += PLANE_ATR_LENGTH;
} else {
float4 errCol = {500, 1, 0, 0};
return errCol;
}
}
}
__kernel void mkernel(__global int *image, __constant int *dimension,
__constant float *position, __constant float *aimDir, __global float *objs) {
objects = objs;
// Gets ray direction and stuf
// ...
// ...
float4 distCol = RayMarch(ro, rd);
float3 impact = rd*distCol.x + ro;
col = distCol.yzw * GetLight(impact);
image[dimension[0]*dimension[1] - idx*dimension[1]+idy] = toInt(col);
Where getDistCol(float3 position) gets called a lot by a lot of functions and I would like to avoid having to pass my float buffer to every function that needs to call getDistCol()...
There is no "static" variables allowed in OpenCL C that you can declare outside of kernels and use across kernels. Some compilers might still tolerate this, others might not. Nvidia has recently changed their OpenCL compiler from LLVM 3.4 to NVVM 7 in a driver update, so you may have the 2 different compilers on your desktop/laptop GPUs.
In your case, the solution is to hand the global kernel parameter pointer over to the function:
float4 getDistCol(float3 position, __global float *objects) {
int arr_length = objects[0]; // access objects normally, as you would in the kernel
// ...
}
kernel void mkernel(__global int *image, __constant int *dimension, __constant float *position, __constant float *aimDir, __global float *objs) {
// ...
getDistCol(position, objs); // hand global objs pointer over to function
// ...
}
Lonely variables out in the wild are only allowed as constant memory space, which is useful for large tables. They are cached in L2$, so read-only access is potentially faster. Example
constant float objects[1234] = {
1.0f, 2.0f, ...
};
I would like to have a Variable with Read-Access to all kernels/functions inside a CL Program. For this i have created a variable at the top of the File and prefixed it with __global.
typedef struct{
/* whatever */
} GlobalParameters;
__global GlobalParameters params;
how can i set the Values inside that Struct from the Host code now? Is that even Possible, or how can i edit it else? Or do i have to pass it as Parameter to the kernel every time i need it?
Program scope variables are meant to be constants and need to be initialized.
So, this works like:
typedef struct{
float whatever;
} GlobalParameters;
__constant GlobalParameters params=(GlobalParameters){3.14f};
then you can use it anywhere. But if opencl-compile-time is ok for it, you can alter it with string replacement after preaparing the host-side constant buffer:
typedef struct{
float whatever;
} GlobalParameters;
__constant GlobalParameters params=(GlobalParameters){##replace_0##};
if this is used for minutes per change, you can re-compile it using new string replacement before device-kernel-compiling. If there are non-changing sets, you can compile N times for different kernel programs and switch between them using different contexts.
I'm working on a project where I need my CUDA device to make computations on a struct containing pointers.
typedef struct StructA {
int* arr;
} StructA;
When I allocate memory for the struct and then copy it to the device, it will only copy the struct and not the content of the pointer. Right now I'm working around this by allocating the pointer first, then set the host struct to use that new pointer (which resides on the GPU). The following code sample describes this approach using the struct from above:
#define N 10
int main() {
int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
StructA *h_a = (StructA*)malloc(sizeof(StructA));
StructA *d_a;
int *d_arr;
// 1. Allocate device struct.
cudaMalloc((void**) &d_a, sizeof(StructA));
// 2. Allocate device pointer.
cudaMalloc((void**) &(d_arr), sizeof(int)*N);
// 3. Copy pointer content from host to device.
cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);
// 4. Point to device pointer in host struct.
h_a->arr = d_arr;
// 5. Copy struct from host to device.
cudaMemcpy(d_a, h_a, sizeof(StructA), cudaMemcpyHostToDevice);
// 6. Call kernel.
kernel<<<N,1>>>(d_a);
// 7. Copy struct from device to host.
cudaMemcpy(h_a, d_a, sizeof(StructA), cudaMemcpyDeviceToHost);
// 8. Copy pointer from device to host.
cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);
// 9. Point to host pointer in host struct.
h_a->arr = h_arr;
}
My question is: Is this the way to do it?
It seems like an awful lot of work, and I remind you that this is a very simple struct. If my struct contained a lot of pointers or structs with pointers themselves, the code for allocation and copy will be quite extensive and confusing.
Edit: CUDA 6 introduces Unified Memory, which makes this "deep copy" problem a lot easier. See this post for more details.
Don't forget that you can pass structures by value to kernels. This code works:
// pass struct by value (may not be efficient for complex structures)
__global__ void kernel2(StructA in)
{
in.arr[threadIdx.x] *= 2;
}
Doing so means you only have to copy the array to the device, not the structure:
int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
StructA h_a;
int *d_arr;
// 1. Allocate device array.
cudaMalloc((void**) &(d_arr), sizeof(int)*N);
// 2. Copy array contents from host to device.
cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);
// 3. Point to device pointer in host struct.
h_a.arr = d_arr;
// 4. Call kernel with host struct as argument
kernel2<<<N,1>>>(h_a);
// 5. Copy pointer from device to host.
cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);
// 6. Point to host pointer in host struct
// (or do something else with it if this is not needed)
h_a.arr = h_arr;
As pointed out by Mark Harris, structures can be passed by values to CUDA kernels. However, some care should be devoted to set up a proper destructor since the destructor is called at exit from the kernel.
Consider the following example
#include <stdio.h>
#include "Utilities.cuh"
#define NUMBLOCKS 512
#define NUMTHREADS 512 * 2
/***************/
/* TEST STRUCT */
/***************/
struct Lock {
int *d_state;
// --- Constructor
Lock(void) {
int h_state = 0; // --- Host side lock state initializer
gpuErrchk(cudaMalloc((void **)&d_state, sizeof(int))); // --- Allocate device side lock state
gpuErrchk(cudaMemcpy(d_state, &h_state, sizeof(int), cudaMemcpyHostToDevice)); // --- Initialize device side lock state
}
// --- Destructor (wrong version)
//~Lock(void) {
// printf("Calling destructor\n");
// gpuErrchk(cudaFree(d_state));
//}
// --- Destructor (correct version)
// __host__ __device__ ~Lock(void) {
//#if !defined(__CUDACC__)
// gpuErrchk(cudaFree(d_state));
//#else
//
//#endif
// }
// --- Lock function
__device__ void lock(void) { while (atomicCAS(d_state, 0, 1) != 0); }
// --- Unlock function
__device__ void unlock(void) { atomicExch(d_state, 0); }
};
/**********************************/
/* BLOCK COUNTER KERNEL WITH LOCK */
/**********************************/
__global__ void blockCounterLocked(Lock lock, int *nblocks) {
if (threadIdx.x == 0) {
lock.lock();
*nblocks = *nblocks + 1;
lock.unlock();
}
}
/********/
/* MAIN */
/********/
int main(){
int h_counting, *d_counting;
Lock lock;
gpuErrchk(cudaMalloc(&d_counting, sizeof(int)));
// --- Locked case
h_counting = 0;
gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));
blockCounterLocked << <NUMBLOCKS, NUMTHREADS >> >(lock, d_counting);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
printf("Counting in the locked case: %i\n", h_counting);
gpuErrchk(cudaFree(d_counting));
}
with the uncommented destructor (do not pay too much attention on what the code actually does). If you run that code, you will receive the following output
Calling destructor
Counting in the locked case: 512
Calling destructor
GPUassert: invalid device pointer D:/Project/passStructToKernel/passClassToKernel/Utilities.cu 37
There are then two calls to the destructor, once at the kernel exit and once at the main exit. The error message is related to the fact that, if the memory locations pointed to by d_state are freed at the kernel exit, they cannot be freed anymore at the main exit. Accordingly, the destructor must be different for host and device executions. This is accomplished by the commented destructor in the above code.
struct of arrays is a nightmare in cuda. You will have to copy each of the pointer to a new struct which the device can use. Maybe you instead could use an array of structs? If not the only way I have found is to attack it the way you do, which is in no way pretty.
EDIT:
since I can't give comments on the top post: Step 9 is redundant, since you can change step 8 and 9 into
// 8. Copy pointer from device to host.
cudaMemcpy(h->arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);
people, i've an issue now..
#include <stdio.h>
#include <stdlib.h>
typedef struct a
{
int *aa;
int *bb;
struct b *wakata;
}a;
typedef struct b
{
int *you;
int *me;
}b;
int main()
{
a *aq;
aq = (a*)malloc(sizeof(a*));
*aq->wakata->you = 1;
*aq->wakata->me = 2;
free(aq);
return 0;
}
and compiled, then debugged :
gcc -o tes tes.c --debug
sapajabole#cintajangankaupergi:/tmp$ gdb -q ./tes
Reading symbols from /tmp/tes...done.
(gdb) r
Starting program: /tmp/tes
Program received signal SIGSEGV, Segmentation fault.
0x08048414 in main () at tes.c:22
22 *aq->wakata->you = 1;
well, the question is, how to set the value to variable inside struct 'b' through struct 'a' ?
anyone ?
The initial allocation of a is only allocating 4 bytes (in a 32-bit architecture). It should be:
aq = (a*)malloc(sizeof(a));
And wakata has not been initialized: Maybe this:
aq->wakata = (b*)malloc(sizeof(b));
And it will need a corresponding free as well prior to the free of aq.
free(aq->wakata);
And since you have pointers to the integers, those would also need to be allocated (you and me). But it is not clear if that is your goal. You probably should remove the * from the int declarations so that they are simply int members rather than the pointers to int.
Looks like you have a few mistakes here. See the code below.
In general a few things to keep in mind. You can't access memory before you malloc it. Also, there is a difference between memory and pointers e.g. int and int *
#include <stdio.h>
#include <stdlib.h>
typedef struct a
{
int aa;
int bb;
struct b *wakata;
}a;
typedef struct b
{
int you;
int me;
}b;
int main()
{
a * aq = malloc(sizeof(a));
aq->wakata = malloc(sizeof(b))
aq->wakata->you = 1;
aq->wakata->me = 2;
free(aq->wakata)
free(aq);
return 0;
}
wakata isn't pointing to any valid memory. You have to malloc memory for it, and then also for wakata->you and wakata->me
Pointers do not contain data. They point at data. That is why they are called pointers.
When you malloc enough space to store an a instance named aq, you allocate space for the pointers contained in that structure. You do not cause them to point at anything, nor do you allocate space to contain the things that they would point at.
You're not allocating space for b in struct a. You have defined 'a' as holding pointers, not structs. Also, I think malloc(sizeof(a*)) should be malloc(sizeof(a))
aq = (a*)malloc(sizeof(a)); // You should probably use calloc here
aq->wakata = (b*)malloc(sizeof(b));
you and me don't seem to need to be pointers, just normal ints
You have some problems with your code.
When you allocate memory for the struct a, you should do
aq = (a*)malloc(sizeof(a));
You now allocated memory for the struct a, but not for the struct b pointed by the wakata member, so you need to do
aq->wakata = (b*)malloc(sizeof(b));
Finally, in the struct b there should not be int* members, but int members. This way, you'll be able to correctly assign a value to them.
Remember that you should check for the correct allocation of memory by checking if the malloc return value is not NULL.
have C sources that must compile in 32bit and 64bit for multiple platforms.
structure that takes the address of a buffer - need to fit address in a 32bit value.
obviously where possible these structures will use natural sized void * or char * pointers.
however for some parts an api specifies the size of these pointers as 32bit.
on x86_64 linux with -m64 -mcmodel=small tboth static data and malloc()'d data fit within the 2Gb range. data on the stack, however, still starts in high memory.
so given a small utility _to_32() such as:
int _to_32( long l ) {
int i = l & 0xffffffff;
assert( i == l );
return i;
}
then:
char *cp = malloc( 100 );
int a = _to_32( cp );
will work reliably, as would:
static char buff[ 100 ];
int a = _to_32( buff );
but:
char buff[ 100 ];
int a = _to_32( buff );
will fail the assert().
anyone have a solution for this without writing custom linker scripts?
or any ideas how to arrange the linker section for stack data, would appear it is being put in this section in the linker script:
.lbss :
{
*(.dynlbss)
*(.lbss .lbss.* .gnu.linkonce.lb.*)
*(LARGE_COMMON)
}
thanks!
The stack location is most likely specified by the operating system and has nothing to do with the linker.
I can't imagine why you are trying to force a pointer on a 64 bit machine into 32 bits. The memory layout of structures is mainly important when you are sharing the data with something which may run on another architecture and saving to a file or sending across a network, but there are almost no valid reasons that you would send a pointer from one computer to another. Debugging is the only valid reason that comes to mind.
Even storing a pointer to be used later by another run of your program on the same machine would almost certainly be wrong since where your program is loaded can differ. Making any use of such a pointer would be undefined abd unpredictable.
the short answer appears to be there is no easy answer. at least no easy way to reassign range/location of the stack pointer.
the loader 'ld-linux.so' at a very early stage in process activation gets the address in the hurd loader - in the glibc sources, elf/ and sysdeps/x86_64/ search out elf_machine_load_address() and elf_machine_runtime_setup().
this happens in the preamble of calling your _start() entry and related setup to call your main(), is not for the faint hearted, even i couldn't convince myself this was a safe route.
as it happens - the resolution presents itself in some other old school tricks... pointer deflations/inflation...
with -mcmodel=small then automatic variables, alloca() addresses, and things like argv[], and envp are assigned from high memory from where the stack will grow down. those addresses are verified in this example code:
#include <stdlib.h>
#include <stdio.h>
#include <alloca.h>
extern char etext, edata, end;
char global_buffer[128];
int main( int argc, const char *argv[], const char *envp )
{
char stack_buffer[128];
static char static_buffer[128];
char *cp = malloc( 128 );
char *ap = alloca( 128 );
char *xp = "STRING CONSTANT";
printf("argv[0] %p\n",argv[0]);
printf("envp %p\n",envp);
printf("stack %p\n",stack_buffer);
printf("global %p\n",global_buffer);
printf("static %p\n",static_buffer);
printf("malloc %p\n",cp);
printf("alloca %p\n",ap);
printf("const %p\n",xp);
printf("printf %p\n",printf);
printf("First address past:\n");
printf(" program text (etext) %p\n", &etext);
printf(" initialized data (edata) %p\n", &edata);
printf(" uninitialized data (end) %p\n", &end);
}
produces this output:
argv[0] 0x7fff1e5e7d99
envp 0x7fff1e5e6c18
stack 0x7fff1e5e6a80
global 0x6010e0
static 0x601060
malloc 0x602010
alloca 0x7fff1e5e69d0
const 0x400850
printf 0x4004b0
First address past:
program text (etext) 0x400846
initialized data (edata) 0x601030
uninitialized data (end) 0x601160
all access to/from the 32bit parts of structures must be wrapped with inflate() and deflate() routines, e.g.:
void *inflate( unsigned long );
unsigned int deflate( void *);
deflate() tests for bits set in the range 0x7fff00000000 and marks the pointer so that inflate() will recognize how to reconstitute the actual pointer.
hope that helps if anyone similarly must support structures with 32bit storage for 64bit pointers.