I have a question that I found many threads in, but none did explicitly answer my question.
I am trying to have a multidimensional array inside the kernel of the GPU using thrust. Flattening would be difficult, as all the dimensions are non-homogeneous and I go up to 4D. Now I know I cannot have device_vectors of device_vectors, for whichever underlying reason (explanation would be welcome), so I tried going the way over raw-pointers.
My reasoning is, a raw pointer points onto memory on the GPU, why else would I be able to access it from within the kernel. So I should technically be able to have a device_vector, which holds raw pointers, all pointers that should be accessible from within the GPU. This way I constructed the following code:
thrust::device_vector<Vector3r*> d_fluidmodelParticlePositions(nModels);
thrust::device_vector<unsigned int***> d_allFluidNeighborParticles(nModels);
thrust::device_vector<unsigned int**> d_nFluidNeighborsCrossFluids(nModels);
for(unsigned int fluidModelIndex = 0; fluidModelIndex < nModels; fluidModelIndex++)
{
FluidModel *model = sim->getFluidModelFromPointSet(fluidModelIndex);
const unsigned int numParticles = model->numActiveParticles();
thrust::device_vector<Vector3r> d_neighborPositions(model->getPositions().begin(), model->getPositions().end());
d_fluidmodelParticlePositions[fluidModelIndex] = CudaHelper::GetPointer(d_neighborPositions);
thrust::device_vector<unsigned int**> d_fluidNeighborIndexes(nModels);
thrust::device_vector<unsigned int*> d_nNeighborsFluid(nModels);
for(unsigned int pid = 0; pid < nModels; pid++)
{
FluidModel *fm_neighbor = sim->getFluidModelFromPointSet(pid);
thrust::device_vector<unsigned int> d_nNeighbors(numParticles);
thrust::device_vector<unsigned int*> d_neighborIndexesArray(numParticles);
for(unsigned int i = 0; i < numParticles; i++)
{
const unsigned int nNeighbors = sim->numberOfNeighbors(fluidModelIndex, pid, i);
d_nNeighbors[i] = nNeighbors;
thrust::device_vector<unsigned int> d_neighborIndexes(nNeighbors);
for(unsigned int j = 0; j < nNeighbors; j++)
{
d_neighborIndexes[j] = sim->getNeighbor(fluidModelIndex, pid, i, j);
}
d_neighborIndexesArray[i] = CudaHelper::GetPointer(d_neighborIndexes);
}
d_fluidNeighborIndexes[pid] = CudaHelper::GetPointer(d_neighborIndexesArray);
d_nNeighborsFluid[pid] = CudaHelper::GetPointer(d_nNeighbors);
}
d_allFluidNeighborParticles[fluidModelIndex] = CudaHelper::GetPointer(d_fluidNeighborIndexes);
d_nFluidNeighborsCrossFluids[fluidModelIndex] = CudaHelper::GetPointer(d_nNeighborsFluid);
}
Now the compiler won't complain, but accessing for example d_nFluidNeighborsCrossFluids from within the kernel will work, but return wrong values. I access it like this (again, from within a kernel):
d_nFluidNeighborsCrossFluids[iterator1][iterator2][iterator3];
// Note: out of bounds indexing guaranteed to not happen, indexing is definitely right
The question is, why does it return wrong values? The logic behind it should work in my opinion, since my indexing is correct and the pointers should be valid addresses from within the kernel.
Thank you already for your time and have a great day.
EDIT:
Here is a minimal reproducable example. For some reason the values appear right despite of having the same structure as my code, but cuda-memcheck reveals some errors. Uncommenting the two commented lines leads me to my main problem I am trying to solve. What does the cuda-memcheck here tell me?
/* Part of this example has been taken from code of Robert Crovella
in a comment below */
#include <thrust/device_vector.h>
#include <stdio.h>
template<typename T>
static T* GetPointer(thrust::device_vector<T> &vector)
{
return thrust::raw_pointer_cast(vector.data());
}
__global__
void k(unsigned int ***nFluidNeighborsCrossFluids, unsigned int ****allFluidNeighborParticles){
const unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i > 49)
return;
printf("i: %d nNeighbors: %d\n", i, nFluidNeighborsCrossFluids[0][0][i]);
//for(int j = 0; j < nFluidNeighborsCrossFluids[0][0][i]; j++)
// printf("i: %d j: %d neighbors: %d\n", i, j, allFluidNeighborParticles[0][0][i][j]);
}
int main(){
const unsigned int nModels = 2;
const int numParticles = 50;
thrust::device_vector<unsigned int**> d_nFluidNeighborsCrossFluids(nModels);
thrust::device_vector<unsigned int***> d_allFluidNeighborParticles(nModels);
for(unsigned int fluidModelIndex = 0; fluidModelIndex < nModels; fluidModelIndex++)
{
thrust::device_vector<unsigned int*> d_nNeighborsFluid(nModels);
thrust::device_vector<unsigned int**> d_fluidNeighborIndexes(nModels);
for(unsigned int pid = 0; pid < nModels; pid++)
{
thrust::device_vector<unsigned int> d_nNeighbors(numParticles);
thrust::device_vector<unsigned int*> d_neighborIndexesArray(numParticles);
for(unsigned int i = 0; i < numParticles; i++)
{
const unsigned int nNeighbors = i;
d_nNeighbors[i] = nNeighbors;
thrust::device_vector<unsigned int> d_neighborIndexes(nNeighbors);
for(unsigned int j = 0; j < nNeighbors; j++)
{
d_neighborIndexes[j] = i + j;
}
d_neighborIndexesArray[i] = GetPointer(d_neighborIndexes);
}
d_nNeighborsFluid[pid] = GetPointer(d_nNeighbors);
d_fluidNeighborIndexes[pid] = GetPointer(d_neighborIndexesArray);
}
d_nFluidNeighborsCrossFluids[fluidModelIndex] = GetPointer(d_nNeighborsFluid);
d_allFluidNeighborParticles[fluidModelIndex] = GetPointer(d_fluidNeighborIndexes);
}
k<<<256, 256>>>(GetPointer(d_nFluidNeighborsCrossFluids), GetPointer(d_allFluidNeighborParticles));
if (cudaGetLastError() != cudaSuccess)
printf("Sync kernel error: %s\n", cudaGetErrorString(cudaGetLastError()));
cudaDeviceSynchronize();
}
A device_vector is a class definition. That class has various methods and operators associated with it. The thing that allows you to do this:
d_nFluidNeighborsCrossFluids[...]...;
is a square-bracket operator. That operator is a host operator (only). It is not usable in device code. Issues like this give rise to the general statements that "thrust::device_vector is not usable in device code." The device_vector object itself is generally not usable. However the data it contains is usable in device code, if you attempt to access it via a raw pointer.
Here is an example of a thrust device vector that contains an array of pointers to the data contained in other device vectors. That data is usable in device code, as long as you don't attempt to make use of the thrust::device_vector object itself:
$ cat t1509.cu
#include <thrust/device_vector.h>
#include <stdio.h>
template <typename T>
__global__ void k(T **data){
printf("the first element of vector 1 is: %d\n", (int)(data[0][0]));
printf("the first element of vector 2 is: %d\n", (int)(data[1][0]));
printf("the first element of vector 3 is: %d\n", (int)(data[2][0]));
}
int main(){
thrust::device_vector<int> vector_1(1,1);
thrust::device_vector<int> vector_2(1,2);
thrust::device_vector<int> vector_3(1,3);
thrust::device_vector<int *> pointer_vector(3);
pointer_vector[0] = thrust::raw_pointer_cast(vector_1.data());
pointer_vector[1] = thrust::raw_pointer_cast(vector_2.data());
pointer_vector[2] = thrust::raw_pointer_cast(vector_3.data());
k<<<1,1>>>(thrust::raw_pointer_cast(pointer_vector.data()));
cudaDeviceSynchronize();
}
$ nvcc -o t1509 t1509.cu
$ cuda-memcheck ./t1509
========= CUDA-MEMCHECK
the first element of vector 1 is: 1
the first element of vector 2 is: 2
the first element of vector 3 is: 3
========= ERROR SUMMARY: 0 errors
$
EDIT: In the mcve you have now posted, you point out that an ordinary run of the code appears to give correct results, but when you use cuda-memcheck, errors are reported. You have a general design problem that will cause this.
In C++, when an object is defined within a curly-braces region:
{
{
Object A;
// object A is in-scope here
}
// object A is out-of-scope here
}
// object A is out of scope here
k<<<...>>>(anything that points to something in object A); // is illegal
and you exit that region, the object defined within the region is now out of scope. For objects with constructors/destructors, this usually means the destructor of the object will be called when it goes out-of-scope. For a thrust::device_vector (or std::vector) this will deallocate any underlying storage associated with that vector. That does not necessarily "erase" any data, but attempts to use that data are illegal and would be considered UB (undefined behavior) in C++.
When you establish pointers to such data inside an in-scope region, and then go out-of-scope, those pointers no longer point to anything that would be legal to access, so attempts to dereference the pointer would be illegal/UB. Your code is doing this. Yes, it does appear to give the correct answer, because nothing is actually erased on deallocation, but the code design is illegal, and cuda-memcheck will highlight that.
I suppose one fix would be to pull all this stuff out of the inner curly-braces, and put it at main scope, just like the d_nFluidNeighborsCrossFluids device_vector is. But you might also want to rethink your general data organization strategy and flatten your data.
You should really provide a minimal, complete, verifiable/reproducible example; yours is neither minimal, nor complete, nor verifiable.
I will, however, answer your side-question:
I know I cannot have device_vectors of device_vectors, for whichever underlying reason (explanation would be welcome)
While a device_vector regards a bunch of data on the GPU, it's a host-side data structure - otherwise you would not have been able to use it in host-side code. On the host side, what it holds should be something like: The capacity, the size in elements, the device-side pointer to the actual data, and maybe more information. This is similar to how an std::vector variable may refer to data that's on the heap, but if you create the variable locally the fields I mentioned above will exist on the stack.
Now, those fields of the device vector that are located in host memory are not generally accessible from the device-side. In device-side code you would typically use the raw pointer to the device-side data the device_vector manages.
Also, note that if you have a thrust::device_vector<T> v, each use of operator[] means a bunch of separate CUDA calls to copy data to or from the device (unless there's some caching going on under the hoold). So you really want to avoid using square-brackets with this structure.
Finally, remember that pointer-chasing can be a performance killer, especially on a GPU. You might want to consider massaging your data structure somewhat in order to make it amenable to flattening.
I have a protothread set up and blocking ...
static int mythread(struct pt *pt){
static int k;
PT_BEGIN(pt)
while(1){
PT_WAIT_UNTIL(pt, eventA == 1); // blocked at lineA
for(k=0;k<100;k++){
//do something
PT_YIELD(pt); //blocked at lineB
}
PT_WAIT_UNTIL(pt, eventB == 1); //block at lineC
}
PT_END(pt)
}
After a while, mythread can be blocked at "lineA", "lineB", or "lineC".
How could an external function, like main() reset mythread to be blocked at the beginning "lineA" again.
By running the macro PT_RESTART(&pt_mythread)? The compiler doesn't like it. Because my main() function isn't inside PT_BEGIN, PT_END block, so the return inside that macro is bad, bad.
Or running PT_INIT(&pt_mythread) again? Any suggestions?
Yes, calling PT_INIT from outside the protothread will restart it. If you look at the source for PT_RESTART:
#define PT_RESTART(pt) \
do { \
PT_INIT(pt); \
return PT_WAITING; \
} while(0)
This is exactly what it does, but then also returns (like a yield) out of the thread. As you say it's designed to be called from inside the protothread.
The protothread struct is basically just a number representing where it was in the thread:
struct pt {
lc_t lc; // where lc_t is an unsigned short;
};
So the only thing we need to do is reset that number to zero, which is exactly what PT_INIT does.
#include <stdio.h>
#include <conio.h>
int main ()
{
int numc;
puts ("NUMBER PLEASE");
numc=getchar();
printf ("%d");
getch ();
return 0;
}
I get the warning numc is assigned a value that is never used, while I'm trying to get the value. Please help.
Did you mean:
printf ("%d", numc);
?
That would use the value the the compiler is warning you about.
you never used numc value after assigning.. that's why it is giving warning..
OK, I understand that the GCC 4.x warning "dereferencing type-punned pointer will break strict-aliasing rules" is no joke and I should clean up my code.
I have code which compiles und runs fine with GCC 3.x, and would be very happy if it would do so with GCC 4.x, too. Assume I want to have the assembled code as short as possible: the function gets passed a pointer and should write some data to there. My original code uses the pointer directly on the stack (without a copy) and increments it there (note that I don't want to pass the incremented value back to the caller). You may think also of passing parameters by register - then any copy would be overhead.
So this was my "ideal" code:
void foo(void *pdataout) {
for (int i=16; i--;)
*(*reinterpret_cast<BYTE**>(&pdataout))++ = 255;
}
I tried some variant (note that the address-operator must be applied to 'pdataout' before any type-cast):
void foo(void *pdataout) {
BYTE *pdo = reinterpret_cast<BYTE*>(*reinterpret_cast<BYTE**>(&pdataout));
for (int i=16; i--;)
*pdo++ = 255;
}
and also this:
void foo(void *pdataout) {
BYTE *pdo = *reinterpret_cast<BYTE**>(&pdataout);
for (int i=16; i--;)
*pdo++ = 255;
}
Nothing pleases GCC 4.x... This last one does - but, it uses a copy of the parameter which I don't like. Is there a way to do this without the copy? I have no idea how to tell it the compiler :-(
void foo(void *pdataout) {
BYTE *pdo = reinterpret_cast<BYTE*>(pdataout);
for (int i=16; i--;)
*pdo++ = 255;
}
As far as I understand now, despite there is no more warning by GCC, using the indirection via an additional variable is not safe!
For me (as union is not usable), the only real solution is to use the -fno-strict-aliasing compiler option. Only with that, GCC is aware that pointers of different type to the same memory address can refer to the same variable.
This article finally helped me to understand strict-aliasing.
In QT have the following code that starts a thread to send out commands. The thread takes a char * and int as arguments. In the "run" I use the pointer that is given by the constuctor. The code is:
MyThread::MyThread(char * payld, int payld_size)
{
payload_size = payld_size;
payload_p = payld;
}
void MyThread::run()
{
while(...)
{
sendCommand(payload_p, payload_size);
}
}
Unfortunately this doesn´t work and my application crashes when I try to use thread.start(). But when I change it to:
MyThread::MyThread(char * payld, int payld_size)
{
payload_size = payld_size;
payload_p = payld;
for(int i=0; i<payload_size; i++)
{
payload[i] = payld[i];
}
}
void MyThread::run()
{
while(...)
{
sendCommand(payload, payload_size);
}
}
The code does run and only crashes sometimes (looks pretty random to me). Can anybody Explain me why version one doesnt work and version two does? And any ideas on why the second code sometimes crashes? Could it be because the size of payload is not predefined (in the header file I defined it as
char payload[];
When I define it as:
char payload[10];
it seems to work better, but it is annoying to test since the crashes are pretty random.
instead of fiddling with char*, I would switch to QString (since you're using Qt). It takes a bit of learning, but it's almost mandatory to get code working smoothly in this framework. Then declare
QString payload;
and depending on sendCommand implementation, use one of the member functions QString to get the char*, like payload.toLatin1()