CUDA: Stack vector in different thread to a 1d vector - vector

I have a thrust vector for each thread in CUDA, and I want to stack vectors by orders (vector in thread 0, vector in thread 1,.... and vector in thread n) to create a 1d vector and send back to CPU. Is there a good way to do this? Any help is appreciated. Thank you.

The most performant way to store items from several threads into a single vector will be thread-interleaved. Suppose each of 4 threads (t0-t3) has 4 elements to store (e0-e3). The final storage pattern which will be most efficient will be:
t0e0 t1e0 t2e0 t3e0 t0e1 t1e1 t2e1 t3e1 t0e2 t1e2 t2e2 t3e2 t0e3 t1e3 t2e3 t3e3
The code to do that would look like this:
#include <thrust/device_vector.h>
const int nt = 4;
const int ne = 4;
template <typename T>
__global__ void k(T *d){
T e0 = threadIdx.x+10;
T e1 = threadIdx.x+20;
T e2 = threadIdx.x+30;
T e3 = threadIdx.x+40;
d[threadIdx.x] = e0;
d[threadIdx.x+nt] = e1;
d[threadIdx.x+2*nt] = e2;
d[threadIdx.x+3*nt] = e3;
}
int main(){
thrust::device_vector<int> d(ne*nt);
k<<<1,nt>>>(thrust::raw_pointer_cast(d.data()));
}
In your question it seems that you want this order:
t0e0 t0e1 t0e2 t0e3 t1e0 t1e1 t1e2 t1e3 t2e0 t2e1 t2e2 t2e3 t3e0 t3e1 t3e2 t3e3
That storage pattern will generally be less efficient, but you could achieve it like this:
#include <thrust/device_vector.h>
const int nt = 4;
const int ne = 4;
template <typename T>
__global__ void k(T *d){
T e0 = threadIdx.x+10;
T e1 = threadIdx.x+20;
T e2 = threadIdx.x+30;
T e3 = threadIdx.x+40;
d[threadIdx.x*nt] = e0;
d[threadIdx.x*nt+1] = e1;
d[threadIdx.x*nt+2] = e2;
d[threadIdx.x*nt+3] = e3;
}
int main(){
thrust::device_vector<int> d(ne*nt);
k<<<1,nt>>>(thrust::raw_pointer_cast(d.data()));
}
The storage efficiency difference in these two cases is the difference between uncoalesced and coalesced behavior, which is covered in numerous questions here on the cuda SO tag, such as this one.

Related

Access components of vector data type (e.g. int4) as array of scalars in OpenCL

I have kernel which evaluate interaction between all pairs of neighbors of an atoms. Each atom has max. 4 neighbors so I store their indexes in int4. But in order to loop over these neighbors I need to access them by index (neighs[0] rather than neighs.x ).
The loop should look something like:
int iatom = get_global_id(0);
int4 ng = neighs[iatom]; // each atoms has 4 neighbors
float4 p0 = atom_pos[iatom];
float4 force = (float)(0.f,0.f,0.f,0.f);
for(int i=0; i<4; i++){
int ing = ng[i]; // HERE: index into vector
float4 pi = atom_pos[ing];
for(int j=i+1; j<4; j++){
int jng = ng[j]; // HERE: index into vector
float4 pj = atom_pos[jng];
force += evalInteraction( p0, pi, pj );
}
}
forces[iatom]=force;
I have some idea how it can be probably done but not sure:
Unroll the loops
since there are just 4*3/2=6 pair-interactions it would be probably even more efficient. But it would be much less readable and more difficult do modify.
cast int4 to int*
but is it fine ? Doesn't it break something? Doesn't it make some performance issue? I mean this:
int4 ng_ = neighs[iatom]; // make sure we copy it to local memory or register
int* ng = (int*)&ng_; // pointer to local memory can be optimized out, right ?
for(int i=0; i<4; i++){
int ing = ng[i];
...
}
You can cast directly, but you can also declare a union for easier access:
union
{
int components[4];
int4 vector;
} neighbors;
neighbors.vector = ng;
neighbors.components[i]; // Works now

Heap Corruption error when calling C from R, can't find the source issue

UPDATE3: Problem is solved but I'm leaving the code here as-is for future reference--I've posted an answer below with the final state of the code in case people wanted to see the final product.
UPDATE2: Refactored to use R_alloc instead of calloc for automated cleanup. Unfortunately the problem persists.
UPDATE: If I add this line right before UNPROTECT(1):
Rprintf("%p %p %p", (void *)rans, (void *)fm, (void *)corrs);
then the function executes with no corrupted heap error. Maybe there's a background garbage collection call that corrupts one of the pointers prior to execution finishing, resulting in a write to a garbage pointer? Important to note here that if I don't print out all three of the pointer addresses, the error comes back.
Also I'm running this on an M1 Mac and compiling with clang via R CMD SHLIB, in case Apple silicon is to blame.
I'm at my wits end trying to debug this issue, and I figured I'd turn to SO for help. I'm writing a function in C to optimize some parts of my R code, and I'm getting a Heap Corruption Error when running the function many times. The function trimCovar() is called from R using the .Call("trimCovar", ...) interface.
I'm having a lot of difficulty debugging this for a few reasons:
I'm on OSX, so I can't use Valgrind
C function depends on inputs from R, so I can't debug the C code on its own
Heap corruption only occurs when calling the function many times within an R function
(just running .Call directly a bunch of times has no errors)
Error point is inconsistent
I start with two sets of vectors, and I condense them into a frequency matrix, where each column is a position in the vector set, and each row is a particular character that appears. I concatenate them into one matrix prior to passing in because it makes pre-processing easier. An toy example of the frequency matrix would be:
INPUT:
v1_1 = 101
v1_2 = 011
v2_1 = 111
v2_2 = 110
Frequency Matrix:
position: | 1_1 | 1_2 | 1_3 | 2_1 | 2_2 | 2_3 |
0: 0.5 0.5 0.0 0.0 0.0 0.5
1: 0.5 0.5 1.0 1.0 1.0 0.5
The goal is to find the NV highest correlated positions across the vector sets, which I do by calculating pairwise KL divergence of positions. These are stored in a linked list sorted in ascending order, and at the end I take the positions corresponding to the first NV entries. The R code I have can deparse everything else, so I really just need a vector of positions at the end (duplicates are allowed).
The function takes in 5 arguments:
fMAT: a frequency matrix (RObject, so gets read in as a flat vector)
fSP : columns in matrix corresponding to positions from the first vector set
sSP : same as fSP but for second vector set
NV : Number of values to return
NR : Number of columns in fMAT
The error returned is:
R(95564,0x104858580) malloc: Heap corruption detected, free list is damaged at 0x600000f10040
*** Incorrect guard value: 4626885667169763328
R(95564,0x104858580) malloc: *** set a breakpoint in malloc_error_break to debug
This only happens when I run an R function that calls this 10+ times, so I'm assuming that I'm just missing one or two small hanging pointers corrupting a memory reference. I've tried running this with gc() called in R immediately after each call, but it doesn't fix the problem. I'm not really sure what else to do at this point, I've tried using lldb but I'm not really sure how to use that program. From running lots of print statements I've determined that it usually crashes in the main loop (identified in code below), but it's inconsistent on when it crashes. I've also tried saving off erroneous inputs--I can rerun them individually with no issues, so it must be something relatively small that only appears over many runs.
Happy to provide more details if it would help. Code is listed at the bottom.
The only thing being allocated here are linked list nodes, and I thought I had free()'d them all prior to returning. I've also double checked the input values, so I'm 99.99% sure that I'm never referencing out of bounds on firstSeqPos, secondSeqPos, ans, or fm. I've also triple checked the R code surrounding this and can confidently say it is not the source of this error.
I haven't coded in C in a long time so I feel like I'm missing something obvious. If I really have to I can try to get ahold of a Linux box to run valgrind, but if there's another option I'd prefer it. Thanks in advance!
Code:
#include <R.h>
#include <Rdefines.h>
#include <Rinternals.h>
#include <math.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct node {
double data;
int i1;
int i2;
struct node *next;
} node;
// Linked list
// data is the correlation value,
// i1 the position from first vector set,
// i2 the position from second vector set
node *makeNewNode(double data, int i1, int i2){
node *newNode;
newNode = (node *)R_alloc(1, sizeof(node));
newNode->data = data;
newNode->i1 = i1;
newNode->i2 = i2;
newNode->next = NULL;
return(newNode);
}
//insert link in sorted order (ascending)
void insertSorted(node **head, node *toInsert, int maxSize) {
int ctr = 0;
if ((*head) == NULL || (*head)->data >= toInsert->data){
toInsert->next = *head;
*head = toInsert;
} else {
node *temp = *head;
while (temp->next != NULL && temp->next->data < toInsert->data){
temp = temp->next;
if (ctr == maxSize){
// Performance optimization, if we aren't inserting in the first NR
// positions then we can just skip since we only care about the NR
// lowest scores overall
return;
}
ctr += 1;
}
toInsert->next = temp->next;
temp->next = toInsert;
}
}
// MAIN FUNCTION CALLED FROM R
// (This is the one that crashes)
SEXP trimCovar(SEXP fMAT, SEXP fSP, SEXP sSP, SEXP NV, SEXP NR){
// Converting input SEXPs into C-compatible values
int nv = asInteger(NV);
int nr = asInteger(NR);
int sp1l = length(fSP);
int sp2l = length(sSP);
int *firstSeqPos = INTEGER(coerceVector(fSP, INTSXP));
int *secondSeqPos = INTEGER(coerceVector(sSP, INTSXP));
double *fm = REAL(fMAT);
int colv1, colv2;
// Using a linked list for efficient insert
node *corrs = NULL;
int cv1, cv2;
double p1, p2, score=0;
// USUALLY FAILS IN THIS LOOP
for ( int i=0; i<sp1l; i++ ){
cv1 = firstSeqPos[i];
colv1 = (cv1 - 1) * nr;
for ( int j=0; j<sp2l; j++ ){
cv2 = secondSeqPos[j];
colv2 = (cv2 - 1) * nr;
// KL Divergence
score = 0;
for ( int k=0; k<nr; k++){
p1 = fm[colv1 + k];
p2 = fm[colv2 + k];
if (p1 != 0 && p2 != 0){
score += p1 * log(p1 / p2);
}
}
// Add result into LL
node *newNode = makeNewNode(score, cv1, cv2);
insertSorted(&corrs, newNode, nv);
}
R_CheckUserInterrupt();
}
SEXP ans;
PROTECT(ans = allocVector(INTSXP, 2*nv));
int *rans = INTEGER(ans);
int ctr=0;
int pos1, pos2;
node *ptr = corrs;
for ( int i=0; i<nv; i++){
rans[2*i] = ptr->i1;
rans[2*i+1] = ptr->i2;
ptr = ptr->next;
}
UNPROTECT(1);
return(ans);
}
int *firstSeqPos = INTEGER(coerceVector(fSP, INTSXP));
int *secondSeqPos = INTEGER(coerceVector(sSP, INTSXP));
This is not good. The SEXPs returned by the 2 calls to coerceVector() need to be protected. However it's usually considered better practice to do this coercion at the R level right before entering the .Call entry point. Note that if fSP and sSP are integer matrices, there's no need to coerce them to integer as they are already seen as integer vectors at the C level. This also avoids a possibly expensive copy (as.integer() in R and coerceVector() in C both trigger a full copy of the matrix data).
The question was answered above, but I received a couple messages from people asking for the final code, so I'm going to include it as an answer to preserve the original question. There's a couple optimizations here (thanks to #hpages for help and troubleshooting regarding these):
Original code fails because the output of coerceVector() wasn't protected with PROTECT(). I've refactored the R code to check for integer inputs prior to calling this C function to avoid this function call and be more efficient with memory (see the accepted answer for more details).
Original code uses R_alloc(), which gives responsibility to R to clean up memory at the end of the function call. However, this introduces substantial memory overhead during the runtime of the function, since memory allocated to nodes not inserted into the linked list aren't cleared until the end of the function call.
Allocation with calloc() isn't as simple as switching over and calling free() at the end of the function, since we have to guard the case where the user interrupts execution of the program. If an interrupt signal is thrown prior to the end of the function, we'll never free the memory.
Final C Code:
#include <R.h>
#include <Rdefines.h>
#include <Rinternals.h>
#include <math.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct node {
double data;
int i1;
int i2;
struct node *next;
} node;
// Defining the head as a static so that we can access it globally
// Important for ensuring clean up in case of interrupt
static node *corrs = NULL;
// Function to clean up memory allocations in case of interrupt
void cleanupFxn(){
node *ptr = corrs;
// Free allocated memory in linked list
while (corrs != NULL){
ptr = corrs;
corrs = corrs->next;
free(ptr);
}
}
node *makeNewNode(double data, int i1, int i2){
node *newNode;
// very important to use calloc here so we have control of when we free it
// R_alloc() memory won't be freed until after function finishes execution
newNode = (node *)calloc(1, sizeof(node));
newNode->data = data;
newNode->i1 = i1;
newNode->i2 = i2;
newNode->next = NULL;
return(newNode);
}
// insert link in sorted order
// returns a bool corresponding to if we inserted
bool insertSorted(node **head, node *toInsert, int maxSize) {
int ctr = 0;
if ((*head) == NULL || (*head)->data >= toInsert->data){
toInsert->next = *head;
*head = toInsert;
return(true);
} else {
node *temp = *head;
while (temp->next != NULL && temp->next->data < toInsert->data){
temp = temp->next;
if (ctr == maxSize){
// Performance optimization, if we aren't inserting in the first NR
// positions then we can just skip since we only care about the NR
// lowest scores overall. Saves a huge amount of time and memory.
return(false);
}
ctr += 1;
}
toInsert->next = temp->next;
temp->next = toInsert;
return(true);
}
}
SEXP trimCovar(SEXP fMAT, SEXP fSP, SEXP sSP, SEXP NV, SEXP NR){
// Converting inputs into C-compatible forms
int nv = asInteger(NV);
int nr = asInteger(NR);
int sp1l = length(fSP);
int sp2l = length(sSP);
// Note here we're not using coerceVector() anymore
// typechecking done on R side
int *firstSeqPos = INTEGER(fSP);
int *secondSeqPos = INTEGER(sSP);
double *fm = REAL(fMAT);
int colv1, colv2;
// Using a linked list for efficient insert
corrs = NULL;
int cv1, cv2;
double p1, p2, score=0;
bool success;
for ( int i=0; i<sp1l; i++ ){
cv1 = firstSeqPos[i];
colv1 = (cv1 - 1) * nr;
for ( int j=0; j<sp2l; j++ ){
cv2 = secondSeqPos[j];
colv2 = (cv2 - 1) * nr;
score = 0;
for ( int k=0; k<nr; k++){
p1 = fm[colv1 + k];
p2 = fm[colv2 + k];
if (p1 != 0 && p2 != 0){
score += p1 * log(p1 / p2);
}
}
node *newNode = makeNewNode(score, cv1, cv2);
success = insertSorted(&corrs, newNode, nv);
// If we don't insert, free the associated memory
// I'm checking for NULL here just out of an abundance of caution
if (!success && newNode != NULL){
free(newNode);
newNode = NULL;
}
}
R_CheckUserInterrupt();
}
SEXP ans;
PROTECT(ans = allocVector(INTSXP, 2*nv));
int *rans = INTEGER(ans);
node *ptr=corrs;
for ( int i=0; i<nv; i++){
rans[2*i] = ptr->i1;
rans[2*i+1] = ptr->i2;
ptr = ptr->next;
}
// Free allocated memory in linked list
cleanupFxn();
UNPROTECT(1);
return(ans);
}
Assuming the C file is named trimCovar.c, we'd compile with R CMD SHLIB trimCovar.c.
R Code to run this function:
dyn.load("trimCovar.so")
# Wrapped into a function with on.exit(...) to ensure cleanup
# in the event the user or system interrupts execution early
CorrComp_C <- function(fm, fsp, ssp, nv, nr){
# type checking to ensure input to C is integer vector
# (could probably do more type checking here, mainly for illustration)
stopifnot(is(fsp, 'integer'))
stopifnot(is(ssp, 'integer'))
on.exit(.C("cleanupFxn"))
a <- .Call('trimCovar', fm, fsp, ssp, nv, nr)
return(a)
}

OpenCL: Optimize matrix multiplication for uchar

I adapted the attached kernel from one of the NVIDIA OpenCL examples and compared performance to clblasSgemm, and found that they perform equally fast (at least on my setup). I am launching it with a {16, 16} local work size.
Now, assume matrices A and B are both uchar, and C accordingly uint. Is there any way to optimize the multiplication? Simply replacing the types degraded performance. I tried hand-vectorizing with uchar4 and uchar16, but that made it slower.
Any suggestions welcome! (I am new to GPU programming and OpenCL)
/*
* This software contains source code provided by NVIDIA Corporation.
*/
#define BLOCK_SIZE 16
__kernel void mat_mul(const __global float* A, const __global float* B,
__global float* C,
const int A_cols, const int B_cols) {
// Block index
const int bx = get_group_id(0);
const int by = get_group_id(1);
// Thread index
const int tx = get_local_id(0);
const int ty = get_local_id(1);
// Index of the first sub-matrix of A processed by the block
const int a0 = A_cols * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
const int a1 = a0 + A_cols - 1;
const int a_step = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
const int b0 = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
const int b_step = BLOCK_SIZE * B_cols;
// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
float Csub = 0;
__local float As[BLOCK_SIZE][BLOCK_SIZE];
__local float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Loop over all the sub-matrices of A and B required to compute the
// block sub-matrix
for (int a=a0, b=b0; a<=a1; a+=a_step, b+=b_step) {
// Load the matrices from device memory to shared memory;
// each thread loads one element of each matrix
As[ty][tx] = A[a + A_cols * ty + tx];
Bs[ty][tx] = B[b + B_cols * ty + tx];
// Synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
// Multiply the two matrices together;
// each thread computes one element of the block sub-matrix
#pragma unroll
for (int k=0; k<BLOCK_SIZE; ++k) {
Csub += As[ty][k] * Bs[k][tx];
}
// Synchronize to make sure that the preceding computation is done
// before loading two new sub-matrices of A and B in the next
// iteration
barrier(CLK_LOCAL_MEM_FENCE);
}
// Write the block sub-matrix to device memory;
// each thread writes one element
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = Csub;
}
There is very simple way to measure if your kernel is good. Calculate it's OPS & bandwidth (how many data in form of matrix are you processing per second). Then compare it to theoretical limits. You will get factor, limiting performance. Usually, it's load-store operations.

How to store three small numbers into one double?

I have int A, B, C. And A is in range 0-9999, B is 0-99, C is 0-99.
Because the function must return only one double, I think of putting them all into one number. Otherwise I need to call function three times.
But I cannot write an efficient code to do this. This will be called millions times, so it should be quite effective, but no ASM.
I need a function double pack3int_to_double(int A, int B, int C) {}
Couldn't you just store A + 1000B + 100000C?
For example, if you wanted to store A = 1234, B = 6, and C = 89, you'd just store
89061234
CCBAAAA
You can then extract the numbers by casting the double to an int and using standard integer division and modulus tricks to recover the individual values.
Hope this helps!
If A<10,000 and B & C <100, A can be expressed with 14 bits, and B & C with 8 bits. Thus you need 30 bits in total.
You could therefore pack/unpack the integers by shifting it to the right place:
int packed = A + B<<14 + C<<22;
A = packed & 0x3FFF; B = (packed >> 14) & 0xFF; C = (packed >> 22) & 0xFF;
Bit shifting is of course MUCH faster than multiply/divide, and you can cast the int to a double and vice versa.
This is technically not legal C code, so you would use this at your own risk:
typedef union {
double x;
struct {
unsigned a : 14;
unsigned b : 7;
unsigned c : 7;
} y;
} result_t;
The C standard doesn't allow using a union member to write a value and a different one to read it out, but I am not aware of a compiler that does the static analysis to diagnose such a problem (it doesn't mean one won't do so in the future). Also, using certain int values may result in a trap representation for a double. But, if you know your system will not generate any trap representations, you can consider using this.
double pack3int_to_double(int A, int B, int C) {
result_t r;
r.y.a = A;
r.y.b = B;
r.y.c = C;
return r.x;
}
void unpack3int_from_double (double X, int *A, int *B, int *C) {
result_t r = { X };
*A = r.y.a;
*B = r.y.b;
*C = r.y.c;
}
You can use out parameters in function call and retrieve all 3 int variables.
You could return a NaN double with the data stored in the mantissa. That gives you 53 bits to utilize. Should be plenty.
http://en.m.wikipedia.org/wiki/NaN
Inspired by your answers, this is what I come up so far. This should be quite efficient, and only 32 bits are used, so the exponent of the double is not touched.
struct pack_abc {
unsigned short a;
unsigned char b, c;
int safety;
};
double pack3int_to_double(int A, int B, int C) {
struct pack_abc R = {A, B, C, 0}; // or 0 could be replaced with something smater, like NaN?
return *(double*)&R;
}
void main() {
int w = 1234, a = 56, d = 78;
int W, A, D, i;
double p = pack3int_to_double(w, a, d);
// we got the data packed into 'p', now let's unpack it
struct pack_abc *R = (struct pack_abc*) & p;
printf("%i %i %i\n", (int)R->a, (int)R->b, (int)R->c);
}

Different values between local and global memory after copy

I'm working in a GPU Kernel and I have some problems copying data from global to local memory
here is my kernel function:
__kernel void nQueens( __global int * data, __global int * result, int board_size)
so I want to copy from __global int * data to __local int aux_data[OBJ_SIZE]
I tried to copy like a normal array:
for(int i = 0; i < OBJ_SIZE; ++i)
{
aux_data[stack_size*OBJ_SIZE + i] = data[index*OBJ_SIZE + i];
}
and also with the functions to copy:
event_t e = async_work_group_copy ( aux_data, (data + (index*OBJ_SIZE)), OBJ_SIZE, 0);
wait_group_events (1, e);
And in both situations I get different values between the global and local memory.
I don't know what I'm doing wrong...
One of the problems with the way you are copying data in the first answer is that you are assigning data to parts of an array that don't exist. aux_data[stack_size*OBJ_SIZE + i] will overflow whenever stack_size > 1.
The problem with answer two might be that you need to pass an array of events, not just a single event.
One thing to make sure is to understand what index is referring to. I'm assuming for my solutions that it is referring to the group ID and not the thread ID. If it is indeed the thread ID, then we have other problems.
Possible Solution 1:
int gid = get_group_id(0);
int lid = get_local_id(0);
int l_s = get_local_id(0);
for(int i = lid; i < OBJ_SIZE; i += l_s)
{
aux_data[i] = data[gid*OBJ_SIZE + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
Possible Solution 2:
int gid = get_group_id(0);
event_t e = async_work_group_copy (aux_data, data + (gid*OBJ_SIZE), OBJ_SIZE, 0);
wait_group_events (1, &e);

Resources