CUDA thrust vector: copy and sum the values from device_vectorA to device_vectorB - vector

I'm new to CUDA.
I want to copy and sum values in device_vector in the following ways. Are there more efficient ways (or functions provided by thrust) to implement these?
thrust::device_vector<int> device_vectorA(5);
thrust::device_vector<int> device_vectorB(20);
copydevice_vectorA 4 times into device_vectorB in the following way:
for (size_t i = 0; i < 4; i++)
{
offset_sta = i * 5;
thrust::copy(device_vectorA.begin(), device_vectorA.end(), device_vectorB.begin() + offset_sta);
}
Sum every 5 values in device_vectorB and store the results in new device_vector (size 4):
// Example
device_vectorB = 1 2 3 4 5 | 1 2 3 4 5 | 1 2 3 4 5 | 1 2 3 4 5
device_vectorC = 15 15 15 15
thrust::device_vector<int> device_vectorC(4);
for (size_t i = 0; i < 4; i++)
{
offset_sta = i * 5;
offset_end = (i + 1) * 5 - 1;
device_vectorC[i] = thrust::reduce(device_vectorB.begin() + offset_sta, device_vectorB.begin() + offset_end, 0);
}
Are there more efficient ways (or functions provided by thrust) to implement these?
P.S. 1 and 2 are separate instances. For simplicity, these two instances just use the same vectors to illustrate.

Step 1 can be done with a single thrust::copy operation using a permutation iterator that uses a transform iterator working on a counting iterator to generate the copy indices "on the fly".
Step 2 is a partitioned reduction, using thrust::reduce_by_key. We can again use a transform iterator working on a counting iterator to create the flags array "on the fly".
Here is an example:
$ cat t2124.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/reduce.h>
#include <thrust/sequence.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
const int As = 5;
const int Cs = 4;
const int Bs = As*Cs;
int main(){
thrust::device_vector<int> A(As);
thrust::device_vector<int> B(Bs);
thrust::device_vector<int> C(Cs);
thrust::sequence(A.begin(), A.end(), 1); // fill A with 1,2,3,4,5
thrust::copy_n(thrust::make_permutation_iterator(A.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1%A.size())), B.size(), B.begin()); // step 1
auto my_flags_iterator = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/A.size());
thrust::reduce_by_key(my_flags_iterator, my_flags_iterator+B.size(), B.begin(), thrust::make_discard_iterator(), C.begin()); // step 2
thrust::host_vector<int> Ch = C;
thrust::copy_n(Ch.begin(), Ch.size(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t2124 t2124.cu
$ compute-sanitizer ./t2124
========= COMPUTE-SANITIZER
15,15,15,15,
========= ERROR SUMMARY: 0 errors
$
If we wanted to, even the device vector A could be dispensed with; that could be created "on the fly" using a counting iterator. But presumably your inputs are not actually 1,2,3,4,5

Related

How to use multinomial function in C with Rmath.h

I'm trying to draw a random sample from 1,2,3,4 under a given probability vector of length four in C, with the help of Rmath.h. I found this line of code could do this for me.
inline void rmultinom(int n, double* prob, int k, int* rn)
For example, I can write it to draw one random sample.
double p[4]={.1, .2, .3, .2};
rmultinom(1, p, 1, int* rn)
However, I'm lost what this 4th argument should be. In R, the rmultinom function only requires the first three arguments. Another question is what is returned from this function. Is there any convenient way for it to return with one of 1, 2, 3, 4?
Here's a simple example
#define MATHLIB_STANDALONE 1
#include "Rmath.h"
#include <stdio.h>
int main(int argc, char** argv) {
int draws = 100;
int classes = 4;
int vals[classes];
double probs[4] = {0.1, 0.2, 0.4, 0.3};
set_seed(123, 456);
rmultinom(draws, probs, classes, vals);
for(int j=0; j < classes; j++) {
printf("Count of class %i drawn: %i\n", j, vals[j]);
}
return 0;
}
Here we make 100 draws from a multinational distribution with 4 classes. We get back a vector of length 4 that says how many values were in each of those classes. For example I got
Count of class 0 drawn: 9
Count of class 1 drawn: 14
Count of class 2 drawn: 43
Count of class 3 drawn: 34

Modify dijkstra's algorithm with some conditions

Given an undirected graph with costs on edges, find the shortest path, from given node A to B. Let's put it this way: besides the costs and edges we start at time t = 0 and for every node you are given a list with some times that you can't pass through those nodes at that times, and you can't do anything in that time you have to wait until "it passes". As the statement says, you are a prisoner and you can teleport through the cells and the teleportation time requires the cost of the edge time, and those time when you can't do anything is when a guardian is with you in the cell and they are in the cell at every timestamp given from the list, find the minimum time to escape the prison.
What I tried:
I tried to modify it like that: in the normal dijkstra you check if it's a guardian at the minimum time you find for every node, but it didn't work.. any other ideas?
int checkGuardian(int min, int ind, List *guardians)
{
for (List iter = guardians[ind]; iter; iter = iter->next)
if(min == iter->value.node)
return min + iter->value.node;
return 0;
}
void dijkstra(Graph G, int start, int end, List *guardians)
{
Multiset H = initMultiset();
int *parent = (int *)malloc(G->V * sizeof(int));
for (int i = 0; i < G->V; ++i)
{
G->distance[i] = INF;
parent[i] = -1;
}
G->distance[start] = 0;
H = insert(H, make_pair(start, 0));
while(!isEmptyMultiset(H))
{
Pair first = extractMin(H);
for (List iter = G->adjList[first.node]; iter; iter = iter->next)
if(G->distance[iter->value.node] > G->distance[first.node] + iter->value.cost
+ checkGuardian(G->distance[first.node] + iter->value.cost, iter->value.node, guardians))
{
G->distance[iter->value.node] = G->distance[first.node] + iter->value.cost
+ checkGuardian(G->distance[first.node] + iter->value.cost, iter->value.node, guardians);
H = insert(H, make_pair(iter->value.node, G->distance[iter->value.node]));
parent[iter->value.node] = first.node;
}
}
printf("%d\n", G->distance[end]);
printPath(parent, end);
printf("%d\n", end);
}
with these structures:
typedef struct graph
{
int V;
int *distance;
List *adjList;
} *Graph;
typedef struct list
{
int size;
Pair value;
struct list *tail;
struct list *next;
struct list *prev;
} *List;
typedef struct multiset
{
Pair vector[MAX];
int size;
int capacity;
} *Multiset;
typedef struct pair
{
int node, cost;
} Pair;
As an input you are given number of nodes, number of edges and start node. For the next number of edges lines you are reading and edge between 2 nodes and the cost associated with that edge, then for the next number of nodes lines you are reading a character "N" if you can't escape from that cell and "Y" if you can escape from that cell then the number of timestamps guardians are in then number of timestamps, timestamps.
For this input:
6 7 1
1 2 5
1 4 3
2 4 1
2 3 8
2 6 4
3 6 2
1 5 10
N 0
N 4 2 3 4 7
Y 0
N 3 3 6 7
N 3 10 11 12
N 3 7 8 9
I would expect this output:
12
1 4 2 6 3
But I get this output:
10
1 4 2 6 3

change vector element by name in rcpp

I have a function where I need to make a table (tab, then change one value - the value where tab.names() == k, where k is given in the function call.
Looking at http://dirk.eddelbuettel.com/code/rcpp/Rcpp-quickref.pdf, I've hoped that the following code would work (replacing "foo" with a variable name), but I guess that requires the element name to be static, and mine won't be. I've tried using which but that won't compile (invalid conversion from 'char' to 'Rcpp::traits::storage_type<16>::type {aka SEXPREC*}' - so I'm doing something wrong there.
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fun(const arma::vec& assignment, int k) {
// count number of peptides per protein
IntegerVector tab = table(as<IntegerVector>(wrap(assignment)));
CharacterVector all_proteins = tab.names();
char kc = '0' + k;
// what I need a working version of:
tab(kc) = 1; // gets ignored, as does a [] version of the same thing.
// or
tab('0' + k) = 1; // also ignored
int ki = which(all_proteins == kc); // gives me compile errors
// extra credit
// tab.names(k-1) = "-1";
return tab;
}
/*** R
set.seed(23)
x <- rpois(20, 5)
k <- 5
fun(x, k)
# same thing in R:
expected_output <- table(x)
expected_output # before modification
# x
# 3 4 5 6 7 9 10 12
# 2 4 3 3 4 2 1 1
expected_output[as.character(k)] <- 1 # this is what I need help with
expected_output
# x
# 3 4 5 6 7 9 10 12
# 2 4 1 3 4 2 1 1
# extra credit:
names(expected_output)[as.character(k)] <- -1
*/
I'm still learning rcpp, and more importantly, still learning how to read the manual pages and plug in the right search terms into google/stackoverflow. I'm sure this is basic stuff (and I'm open to better methods - I currently think like an R programmer in terms of initial approaches to problems, not a C++ programmer.)
(BTW - The use of arma::vec is used in other parts of the code which I'm not showing for simplicity - I realize it's not useful here. I debated on switching it, but decided against it on the principle that I've tested that part, it works, and the last thing I want to do is introduce an extra bug...)
Thanks!
You can use the .findName() method to get the relevant index:
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fun(const arma::vec& assignment, int k) {
// count number of peptides per protein
IntegerVector tab = table(as<IntegerVector>(wrap(assignment)));
CharacterVector all_proteins = tab.names();
int index = tab.findName(std::string(1, '0' + k));
tab(index) = 1;
all_proteins(index) = "-1";
tab.names() = all_proteins;
return tab;
}
/*** R
set.seed(23)
x <- rpois(20, 5)
k <- 5
fun(x, k)
*/
Output:
> Rcpp::sourceCpp('table-name.cpp')
> set.seed(23)
> x <- rpois(20, 5)
> k <- 5
> fun(x, k)
3 4 -1 6 7 9 10 12
2 4 1 3 4 2 1 1
You could write your own function (use String instead of char):
int first_which_equal(const CharacterVector& x, String y) {
int n = x.size();
for (int i = 0; i < n; i++) {
if (x[i] == y) return(i);
}
return -1;
}
Also, it seems that tab(kc) is converting kc to an integer representation.

How do I add the results of work-items in a work-group in OpenCL?

I'm calling the kernel below with GlobalWorkSize 64 4 1 and WorkGroupSize 1 4 1 with the argument output initialized to zeros.
__kernel void kernelB(__global unsigned int * output)
{
uint gid0 = get_global_id(0);
uint gid1 = get_global_id(1);
output[gid0] += gid1;
}
I'm expecting 6 6 6 6 ... as the sum of the gid1's (0 + 1 + 2 + 3). Instead I get 3 3 3 3 ... Is there a way to get this functionality? In general I need the sum of the results of each work-item in a work group.
EDIT: It seems it must be said, I'd like to solve this problem without atomics.
You need to use local memory to store the output from all work items. After the work items are done their computation, you sum the results with an accumulation step.
__kernel void kernelB(__global unsigned int * output)
{
uint item_id = get_local_id(0);
uint group_id = get_group_id(0);
//memory size is hard-coded to the expected work group size for this example
local unsigned int result[4];
//the computation
result[item_id] = item_id % 3;
//wait for all items to write to result
barrier(CLK_LOCAL_MEM_FENCE);
//simple O(n) reduction using the first work item in the group
if(local_id == 0){
for(int i=1;i<4;i++){
result[0] += result[i];
}
output[group_id] = result[0];
}
}
Multiple work items are accessing elements of global simultaneously and the result is undefined. You need to use atomic operations or write unique location per work item.

OpenCL kernel question

I have taken the Kernel from the great OpenCL SpMV article for AMD by Bryan Catanzaro.
I have given it a toy problem where the input is
A= [0 0 0 6 1 3 5 7 2 4 0 0]
offsets= [-3 0 2]
x= [1 2 3 4]
and the output y should be [7 22 15 34]
Here is the kernel:
__kernel
void dia_spmv(__global float *A, __const int rows,
__const int diags, __global int *offsets,
__global float *x, __global float *y) {
int row = get_global_id(0);
float accumulator = 0;
for(int diag = 0; diag < diags; diag++) {
int col = row + offsets[diag];
if ((col >= 0) && (col < rows)) {
float m = A[diag*rows + row];
float v = x[col];
accumulator += m * v;
}
}
y[row] = accumulator;
}
After loading and writing the input arguments I execute the kernel like this:
size_t global_work_size;
global_work_size = 4;
err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &global_work_size,NULL, 0, NULL, NULL);
err = clFinish(cmd_queue);
And I get the correct result when I read y back from gpu memory.
I.e. I get y = [7 22 15 34]
I am new to OpenCL (and GPGPU in general) so I want to try and understand how to extend the problem correctly for much larger matrices of arbitrary dimension.
So lets say I have 1000 000 rows. What should I set global_work_size to be?
And should I set local_work_size or should I leave it as NULL?
To use the kernel for arbitrary matrix sizes you should think about the problem and rewrite the kernel. The issue is the limited memory size of the GPU and limited size for a single buffer. You can get the maximum size for a buffer with clGetDeviceInfo and CL_DEVICE_MAX_MEM_ALLOC_SIZE.
You need to split your problem into smaller pieces. Calculate them separately and merge the results afterwards.
I do not know the problem above and can not give you any hint which helps you to implement this. I can only give you the general direction.

Resources