Writing a Partial Sum GPU Kernel

Writing a Partial Sum GPU Kernel - opencl

I have the following array with sparse 1's every now and then. Its a massive vector, megabytes in size
[0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 ..]
I need to store those 1's at an index for processing, so I need a kernel that produces this:
[0 0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 ..]
How can I parallelize such an operation?

You're looking for a 'parallel inclusive scan', which the thrust library (ships with the cuda toolkit), includes out of the box:
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <iostream>
int main( int argc, char * argv[] )
{
int data[17] = {0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0 };
thrust::device_vector< int > in( data, data + 17 );
thrust::device_vector< int > out( in.size() );
thrust::inclusive_scan( in.begin(), in.end(), out.begin() );
for ( int i = 0; i < out.size(); ++i )
std::cout << out[i] << " ";
std::cout << endl;
}
outputs:
0 0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2
Or you could explicitly write a kernel - which will just be a variation on the parallel prefix sum algorithm, which thrust generalizes nicely.

Related

binary variable does not correspond with category numbers

I'm trying to create a new binary variable based on several categorical variables. I have tried multiple ways to do this including base or if else commands, and mutate case when from dyplyr. When I make the new variable, the number does not add up to how many were in the categories in the original variables.
data<- c("ket"(1,0,0,0,1,0)
c("weed"(0,1,1,1,0,0)
c("speed"(1,0,0,1,0,0)
c("meth"(0,0,0,1,0,0)
data<-data%>%
mutate(druguse = case_when(
(weed==1 | ket==1 | meth==1 | speed==1) ~1,
(TRUE ~0)))
the new variable should add up to how many answered one in each category, but the number in my new variable is a lot lower.
thank you!

You can avoid having to write out an explicit case_when here, by taking the sign of the sum of each row. This will be 0 if the whole row is zero, and one otherwise.
data %>% mutate(druguse = sign(rowSums(.)))
#> ket weed speed meth druguse
#> 1 1 0 1 0 1
#> 2 0 1 0 0 1
#> 3 0 1 0 0 1
#> 4 0 1 1 1 1
#> 5 1 0 0 0 1
#> 6 0 0 0 0 0
Data
data <- structure(list(ket = c(1, 0, 0, 0, 1, 0), weed = c(0, 1, 1, 1,
0, 0), speed = c(1, 0, 0, 1, 0, 0), meth = c(0, 0, 0, 1, 0, 0
)), class = "data.frame", row.names = c(NA, -6L))
data
#> ket weed speed meth
#> 1 1 0 1 0
#> 2 0 1 0 0
#> 3 0 1 0 0
#> 4 0 1 1 1
#> 5 1 0 0 0
#> 6 0 0 0 0

MPI sending 2D array

I am new to MPI. I am trying to assign an adjacency matrix to the processes so that I am able to implement the 1-D BFS algorithm given by https://ieeexplore.ieee.org/document/1559977. Assume that I have a 6*6 matrix like that with 4 processes:
0 0 0 0 0 0
1 1 1 1 1 1
0 1 0 1 0 1
1 0 1 0 1 0
1 1 1 1 1 1
0 1 0 1 0 1
I want it to be processed like:
A A A A A A
B B B B B B
C C C C C C
D D D D D D
D D D D D D
D D D D D D
That every process is assigned with (int)(no_of_vertices/size) number of rows and the last process is assigned with the rest of the rows which equals to (no_of_vertices-(size-1)*local_vertices) since the graph size will hardly be evenly divisible by the number of processes p that every process has the same workload.
Therefore, I read through this answer by Jonathan Dursi sending blocks of 2D array in C using MPI and have my own code. I post it below:
#include "mpi.h"
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
#include<iostream>
#include <fstream>
#include <algorithm>
#include <vector>
#include <string>
#include <sstream>
#include <chrono>
#include <cmath>
using namespace std;
#define MAX_QUEUE_SIZE 5
int Max(int a, int b, int c)
{
int max;
if (a >= b)
{
if (a >= c) {
max = a;
}
else
max = c;
}
else if (b >= c) { max = b; }
else max = c;
return max;
}
int areAllVisited(int visited[], int size)
{
for (int i = 0; i < size; i++)
{
if (visited[i] == 0)
return 0;
}
return 1;
}
int malloc2dint(int*** array, int n, int m) {
/* allocate the n*m contiguous items */
int* p = (int*)malloc(n * m * sizeof(int));
if (!p) return -1;
/* allocate the row pointers into the memory */
(*array) = (int**)malloc(n * sizeof(int*));
if (!(*array)) {
free(p);
return -1;
}
/* set up the pointers into the contiguous memory */
for (int i = 0; i < n; i++)
(*array)[i] = &(p[i * m]);
return 0;
}
int free2dint(int*** array) {
/* free the memory - the first element of the array is at the start */
free(&((*array)[0][0]));
/* free the pointers into the memory */
free(*array);
return 0;
}
int main(int argc, char* argv[])
{
//Variables and Initializations
int size, rank;
int local_vertices;
int** adj_matrix=NULL;
int *adjacency_matrix;
int adjacency_queue[MAX_QUEUE_SIZE];
int source_vertex;
int no_of_vertices;
int ** local_adj_matrix;
int *visited;
int *distance;
int* frontier_vertex;
//MPI Code
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size); // MPI_COMM -> communicator, size=number of processes
// Get my rank in the communicator
MPI_Comm_rank(MPI_COMM_WORLD, &rank);//rank= current processes
//read input file, transform into adjacency matrix
if (rank == 0)
{
string filename="out.txt";
std::fstream in(filename.c_str());
cout << "reading file:" << filename << endl;
string s;
size_t n = 0, m = 0;
string data1, data2;
while (true)
{
std::getline(in, s);
istringstream is(s);
is >> data1 >> data2;
int d1 = stoi(data1);
int d2 = stoi(data2);
n = Max(n, d2, d1);
m += 1;
if (in.eof()) { break; }
}
//this block will count the number of lines and calculate the maximun number appeared in the file, which are the parameters n, m(vertice, edge)
in.clear();
in.seekg(0, ios::beg);
n += 1;
m -= 1;
//initialize the 2D array
malloc2dint(&adj_matrix, n, n);
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++)
{
adj_matrix[i][j] = 0;
}
}
for (size_t i = 0; i < m; i++)
{
int x, y;
std::getline(in, s);
istringstream is(s);
is >> data1 >> data2;
x = stoi(data1);
y = stoi(data2);
adj_matrix[x][y] = 1;
}
in.close();
//print the matrix
cout << "the adjacency matrix is:" << endl;
for (int i = 0; i < n; i++) {
cout << endl;
for (int j = 0; j < n; j++) {
cout << adj_matrix[i][j] << " ";
}
}
source_vertex = 0;
no_of_vertices = n;
local_vertices = (int)(no_of_vertices/size);
}
MPI_Bcast(&local_vertices,1,MPI_INT,0,MPI_COMM_WORLD);
MPI_Bcast(&no_of_vertices, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&source_vertex, 1, MPI_INT, 0, MPI_COMM_WORLD);
/* create a datatype to describe the subarrays of the global array */
int sizes[2] = { no_of_vertices, no_of_vertices }; /* global size */
int subsizes[2] = { 1, no_of_vertices }; /* local size */
int starts[2] = { 0,0 }; /* where this one starts */
MPI_Datatype type, subarrtype;
MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &type);
MPI_Type_create_resized(type, 0, no_of_vertices * sizeof(int), &subarrtype);// /local_vertices
MPI_Type_commit(&subarrtype);
if (rank == size - 1) {
local_vertices = no_of_vertices - (size - 1) * local_vertices;
malloc2dint(&local_adj_matrix, local_vertices, no_of_vertices); }
malloc2dint(&local_adj_matrix,local_vertices,no_of_vertices);
//MPI_Barrier(MPI_COMM_WORLD);
int* adjptr = NULL;
if (rank == 0) adjptr = &(adj_matrix[0][0]);
int *sendcounts=new int[size];
int *displs = new int[size];
if (rank == 0) {
for (int i = 0; i < size; i++)
if (i == size - 1) {
sendcounts[i] = no_of_vertices - (size - 1) * local_vertices;
}
else
sendcounts[i] = local_vertices;
int disp = 0;
for (int i = 0; i < size; i++) {
displs[i] = disp;
disp += no_of_vertices*local_vertices;
}
}
//Scattering each row of the adjacency matrix to each of the processes
//int MPI_Scatter(const void* sendbuf, int sendcount, MPI_Datatype sendtype,void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
MPI_Scatterv(adjptr, sendcounts,displs,subarrtype, &local_adj_matrix[0][0], local_vertices*no_of_vertices, MPI_INT, 0, MPI_COMM_WORLD);
cout << "rank=" << rank << endl;
for(int i=0;i<local_vertices;i++){
cout << endl;
for (int j = 0; j < no_of_vertices; j++) {
cout << local_adj_matrix[i][j] << " ";
}
}
MPI_Barrier(MPI_COMM_WORLD);
//End of BFS code
MPI_Finalize();
return 0;
}
However, the situation in 2 is not the same as mine and naturally, I got the unexpected outputs:
reading file:out.txt
the adjacency matrix is:
0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0
rank=1
0 0 0 0 0 0 0 1 0 0 0 0 rank=0
rank=2
rank=4
rank=3
rank=5
rank=6
rank=7
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
-410582944 556 -410582944 556 -410353664 556 815104 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 -912406441 -1879044864
0 0 0 0 0 0 -906966900 -1879046144 -410511728 556 -410398928 556
0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0
-33686019 2949222 32575338 37898 -410536560 556 -410396256 556 3473509 6357044 8192057 0
0 0 0 0 0 0 0 0 0 0 0 0
C:\Program Files\Microsoft MPI\Bin\mpiexec.exe (process 3628) exited with code 0.
Press any key to close this window . . .
I am sure that there are huge mistakes in my initialization of the two arrays sendcounts and displs and the creation of the newtype subarrtype but I have no idea how to fix them. Anyone give me some hands?
possible input file:
0 3
0 2
2 3
3 1
5 11
5 4
5 1
6 10
8 5
9 4
10 4
11 7

Since you have defined a subarray that maps on to all the local data you want to send, all the sendcounts should be 1. The receive counts need to be the number of integers in the message (which is what you seem to have set them as) but the send count is one subarray.

Just as sendcounts is in terms of sizeof(dataype), so are the displacements. If you edit line 186:
disp += no_of_vertices*local_vertices;
to be:
disp += local_vertices;
then I think it works.
You actually don't need all the complications of resized dataytypes as you are scattering complete, contiguous rows.

Generating lists in R with patterns related to the entry number

Is there a smart way to generate a list like the one below in R using perhaps lapply() or other more extrapolable procedures?
ones = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
twos = c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
threes = c(1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0)
fours = c(1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0)
fives = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1)
l = list(ones, twos, threes, fours)
[[1]]
[1] 1 1 1 1 1 1 1 1 1 1 1
[[2]]
[1] 1 0 1 0 1 0 1 0 1 0 1
[[3]]
[1] 1 0 0 1 0 0 1 0 0 1 0
[[4]]
[1] 1 0 0 0 1 0 0 0 1 0 0
These correspond to polynomials coefficients in generating functions for partitions.
The first list is for ones and so the counting is in steps of 1 integer at a time; hence the vector 1,1,1,1,1,1,1,... In the entry [[2]] we have the twos, and we are counting by 2's starting at 0, skipping the 1 (coded as 0). In [[3]] we are counting by 3's: zero, three, six, nine, etc.

A fairly straightforward way in base R is
lapply(seq(0L, 5L), function(i) rep(c(1L, integer(i)), length.out=11L))
[[1]]
[1] 1 1 1 1 1 1 1 1 1 1 1
[[2]]
[1] 1 0 1 0 1 0 1 0 1 0 1
[[3]]
[1] 1 0 0 1 0 0 1 0 0 1 0
[[4]]
[1] 1 0 0 0 1 0 0 0 1 0 0
[[5]]
[1] 1 0 0 0 0 1 0 0 0 0 1
seq(0L, 5L) produces the vector 0 through 5, an equivalent would be seq_len(5L)-1L, which is faster for creation of large vectors.
c(1L, integer(i)) produces the inner, repeated part of the 0-1 vectors, which rep repeats according to the desired length (here 11) using the length.out argument.
lapply and function(i) allow the number of 0s to increase as we loop through the vector.

R : Updating a vector given a set of indices

I have a vector (initialized to zeros) and a set of indices of that vector. For each value in indices, I want to increment the corresponding index in the vector. So, say 6 occurs twice in indices (as in the example below), then the value of the 6th element of the vector should be 2.
Eg:
> v = rep(0, 10)
> v
[1] 0 0 0 0 0 0 0 0 0 0
> indices
[1] 7 8 6 6 2
The updated vector should be
> c(0, 1, 0, 0, 0, 2, 1, 1, 0, 0)
[1] 0 1 0 0 0 2 1 1 0 0
What is the most idiomatic way of doing this without using loops?

The function tabulate is made for that
> indices = c(7,8,6,6,2);
> tabulate(bin=indices, nbins=10);
[1] 0 1 0 0 0 2 1 1 0 0

You can use rle for this:
x <- rle(sort(indices))
v[x$values] <- x$lengths
v
# [1] 0 1 0 0 0 2 1 1 0 0

Code onset from event occurrence

I have a vector that gives presence/absence of an event (insurgency in this case) over time, and I'd like to create another vector that gives onset of the event, i.e.:
occurrence <- c(1, 1, 0, 0, 1, 0, 0, 1, 1, 1)
onset <- c(0, 0, 0, 0, 1, 0, 0, 1, 0, 0)
The following loop will get what I need:
answer <- 0
for (t in 2:length(occurrence) {
answer[t] <- ifelse((occurrence[t]-occurrence[t-1])==1, 1, 0)
}
> answer
[1] 0 0 0 0 1 0 0 1 0 0
Is there an easier way of doing this?
Thanks.

Use pmax() and diff():
c(NA, pmax(0, diff(occurrence)))
[1] NA 0 0 0 1 0 0 1 0 0
This works because diff() calculates the difference between successive elements, resulting in 1 for every start. Then you need to remove the 0 and -1 values. pmax is a parallel version of max() and is handy to change all -1s to zero
diff(occurrence)
[1] 0 -1 0 1 -1 0 1 0 0

You can compare the prior time interval to the current one and chose those where 0 is followed by one with this code:
> c(0, as.numeric(occurrence[-length(occurrence)] == 0 & occurrence[-1]==1) )
[1] 0 0 0 0 1 0 0 1 0 0
(I padded it with a leading 0 because you didn't want leading occurrence:1's to be counted as new events.)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Writing a Partial Sum GPU Kernel - opencl

Related

binary variable does not correspond with category numbers

MPI sending 2D array

Generating lists in R with patterns related to the entry number

R : Updating a vector given a set of indices

Code onset from event occurrence

Categories

Resources