I´m trying to learn some cuda and I can't figure out how to solve the following situation:
Consider two groups G1 and G2:
G1 have 2 vectors with 3 elements each a1 = {2,5,8} and b1 =
{8,4,6}
G2 have 2 vectors with 3 elements each a2 = {7,3,1}
and b2 = {4,2,9}
The task is to sum vector a and b from each group and return a sorted c vector, so:
G1 will give c1 = {10,9,14) => (sort algorithm) => c1 = {9,10,14}
G2 will give c2 = {11,5,10) => (sort algorithm) => c1 = {5,10,11}
If I have a gforce with 92 cuda cores I would like to create 92 G groups and make all the sum in parallel so
core 1-> G1 -> c1 = a1 + b1 -> sort c1 -> return c1
core 2-> G2 -> c2 = a2 + b2 -> sort c2 -> return c2
....
core 92-> G92 -> c92 = a92 + b92 -> sort c92 -> return c92
The kernel below sum two vectors in parallel and return another one:
__global__ void add( int*a, int*b, int*c )
{
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
What I can´t understand is how make the kernel handle the entire vector not only one
element of the vector and them return an entire vector.
Something like this:
__global__ void add( int*a, int*b, int*c, int size )
{
for (int i = 0; i < size ; i++)
c[i] = a[i] + b[i];
//sort c
}
Can anyone please explain me if it is possible and how to do it?
This is a small example. It uses cudaMallocPitch and cudaMemcpy2D. I hope it will give you guidelines to solve your particular problem:
#include<stdio.h>
#include<cuda.h>
#include<cuda_runtime.h>
#include<device_launch_parameters.h>
#include<conio.h>
#define N 92
#define M 3
__global__ void test_access(float** d_a,float** d_b,float** d_c,size_t pitch1,size_t pitch2,size_t pitch3)
{
int idx = threadIdx.x;
float* row_a = (float*)((char*)d_a + idx*pitch1);
float* row_b = (float*)((char*)d_b + idx*pitch2);
float* row_c = (float*)((char*)d_c + idx*pitch3);
for (int i=0; i<M; i++) row_c[i] = row_a[i] + row_b[i];
printf("row %i column 0 value %f \n",idx,row_c[0]);
printf("row %i column 1 value %f \n",idx,row_c[1]);
printf("row %i column 2 value %f \n",idx,row_c[2]);
}
/********/
/* MAIN */
/********/
int main()
{
float a[N][M], b[N][M], c[N][M];
float **d_a, **d_b, **d_c;
size_t pitch1,pitch2,pitch3;
cudaMallocPitch(&d_a,&pitch1,M*sizeof(float),N);
cudaMallocPitch(&d_b,&pitch2,M*sizeof(float),N);
cudaMallocPitch(&d_c,&pitch3,M*sizeof(float),N);
for (int i=0; i<N; i++)
for (int j=0; j<M; j++) {
a[i][j] = i*j;
b[i][j] = -i*j+1;
}
cudaMemcpy2D(d_a,pitch1,a,M*sizeof(float),M*sizeof(float),N,cudaMemcpyHostToDevice);
cudaMemcpy2D(d_b,pitch2,b,M*sizeof(float),M*sizeof(float),N,cudaMemcpyHostToDevice);
test_access<<<1,N>>>(d_a,d_b,d_c,pitch1,pitch2,pitch3);
cudaMemcpy2D(c,M*sizeof(float),d_c,pitch3,M*sizeof(float),N,cudaMemcpyDeviceToHost);
for (int i=0; i<N; i++)
for (int j=0; j<M; j++) printf("row %i column %i value %f\n",i,j,c[i][j]);
getch();
return 0;
}
92 3-D vectors can be seen as 1 276-D vector, then you can use the single vector add kernel to add them. Thrust will be a more simple way to do this.
update
If your vector is only 3-D, you could simply sort the elements immediately after they are calculated, using sequential method.
If your vector has higher dimensions, you could consider use cub::BlockRadixSort. The idea is to first add one vector per thread/block, then sort the vector within the block using cub::BlockRadixSort.
http://nvlabs.github.io/cub/classcub_1_1_block_radix_sort.html
Related
I'm an amateur playing with discrete math. This isn't a
homework problem though I am doing it at home.
I want to solve ax + by = c for natural numbers, with a, b and c
given and x and y to be computed. I want to find all x, y pairs
that will satisfy the equation.
This has a similar structure to Bezout's identity for integers
where there are multiple (infinite?) solution pairs. I thought
the similarity might mean that the extended Euclidian algorithm
could help here. Below are two implementations of the EEA that
seem to work; they're both adapted from code found on the net.
Could these be adapted to the task, or perhaps can someone
find a more promising avenue?
typedef long int Int;
#ifdef RECURSIVE_EEA
Int // returns the GCD of a and b and finds x and y
// such that ax + by == GCD(a,b), recursively
eea(Int a, Int b, Int &x, Int &y) {
if (0==a) {
x = 0;
y = 1;
return b;
}
Int x1; x1=0;
Int y1; y1=0;
Int gcd = eea(b%a, a, x1, y1);
x = y1 - b/a*x1;
y = x1;
return gcd;
}
#endif
#ifdef ITERATIVE_EEA
Int // returns the GCD of a and b and finds x and y
// such that ax + by == GCD(a,b), iteratively
eea(Int a, Int b, Int &x, Int &y) {
x = 0;
y = 1;
Int u; u=1;
Int v; v=0; // does this need initialising?
Int q; // quotient
Int r; // remainder
Int m;
Int n;
while (0!=a) {
q = b/a; // quotient
r = b%a; // remainder
m = x - u*q; // ?? what are the invariants?
n = y - v*q; // ?? When does this overflow?
b = a; // A candidate for the gcd - a's last nonzero value.
a = r; // a becomes the remainder - it shrinks each time.
// When a hits zero, the u and v that are written out
// are final values and the gcd is a's previous value.
x = u; // Here we have u and v shuffling values out
y = v; // via x and y. If a has gone to zero, they're final.
u = m; // ... and getting new values
v = n; // from m and n
}
return b;
}
#endif
If we slightly change the equation form:
ax + by = c
by = c - ax
y = (c - ax)/b
Then we can loop x through all numbers in its range (a*x <= c) and compute if viable natural y exists. So no there is not infinite number of solutions the limit is min(c/a,c/b) ... Here small C++ example of naive solution:
int a=123,b=321,c=987654321;
int x,y,ax;
for (x=1,ax=a;ax<=c;x++,ax+=a)
{
y = (c-ax)/b;
if (ax+(b*y)==c) here output x,y solution somewhere;
}
If you want to speed this up then just iterate y too and just check if c-ax is divisible by b Something like this:
int a=123,b=321,c=987654321;
int x,y,ax,cax,by;
for (x=1,ax=a,y=(c/b),by=b*y;ax<=c;x++,ax+=a)
{
cax=c-ax;
while (by>cax){ by-=b; y--; if (!y) break; }
if (by==cax) here output x,y solution somewhere;
}
As you can see now both x,y are iterated in opposite directions in the same loop and no division or multiplication is present inside loop anymore so its much faster here first few results:
method1 method2
[ 78.707 ms] | [ 21.277 ms] // time needed for computation
75044 | 75044 // found solutions
-------------------------------
75,3076776 | 75,3076776 // first few solutions in x,y order
182,3076735 | 182,3076735
289,3076694 | 289,3076694
396,3076653 | 396,3076653
503,3076612 | 503,3076612
610,3076571 | 610,3076571
717,3076530 | 717,3076530
824,3076489 | 824,3076489
931,3076448 | 931,3076448
1038,3076407 | 1038,3076407
1145,3076366 | 1145,3076366
I expect that for really huge c and small a,b numbers this
while (by>cax){ by-=b; y--; if (!y) break; }
might be slower than actual division using GCD ...
I'm solving Navier-Stokes equation for incompressible fluid flow through a square region with obstacle. As an output I get X and Y components of velocity as NxN matrix each. How to plot vector field for it in gnuplot.
I found this answer but I can't understand what values to put for x, y, dx, dy.
Can anyone explain how to use my output to plot vector field?
UPDATE
I tried doing as #LutzL said, but something seems to be wrong with my code. Is everything alright with this code?
int main() {
ifstream finu("U"), finv("V");
int N = 41, M = 41;
auto
**u = new double *[N],
**v = new double *[N];
for (int i = 0; i < N; i++) {
u[i] = new double[M];
v[i] = new double[M];
}
double
dx = 1.0 / (N - 1),
dy = 1.0 / (M - 1);
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
finu >> u[i][j];
finv >> v[i][j];
}
}
ofstream foutvec("vec");
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
foutvec << dx * i << "\t" << dy * j << "\t" << u[i][j] << "\t" << v[i][j] << endl;
}
}
ofstream plt("graph.plt");
plt << "set term pngcairo"
"\nset title 'Navier-Stokes Equation'"
"\nset output 'vec.png'"
"\nplot 'vec' w vec";
plt.close();
system("gnuplot graph.plt");
return 0;
}
As an output I get a bit weird field.
You need to save your result in a text file with lines
x[i] y[j] X[i,j] Y[i,j]
for all of the pairs i,j. Then use gnuplot with the "traditional" vector field command.
You need only use using if you put additional columns into that file, and the vectors to display are not (simply) the 3rd and 4th columns. One use might be that you compute a scaling factor R[i,j] to display X/R, Y/R. Put that into 5th place
x[i] y[j] X[i,j] Y[i,j] R[i,j]
and call with using 1:2:($3/$5):($4/$5) to perform the scaling in gnuplot.
In the code in the update and the resulting image, one sees that the vector field is too large to plot. Scale with dt for some reasonable time step, in the gnuplot commands this could be done via
dt = 0.01
plot 'vec' u 1:2:(dt*$3):(dt*$4) w vec
The incomplete plot hints to an incomplete data file on the disk. Flush or close the output stream for the vector data.
Below is a piece of C code run from R used to compare each row of a matrix to a vector. The number of identical values is stored in the first column of a two-column matrix.
I know it can easily be done in R (as done to check the results), but this is a first step for a more complex use case.
When openmp is not used, it works ok. When openmp is used, it give correlated (0.99) but inconsistent results.
Question1: What am I doing wrong?
Question2: I use a double for loop to fill the output matrix (ret) with zeros. What would be a better solution?
Also, inconsistencies were observed when the code was used in a package. I tried to make the code reproducible using inline, but it does not recognize the openmp statements (I tried to include 'omp.h', in the parameters of cfunction, ...).
Question3: How can we make this code work with inline?
I'm (too?) far outside my comfort zone on this topic.
library(inline)
compare <- cfunction(c(x = "integer", vec = "integer"), "
const int I = nrows(x), J = ncols(x);
SEXP ret;
PROTECT(ret = allocMatrix(INTSXP, I, 2));
int *ptx = INTEGER(x), *ptvec = INTEGER(vec), *ptret = INTEGER(ret);
for (int i=0; i<I; i++)
for (int j=0; j<2; j++)
ptret[j * I + i] = 0;
int i, j;
#pragma omp parallel for default(none) shared(ptx, ptvec, ptret) private(i,j)
for (j=0; j<J; j++)
for (i=0; i<I; i++)
if (ptx[i + I * j] == ptvec[j]) {++ptret[i];}
UNPROTECT(1);
return ret;
")
N = 3e3
M = 1e4
m = matrix(sample(c(-1:1), N*M, replace = TRUE), nc = M)
v = sample(-1:1, M, replace = TRUE)
cc = compare(m, v)
cr = rowSums(t(t(m) == v))
all.equal(cc[,1], cr)
Thanks to the comments above, I reconsidered the data race issue.
IIUC, my loop was parallelized on j (the columns). Then, each thread had its own value of i (the rows), but possible identical values across threads, that were then trying to increment ptret[i] at the same time.
To avoid this, I now loop on i first, so that only a single thread will increment each row.
Then, I realized that I could move the zero-initialization of ptret within the first loop.
It seems to work. I get identical results, increased CPU usage, and 3-4x speedup on my laptop.
I guess that solves questions 1 and 2. I will have a closer look at the inline/openmp problem.
Code below, fwiw.
#include <omp.h>
#include <R.h>
#include <Rinternals.h>
#include <stdio.h>
SEXP c_compare(SEXP x, SEXP vec)
{
const int I = nrows(x), J = ncols(x);
SEXP ret;
PROTECT(ret = allocMatrix(INTSXP, I, 2));
int *ptx = INTEGER(x), *ptvec = INTEGER(vec), *ptret = INTEGER(ret);
int i, j;
#pragma omp parallel for default(none) shared(ptx, ptvec, ptret) private(i, j)
for (i = 0; i < I; i++) {
// init ptret to zero
ptret[i] = 0;
ptret[I + i] = 0;
for (j = 0; j < J; j++)
if (ptx[i + I * j] == ptvec[j]) {
++ptret[i];
}
}
UNPROTECT(1);
return ret;
}
I am very new to OpenCL and am going through the Altera OpenCL examples.
In their matrix multiplication example, they have used the concept of blocks, where dimensions of the input matrices are multiple of block size. Here's the code:
void matrixMult( // Input and output matrices
__global float *restrict C,
__global float *A,
__global float *B,
// Widths of matrices.
int A_width, int B_width)
{
// Local storage for a block of input matrices A and B
__local float A_local[BLOCK_SIZE][BLOCK_SIZE];
__local float B_local[BLOCK_SIZE][BLOCK_SIZE];
// Block index
int block_x = get_group_id(0);
int block_y = get_group_id(1);
// Local ID index (offset within a block)
int local_x = get_local_id(0);
int local_y = get_local_id(1);
// Compute loop bounds
int a_start = A_width * BLOCK_SIZE * block_y;
int a_end = a_start + A_width - 1;
int b_start = BLOCK_SIZE * block_x;
float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
{
A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
{
running_sum += A_local[local_y][k] * B_local[local_x][k];
}
}
// Store result in matrix C
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = running_sum;
}
Assume block size is 2, then: block_x and block_y are both 0; and local_x and local_y are both 0.
Then A_local[0][0] would be A[0] and B_local[0][0] would be B[0].
Sizes of A_local and B_local are 4 elements each.
In that case, how would A_local and B_local access other elements of the block in that iteration?
Also would separate threads/cores be assigned for each local_x and local_y?
There is definitely a barrier missing in your code sample. The outer for loop as you have it will only produce correct results if all work items are executing instructions in lockstep fashion, thus guaranteeing the local memory is populated before the for k loop.
Maybe this is the case for Altera and other FPGAs, but this is not correct for CPUs and GPUs.
You should add barrier(CLK_LOCAL_MEM_FENCE); if you are getting unexpected results, or want to be compatible with other type of hardware.
float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
{
A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
barrier(CLK_LOCAL_MEM_FENCE);
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
{
running_sum += A_local[local_y][k] * B_local[local_x][k];
}
}
A_local and B_local are both shared by all work items of the work group, so all their elements are loaded in parallel (by all work items of the work group) at each step of the encompassing for loop.
Then each work item uses some of the loaded values (not necessarily the values the work item loaded itself) to do its share of the computation.
And finally, the work item stores its individual result into the global output matrix.
It is a classical tiled implementation of a matrix-matrix multiplication. However, I'm really surprised not to see any sort of call to a memory synchronisation function, such as work_group_barrier(CLK_LOCAL_MEM_FENCE) between the load of A_local and B_local and their use in the k loop... But I might very well have overlooked something here.
I have a problem with Shuffling this array with Arduino software:
int questionNumberArray[10]={0,1,2,3,4,5,6,7,8,9};
Does anyone know a build in function or a way to shuffle the values in the array without any repeating?
The simplest way would be this little for loop:
int questionNumberArray[] = {0,1,2,3,4,5,6,7,8,9};
const size_t n = sizeof(questionNumberArray) / sizeof(questionNumberArray[0]);
for (size_t i = 0; i < n - 1; i++)
{
size_t j = random(0, n - i);
int t = questionNumberArray[i];
questionNumberArray[i] = questionNumberArray[j];
questionNumberArray[j] = t;
}
Let's break it line by line, shall we?
int questionNumberArray[] = {0,1,2,3,4,5,6,7,8,9};
You don't need to put number of cells if you initialize an array like that. Just leave the brackets empty like I did.
const size_t n = sizeof(questionNumberArray) / sizeof(questionNumberArray[0]);
I decided to store number of cells in n constant. Operator sizeof gives you number of bytes taken by your array and number of bytes taken by one cell. You divide first number by the second and you have size of your array.
for (size_t i = 0; i < n - 1; i++)
Please note, that range of the loop is n - 1. We don't want i to ever have value of last index.
size_t j = random(0, n - i);
We declare variable j that points to some random cell with index greater than i. That is why we never wanted i to have n - 1 value - because then j would be out of bound. We get random number with Arduino's random function: https://www.arduino.cc/en/Reference/Random
int t = questionNumberArray[i];
questionNumberArray[i] = questionNumberArray[j];
questionNumberArray[j] = t;
Simple swap of two values. It's possible to do it without temporary t variable, but the code is less readable then.
In my case the result was as follows:
questionNumberArray[0] = 0
questionNumberArray[1] = 9
questionNumberArray[2] = 7
questionNumberArray[3] = 4
questionNumberArray[4] = 6
questionNumberArray[5] = 5
questionNumberArray[6] = 1
questionNumberArray[7] = 8
questionNumberArray[8] = 2
questionNumberArray[9] = 3