So far I thought that the following syntax was invalid,
int B[ydim][xdim];
But today I tried and it worked! I ran it many times to make sure it did not work by chance, even valgrind didn't report any segfault or memory leak!! I am very surprised. Is it a new feature introduced in g++? I always have used 1D arrays to store matrices by indexing them with correct strides as done with A in the program below. But this new method, as with B, is so simple and elegant that I have always wanted. Is it really safe to use? See the sample program.
PS. I am compiling it with g++-4.4.3, if that matters.
#include <cstdlib>
#include <iostream>
int test(int ydim, int xdim) {
// Allocate 1D array
int *A = new int[xdim*ydim](); // with C++ new operator
// int *A = (int *) malloc(xdim*ydim * sizeof(int)); // or with C style malloc
if (A == NULL)
return EXIT_FAILURE;
// Declare a 2D array of variable size
int B[ydim][xdim];
// populate matrices A and B
for(int y = 0; y < ydim; y++) {
for(int x = 0; x < xdim; x++) {
A[y*xdim + x] = y*xdim + x;
B[y][x] = y*xdim + x;
}
}
// read out matrix A
for(int y = 0; y < ydim; y++) {
for(int x = 0; x < xdim; x++)
std::cout << A[y*xdim + x] << " ";
std::cout << std::endl;
}
std::cout << std::endl;
// read out matrix B
for(int y = 0; y < ydim; y++) {
for(int x = 0; x < xdim; x++)
std::cout << B[y][x] << " ";
std::cout << std::endl;
}
delete []A;
// free(A); // or in C style
return EXIT_SUCCESS;
}
int main() {
return test(5, 8);
}
int b[ydim][xdim] is declaring a 2-d array on the stack. new, on the other hand, allocates the array on the heap.
For any non-trivial array size, it's almost certainly better to have it on the heap, lest you run yourself out of stack space, or if you want to pass the array back to something outside the current scope.
This is a C99 'variable length array' or VLA. If they are supported by g++ too, then I believe it is an extension of the C++ standard.
Nice, aren't they?
Related
I am trying to calculate the euclidean distance for KNN but in parallel using dpc++. the training dataset contains 5 features and 1600 rows, while I want to calculate the distance between the current test point and each training point on the grid in parallel, but I keep getting an error regarding sycl kernal.
code for the function:
code
std::vector<double> distance_calculation_FPGA(queue& q,const std::vector<std::vector<double>>& dataset,const std::vector<double>& curr_test) {
range<1> num_items{ dataset.size()};
std::vector<double>res;
res.resize(dataset.size());
buffer dataset_buf(dataset);
buffer curr_test_buf(curr_test);
buffer res_buf(res.data(), num_items);
q.submit([&](handler& h) {
accessor a(dataset_buf, h, read_only);
accessor b(curr_test_buf, h, read_only);
accessor dif(res_buf, h, write_only, no_init);
h.parallel_for(num_items, [=](auto i) {
for (int j = 0; j <(const int) a[i].size(); ++j) {
dif[i] += (a[i][j] - b[j]) * (a[i][j] - b[j]) ;
}
});
});
for (int i = 0; i < res.size(); ++i) {
std::cout << res[i] << std::endl;
}
//old distance calculation (serial)
//for (int i = 0; i < dataset.size(); ++i) {
// double dis = 0;
// for (int j = 0; j < dataset[i].size(); ++j) {
// dis += (curr_test[j] - dataset[i][j]) * (curr_test[j] - dataset[i][j]);
//}
//res.push_back(dis);
//}
return res;
}
the error I am receiving:
SYCL kernel cannot call a variadic function
SYCL kernel cannot call an undefined function without SYCL_EXTERNAL attribute
Would be extremely grateful for any help!
Thanks
We tried running your code by creating dummy 'dataset' and 'curr_test' variables. We were able to run the program successfully. Please refer this thread
Please refer to the complete code attached below.
#include <CL/sycl.hpp>
#include <iostream>
using namespace sycl;
std::vector<double> distance_calculation_FPGA(queue& q,const std::vector<std::vector<double>>& dataset,const std::vector<double>& curr_test)
{
range<1> num_items{ dataset.size()};
std::vector<double>res;
res.resize(dataset.size());
buffer dataset_buf(dataset);
buffer curr_test_buf(curr_test);
buffer res_buf(res.data(), num_items);
q.submit([&](handler& h) {
accessor a(dataset_buf, h, read_only);
accessor b(curr_test_buf, h, read_only);
accessor dif(res_buf, h, write_only, no_init);
h.parallel_for(num_items, [=](auto i) {
for (int j = 0; j <(const int) a[i].size(); ++j) {
// dif[i] += (a[i][j] - b[j]) * (a[i][j] - b[j]) ;
dif[i]+=a[i][j];
}
});
});
q.wait(); //We have added this line of code for synchronization.
for (int i : res) {
std::cout <<i<< std::endl;
}
return res;
}
int main(){
std::vector<std::vector<double>> dataset;
for(int i=0;i<5;i++)
{
std::vector<double> d;
for(int j=0;j<1600;j++)
{
d.push_back((double)j);
}
dataset.push_back(d);
}
std::vector<double> curr_test;
for(int i=0;i<1600;i++)
{
curr_test.push_back((double)i);
}
queue q;
std::cout << "Running on "<<
q.get_device().get_info<sycl::info::device::name>()<< std::endl;
//print the device name as a test to check the parallelisation
distance_calculation_FPGA(q,dataset,curr_test);
return 0;
}
I am learning to use Boost.MPI to parallelize the large amount of computation, here below is just my simple test see if I can get MPI logic correctly. However, I did not get it to work. I used world.size()=10, there are total 50 elements in data array, each process will do 5 iteration. I would hope to update data array by having each process sending the updated data array to root process, and then the root process receives the updated data array then print out. But I only get a few elements updated.
Thanks for helping me.
#include <boost/mpi.hpp>
#include <iostream>
#include <cstdlib>
namespace mpi = boost::mpi;
using namespace std;
#define max_rows 100
int data[max_rows];
int modifyArr(const int index, const int arr[]) {
return arr[index]*2+1;
}
int main(int argc, char* argv[])
{
mpi::environment env(argc, argv);
mpi::communicator world;
int num_rows = 50;
int my_number;
if (world.rank() == 0) {
for ( int i = 0; i < num_rows; i++)
data[i] = i + 1;
}
broadcast(world, data, 0);
for (int i = world.rank(); i < num_rows; i += world.size()) {
my_number = modifyArr(i, data);
data[i] = my_number;
world.send(0, 1, data);
//cout << "i=" << i << " my_number=" << my_number << endl;
if (world.rank() == 0)
for (int j = 1; j < world.size(); j++)
mpi::status s = world.recv(boost::mpi::any_source, 1, data);
}
if (world.rank() == 0) {
for ( int i = 0; i < num_rows; i++)
cout << "i=" << i << " results = " << data[i] << endl;
}
return 0;
}
Your problem is probably here:
mpi::status s = world.recv(boost::mpi::any_source, 1, data);
This is the only way data can get back to the master node.
However, you do not tell the master node where in data to store the answers it is getting. Since data is the address of the array, everything should get stored in the zeroth element.
Interleaving which elements of the array you are processing on each node is a pretty bad idea. You should assign blocks of the array to each node so that you can send entire chunks of the array at once. That will reduce communication overhead significantly.
Also, if your issue is simply speeding up for loops, you should consider OpenMP, which can do things like this:
#pragma omp parallel for
for(int i=0;i<100;i++)
data[i]*=4;
Bam! I just split that for loop up between all of my processes with no further work needed.
I need to perform some regexp operations on binary data. I wrote a function to convert QByteArray data in a hexa string representation. Each byte is prepended by 'x' for parsing purpose.
How could this code be optimized?
QByteArray data;
QByteArray newData;
for (int i = 0; i < data.size(); i++) {
QString hex;
hex.setNum(data[i], 16);
if (data[i] < 10) {
hex.prepend("x0");
} else {
hex.prepend("x");
}
newData.append(hex.toLatin1());
}
The code you posted has two bugs in it that I corrected.
1) Assuming you always want two hex digits you want to check if the value is less than 16, not 10.
2) QString::setNum has no overload for char, so the value is promoted to a larger type. For a value like 128, which is negative in a signed char, you would get x0ffffffffffffff80 due to sign extension.
The function foo1 is your original code with the bugs fixed, and foo2 is a more optimal version that avoids creating a temporary QString since the conversion to unicode and back isn't free, and prepending values to a string requires additional copying.
I used QElapsedTimer because on Windows where I am testing it uses the high resolution PerformanceCounter clock. If you are on another platform it might be less accurate. You can see the different types of clocks it may use in the documentation.
Set display_converted_string to true if you want the converted string printed to verify they are identical.
#include <QString>
#include <QByteArray>
#include <QElapsedTimer>
#include <iostream>
QByteArray foo1(QByteArray data)
{
QByteArray newData;
for (int i = 0; i < data.size(); i++) {
unsigned char c = data[i];
QString hex;
hex.setNum(c, 16);
if (c < 16) {
hex.prepend("x0");
} else {
hex.prepend("x");
}
newData.append(hex.toLatin1());
}
return newData;
}
QByteArray foo2(QByteArray data)
{
static const char digits[] = {'0','1','2','3','4','5','6','7',
'8','9','a','b','c','d','e','f'};
QByteArray newData;
newData.reserve(data.size() * 3);
for (int i = 0; i < data.size(); i++)
{
unsigned char c = data[i];
newData.append('x');
newData.append(digits[(c >> 4) & 0x0f]);
newData.append(digits[c & 0x0f]);
}
return newData;
}
int main()
{
const int iterations = 10000;
const bool display_converted_string = false;
QElapsedTimer t;
std::cout << "Using clock type " << t.clockType() << ".\n";
QByteArray data(256, 0);
QByteArray newData;
qint64 elapsed1 = 0, elapsed2 = 0;
//Set the values in data to 0-255 to make sure all values are converted properly.
for(int i = 0; i < data.size(); ++i)
{
data[i] = i;
}
t.start();
for(int i = 0; i < iterations; ++i)
{
newData = foo1(data);
}
elapsed1 = t.nsecsElapsed();
std::cout << "foo1 elapsed time = " << elapsed1 << "\n";
if(display_converted_string)
{
std::cout << "newData = " << newData.data() << "\n";
}
t.restart();
for(int i = 0; i < iterations; ++i)
{
newData = foo2(data);
}
elapsed2 = t.nsecsElapsed();
std::cout << "foo2 elapsed time = " << elapsed2 << "\n";
if(display_converted_string)
{
std::cout << "newData = " << newData.data() << "\n";
}
return 0;
}
For the quick sort algorithm(recursive), every time when it calls itself, it have the condition if(p < r). Please correct me if I am wrong: as far as I know, for every recursive algorithm, it has a condition as the time when it entered the routine, and this condition is used to get the base case. But I still cannot understand how to correctly set and test this condition ?
void quickSort(int* arr, int p, int r)
{
if(p < r)
{
int q = partition(arr,p,r);
quickSort(arr,p,q-1);
quickSort(arr,q+1,r);
}
}
For my entire code, please refer to the following:
/*
filename : main.c
description: quickSort algorithm
*/
#include<iostream>
using namespace std;
void exchange(int* val1, int* val2)
{
int temp = *val1;
*val1 = *val2;
*val2 = temp;
}
int partition(int* arr, int p, int r)
{
int x = arr[r];
int j = p;
int i = j-1;
while(j<=r-1)
{
if(arr[j] <= x)
{
i++;
// exchange arr[r] with arr[j]
exchange(&arr[i],&arr[j]);
}
j++;
}
exchange(&arr[i+1],&arr[r]);
return i+1;
}
void quickSort(int* arr, int p, int r)
{
if(p < r)
{
int q = partition(arr,p,r);
quickSort(arr,p,q-1);
quickSort(arr,q+1,r);
}
}
// driver program to test the quick sort algorithm
int main(int argc, const char* argv[])
{
int arr1[] = {13,19,9,5,12,8,7,4,21,2,6,11};
cout <<"The original array is: ";
for(int i=0; i<12; i++)
{
cout << arr1[i] << " ";
}
cout << "\n";
quickSort(arr1,0,11);
//print out the sorted array
cout <<"The sorted array is: ";
for(int i=0; i<12; i++)
{
cout << arr1[i] << " ";
}
cout << "\n";
cin.get();
return 0;
}
Your question is not quite clear, but I will try to answer.
Quicksort works by sorting smaller and smaller arrays. The base case is an array with less than 2 elements because no sorting would be required.
At each step it finds a partition value and makes it true that all the values to the left of the partition value are smaller and all values to the right of the partition value are larger. In other words, it puts the partition value in the correct place. Then it recursively sorts the array to the left of the partition and the array to right of the partition.
The base case of quicksort is an array with one element because a one element array requires no sorting. In your code, p is the index of the first element and r is the index of the last element. The predicate p < r is only true for an array of at least size 2. In other words, if p >= r then you have an array of size 1 (or zero, or nonsense) and there is no work to do.
I have written a simple program that does autocorrelation as follows...I've used pgi accelerator directives to move the computation to GPUs.
//autocorrelation
void autocorr(float *restrict A, float *restrict C, int N)
{
int i, j;
float sum;
#pragma acc region
{
for (i = 0; i < N; i++) {
sum = 0.0;
for (j = 0; j < N; j++) {
if ((i+j) < N)
sum += A[j] * A[i+j];
else
continue;
}
C[i] = sum;
}
}
}
I wrote a similar program in OpenCL, but I am not getting correct results. The program is as follows...I am new to GPU programming, so apart from hints that could fix my error, any other advices are welcome.
__kernel void autocorrel1D(__global double *Vol_IN, __global double *Vol_AUTOCORR, int size)
{
int j, gid = get_global_id(0);
double sum = 0.0;
for (j = 0; j < size; j++) {
if ((gid+j) < size)
{
sum += Vol_IN[j] * Vol_IN[gid+j];
}
else
continue;
}
barrier(CLK_GLOBAL_MEM_FENCE);
Vol_AUTOCORR[gid] = sum;
}
Since I have passed the dimension to be 1, so I am considering my get_global_size(0) call would give me the id of the current block, which is used to access the input 1d array.
Thanks,
Sayan
The code is correct. As far as I know, that should run fine and give corret results.
barrier(CLK_GLOBAL_MEM_FENCE); is not needed. You'll get more speed without that sentence.
Your problem should be outside the kernel, check that you a re passing correctly the input, and you are taking out of GPU the correct data.
BTW, I supose you are using a double precision suported GPU as you are doing double calcs.
Check that you are passing also double values. Remember you CAN't point a float pointer to a double value, and viceversa. That will give you wrong results.