Weird behavior of dpc++ code after running it on FPGA device - intel

I am using DPC++ to accelerate knn algorithm on FPGA device. The following code is the code I wrote for the euclidean distance. The problem is that the fpga_emulation works very well with no problems while running it on fpga hardware (Intel Arria 10 OneAPI) gives -nan for all values in the resulting buffer, which means something got wrong in the parallel_for lioop. But I can't find anything wrong about it and the emulation worked.
I am using Intel Devcloud platform.
std::vector<double> distance_calculation_FPGA(queue& q, const std::vector<std::vector<double>>& dataset, const std::vector<double>& curr_test) {
std::cout<<"convert 2D to 1D"<<std::endl;
std::vector<double>linear_dataset;
for (int i = 0; i < dataset.size(); ++i) {
for (int j = 0; j < dataset[i].size(); ++j) {
linear_dataset.push_back(dataset[i][j]);
}
}
std::cout<<"buffering"<<std::endl;
range<1> num_items{dataset.size()};
std::vector<double>res;
//std::cout << "im in" << std::endl;
res.resize(dataset.size());
buffer dataset_buf(linear_dataset);
buffer curr_test_buf(curr_test);
buffer res_buf(res.data(), num_items);
std::cout<<"submit a job"<<std::endl;
auto start = std::chrono::high_resolution_clock::now();
{
q.submit([&](handler& h) {
accessor a(dataset_buf, h, read_only);
accessor b(curr_test_buf, h, read_only);
accessor dif(res_buf, h, write_only, no_init);
h.parallel_for(num_items, [=](auto i) {
for (int j = 0; j < 5; ++j) {
dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);
}
// out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl;
});
}).wait();
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";
/* Iterative distance calculation
for (int i = 0; i < dataset.size(); ++i) {
double dis = 0;
for (int j = 0; j < dataset[i].size(); ++j) {
dis += (curr_test[j] - dataset[i][j]) * (curr_test[j] - dataset[i][j]);
}
res.push_back(dis);
}
*/
return res;
}
results with fpga_emulation: ./knn.fpga_emu
results for fpga hardware: ./knn.fpga

Question on your usage, usually with something like a NaN obviously we are looking at uninitialized memory (or divide by 0 which you don't have). Is it possible the ranges are some how off on the FGPA and/or the values aren't properly initialized for the array incidies?
Sorry I know that's pretty basic, but without your dataset I'm not 100% sure I can reproduce it.

Related

Writing to Global Memory Causing Crash in OpenCL in For Loop

One of my OpenCL helper functions writing to global memory in one place runs just fine, and the kernel executes typically. Still, when run from directly after that line, it freezes/crashes the kernel, and my program can't function.
The values in this function change (different values for an NDRange of 2^16), and therefore the loops change as well, and not all threads can execute the same code because of the conditionals.
Why exactly is this an issue? Am I missing some kind of memory blocking or something?
void add_world_seeds(yada yada yada...., const uint global_id, __global long* world_seeds)
for (; indexer < (1 << 16); indexer += increment) {
long k = (indexer << 16) + c;
long target2 = (k ^ e) >> 16;
long second_addend = get_partial_addend(k, x, z) & MASK_16;
if (ctz(target2 - second_addend) < mult_trailing_zeroes) { continue; }
long a = (((first_mult_inv * (target2 - second_addend)) >> mult_trailing_zeroes) ^ (J1_MUL >> 32)) & mask;
for (; a < (1 << 16); a += increment) {
world_seeds[global_id] = (a << 32) + k; //WORKS HERE
if (get_population_seed((a << 32) + k, x, z) != population_seed_state) { continue; }
world_seeds[global_id] = (a << 32) + k; //DOES NOT WORK HERE
}
}
for (; a < (1 << 16); a += increment) {
world_seeds[global_id] = (a << 32) + k; //WORKS HERE
if (get_population_seed((a << 32) + k, x, z) != population_seed_state) { continue; }
world_seeds[global_id] = (a << 32) + k; //DOES NOT WORK HERE
}
There was in fact a bug causing the undefined behavior in the code, in particular the main reversal kernel included a variable in the arguments called "increment", and in that same kernel I defined another variable called increment. It compiled fine but led to completely all over the wall wrong results and memory crashes.

Inverting a ZZ_p matrix in NTL

I am trying to generate a random binary matrix and its inverse mod q where q is a power of 2. Sometimes when the determinant of my matrix is invertible modulo q (so the matrix over Z_q is invertible), I am getting the error "InvMod:inverse undefined Aborted (core dumped)" and other times the inverse is computed. What am I doing incorrectly?
#include <iostream>
//NTL files
#include <NTL/ZZ_p.h>
#include <NTL/vec_vec_ZZ_p.h>
#include <NTL/LLL.h>
#include <NTL/matrix.h>
#include <NTL/vector.h>
#include <NTL/tools.h>
#include <NTL/ZZ.h>
#include <NTL/vec_vec_ZZ.h>
using namespace std;
using namespace NTL;
int main(){//task generate a random matrix S with 0/1 entries stored as a ZZ_p matrix, then generate a random, invertible S
int nn = 8;
ZZ n = ZZ(nn);
ZZ N = ZZ(0);
ZZ q; power2(q, 4);
ZZ_p::init(q);
mat_ZZ S; S.SetDims(nn,nn);
for(int i = 0; i<nn; i++){
for(int j = 0; j<nn; j++){
S[i][j] = RandomBits_ZZ(1);
}
}
mat_ZZ_p S1; S1.SetDims(nn,nn);//copy to ZZ_P
mat_ZZ_p R; R.SetDims(nn,nn);//will set to inverse if
cout<<"The random matrix is S = "<<endl; //print S
for(int i = 0; i<nn; i++){
for(int j=0; j<n;j++){
cout<<S[i][j]<<", ";
} cout<<endl;
}
ZZ d; determinant(d,S); ZZ_p d1; conv(d1, d % q);
if(GCD(q,d) == 1){//convert to mod q datatype
for(int i = 0; i<nn; i++){
for(int j = 0; j<nn; j++){
conv(S1[i][j], S[i][j]);
}
}
//let's invert the matrix and print it!
cout<<"The random matrix is R = "<<endl; //print R
R = inv(S1); //mul(R,R,S1);
for(int i = 0; i<nn; i++){
for(int j=0; j<n;j++){
cout<<R[i][j]<<", ";
} cout<<endl;
}
}
cout<<endl<<"det of S is "<<d<<" and this mod q is "<<d1<<endl;
cout<<"Our modulus is "<< q <<endl;
return 0;
}
If the determinant is invertable mod q this only means that there exists an inverse matrix. But the algorithm that computes this matrix can still come to a point where it would need to calculate the inverse of an element that don't has one.
You don't have this problem if q is prime.
By the way, here is a simplified version of your code.
#include <iostream>
//NTL files
#include <NTL/mat_ZZ_p.h>
using namespace std;
using namespace NTL;
int main()
{//task generate a random matrix S with 0/1 entries stored as a ZZ_p matrix, then generate a random, invertible S
int nn = 8;
ZZ q;
power2(q, 4);
ZZ_p::init(q);
mat_ZZ_p S;
S.SetDims(nn, nn);
for(int i = 0; i < nn; i++)
{
for(int j = 0; j < nn; j++)
{
S[i][j] = conv<ZZ_p>(RandomBits_ZZ(1));
}
}
mat_ZZ_p R;
R.SetDims(nn, nn);//will set to inverse if
cout << "The random matrix is S = " << endl << S;
ZZ_p d;
determinant(d, S);
cout << endl << "det(S) = " << d << endl;
cout << "q = " << q << endl;
if(GCD(conv<ZZ>(d), q) == 1)
{
// let's invert the matrix and print it!
R = inv(S);
cout << "The random matrix is R = " << R << endl;
}
return 0;
}

Global and local Ids usage in an Algorithm in OpenCL

well I posted some weeks ago about an error I had into my openCL implementation but it seems I have to start from the beggining. So, how should be implemented the next algorithm in OpenCL.
int m = 10;
int n = 10;
//arrA[] has m elements
//arrB[] has n elements
//arrC[] has m x n elements
for(int i = 0; i < m; i++)
{
for(int j = 0; j < n; j++)
{
arrC[i x j] = arrA[i] x arrB[j];
}
}
For this case I need just knowing how to handle this with the global and local ids.... because there is where I am a little lost. Thank you so much
This is the code I currently have (This is an extraction of the real code that will perform a reduction because I need to get a maximum value).
"sampleKernel(__global const double *bufferX,"
" __global const double *bufferY,"
" __global double* result,"
" __const int lengthX,"
" __const int lengthY){"
" const int index_a = get_global_id(0);"//Get the global indexes for 2D reference
" const int index_b = get_global_id(1);"
" const int local_index = get_local_id(0);"//Current thread id -> Should be the same as index_a * lengthY + index_b;
" if (local_index < (lengthX * lengthY)) {"// Load data into local memory
" if(index_a < lengthX && index_b < lengthY)"
" {"
" result[local_index] = bufferX[index_a] * bufferY[index_b];"
" }"
" } "
"}";
Maybe I should use a get_local_id(1) too, and use the thread Id as local_id_1 * N + local_id_2 where N is the maximum local_id_2 value.

Converting 1d vector into 2d vector

#include <iostream>
#include <vector>
int main()
{
std::vector<std::vector<double> > DV; //2d vector
std::vector<double>temp(8,0.0); //1d vector
temp[0] = 1;
temp[1] = 2;
temp[2] = 3;
temp[3] = 4;
temp[4] = 5;
temp[5] = 6;
temp[6] = 7;
temp[7] = 8;
DV.resize(3, temp);
for (int i = 0; i < DV.size(); i++)
{
for (int j = 0; j < DV.size(); j++)
{
std::cout << DV[i][j];
}
}
std::cin.get();
}
The convertion actually works but it does not give the expected the result. The output should be:
1 2 3
4 5 6
7 8
and it outputs:
123123123
Thanks in advance
I'm not aware of a method to automagically turn a 1D vector into a 2D one. It's not too hard to do manually, though...
typedef std::vector<std::vector<double>> DoubleVector2D;
DoubleVector2D boxed(size_t cols, std::vector<double> values) {
DoubleVector2D result;
for (std::size_t i = 0; i < values.size(); ++i) {
if (i % cols == 0) result.resize(result.size() + 1);
result[i / cols].push_back(values[i]);
}
return result;
}
With that done, you can call boxed(3, temp) to get back a vector of vectors of doubles. At that point, you just have to loop over them.
for (auto row : DV) {
for (auto value : row) {
std::cout << value << ' ';
}
std::cout << "\n";
}
If you're stuck without decent C++11 support, you may need to use counters or iterators.
for (int row = 0; row < DV.size(); ++row) {
for (int col = 0; col < DV[i].size(); ++col) {
std::cout << DV[row][col] << ' ';
}
std::cout << '\n';
}
Change this lines
for (int i = 0; i < DV.size(); i++){
for (int j = 0; j < DV.size(); j++){
std::cout << DV[i][j] << ", ";
}
std::cout << std::endl;
}
Your issue is how you are printing your values to the standard output.

OpenCV populating CvMat with data and verifying it

I have a vector of TrainingSets(struct below) called data
class TrainingSet
{
public:
int time;
float input[2];
float output[3*NUM_TRACKING_POINTS];
TrainingSet(int t, float in[2], float out[3*NUM_TRACKING_POINTS])
{
time = t;
for (int i = 0; i < 2; i++)
input[i] = in[i];
for (int i = 0; i < 3*NUM_TRACKING_POINTS; i++)
output[i] = out[i];
}
TrainingSet()
{
}
};
And then I try to take the contents of this Vector, and put them into CvMats for the purpose of training a Neural Network.
int datasize = data.size();
float** in = new float*[datasize];
float** out = new float*[datasize];
for (int i = 0; i < datasize; i++) {
in[i] = new float[2*TIME_STEPS];
out[i] = new float[3*NUM_TRACKING_POINTS];
}
for ( int i = 0 ; i < datasize; i ++)
{
// get the first set in the sequence.
TrainingSet tset = data.front();
data.pop();
// get the inputs
in[i] = new float[2*TIME_STEPS];
in[i][0] = tset.input[0];
in[i][1] = tset.input[1];
// get the outputs
out[i] = new float[3*NUM_TRACKING_POINTS];
for (int j = 0; j < 3*NUM_TRACKING_POINTS; j++)
out[i][j] = tset.output[j];
for (int j = 2; j < 2*TIME_STEPS; j++)
{
if (i == 0)
in[i][j] = 0.0f;
else
in[i][j] = in[i - 1][j - 2];
}
}
// make matrices from data.
CvMat *trainInput = cvCreateMat(datasize, 2*TIME_STEPS, CV_32FC1);
cvInitMatHeader(trainInput, datasize, 2*TIME_STEPS, CV_32FC1, in);
CvMat *trainOutput = cvCreateMat(datasize, 3*NUM_TRACKING_POINTS, CV_32FC1);
cvInitMatHeader(trainOutput, datasize, 3*NUM_TRACKING_POINTS, CV_32FC1, out);
for (int x = 0; x < datasize; x++)
{
cout << "IN: ";
for (int y = 0; y < 2*TIME_STEPS; y++)
cout << cvmGet(trainInput, x, y) << " ";
cout << endl << "IN: ";
for (int y = 0; y < 2*TIME_STEPS; y++)
cout << in[x][y] << " ";
cout << endl << "OUT: ";
for (int y = 0; y < 3 * NUM_TRACKING_POINTS; y++)
cout << cvmGet(trainOutput, x, y) << " ";
cout << endl << "OUT: ";
for (int y = 0; y < 3 * NUM_TRACKING_POINTS; y++)
cout << out[x][y] << " ";
cout << endl << endl;
}
That last forloop is to check to see if the matrices contents are the data I just fed it, but they don't match. The Matrices seem to have completely different data.
Any thoughts on what is going wrong?
Seems to me that in and out are not a contiguous array, but an array of pointers.
I think the cvMat needs a contiguous memory array to be able to operate on it.
Also once you create the array, you don't need to create a CvMat from it, just
use the
CvSetData( header, data ).

Resources