I am trying to properly use 'cblas_dtrsv' however I do not get right output and I do not know why. Here is my example(dtrsv_example.c)
#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"
int main()
{
double *A, *b, *x;
int m, n, k, i, j;
m = 4, k = 4, n = 4;
printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n"
" performance \n\n");
A = (double *)mkl_malloc( m*k*sizeof( double ), 64 );
b = (double *)mkl_malloc( n*sizeof( double ), 64 );
x = (double *)mkl_malloc( n*sizeof( double ), 64 );
if (A == NULL || b == NULL || x == NULL) {
printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
mkl_free(A);
mkl_free(b);
mkl_free(x);
return 1;
}
A[0] = 11;
for (i = 0; i < m; i++) {
for (j = 0; j <= i; j++) {
A[j + i*m] = (double)(j+i*m);
}
}
for (i = 0; i < n; i++) {
x[i] = (i+1)*5.0;
}
printf ("\n Computations completed.\n\n");
printf ("\n Result x: \n");
for (j = 0; j < n; j++) {
printf ("%f\n", x[i]);
}
printf ("\n Deallocating memory \n\n");
mkl_free(A);
mkl_free(b);
mkl_free(x);
printf (" Example completed. \n\n");
return 0;
}
Compilation seems fine:
icc -c -Wall -c -o dtrsv_example.o dtrsv_example.c
icc dtrsv_example.o -o dtrsv_example -L/opt/intel/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lm
However, I get the wrong result:
./dtrsv_example
Computations completed.
Result x:
0.000000
0.000000
0.000000
0.000000
Deallocating memory
Example completed.
Any ideas of what I might be doing wrong here?
Even though I thought I had carefully checked it, after a break I realized of my beginner mistake:
for (j = 0; j < n; j++) {
printf ("%f\n", x[i]);
}
it should be x[j] instead!
Hopefully other people can use my example to understand how the cblas_dtrsv interface is used.
Related
I am using DPC++ to accelerate knn algorithm on FPGA device. The following code is the code I wrote for the euclidean distance. The problem is that the fpga_emulation works very well with no problems while running it on fpga hardware (Intel Arria 10 OneAPI) gives -nan for all values in the resulting buffer, which means something got wrong in the parallel_for lioop. But I can't find anything wrong about it and the emulation worked.
I am using Intel Devcloud platform.
std::vector<double> distance_calculation_FPGA(queue& q, const std::vector<std::vector<double>>& dataset, const std::vector<double>& curr_test) {
std::cout<<"convert 2D to 1D"<<std::endl;
std::vector<double>linear_dataset;
for (int i = 0; i < dataset.size(); ++i) {
for (int j = 0; j < dataset[i].size(); ++j) {
linear_dataset.push_back(dataset[i][j]);
}
}
std::cout<<"buffering"<<std::endl;
range<1> num_items{dataset.size()};
std::vector<double>res;
//std::cout << "im in" << std::endl;
res.resize(dataset.size());
buffer dataset_buf(linear_dataset);
buffer curr_test_buf(curr_test);
buffer res_buf(res.data(), num_items);
std::cout<<"submit a job"<<std::endl;
auto start = std::chrono::high_resolution_clock::now();
{
q.submit([&](handler& h) {
accessor a(dataset_buf, h, read_only);
accessor b(curr_test_buf, h, read_only);
accessor dif(res_buf, h, write_only, no_init);
h.parallel_for(num_items, [=](auto i) {
for (int j = 0; j < 5; ++j) {
dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);
}
// out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl;
});
}).wait();
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";
/* Iterative distance calculation
for (int i = 0; i < dataset.size(); ++i) {
double dis = 0;
for (int j = 0; j < dataset[i].size(); ++j) {
dis += (curr_test[j] - dataset[i][j]) * (curr_test[j] - dataset[i][j]);
}
res.push_back(dis);
}
*/
return res;
}
results with fpga_emulation: ./knn.fpga_emu
results for fpga hardware: ./knn.fpga
Question on your usage, usually with something like a NaN obviously we are looking at uninitialized memory (or divide by 0 which you don't have). Is it possible the ranges are some how off on the FGPA and/or the values aren't properly initialized for the array incidies?
Sorry I know that's pretty basic, but without your dataset I'm not 100% sure I can reproduce it.
Why do I get a segfault when I try to print the strlen of a string, which is part of an array of strings? I can print each string - the printf works perfectly. But why does the strlen cause a segfault?
The below program first takes an input n, which is the number of strings I want to dynamically allocate. Then I allocate space for 32 bytes for each string
int main() {
int i;
int n;
char **nums;
scanf("%d", &n);
printf("n = %d\n", n);
nums = malloc(sizeof(char *) * n);
printf("allocated nums\n");
for(i=0; i<n; i++) {
nums[i] = malloc(sizeof(char) * 32);
memset(nums[i], '\0', sizeof(char) * 32);
}
for(i=0; i < n; i++) {
scanf("%s", &nums[i]);
}
for(i=0; i<n; i++) {
// THIS PRINTS FINE
printf("string = %s\n", &nums[i]);
// SEGFAULT HERE IMMEDIATELY
printf("length = %d\n", strlen(nums[i]));
}
Here is the console output. As a test, I entered in n=3, followed by the numbers 45, 46, and 47:
3
n = 3
allocated nums
45
46
47
string = 45
Segmentation fault (core dumped)
Additionally, I get a segfault when I try to access an individual character in each string. Again, the first printf in outer for loop prints the string, then I get a segfault accessing nums[i][k]:
int i, k=0;
for(i=0; i<n; i++) {
printf("Printiiing: %s\n", &nums[i]);
// WHY DOES THIS SEGFAULT???
//printf("first char = %c\n", nums[i][0]);
while(k<32 && nums[i][k] != '\0') {
// THIS CAUSES A SEG FAULT
printf("char = %c", (nums[i])[k]);
k++;
}
}
This is fine:
for(i=0; i<n; i++) {
nums[i] = malloc(sizeof(char) * 32);
memset(nums[i], '\0', sizeof(char) * 32);
}
After this loop, each nums[i] is a pointer to 32-byte buffer.
But this corrupts (overwrites) all the pointers, instead of reading strings into the allocated buffers:
for(i=0; i < n; i++) {
scanf("%s", &nums[i]);
}
To fix the bug, use: scanf("%s", nums[i]);.
There is the difference pointer between the &nums[i] and nums[i] address. You can check with the strlen((const char*)(&nums[i])) for get length.
The explanation of #Employee of Russia is exactly for reason.
I am looking for an simple example where using vectorization and parallelization on Xeon Phi this has better perfomance than only-Xeon. Could you help me please?
I am trying with the next example. I comment the lines 14, 18 and 19 for run on only-Xeon and uncoment these for Xeon-Phi, but only-Xeon has better performance than Xeon-phi
1.void main(){
2.double *a, *b, *c;
3.int i,j,k, ok, n=100;
4.int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );
5.ok = posix_memalign((void**)&a, 64, n*nPadded*sizeof(double));
6.ok = posix_memalign((void**)&b, 64, n*nPadded*sizeof(double));
7.ok = posix_memalign((void**)&c, 64, n*nPadded*sizeof(double));
8.for(i=0; i<n; i++)
9.{
10. a[i] = (int) rand();
11. b[i] = (int) rand();
12. c[i] = 0.0;
13.}
14.#pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))
15.#pragma omp parallel for
16.for( i = 0; i < n; i++ )
17. for( k = 0; k < n; k++ )
18. #pragma vector aligned
19. #pragma ivdep
20. for( j = 0; j < n; j++ ){
21. c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j]
22.}
First couple words about autovectorization. Advantage of autovectorization is simplicity. You need to set some keywords than magic happens and compiler make fast code for you. If you want to go this way try this manual.
The disadvantage of this approach is that there is no easy way to understand how compiler make his work. In vectorization report you will see "LOOP WAS VECTORIZED" or "LOOP WAS NOT VECTORIZED". But if you want truly understand how your code works the only way is look in your program assembly. This is not a problem to get assembly. You need to compile program with -fcode-asm. But I think if you need to read assembly to check how "simple autovectorization" method works it is not so simple.
Alternative to autovectorization are intrinsics (actually, this is not single alternative). Think about intrinsics like assembly wrapped with C functions. Many intrinsics internally wrap single assembly command.
I recommend to use this intrinsics guide.
So my simple way steps:
Make single thread reference implementation. You will use it to check correctness of intrinsics version.
Implement SSE intrinsics version. SSE intrinsics are much simpler and can be tested on Xeon.
Implement AVX-512 version for Xeon Phi.
Measure your speed.
Let's do it with your program.
There are many differences with your program:
I use float instead double.
I use _mm_malloc instead posix_memalign.
I suppose n is divided by 16 without remainder (16 floats in AVX-512 vector register). I don't work with loop peeling in this example.
I use native mode instead of offload mode. KNL is bootable so it is not necessary to use offload mode anymore.
Also I think your program is not correct because it modifies c array from several threads in one moment of time. But lets think it is not important and we just need some calculation job.
My code work time:
Intel Xeon 5680
reference calc time: 97.677505 seconds
Intrinsics calc time: 6.189296 seconds
Intel Xeon Phi (KNC) SE10X
reference calc time: 199.0 seconds
Intrinsics calc time: 2.78 seconds
Code:
#include <stdio.h>
#include <omp.h>
#include <math.h>
#include "immintrin.h"
#include <assert.h>
#define F_E_Q(X,Y,N) (round((X) * pow(10, N)-(Y) * pow(10, N)) == 0)
void reference(float* a, float* b, float* c, int n, int nPadded);
void intrinsics(float* a, float* b, float* c, int n, int nPadded);
char *test(){
int n=4800;
int nPadded = n;
assert(n%16 == 0);
float* a = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
float* b = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
float* cRef = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
float* c = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
assert(a != NULL);
assert(b != NULL);
assert(cRef != NULL);
assert(c != NULL);
for(int i=0, max = n*nPadded; i<max; i++){
a[i] = (int) rand() / 1804289408.0;
b[i] = (int) rand() / 1804289408.0;
cRef[i] = 0.0;
c[i] = 0.0;
}
debug_arr("a", "%f", a, 0, 9, 1);
debug_arr("b", "%f", b, 0, 9, 1);
debug_arr("cRef", "%f", cRef, 0, 9, 1);
debug_arr("c", "%f", c, 0, 9, 1);
double t1 = omp_get_wtime();
reference(a, b, cRef, n, nPadded);
double t2 = omp_get_wtime();
debug("reference calc time: %f", t2-t1);
t1 = omp_get_wtime();
intrinsics(a, b, c, n, nPadded);
t2 = omp_get_wtime();
debug("Intrinsics calc time: %f", t2-t1);
debug_arr("cRef", "%f", cRef, 0, 9, 1);
debug_arr("c", "%f", c, 0, 9, 1);
for(int i=0, max = n*nPadded; i<max; i++){
assert(F_E_Q(cRef[i], c[i], 2));
}
_mm_free(a);
_mm_free(b);
_mm_free(cRef);
_mm_free(c);
return NULL;
}
void reference(float* a, float* b, float* c, int n, int nPadded){
for(int i = 0; i < n; i++ )
for(int k = 0; k < n; k++ )
for(int j = 0; j < n; j++ )
c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j];
}
#if __MIC__
void intrinsics(float* a, float* b, float* c, int n, int nPadded){
#pragma omp parallel for
for(int i = 0; i < n; i++ )
for(int k = 0; k < n; k++ )
for(int j = 0; j < n; j+=16 ){
__m512 aPart = _mm512_extload_ps(a + i*nPadded+k, _MM_UPCONV_PS_NONE, _MM_BROADCAST_1X16, _MM_HINT_NONE);
__m512 bPart = _mm512_load_ps(b + k*nPadded+j);
__m512 cPart = _mm512_load_ps(c + i*nPadded+j);
cPart = _mm512_add_ps(cPart, _mm512_mul_ps(aPart, bPart));
_mm512_store_ps(c + i*nPadded+j, cPart);
}
}
#else
void intrinsics(float* a, float* b, float* c, int n, int nPadded){
#pragma omp parallel for
for(int i = 0; i < n; i++ )
for(int k = 0; k < n; k++ )
for(int j = 0; j < n; j+=4 ){
__m128 aPart = _mm_load_ps1(a + i*nPadded+k);
__m128 bPart = _mm_load_ps(b + k*nPadded+j);
__m128 cPart = _mm_load_ps(c + i*nPadded+j);
cPart = _mm_add_ps(cPart, _mm_mul_ps(aPart, bPart));
_mm_store_ps(c + i*nPadded+j, cPart);
}
}
#endif
This code is for problem DIGJUMP.
It gives me correct output for all inputs i have tried (i have tried a lot of them). But the problem is that it is getting TLE while submitting it on codechef. I checked the editorial and the same solution (concept-wise) gets accepted, so it means algorithmic approach is correct. I must have something wrong in the implementation.
I tried it for a long time, but could not figure out what is wrong.
#include <string.h>
#include <vector>
#include <queue>
#include <stdio.h>
using namespace std;
class Node
{
public:
int idx, steps;
};
int main()
{
char str[100001];
scanf("%s", str);
int len = strlen(str);
vector<int> adj[10];
for(int i = 0; i < len; i++)
adj[str[i] - '0'].push_back(i);
int idx, chi, size, steps;
Node tmpn;
tmpn.idx = 0;
tmpn.steps = 0;
queue<Node> que;
que.push(tmpn);
bool * visited = new bool[len];
for(int i = 0; i < len; i++)
visited[i] = false;
while(!que.empty())
{
tmpn = que.front();
que.pop();
idx = tmpn.idx;
steps = tmpn.steps;
chi = str[idx] - '0';
if(visited[idx])
continue;
visited[idx] = true;
if(idx == len - 1)
{
printf("%d\n", tmpn.steps);
return 0;
}
if(visited[idx + 1] == false)
{
tmpn.idx = idx + 1;
tmpn.steps = steps + 1;
que.push(tmpn);
}
if(idx > 0 && visited[idx - 1] == false)
{
tmpn.idx = idx - 1;
tmpn.steps = steps + 1;
que.push(tmpn);
}
size = adj[chi].size();
for(int j = 0; j < size; j++)
{
if(visited[adj[chi][j]] == false)
{
tmpn.idx = adj[chi][j];
tmpn.steps = steps + 1;
que.push(tmpn);
}
}
}
return 0;
}
This solution won't finish in acceptable time for the problem. Remember that BFS is O(E). In a string with O(n) digits of some kind there are O(n^2) edges between those digits. For N=10^5 O(N^2) is too much.
This will need some optimizations like if we came to current node from a similar node, we wont skip further to similar nodes.
I cant find the error in this code, Im looking at it for hours... Valgrind says:
==23114== Invalid read of size 1
==23114== Invalid write of size 1
I tried debugging with some printfs, and i think that the error is in this function.
void rdm_hide(char *name, Byte* img, Byte* bits, int msg, int n, int size)
{
FILE *fp;
int r;/
Byte* used;
int i = 0, j = 0;
int p;
fp = fopen(name, "wb");
used = malloc(sizeof(Byte) * msg);
for(i = 0; i < msg; i++)
used[i] = -1;
while(i < 3)
{
if(img[j] == '\n')
i++;
j++;
}
for(i = 0; i < msg; i++)
{
r = genrand_int32();
p = r % n;
if(!search(p, used, msg))
{
used[i] = (Byte)p;
if(bits[i] == (Byte)0)
img[j + p] = img[j + p] & (~1);
else if(bits[i] == (Byte)1)
img[j + p] = img[j + p] | 1;
}
else
i --;
}
for(i = 0; i < size; i++)
fputc( (char) img[i], fp);
fclose(fp);
free(used);
}
Thanks for help!
==23114== Invalid read of size 1
==23114== Invalid write of size 1
I am pretty sure that's not all valgrind says.
You should
Build your program with debug info (most likely -g flag). This will let valgrind tell you exactly which line triggers invalid read and write
If the problem doesn't become obvious, edit your question and include entire valgrind output.
Re-running valgrind --track-origins=yes your-exe may provide additional useful info.
Lastly, your algorithm appears to be totally bogus. As far as I can tell, the j becomes 3 after the first while loop and never changes after that (in which case you should just use const int j = 3; and do away with j++). Also, you reference img[j + p], where p is between 0 and n. If n is indeed the size of img, then it's little surprise that j + p indexes outside of the img limits, and triggers both errors.