Code analysis ignores OpenMP constructs; analyzing single-threaded code [duplicate] - recursion

This question already exists:
traversal binary search tree with openMP
Closed 2 years ago.
void inorderTraversal(struct node* node)
{
double start;
double end;
start = omp_get_wtime();
printf(" %d Thread : %d \n", node->data,omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
if (node->left)
inorderTraversal(node->left);
#pragma omp section
if (node->right)
inorderTraversal(node->right);
}
end = omp_get_wtime();
printf("Calisma zamani %f sn \n", end - start);
/* I take this error. why ? Analyzing single-threaded code how can I fixed ? This is traversal function binary search tree code works, but no parallelism. I need to parallelism with openMP */
}

Related

Using OpenMP with GPU

Everyone good time of day!
I would like to ask the advice of the respected community about the use of GPU computing power instead of or together with the CPU.
I have a well-functioning program based on recursive search of all kinds of combinations of some events, paralleled using OpenMP to run on all available processor cores.
The pseudocode C++ is as follows:
// #includes
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; // (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
}
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
{
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
}
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
else
continue;
}
}
}
}
Unfortunately, I don't have a CPU with a thousand cores at my disposal, and without this, the algorithm works for a very long time. At the place where I work, I was advised to think about using a GPU to speed up calculations. I learned that OpenMP can work with video cards (and especially with NVidia), but OpenACC also does it well.
In this regard, my main question is whether it is possible to simply and, at the same time, effectively set the execution of a recursive algorithm on a GPU? Can this give a noticeable acceleration relative to the CPU? If so, maybe OpenACC will do better? And is it possible to give instructions to the video card through the "#pragma omp task", or are other instructions REQUIRED? And how would it be possible to combine calculations on the CPU and GPU?
Thank you so much for any help!
P.S. I apologize for my English, which is not my native language :)

Performance degrades a lot by using mpirun to execute my program

I am new to the field of MPI. I write my program by using Intel Math Kernel Library and I want to compute a matrix-matrix multiplication by blocks, which means that I split the large matrix X into many small matrixs along the column as the following. My matrix is large, so each time I only compute (N, M) x (M, N) where I can set M manually.
XX^T = X_1X_1^T + X_2X_2^T + ... + X_nX_n^T
I first set the number of total threads as 16 and M equals to 1024. Then I run my program directly as the following . I check my cpu state and I find that the cpu usage is 1600%, which is normal.
./MMNET_MPI --block 1024 --numThreads 16
However, I tried to run my program by using MPI as the following. Then I find that cpu usage is only 200-300%. Strangely, I change the block number to 64 and I can get a little performance improvement to cpu usage 1200%.
mpirun -n 1 --bind-to none ./MMNET_MPI --block 1024 --numThreads 16
I do not know what the problem is. It seems that mpirun does some default setting which has an impact on my program. The following is a part of my matrix multiplication code. The command #pragma omp parallel for aims to extract the small N by M matrix from compression format parallel. After that I use clubs_dgemv to compute the matrix-matrix multiplication.
#include "MemoryUtils.h"
#include "Timer.h"
#include "omp.h"
#include <mpi.h>
#include <mkl.h>
#include <iostream>
using namespace std;
int main(int argc, char** argv) {
omp_set_num_threads(16);
Timer timer;
double start_time = timer.get_time();
MPI_Init(&argc, &argv);
int total_process;
int id;
MPI_Comm_size(MPI_COMM_WORLD, &total_process);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
if (id == 0) {
cout << "========== Testing MPI properties for MMNET ==========" << endl;
}
cout << "Initialize the random matrix ..." << endl;
unsigned long N = 30000;
unsigned long M = 500000;
unsigned long snpsPerBlock = 1024;
auto* matrix = ALIGN_ALLOCATE_DOUBLES(N*M);
auto* vector = ALIGN_ALLOCATE_DOUBLES(N);
auto* result = ALIGN_ALLOCATE_DOUBLES(M);
auto *temp1 = ALIGN_ALLOCATE_DOUBLES(snpsPerBlock);
memset(result, 0, sizeof(double) * M);
cout << "Time for allocating is " << timer.update_time() << " sec" << endl;
memset(matrix, 1.1234, sizeof(double) * N * M);
memset(vector, 1.5678, sizeof(double) * N);
// #pragma omp parallel for
// for (unsigned long row = 0; row < N * M; row++) {
// matrix[row] = (double)rand() / RAND_MAX;
// }
// #pragma omp parallel for
// for (unsigned long row = 0; row < N; row++) {
// vector[row] = (double)rand() / RAND_MAX;
// }
cout << "Time for generating data is " << timer.update_time() << " sec" << endl;
cout << "Starting calculating..." << endl;
for (unsigned long m0 = 0; m0 < M; m0 += snpsPerBlock) {
uint64 snpsPerBLockCrop = std::min(M, m0 + snpsPerBlock) - m0;
auto* snpBlock = matrix + m0 * N;
MKL_INT row = N;
MKL_INT col = snpsPerBLockCrop;
double alpha = 1.0;
MKL_INT lda = N;
MKL_INT incx = 1;
double beta = 0.0;
MKL_INT incy = 1;
cblas_dgemv(CblasColMajor, CblasTrans, row, col, alpha, snpBlock, lda, vector, incx, beta, temp1, incy);
// compute XA
double beta1 = 1.0;
cblas_dgemv(CblasColMajor, CblasNoTrans, row, col, alpha, snpBlock, lda, temp1, incx, beta1, result, incy);
}
cout << "Time for computation is " << timer.update_time() << " sec" << endl;
ALIGN_FREE(matrix);
ALIGN_FREE(vector);
ALIGN_FREE(result);
ALIGN_FREE(temp1);
return 0;
}
My cpu information is as the following.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 44
On-line CPU(s) list: 0-43
Thread(s) per core: 1
Core(s) per socket: 22
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6152 CPU # 2.10GHz
Stepping: 4
CPU MHz: 1252.786
CPU max MHz: 2101.0000
CPU min MHz: 1000.0000
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 30976K
NUMA node0 CPU(s): 0-21
NUMA node1 CPU(s): 22-43
MKL by default implements some intelligent dynamic selection of the number of threads to use. This is controlled by the variable MKL_DYNAMIC, which is set to TRUE by default. The documentation for MKL states:
If you [sic] are able to detect the presence of MPI, but cannot determine if it has been called in a thread-safe mode (it is impossible to detect this with MPICH 1.2.x, for instance), and MKL_DYNAMIC has not been changed from its default value of TRUE, Intel MKL will run one thread.
Since you call MPI_Init() and not MPI_Init_thread() to initialise MPI, you are effectively asking for single-threaded MPI level (MPI_THREAD_SINGLE). The library is free to provide you any threading level and it will conservatively stick to MPI_THREAD_SINGLE. You can check that by calling MPI_Query_thread(&provided) after the initialisation and see if the output value is greater than MPI_THREAD_SINGLE.
Since you are mixing OpenMP and threaded MKL with MPI, you should really tell MPI to initialise at a higher threading support level by calling MPI_Init_thread():
int provided;
MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);
// This ensures that MPI actually provides MPI_THREAD_MULTIPLE
if (provided < MPI_THREAD_MULTIPLE) {
// Complain
}
(technically, you need MPI_THREAD_FUNNNELED, if you do not make MPI calls from outside the main thread, but that is not thread-safe mode as MKL understands it)
Even if you request certain thread support level from MPI, there is no guarantee that you will get it, which is why you have to examine the provided level. Also, older Open MPI versions must explicitly be build with such support - the default is to not build with support for MPI_THREAD_MULTIPLE as some network modules are not thread-safe. You can check if that's the case by running ompi_info and looking for a line similar to this one:
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
Now, the reality is that most threaded software that does not make MPI calls outside the main thread runs perfectly fine even if MPI does not provide higher level of thread support than MPI_THREAD_SINGLE, i.e., with most MPI implementations MPI_THREAD_SINGLE is equivalent to MPI_THREAD_FUNNELED. In that case, setting MKL_DYNAMIC to FALSE should make MKL behave as when run without mpirun:
mpirun -x MKL_DYNAMIC=FALSE ...
In any case, since your program accepts the number of threads as an argument, simply call both mkl_set_num_threads() and omp_set_num_threads() and do not rely on magical default mechanisms.
Edit: Enabling full thread support has consequences - increased latency and some network modules may refuse to work, for example the InfiniBand module in older Open MPI versions, resulting in the library quietly switching to slower transports such as TCP/IP. Better request MPI_THREAD_FUNNELED and explicitly set the number of MKL and OpenMP threads.

OpenCL kernel won't compile if index of array member is calculated by a function or expression

I'm writing a N-body (galaxy) simulator using OpenCL and OpenGL. In the OpenCL kernel code is a line where the index of an array member is calculated by a function as follows: array[indx(i,j)]. The kernel is compiling OK on an integrated Intel's graphics card, but it fails to compile on a dedicated NVidia graphics card. Error code -11 is returned by clBuildProgram(), but the build log is empty, although log string size is reported to be 16 characters long. Possibly a bug in NVidia's driver? I have tried to run the code on a third integrated Intel's graphics card and it says "Invalid operation.", also -11 error code is returned. Why does this kernel compile on one card, but fails to compile on other two cards?
I have tried to drop indx() function and replace array[indx(i,j)] with array[(j * (j - 1) / 2 + i)]. It still does not compile on other two cards.
unsigned long long indx(unsigned long long i, unsigned long long j)
{
return (j * (j - 1) / 2 + i);
}
kernel void propagate(global float4* pos, global float4* vel, global float* acc_matr)
{
constant float mass = 0.00009f;
for (unsigned long long j = i + 1; j < get_global_size(0); j++)
{
float r = distance(pos[i], pos[j]);
if (r > 0.05f)
{
acc_matr[indx(i, j)] = mass / r / r / r;
}
else
{
acc_matr[indx(i, j)] = -0.5f;
}
}
}

AMD OpenCL C Compiler notes dead and deleted loops which shouldn't be dead and deleted

I have the following loop executed in my OpenCl kernel:
__kernel void kernelA(/* many parameters */)
{
/* Prefetching code and other stuff
* ...
* ...
*/
float2 valueA = 0.0f;
#pragma unroll //<----- line X
for(unsigned int i = 0; i < MAX_A; i++) // MAX_A > 0
{
#pragma unroll
for(unsigned int j = 0; j < MAX_B; j++) // MAX_B > 0
valueA += arrayA[(i * MAX_A) + j];
}
/*
* Code that uses the result saved to valueA
*/
}
As can be seen clearly the loop shall summarize values contained in arrayA. Now I wanted to try the #pragma unroll to see whether there is any performance difference between looped and unrolled execution.
But when I compile the kernel, the compiler notes LOOP UNROLL: pragma unroll (line X) ignored because this loop is dead and deleted. I don't understand that information, because the code in the loop is surely executed. MAX_A and MAX_B are definitely greater than zero and the the sum saved to valueA is also used after the loop.
I have the same structure somewhere else in the code and also this position is marked by the upper note.
The compiler I use is the AMD OpenCL C compiler delivered by the APP SDK.
The comment by #DarkZeroes is the solution of this question. There was no instruction to put the result into an output array of the kernel, so the code above and everything what depended on that was optimized away by the compiler.

OpenCL function call stack size

Can I know OpenCL's function call stack size?
I'm using NVIDIA OpenCL1.2 in Ubuntu. (NVIDIA CC=5.2)
And I found some unexpected result in my testcode.
When some function invoked 64 times, the next invoked function seems like can not access the arguments.
In my thought, call stack overflow makes this progblem.
Below is my example code and result:
void testfunc(int count, int run)
{
if(run==0) return;
count++;
printf("count=%d run=%d\n", count, run);
run = run - 1;
testfunc(count, run);
}
__kernel void hello(__global int * in_img, __global int * out_img)
{
int run;
int count=0;
run = 70;
testfunc(count, run);
}
And this is the result :
count=1 run=70
count=2 run=69
count=3 run=68
count=4 run=67
count=5 run=66
count=6 run=65
count=7 run=64
.....
count=59 run=12
count=60 run=11
count=61 run=10
count=62 run=9
count=63 run=8
count=64 run=7
count=0 run=0 // <--- Why count and run values are ZERO?
count=0 run=0
count=0 run=0
count=0 run=0
count=0 run=0
count=0 run=0
Recursion is not supported in OpenCL 1.x. From AMD's Introduction to OpenCL:
Key restrictions in the OpenCL C language are:
Function pointers are not supported.
Bit-fields are not supported.
Variable length arrays are not supported.
Recursion is not supported.
No C99 standard headers such as ctypes.h,errno.h,stdlib.h,etc. can be included.
AFAIK not all implementations have a call-stack like feature at all. In these cases, and possibly in your case, any function calls are inlined in the calling scope.

Resources