In the solver, I implements MPI to do simulation. I use clock() to check the CPU time of each part of the solver. The following code is used to check the CPU time:
MPI_Barrier(SlaveComm);
clock_t pre_dot1 = clock();
newrho = symmat->Lddot_MPI(n,r,z,false);
clock_t after_dot1 = clock();
time_dot3 += (after_dot1 - pre_dot1)/(double) CLOCKS_PER_SEC;
Lddot_MPI is a very simple function which calculates the dot as shown by the following code:
double BaseMatrix::Lddot_MPI(int n,double x[], double y[], bool forall){
// the dot product of two vectors.
double dsum(0.0);
for (int ii=0; ii<n; ii++)
{
dsum = dsum + x[ii]*y[ii];
}
double dsum_total(0.0);
if (forall)
MPI_Allreduce(&dsum, &dsum_total, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
else
MPI_Allreduce(&dsum, &dsum_total, 1, MPI_DOUBLE, MPI_SUM, SlaveComm);
return dsum_total;
}
I call this function at the three different locations in the same forloop, that means
for (int i=0; i<10000; i++)
{
newrho = symmat->Lddot_MPI(n,r,z,false);
.
.
.
newrho1 = symmat->Lddot_MPI(n,r,z,false);
.
.
.
newrho2 = symmat->Lddot_MPI(n,r,z,false);
.
.
.
}
Since before each time this function is called, MPI_Barrier is used to synchronize all the tasks. (This is added only for testing purpose). So the cpu time calculated is only the cpu time of this function. Since the function is called in forloop repeated, the cpu time is the accumulated cpu time of all the repeated calls. I found that the cpu time of this three functions (or the same function at three locations) are different substantially. One can be 20 times longer than the other. I also write out the cpu time of each call. One function is also be longer than the other consistently.
I think the computation and communication are the same for this function at three locations. I wonder why cpu times are so different. Is it possible the code previous to that function can have effect on that?
Related
Everyone good time of day!
I would like to ask the advice of the respected community about the use of GPU computing power instead of or together with the CPU.
I have a well-functioning program based on recursive search of all kinds of combinations of some events, paralleled using OpenMP to run on all available processor cores.
The pseudocode C++ is as follows:
// #includes
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; // (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
}
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
{
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
}
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
else
continue;
}
}
}
}
Unfortunately, I don't have a CPU with a thousand cores at my disposal, and without this, the algorithm works for a very long time. At the place where I work, I was advised to think about using a GPU to speed up calculations. I learned that OpenMP can work with video cards (and especially with NVidia), but OpenACC also does it well.
In this regard, my main question is whether it is possible to simply and, at the same time, effectively set the execution of a recursive algorithm on a GPU? Can this give a noticeable acceleration relative to the CPU? If so, maybe OpenACC will do better? And is it possible to give instructions to the video card through the "#pragma omp task", or are other instructions REQUIRED? And how would it be possible to combine calculations on the CPU and GPU?
Thank you so much for any help!
P.S. I apologize for my English, which is not my native language :)
I am trying to capture the average Total CPU usage over a 5 minute time in windows. My thought was to capture the CPU usage every second and in the end average each value.
Before I can do the above I wanted to make sure I was getting the correct current value so I started with the example from here:
How to get CPU Usage and Virtual Memory for a process in .Net Core?
And modified it to output the CPU usage each second so I can compare the value it is picking up with what Task Manager shows:
var value = cpuCounter.NextValue();
// In most cases you need to call .NextValue() twice
if (Math.Abs(value) <= 0.00)
value = cpuCounter.NextValue();
for (int i64 = 0; i64 < 10000; ++i64)
{
await Task.Delay(1000);
value = cpuCounter.NextValue();
Debug.WriteLine("CPU Counter Value: " + value);
}
My issue with the output of the above code is each output is about 10% lower than Task Manager. What am I doing wrong?
I have the following opencl kernel function to get the column sum of a image.
__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
int srcRows,int srcStep,int dstStep)
{
const int x = get_global_id(0);
srcStep >>= 2;
dstStep >>= 2;
if (x < srcCols)
{
int srcIdx = x ;
int dstIdx = x ;
float sum = 0;
for (int y = 0; y < srcRows; ++y)
{
sum += src[srcIdx];
dst[dstIdx] = sum;
srcIdx += srcStep;
dstIdx += dstStep;
}
}
}
I assign that each thread process a column here so that a lot of threads can get the column_sum of each column in parallel.
I also use float4 to rewrite the above kernel so that each thread can read 4 elements in a row at one time from the source image, which is shown below.
__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
int srcRows,int srcStep,int dstStep)
{
const int x = get_global_id(0);
srcStep >>= 2;
dstStep >>= 2;
if (x < srcCols/4)
{
int srcIdx = x ;
int dstIdx = x ;
float4 sum = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
for (int y = 0; y < srcRows; ++y)
{
float4 temp2;
temp2 = vload4(0, &src[4 * srcIdx]);
sum = sum + temp2;
vstore4(sum, 0, &dst[4 * dstIdx]);
srcIdx += (srcStep/4);
dstIdx += (dstStep/4);
}
}
}
In this case, theoretically, I think the time consumed by the second kernel to process a image should be 1/4 of the time consumed by the first kernel function. However, no matter how large the image is, the two kernels almost consume the same time. I don't know why. Can you guys give me some ideas? T
OpenCL vector data types like float4 were fitting better the older GPU architectures, especially AMD's GPUs. Modern GPUs don't have SIMD registers available for individual work-items, they are scalar in that respect. CL_DEVICE_PREFERRED_VECTOR_WIDTH_* equals 1 for OpenCL driver on NVIDIA Kepler GPU and Intel HD integrated graphics. So adding float4 vectors on modern GPU should require 4 operations. On the other hand, OpenCL driver on Intel Core CPU has CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT equal to 4, so these vectors could be added in a single step.
You are directly reading the values from "src" array (global memory). Which typically is 400 times slower than private memory. Your bottleneck is definitelly the memory access, not the "add" operation itself.
When you move from float to float4, the vector operation (add/multiply/...) is more efficient thanks to the ability of the GPU to operate with vectors. However, the read/write to global memory remains the same.
And since that is the main bottleneck, you will not see any speedup at all.
If you want to speed your algorithm, you should move to local memory. However you have to manually resolve the memory management, and the proper block size.
which architecture do you use?
Using float4 has higher instruction level parallelism (and then require 4 times less threads) so theoretically should be faster (see http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf)
However did i understand correctly in you kernel you are doing prefix-sum (you store the partial sum at every iteration of y)? If so, because of the stores the bottleneck is at the memory writes.
I think on the GPU float4 is not a SIMD operation in OpenCL. In other words if you add two float4 values the sum is done in four steps rather than all at once. Floatn is really designed for the CPU. On the GPU floatn serves only as a convenient syntax, at least on Nvidia cards. Each thread on the GPU acts as if it is scalar processor without SIMD. But the threads in a warp are not independent like they are on the CPU. The right way to think of the GPGPU models is Single Instruction Multiple Threads (SIMT).
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
Have you tried running your code on the CPU? I think the code with float4 should run quicker (potentially four times quicker) than the scalar code on the CPU. Also if you have a CPU with AVX then you should try float8. If the float4 code is faster on the CPU than float8 should be even faster on a CPU with AVX.
try to define __ attribute __ to kernel and see changes in run timing
for example try to define:
__ kernel void __ attribute__((vec_type_hint(int)))
or
__ kernel void __ attribute__((vec_type_hint(int4)))
or some floatN as you want
read more:
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/functionQualifiers.html
The following kernel computes an acoustic pressure field, with each thread computing it's own private instance of the pressure vector, which then needs to be summed down into global memory.
I'm pretty sure the code which computes the pressurevector is correct, but I'm still having trouble making this produce the expected result.
int gid = get_global_id(0);
int lid = get_local_id(0);
int nGroups = get_num_groups(0);
int groupSize = get_local_size(0);
int groupID = get_group_id(0);
/* Each workitem gets private storage for the pressure field.
* The private instances are then summed into local storage at the end.*/
private float2 pressure[HYD_DIM_TOTAL];
local float2 pressure_local[HYD_DIM_TOTAL];
/* Code which computes value of 'pressure' */
//wait for all workgroups to finish accessing any memory
barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
/// sum all results in a workgroup into local buffer:
for(i=0; i<groupSize; i++){
//each thread sums its own private instance into the local buffer
if (i == lid){
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_local[iHyd] += pressure[iHyd];
}
}
//make sure all threads in workgroup get updated values of the local buffer
barrier(CLK_LOCAL_MEM_FENCE);
}
/// copy all the results into global storage
//1st thread in each workgroup writes the group's local buffer to global memory
if(lid == 0){
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_global[groupID +nGroups*iHyd] = pressure_local[iHyd];
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
/// sum the various instances in global memory into a single one
// 1st thread sums global instances
if(gid == 0){
for(iGroup=1; iGroup<nGroups; iGroup++){
//we only need to sum the results from the 1st group onward
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_global[iHyd] += pressure_global[iGroup*HYD_DIM_TOTAL +iHyd];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
}
Some notes on data dimensions:
The total number of threads will vary between 100 and 2000, but may on occasion lie outside this interval.
groupSizewill depend on hardware but I'm currently using values between 1(cpu) and 32(gpu).
HYD_DIM_TOTAL is known at compile time and varies between 4 and 32 (will generally, but not necessarily, be a power of 2).
Is there anything blatantly wrong with this reduction code?
PS: I run this on an i7 3930k with AMD APP SDK 2.8 and on an NVIDIA GTX580.
I notice two issues here, one big, one smaller:
This code suggests that you have a misunderstanding of what a barrier does. A barrier never synchronizes across multiple workgroups. It only synchronizes within a workgroup. The CLK_GLOBAL_MEM_FENCE makes it look like it is global synchronization, but it really isn't. That flag just fences all of the current work item's accesses to global memory. So outstanding writes will be globally observable after a barrier with this flag. But it does not change the barrier's synchronization behavior, which is only at the scope of a workgroup. There is no global synchronization in OpenCL, beyond launching another NDRange or Task.
The first for loop causes multiple work items to overwrite each others' computation. The indexing of pressure_local with iHyd will be done by each work item with the same iHyd. This will produce undefined results.
Hope this helps.
I'm trying to work on a demonstration about multithreading. I need an example of a computationally-intensive function/method. But at the same time, the code that does the computing should be simple.
For example, I'm looking for a function that maybe does something like calculate the nth digit of pi or e:
function calculatePiToNthDecimalDigit(digits) {
var pi = "3.";
for (var i = 1; i < digits; i++) {
pi += digitOfPiAtDecimalPlace(i);
}
return pi;
}
function digitOfPiAtDecimalPlace(decimalPlace) {
...
}
Can anyone give me an example of a function that is relatively simple but can be used in succession (e.g. tight loop) to generate a very hard-to-compute (takes a long time) value?
The simplest I can think of is summing a huge list of numbers. Addition is obviously easy, but if the list is huge, that will make it computationally-intensive, and the problem lends itself well to multi-threading.
Real tests come from real problems. How about the numerical integration of a function using a simple formula such as the trapezoidal rule:
Lets try to prove that using C#
void Main(string[] args)
{
int N = 2097153;
double two = Integral(0, Math.PI, N);
double err = (2.0 - two) / 2.0;
Console.WriteLine("N={0} err={1}", N, err);
}
double f(double x) { return Math.Sin(x); }
double Integral(double a, double b, int N)
{
double h = (b - a) / N;
double res = (f(a) + f(b)) / 2;
for (int j = 1; j < N; j++)
{
double x = a + j*h;
res += f(x);
}
return h * res;
}
at which point I get N=2097153 and err=2.1183055309848E-13 after several milliseconds. If you go much higher in accuracy then the error starts to up as round-off errors start to creep in. I think something similar might happen with a calculation for Pi whereas you will reach you machine accuracy within a few milliseconds and beyond that you are really calculating garbage. You could just repeat the integral several times for a longer overall effect.
So you might be ok to show a drop in time from lets say 140 ms down to 90 ms and count it as a victory.
The multiplication of two NxN matrices has complexity proportional to N^3, so it is relatively easy to create a "computationally intensive" task, just by squaring a sufficiently large matrix. For example, as size goes from N=10 to N=100 to N=1000, the number of (scalar) multiplications required by the classic algorithm for matrix multiplication goes from one thousand to one million to one billion.
Also such a task has plenty of opportunities for parallel processing, if your multi-threading demonstration is meant to take advantage of such opportunities. E.g. the same row can be multiplied by more than one column in parallel.