Understanding the efficiency of different layer/stages of pipelines when calculating a sum [duplicate] - intel

This question already has answers here:
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
(1 answer)
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
(1 answer)
How to optimize these loops (with compiler optimization disabled)?
(3 answers)
When, if ever, is loop unrolling still useful?
(9 answers)
how will unrolling affect the cycles per element count CPE
(1 answer)
Closed 1 year ago.
So I have the results for some experiments and I need to write about the efficiency of pipelining.
I don't have the source code but I do have the time it took for a 4 layer pipeline and an 8 layer pipeline to sum an array of 100,000,000 doubles.
The sum was performed the following way.
For the 4-layer pipeline
d0 = 0.0; d1 = 0.0; d2 = 0.0; d3 = 0.0;
for (int i = 0; i < N; i = i + 4) {
d0 = d0 + a[i + 0];
d1 = d1 + a[i + 1];
d2 = d2 + a[i + 2];
d3 = d3 + a[i + 3];
}
c = d0 + d1 + d2 + d3;
for the 8 layer pipeline
d0 = 0.0; d1 = 0.0; d2 = 0.0; d3 = 0.0;
d4 = 0.0; d5 = 0.0; d6 = 0.0; d7 = 0.0;
for (int i = 0; i < N; i = i + 8) {
d0 = d0 + a[i + 0];
d1 = d1 + a[i + 1];
d2 = d2 + a[i + 2];
d3 = d3 + a[i + 3];
d4 = d4 + a[i + 4];
d5 = d5 + a[i + 5];
d6 = d6 + a[i + 6];
d7 = d7 + a[i + 7];
}
c = d0 + d1 + d2 + d3 + d4 + d5 + d6 + d7;
The results I have show the following time values for No pipeline , 2 layer pipeline , 4 layer pipeline and 8 layer pipeline. The code for the no pipeline and 2 -layer pipeline is similar to the ones I showed above. The results are averaged over 10 runs and are as follows. The experiment was run in an Intel Core i7-9750H Processor.
No Pipeline : 0.106 secs
2-Layer-Pipeline: 0.064 secs
4-Layer-Pipeline: 0.046 secs
8-Layer-Pipeline: 0.048 secs
It is evident that from no pipeline to 4 pipeline the effiency gets better but I'm trying to think of ways as to why the efficiency actually got worst from the 4-layer pipeline to the 8 layer-pipeline. Considering that the sum is done by different registers then there shouldn't be any type of dependency hazard affecting the values. One Idea that I had is that maybe there aren't enough ALUs to process more than 4 floating point numbers at one time and this causes stalls but then wouldn't it at least perform better than the 4 stage pipeline. I have plotted the processes in excel to try to find where the stalls/bubbles are happening but I can't see any.

Related

What are the different versions of arithmetic swap and why do they work?

I think we all should be familiar of the arithmetic swap algorithm, that swaps two variables without using a third variable. Now I found out that there are two variations of the arithmetic swap. Please consider the following:
Variation 1.
int a = 2;
int b = 3;
a = a + b;
b = a - b;
a = a - b;
Variation 2.
int a = 2;
int b = 3;
b = b - a;
a = a + b;
b = a - b;
I want to know, why are there two distinct variations of the arithmetic swap and why do they work? Are there also other variations of the arithmetic swap that achieve the same result? How are they related? Is there any elegant mathematical formula that justifies why the arithmetic swap works the way it does, for all variations? Is there anything related between these two variations of the two arithmetic swap, like an underlying truth?
Break each variable out as what it represents:
a = 2
b = 3
a1 = a + b
b1 = a1 - b = (a + b) - b = a
a2 = a1 - b1 = (a + b) - a = b
a = 2
b = 3
b1 = b - a
a1 = a + b1 = a + (b - a) = b
b2 = a1 - b1 = b - (b - a) = a
There's not underlying truth other than the fact that the math works out. Remember that each time you do an assignment, it's effectively a new "variable" from the math side.

Dynamic forecasting the data.table way?

I need to create dynamic forecast using
v(t) = b0 + b1*v(t-1) + b2*v(t-2) + b3*x(t-1) + b4*x(t-2)
The dataset looks like this at time 0. In actual data, there are 80 different x's and 100K "dates".
date v vLag1 vLag2 x xLag1 xLag2 b1 b2 b3 b4
2016-06-30 NA 105 95 33 11 23 0.2 3.2 -1.2 0.4
2016-07-01 NA NA NA 43 33 11 0.2 3.2 -1.2 0.4
2016-07-02 NA NA NA 52 43 33 0.2 3.2 -1.2 0.4
The goal is to predict v's, replacing all NA's with values. I created vLag1, vLag2, xLag1, xLag2 so that I have all I need to calculate v in one row.
All x's and b's are known ahead of time, so I created lags of x shown above. The b's are the coefficients.
For each date, the v(t) would be predicted, and the predicted v(t)'s will feed into the next date's v prediction as the lagged regressors.
To avoid looping over rows like this:
for (i in 2:nrows){
df$v[i] <- df$v[i-1] * df$coeff[i]
}
I have tried to use repeated substitution, so that all the future v's only reference v1, which is easy to calculate because v1's calculation involves other values in the same row.
v2 = b0 + b1*v1 + b2*v0 + b3*x1 + b4*x0
v3 = b0 + b1*v2 + b2*v1 + b3*x2 + b4*x1
(substitute v2) v3 = b0 + b1*(b0 + b1*v1 + b2*v0 + b3*x1 + b4*x0) + b2*v1 + b3*x2 + b4*x1
v4 = ...
But with so many lags of v's and x's to keep track, this also got out of control.
I have been browsing the data.table's shift function in SO. But, in my case, where the the values need to be dynamically obtained and then shifted, is there any way to dynamically predict in data.table's functions?
Instead of data.table (where you can't do this easily) this looks like an easy job for Rcpp:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector dyn_fore(const NumericVector x,
const double v1, const double v2,
const double x1, const double x2,
const double b0, const double b1, const double b2,
const double b3, const double b4) {
int n = x.size();
NumericVector v(n);
v(0) = b0 + b1*v1 + b2*v2 + b3*x1 + b4*x2;
v(1) = b0 + b1*v(0) + b2*v1 + b3*x(0) + b4*x1;
for (int i = 2; i < n; i++) {
v(i) = b0 + b1*v(i-1) + b2*v(i-2) + b3*x(i-1) + b4*(i-2);
}
return v;
}
(If you use Windows, make sure you have a working Rtools installation, put this in an C++ File in RStudio and source it. Check if I got coefficients and indices right.)
Then in R:
x <- c(33, 43, 52, 67)
dyn_fore(x, 105, 95, 11, 23, 0, 0.2, 3.2, -1.2, 0.4)
#[1] 321.00 365.00 1048.60 1315.72

Moving slowly an initial angle until it reach a final angle

I will try to be very descriptive with this. I'm editing a game right now and the scenario is a 3D area.
I have an initial angle, writen as a direction vector, and another vector which haves different coordinates. As we know, the angle between 2 vectors is given by the formula: Theta = ACos( DotProduct( vec1, vec2 ) / ( VectorLength( vec1 ) * VectorLength( vec2 ) ) )
So let's describe the scenario: I'm currently programming some kind of stationary weapon, a sentry gun, this thing moves slowly his "head", shooting bullets to enemies. That angle rotation thing is my problem.
Let's imagine this: I have my sentry gun on a empty 3D area, and a "enemy" spawns over there. I can currently get the direction vector of my sentry's view angle, and the direction vector between my sentry and the player. Let's guess, using the formula described, his separation angle is 45 degrees. My sentry gun thinks (calls a function) at every 0.1 seconds, and I want to move his head 5 degrees at every thinking function until it reach the the player (ie, both vectors are nearly equal), and that means it will reach the player (if player keeps on its position...) in 0.9 seconds (5 degrees from 45)
How I can move sentry's view angle slowly until it reach a target? In 2D is easily but know I'm fighting with a 3D scenario, and I'm currently lost with this.
Any help would be appreciated, and about coding, I will be grateful with a pseudocode. Thanks! (and sorry for my english)
What you need is called SLERP - spherical linear interpolation
Your starting direction vector is p0 there, goal direction is p1, Omega is your Theta, and t parameter varies in range 0..1 with needed step
Delphi example for 2D case (it is easy to control)
var
p0, p1: TPoint;
i, xx, yy: Integer;
omega, InvSinOmega, t, a0, a1: Double;
begin
P0 := Point(0, 200);
P1 := Point(200, 0);
omega := -Pi / 2;
InvSinOmega := 1.0 / Sin(omega);
Canvas.Brush.Color := clRed;
Canvas.Ellipse(120 + P0.X, 120 + P0.Y, 120 + P0.X + 7, 120 + P0.Y + 7);
Canvas.Ellipse(120 + P1.X, 120 + P1.Y, 120 + P1.X + 7, 120 + P1.Y + 7);
for i := 1 to 9 do begin
t := i / 10;
a0 := sin((1 - t) * omega) * InvSinOmega;
a1 := sin(t * omega) * InvSinOmega;
xx := Round(P0.X * a0 + P1.X * a1);
yy := Round(P0.Y * a0 + P1.Y * a1);
Canvas.Brush.Color := RGB(25 * i, 25 * i, 25 * i);
Canvas.Ellipse(120 + xx, 120 + yy, 120 + xx + 9, 120 + yy + 9);
end;

Curve Fit 5 points

I am trying to curve fit 5 points in C. I have used this code from a previous post (Can sombody simplify this equation for me?) to do 4 points, but now I need to add another point.
// Input data: arrays x[] and y[]
// x[1],x[2],x[3],x[4] - X values
// y[1],y[2],y[3],y[4] - Y values
// Calculations
A = 0
B = 0
C = 0
D = 0
S1 = x[1] + x[2] + x[3] + x[4]
S2 = x[1]*x[2] + x[1]*x[3] + x[1]*x[4] + x[2]*x[3] + x[2]*x[4] + x[3]*x[4]
S3 = x[1]*x[2]*x[3] + x[1]*x[2]*x[4] + x[1]*x[3]*x[4] + x[2]*x[3]*x[4]
for i = 1 to 4 loop
C0 = y[i]/(((4*x[i]-3*S1)*x[i]+2*S2)*x[i]-S3)
C1 = C0*(S1 - x[i])
C2 = S2*C0 - C1*x[i]
C3 = S3*C0 - C2*x[i]
A = A + C0
B = B - C1
C = C + C2
D = D - C3
end-loop
// Result: A, B, C, D
I have been trying to covert this to a 5 point curve fit, but am having trouble figuring out what goes inside the loop:
// Input data: arrays x[] and y[]
// x[1],x[2],x[3],x[4],x[5] - X values
// y[1],y[2],y[3],y[4],y[5] - Y values
// Calculations
A = 0
B = 0
C = 0
D = 0
E = 0
S1 = x[1] + x[2] + x[3] + x[4]
S2 = x[1]*x[2] + x[1]*x[3] + x[1]*x[4] + x[2]*x[3] + x[2]*x[4] + x[3]*x[4]
S3 = x[1]*x[2]*x[3] + x[1]*x[2]*x[4] + x[1]*x[3]*x[4] + x[2]*x[3]*x[4]
S4 = x[1]*x[2]*x[3]*x[4] + x[1]*x[2]*x[3]*[5] + x[1]*x[2]*x[4]*[5] + x[1]*x[3]*x[4]*[5] + x[2]*x[3]*x[4]*[5]
for i = 1 to 4 loop
C0 = ??
C1 = ??
C2 = ??
C3 = ??
C4 = ??
A = A + C0
B = B - C1
C = C + C2
D = D - C3
E = E + C4
end-loop
// Result: A, B, C, D, E
any help in filling out the C0...C4 would be appreciated. I know this has to do with the matrices but I have not been able to figure it out. examples with pseudo code or real code would be most helpful.
thanks
I refuse to miss this opportunity to generalize. :)
Instead, we're going to learn a little bit about Lagrange polynomials and the Newton Divided Difference Method of their computation.
Lagrange Polynomials
Given n+1 data points, the interpolating polynomial is
where l_j(i) is
.
What this means is that we can find the polynomial approximating the n+1 points, regardless of spacing, etc, by just summing these polynomials. However, this is a bit of a pain and I wouldn't want to do it in C. Let's take a look at Newton Polynomials.
Newton Polynomials
Same start, given n+1 data points, the approximating polynomial is going to be
where each n(x) is
with a coefficient of
, being the divided difference.
The final form end's up looking like
.
As you can see, the formula is pretty easy given the divided difference values. You just do each new divided difference and multiply by each point so far. It should be noted that you'll end up with a polynomial of degree n from n+1 points.
Divided Difference
All that's left is to define the divided difference which is really best explained by these two pictures:
and
.
With this information, a C implementation should be reasonable to do. I hope this helps and I hope you learned something! :)
If the x values are equally spaced with x2-x1=h, x3-x2=h, x4-x3=h and x5-x4=h then
C0 = y1;
C1 = -(25*y1-48*y2+36*y3-16*y4+3*y5)/(12*h);
C2 = (35*y1-104*y2+114*y3-56*y4+11*y5)/(24*h*h);
C3 = -(5*y1-18*y2+24*y3-14*y4+3*y5)/(12*h*h*h);
C4 = (y1-4*y2+6*y3-4*y4+y5)/(24*h*h*h*h);
y(x) = C0+C1*(x-x1)+C2*(x-x1)^2+C3*(x-x1)^3+C4*(x-x1)^4
// where `^` denotes exponentiation (and not XOR).

Enumerated values with a single array of numbers without looping

I am writing some functionality for a visual node based CAD program that will not allow for me to loop so I need a workaround to enumerate a list of numbers. I am an architect with very little programming experience so any help would be great.
A have an array of numbers(numArray) coming in as such 0,1,2,3,4... (first column) I need to take those numbers and convert them into their counterpart for column 1,2,3,4 without using any loops or nested loops.
numArray 1 2 3 4
-----------
0 = 0|0|0|0
1 = 0|0|0|1
2 = 0|0|0|2
3 = 0|0|0|3
4 = 0|0|1|0
5 = 0|0|1|1
6 = 0|0|1|2
7 = 0|0|1|3
8 = 0|0|2|0
9 = 0|0|2|1
10= 0|0|2|2
12= 0|0|2|3
13= 0|0|3|0
14= 0|0|3|1
15= 0|0|3|2
16= 0|1|3|3
17= 0|1|0|0
18= 0|1|0|1
19= 0|1|0|2
20= 0|1|0|3
21= 0|1|1|0
22= 0|1|1|1
23= 0|1|1|2
24= 0|1|1|3
I have figured out column 4 by implementing the following:
int column4 = numArray % 4;
this works and creates the numbers as such 0,1,2,3,0,1,2,3.... this is great however I am not sure how to use the num array coming in to produce column 3 2 and 1. Again I have very little programming experience so any help would be great.
You're converting the input to base 4 notation, so this will do the job:
int input[] = {0,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23,24};
for (int i = 0; i < sizeof(input)/sizeof(input[0]); i++)
{
int c1, c2, c3, c4;
c4 = input[i] % 4;
c3 = (input[i] / 4) % 4;
c2 = (input[i] / 16) % 4;
c1 = (input[i] / 64) % 4;
printf("%d = %d\t%d\t%d\t%d\n", input[i], c1, c2, c3, c4);
}

Resources