Dynamic forecasting the data.table way? - r

I need to create dynamic forecast using
v(t) = b0 + b1*v(t-1) + b2*v(t-2) + b3*x(t-1) + b4*x(t-2)
The dataset looks like this at time 0. In actual data, there are 80 different x's and 100K "dates".
date v vLag1 vLag2 x xLag1 xLag2 b1 b2 b3 b4
2016-06-30 NA 105 95 33 11 23 0.2 3.2 -1.2 0.4
2016-07-01 NA NA NA 43 33 11 0.2 3.2 -1.2 0.4
2016-07-02 NA NA NA 52 43 33 0.2 3.2 -1.2 0.4
The goal is to predict v's, replacing all NA's with values. I created vLag1, vLag2, xLag1, xLag2 so that I have all I need to calculate v in one row.
All x's and b's are known ahead of time, so I created lags of x shown above. The b's are the coefficients.
For each date, the v(t) would be predicted, and the predicted v(t)'s will feed into the next date's v prediction as the lagged regressors.
To avoid looping over rows like this:
for (i in 2:nrows){
df$v[i] <- df$v[i-1] * df$coeff[i]
}
I have tried to use repeated substitution, so that all the future v's only reference v1, which is easy to calculate because v1's calculation involves other values in the same row.
v2 = b0 + b1*v1 + b2*v0 + b3*x1 + b4*x0
v3 = b0 + b1*v2 + b2*v1 + b3*x2 + b4*x1
(substitute v2) v3 = b0 + b1*(b0 + b1*v1 + b2*v0 + b3*x1 + b4*x0) + b2*v1 + b3*x2 + b4*x1
v4 = ...
But with so many lags of v's and x's to keep track, this also got out of control.
I have been browsing the data.table's shift function in SO. But, in my case, where the the values need to be dynamically obtained and then shifted, is there any way to dynamically predict in data.table's functions?

Instead of data.table (where you can't do this easily) this looks like an easy job for Rcpp:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector dyn_fore(const NumericVector x,
const double v1, const double v2,
const double x1, const double x2,
const double b0, const double b1, const double b2,
const double b3, const double b4) {
int n = x.size();
NumericVector v(n);
v(0) = b0 + b1*v1 + b2*v2 + b3*x1 + b4*x2;
v(1) = b0 + b1*v(0) + b2*v1 + b3*x(0) + b4*x1;
for (int i = 2; i < n; i++) {
v(i) = b0 + b1*v(i-1) + b2*v(i-2) + b3*x(i-1) + b4*(i-2);
}
return v;
}
(If you use Windows, make sure you have a working Rtools installation, put this in an C++ File in RStudio and source it. Check if I got coefficients and indices right.)
Then in R:
x <- c(33, 43, 52, 67)
dyn_fore(x, 105, 95, 11, 23, 0, 0.2, 3.2, -1.2, 0.4)
#[1] 321.00 365.00 1048.60 1315.72

Related

Understanding the efficiency of different layer/stages of pipelines when calculating a sum [duplicate]

This question already has answers here:
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
(1 answer)
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
(1 answer)
How to optimize these loops (with compiler optimization disabled)?
(3 answers)
When, if ever, is loop unrolling still useful?
(9 answers)
how will unrolling affect the cycles per element count CPE
(1 answer)
Closed 1 year ago.
So I have the results for some experiments and I need to write about the efficiency of pipelining.
I don't have the source code but I do have the time it took for a 4 layer pipeline and an 8 layer pipeline to sum an array of 100,000,000 doubles.
The sum was performed the following way.
For the 4-layer pipeline
d0 = 0.0; d1 = 0.0; d2 = 0.0; d3 = 0.0;
for (int i = 0; i < N; i = i + 4) {
d0 = d0 + a[i + 0];
d1 = d1 + a[i + 1];
d2 = d2 + a[i + 2];
d3 = d3 + a[i + 3];
}
c = d0 + d1 + d2 + d3;
for the 8 layer pipeline
d0 = 0.0; d1 = 0.0; d2 = 0.0; d3 = 0.0;
d4 = 0.0; d5 = 0.0; d6 = 0.0; d7 = 0.0;
for (int i = 0; i < N; i = i + 8) {
d0 = d0 + a[i + 0];
d1 = d1 + a[i + 1];
d2 = d2 + a[i + 2];
d3 = d3 + a[i + 3];
d4 = d4 + a[i + 4];
d5 = d5 + a[i + 5];
d6 = d6 + a[i + 6];
d7 = d7 + a[i + 7];
}
c = d0 + d1 + d2 + d3 + d4 + d5 + d6 + d7;
The results I have show the following time values for No pipeline , 2 layer pipeline , 4 layer pipeline and 8 layer pipeline. The code for the no pipeline and 2 -layer pipeline is similar to the ones I showed above. The results are averaged over 10 runs and are as follows. The experiment was run in an Intel Core i7-9750H Processor.
No Pipeline : 0.106 secs
2-Layer-Pipeline: 0.064 secs
4-Layer-Pipeline: 0.046 secs
8-Layer-Pipeline: 0.048 secs
It is evident that from no pipeline to 4 pipeline the effiency gets better but I'm trying to think of ways as to why the efficiency actually got worst from the 4-layer pipeline to the 8 layer-pipeline. Considering that the sum is done by different registers then there shouldn't be any type of dependency hazard affecting the values. One Idea that I had is that maybe there aren't enough ALUs to process more than 4 floating point numbers at one time and this causes stalls but then wouldn't it at least perform better than the 4 stage pipeline. I have plotted the processes in excel to try to find where the stalls/bubbles are happening but I can't see any.

Automating a function to return an expression with math constants and unknowns

I am trying to build a transitions matrix from Panel data observations in order to obtain the ML estimators of a weighted transitions matrix. A key step is obtaining the individual likelihood function for individuals. Say you have the following data frame:
ID Feature1 Feature2 Transition
120421006 10000 1 ab
120421006 12000 0 ba
120421006 10000 1 ab
123884392 3000 1 ab
123884392 2000 0 ba
908747738 1000 1 ab
The idea is to return, for each agent, the log-likelihood of his path. For agent 120421006, for instance, this boils down to (ignoring the initial term)
LL = log(exp(Yab)/1 + exp(Yab)) + log(exp(Yba) /(1 + exp(Yba))) +
log(exp(Yab)/1 + exp(Yab))
i.e,
log(exp(Y_transition)/(1 + exp(Y_transition)))
where Y_transition = xFeature1 + yFeature2 for that transition, and x and y are unknowns.
For example, for individual 120421006, this would boil down to an expression with three elements, since he transitions thrice, and the function would return
LL = log(exp(10000x + 1y)/ 1 + exp(10000x + 1y)) +
log(exp(12000x + 0y)/ 1 + exp(12000x + 0y)) +
log(exp(10000x + 1y)/ 1 + exp(10000x + 1y))
And here's the catch: I need x and y to return as unknowns, since the objective is to obtain a sum over the likelihoods of all individuals in order to pass it to an ML estimator. How would you automate a function that returns this output for all IDs?
Many thanks in advance
First you have to decide how flexible your function has to be. I am leaving it fairly rigid, but you can alter it at your flavor.
First, you have to input the initial guess parameters, which you will supply in the optimizer. Then, declare your data and variables to be used in your estimation.
Assuming you will always have only 2 variables (you can change it later)
y <- function(initial_param, data, features){
x = initial_param[1]
y = initial_param[2]
F1 = data[, features[1]]
F2 = data[, features[2]]
LL = log(exp(F1 * x + F2 * y) / (1 + exp(F1 * x + F2 * y)))
return(-sum(LL))
}
This function returns the sum of minus the log likelihood, given that most optimizers try to find the parameters at which your function reaches a minimum, by default.
To find your parameters just supply the below function with your likelihood function y, the initial parameters, data set and a vector with the names of your variables:
nlm(f = y, initial_param = your_starting_guess, data = your_data,
features = c("name_of_first_feature", "name_of_second_feature"), iterlim=1000, hessian=F)
Create the function:
fun=function(x){
a=paste0("exp(",x[1],"*x","+",x[2],"*y)")
parse(text=paste("sum(",paste0("log(",a,"/(1+",a,"))"),")"))
}
by(test[2:3],test[,1],fun)
sum(log(exp(c(10000, 12000, 10000) * x + c(1, 0, 1) * y)/(1 +
exp(c(10000, 12000, 10000) * x + c(1, 0, 1) * y))))
--------------------------------------------------------------------
sum(log(exp(c(3000, 2000) * x + c(1, 0) * y)/(1 + exp(c(3000,
2000) * x + c(1, 0) * y))))
--------------------------------------------------------------------
sum(log(exp(1000 * x + 1 * y)/(1 + exp(1000 * x + 1 * y))))
taking an example of x=0 and y=3 we can solve this:
x=0
y=3
sapply(by(test[2:3],test[,1],fun),eval)
[1] -0.79032188 -0.74173453 -0.04858735
in your example above:
x=0
y=3
log(exp(10000*x + 1*y)/ (1 + exp(10000*x + 1*y))) +#There should be paranthesis
log(exp(12000*x + 0*y)/ (1 + exp(12000*x + 0*y))) +
log(exp(10000*x + 1*y)/( 1 + exp(10000*x + 1*y)))
[1] -0.7903219
To get what you need within the comments:
fun1=function(x){
a=paste0("exp(",x[1],"*x","+",x[2],"*y)")
paste("sum(",paste0("log(",a,"/(1+",a,"))"),")")
}
paste(by(test[2:3],test[,1],fun1),collapse = "+")
1] "sum( log(exp(c(10000, 12000, 10000)*x+c(1, 0, 1)*y)/(1+exp(c(10000, 12000, 10000)*x+c(1, 0, 1)*y))) )+sum( log(exp(c(3000, 2000)*x+c(1, 0)*y)/(1+exp(c(3000, 2000)*x+c(1, 0)*y))) )+sum( log(exp(1000*x+1*y)/(1+exp(1000*x+1*y))) )"
But this doesnt make sense why you would group them and then sum all of them. That is same as just summing them without grouping them using the ID which would be simpler and faster

Is there a way to prove that this example of a hascode algorithm will give unique values

I was watching https://www.youtube.com/watch?v=UPo-M8bzRrc&index=21&list=PL4BBB74C7D2A1049C,(CS 61B Lecture 21: Hash Tables) and the example the professor gave was
You have a two letter word, each letter falls between a-z
public class Word{
public static final int LETTERS = 26, WORDS = LETTERS * LETTERS;
private String word;
public int hashCode(){
return LETTERS * (word.charAt(0)-'a') + word.charAt(1) - 'a';
}
}
Is there a way to prove(mathematical?) that each possible word will map to a different value between 0 and 675?
I've proved that the range will be between 0 and 675(give "aa" and "zz", but unsure about how to prove uniqueness.
The formula to get the hash code is:
hash = 26 * (A - 'a') + (B - 'a')
= 26 * (A - 97) + (B - 97) // 'a' == 97 in ASCII
= 26A + B - 27*97
So what we need to prove is that 26A + B has distinct values for any A and B in range <97; 122> (decimal values for <'a'; 'z'>). We ignore the constant `27 * 97 part, as it would not change the reasoning.
Let's look at the opposite statement - when the hash code would not be distinct? It would not be distinct when change in A would be compensated by the change in B. So the following would need to be true:
26 * A1 + B1 = 26 * A2 + B2
Let's assume that A2 = A1 + 1:
26 * A1 + B1 = 26 * (A1 + 1) + B2
= 26 * A1 + B2 + 26
Which means:
B1 = B2 + 26
B1 - B2 = 26
Which is impossible, because B is the char code for letters in range <'a'; 'z'>. And this range, in decimal ASCII values, is 25 (122-97). The required compensation by B would increase for every other A1 - A2 difference.
So, by proving the opposite is impossible, we've proved that hash code is unique for that characters.

Curve Fit 5 points

I am trying to curve fit 5 points in C. I have used this code from a previous post (Can sombody simplify this equation for me?) to do 4 points, but now I need to add another point.
// Input data: arrays x[] and y[]
// x[1],x[2],x[3],x[4] - X values
// y[1],y[2],y[3],y[4] - Y values
// Calculations
A = 0
B = 0
C = 0
D = 0
S1 = x[1] + x[2] + x[3] + x[4]
S2 = x[1]*x[2] + x[1]*x[3] + x[1]*x[4] + x[2]*x[3] + x[2]*x[4] + x[3]*x[4]
S3 = x[1]*x[2]*x[3] + x[1]*x[2]*x[4] + x[1]*x[3]*x[4] + x[2]*x[3]*x[4]
for i = 1 to 4 loop
C0 = y[i]/(((4*x[i]-3*S1)*x[i]+2*S2)*x[i]-S3)
C1 = C0*(S1 - x[i])
C2 = S2*C0 - C1*x[i]
C3 = S3*C0 - C2*x[i]
A = A + C0
B = B - C1
C = C + C2
D = D - C3
end-loop
// Result: A, B, C, D
I have been trying to covert this to a 5 point curve fit, but am having trouble figuring out what goes inside the loop:
// Input data: arrays x[] and y[]
// x[1],x[2],x[3],x[4],x[5] - X values
// y[1],y[2],y[3],y[4],y[5] - Y values
// Calculations
A = 0
B = 0
C = 0
D = 0
E = 0
S1 = x[1] + x[2] + x[3] + x[4]
S2 = x[1]*x[2] + x[1]*x[3] + x[1]*x[4] + x[2]*x[3] + x[2]*x[4] + x[3]*x[4]
S3 = x[1]*x[2]*x[3] + x[1]*x[2]*x[4] + x[1]*x[3]*x[4] + x[2]*x[3]*x[4]
S4 = x[1]*x[2]*x[3]*x[4] + x[1]*x[2]*x[3]*[5] + x[1]*x[2]*x[4]*[5] + x[1]*x[3]*x[4]*[5] + x[2]*x[3]*x[4]*[5]
for i = 1 to 4 loop
C0 = ??
C1 = ??
C2 = ??
C3 = ??
C4 = ??
A = A + C0
B = B - C1
C = C + C2
D = D - C3
E = E + C4
end-loop
// Result: A, B, C, D, E
any help in filling out the C0...C4 would be appreciated. I know this has to do with the matrices but I have not been able to figure it out. examples with pseudo code or real code would be most helpful.
thanks
I refuse to miss this opportunity to generalize. :)
Instead, we're going to learn a little bit about Lagrange polynomials and the Newton Divided Difference Method of their computation.
Lagrange Polynomials
Given n+1 data points, the interpolating polynomial is
where l_j(i) is
.
What this means is that we can find the polynomial approximating the n+1 points, regardless of spacing, etc, by just summing these polynomials. However, this is a bit of a pain and I wouldn't want to do it in C. Let's take a look at Newton Polynomials.
Newton Polynomials
Same start, given n+1 data points, the approximating polynomial is going to be
where each n(x) is
with a coefficient of
, being the divided difference.
The final form end's up looking like
.
As you can see, the formula is pretty easy given the divided difference values. You just do each new divided difference and multiply by each point so far. It should be noted that you'll end up with a polynomial of degree n from n+1 points.
Divided Difference
All that's left is to define the divided difference which is really best explained by these two pictures:
and
.
With this information, a C implementation should be reasonable to do. I hope this helps and I hope you learned something! :)
If the x values are equally spaced with x2-x1=h, x3-x2=h, x4-x3=h and x5-x4=h then
C0 = y1;
C1 = -(25*y1-48*y2+36*y3-16*y4+3*y5)/(12*h);
C2 = (35*y1-104*y2+114*y3-56*y4+11*y5)/(24*h*h);
C3 = -(5*y1-18*y2+24*y3-14*y4+3*y5)/(12*h*h*h);
C4 = (y1-4*y2+6*y3-4*y4+y5)/(24*h*h*h*h);
y(x) = C0+C1*(x-x1)+C2*(x-x1)^2+C3*(x-x1)^3+C4*(x-x1)^4
// where `^` denotes exponentiation (and not XOR).

Solve Linear System Over Finite Field with Module

Is there in sage, any instruction to solve a linear system equations
module p(x) (polynomial over finite field), where the system coefficients are polynomials over finite field in any indeterminate?. I know that for integers exists something like, example
sage: I6 = IntegerModRing(6)
sage: M = random_matrix(I6, 4, 4)
sage: v = random_vector(I6, 4)
sage: M \ v
(4, 0, 2, 1)
Here my code
F.<a> = GF(2^4)
PR = PolynomialRing(F,'X')
X = PR.gen()
a11 = (a^2)*(X^3)+(a^11)*(X^2)+1
a12 = (a)*(X^4)+(a^13)*(X^3)+X+1
a13 = X^2+(a^13)*(X^3)+a*(X^2)+1
a21 = X^3
a22 = X+a
a23 = X^2+X^3+a*X
a31 = (a^12)*X+a*(X^2)
a32 = (a^8)*(X^2)+X^2+X^3
a33 = a*X + (a^2)*(X^3)
M = matrix([[a11,a12,a13],[a21,a22,a23],[a31,a32,a33]])
v = vector([(a^6)*(X^14)+X^13+X,a*(X^2)+(X^3)*(a^11)+X^2+X+a^12,(a^8)*(X^7)+a*(X^2)+(a^12)* (X^13)+X^3+X^2+X+1])
p = (a^2 + a)*X^3 + (a + 1)*X^2 + (a^2 + 1)*X + 1 # is than 6 in the firs code
I'm trying
matrix(PolynomialModRing(p),M)\vector(PolynomialModRing(p),v)
but PolynomialModRing not exist ...
EDIT
another person talk me that I will make
R.<Xbar> = PR.quotient(PR.ideal(p))
# change your formulas to Xbar instead of X
A \ b
# ==> (a^3 + a, a^2, (a^3 + a^2)*Xbar^2 + (a + 1)*Xbar + a^3 + a)
this work fine but Now I'm trying to apply the Chinese Theorem Remainder after the code, then .... I defined
q = X^18 + a*X^15 + a*X^12 + X^11 + (a + 1)*X^2 + a
r = a^3*X^3 + (a^3 + a^2 + a)*X^2 + (a^2 + 1)*X + a^3 + a^2 + a
#p,q and r are relatively prime
and I'm trying ...
crt([(A\b)[0],(A\b)[1],(A\b)[2]],[p,q,r])
but I get
File "element.pyx", line 344, in sage.structure.element.Element.getattr (sage/structure/element.c:3871)
File "misc.pyx", line 251, in sage.structure.misc.getattr_from_other_class (sage/structure/misc.c:1606)
AttributeError: 'PolynomialQuotientRing_field_with_category.element_class' object has no attribute 'quo_rem'
I'm thinking that problem is the change Xbar to X
Here my complete example to integers
from numpy import arange, eye, linalg
#2x-3y+2z=21
#x+4y-z=1
#-x+2y+z=17
A = matrix([[2,-3,2],[1,4,-1],[-1,2,1]])
b=vector([21,1,17])
p=[17,11,13]
d=det(A)
dlist=[0,0,0]
ylist=matrix(IntegerModRing(p[i]),[[2,-3,2],[1,4,-1], [-1,2,1]])\vector(IntegerModRing(p[i]),[21,1,17])
p1=[int(ylist[0]),int(ylist[1]),int(ylist[2])]
CRT(p1,p)
Maybe... this is what you want? Continuing your example:
G = F.extension(p) # This is what you want for "PolynomialModRing(p)
matrix(G,M)\vector(G,v)
which outputs
(a^3 + a, a^2, (a^3 + a^2)*X^2 + (a + 1)*X + a^3 + a)
In your question you ask "where the system coefficients are polynomials over finite field in any indeterminate" so what I'm doing above is NOT what you have actually asked, which would be a weird question to ask given your example. So, I'm going to just try to read your mind... :-)

Resources