I am working on an estimation module, where we are computing seasonality variations and forecasting. Previously, we were using fixed 5-order sinusoidal functions for estimation. The formula was as follows
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(4*pi*doy/365)+ z[4]*cos(4*pi*doy/365)
+z[5]*sin(6*pi*doy/365)+ z[6]*cos(6*pi*doy/365)
+z[7]*sin(8*pi*doy/365)+ z[8]*cos(8*pi*doy/365)
+ z[9]*sin(10*pi*doy/365)+ z[10]*cos(10*pi*doy/365))
Now, we have tried some modifications in our model. Using Fast Fourier Transform, we are able to generate the orders for trigonometric functions automatically.
For example, on my current dataset, I have the following array of orders.
order_FFT = [2, 6, 10, 24], such that
order_FFT[0] = 2
order_FFT[1] = 6
order_FFT[2] = 10
order_FFT[3] = 24
There will be 4 orders here. With some other dataset, there could be more or less no. of orders. Therefore, I need to define a for loop so that the formula gets modified.
With my current dataset and corresponding orders_FFT array, the for loop should execute the following formula:
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(6*pi*doy/365)+ z[4]*cos(6*pi*doy/365)
+z[5]*sin(10*pi*doy/365)+ z[6]*cos(10*pi*doy/365)
+z[7]*sin(24*pi*doy/365)+ z[8]*cos(24*pi*doy/365)
which basically means
doy_seasonality = exp(z[1]*sin(order_FFT[0]*pi*doy/365)+z[2]*cos(order_FFT[0]*pi*doy/365)
+z[3]*sin(order_FFT[1]*pi*doy/365)+ z[4]*cos(order_FFT[1]*pi*doy/365)
+z[5]*sin(order_FFT[2]*pi*doy/365)+ z[6]*cos(order_FFT[2]*pi*doy/365)
+z[7]*sin(order_FFT[3]*pi*doy/365)+ z[8]*cos(order_FFT[3]*pi*doy/365)
I am at a loss trying to figure out a for loop code for this. Sorry that I am not able to show my own efforts here.
I would not use a loop. Here is an R approach:
#Some test data
set.seed(42)
z <- rnorm(8)
doy <- 1:365
order_FFT <- c(2, 6, 10, 24)
#separate coefficients for sin and cos in two rows:
z <- matrix(z, nrow = 2)
#calculate the sins and cosins:
sins <- outer(doy, order_FFT, function(x, y) sin(x * pi * y / 365))
cosins <- outer(doy, order_FFT, function(x, y) cos(x * pi * y / 365))
#use matrix products to multiply and sum
doy_seasonality2 <- c(exp(sins %*% z[1,] + cosins %*% z[2,]))
Does it produce the same result?
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(6*pi*doy/365)+ z[4]*cos(6*pi*doy/365)
+z[5]*sin(10*pi*doy/365)+ z[6]*cos(10*pi*doy/365)
+z[7]*sin(24*pi*doy/365)+ z[8]*cos(24*pi*doy/365))
all.equal(doy_seasonality, doy_seasonality2)
#[1] TRUE
Related
Hey everyone, I have a large Matrix X with the dimensions (654x7095). I wanted to subset this matrix and replace the values of this subsetted matrix of X with another matrix which I have created. The R-code is as follows -
install.packages("Matrix")
install.packages("base")
library(Matrix)
library(base)
T = 215
n = 3
k = 33
X = matrix(0,T*n,T*k)
IN = diag(n)
K1 = Matrix(0, n*n, n*(n-1)/2, sparse = TRUE)
for(i in 1:(n-1)){
K1[(2+(i-1)*(n+1)):(i*n), (1+(i-1)*(n-i/2)):(i*(n-i)*(i+1)/2)] <- diag(n-i)
}
yin = matrix(rnorm(645), ncol = 3)
Xu = matrix(rnorm(2150), ncol = 10)
#Till yet I have defined the variables and matrices which will be used in subsetting.
Above codes are perfectly fine, however, the code below is showing error -
#Loop for X subsetting
for(i in 1:T){
X[(((i-1)*n)+1):(i*n), (((i-1)*k)+1):(i*k)] <- cbind( (t(kronecker(yin[i,],IN))%*%K1) , (t(kronecker(Xu[i,],IN))))
}
# in this Kronecker() finds the Kronecker tensor product of two Matrix A and B. This function can be used with the help of "base" library.
When I am running this above code, the error which is showing is -
Error in X[(((i - 1) * n) + 1):(i * n), ] <- cbind((t(kronecker(yin[i, :
number of items to replace is not a multiple of replacement length
However, when I am running this same command in MATLAB it is working perfectly fine. MATLAB CODE -
X = zeros(T*n,T*k);
for i = 1:T
X((i-1)*n+1:i*n,(i-1)*k+1:i*k) = [kron(yin(i,:),IN)*K1, kron(Xu(i,:),IN)];
end
The output which MATLAB is giving is that it fills up the values in number of rows and columns which is defined in the Loop for subsetting the X. I have attached the snapshot of the desired output which MATLAB is giving. However, error is showing in R for the same.
Can someone enlighten me as where I am going wrong with the R code?
I appreciate the help, Many thanks.
I think the problem is how the class 'dgeMatrix' is handled. Try
for (i in 1:T) {
X[(((i-1)*n)+1):(i*n), (((i-1)*k)+1):(i*k)] <- as.matrix(cbind((t(kronecker(yin[i,],IN))%*%K1) , (t(kronecker(Xu[i,],IN)))))
}
I have a data frame with 10,000 rows and 40 columns. I am trying to apply a function to each of these rows. For each row, I am expecting to return a scalar which is the value of the statistic I am calculating in this function. Below is what I have done so far;
library(sandwich)
# Creating example data #
nrows=10000
ncols=40
n1=20
n2=20
df=data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
# Function to evaluate on each row of df #
get_est= function(x){
mod = lm(x~cov$group)
vcov = vcovHC(mod)
coef = as.numeric(mod$coefficients[2])
se = sqrt(as.numeric(diag(vcov)[2]))
stats = coef/se
return(stats)
}
# Applying above function to full data #
t1=Sys.time()
estimates=apply(df, 1, function(x) get_est(x))
t2=Sys.time()-t1
# Time taken by apply function
Time difference of 32.10623 secs
Is there a way to significantly decrease the time taken to implement get_est() on the full data? The main reason I need to speed up the computation on a single df is because I have 1000 more data frames with the same dimension and I have to apply this function to each row to each of these data frames simultaneously. To illustrate, below is the broader situation I am dealing with;
# Creating example data
set.seed(1234)
nrows = 10000
ncols = 40
n1 = 20
n2 = 20
df.list = list()
for(i in 1:1000){
df.list[[i]] = data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
}
# Applying get_est() to each row and to each of data frame in df.list #
all.est = foreach(j = 1:length(df.list), .combine = cbind, .packages = 'sandwich') %dopar% {
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
est = apply(df.list[[j]], 1, function(x) get_est(x))
return(est)
}
Even after parallelizing it is taking hours to finish. My ultimate objective is to significantly cut down the time to obtain "all.est" which will contain 10000 rows and 1000 columns where each column has the stats estimates for the respective data set. Any help is much appreciated!! Thanks in advance!
Your function get_est uses some "expensive" functions, such as lm, vcovHC, and so on. If you think of the OLS equation,
$$
\hat{\beta} = (X^TX)^{-1}X^Ty,
$$
then you can see that the first part $(X^TX)^{-1}X^T$ doesn't change in your simulation, so the design matrix is constant. To make use of this, I compute $(X^TX)^{-1}X^T$ before starting the simulation. This approach then also requires computing the HC3 standard errors manually using the formula (see e.g. here)
$$
\widehat{\text{Cov}}_{\text{HC3}}(\hat{\beta}) = (X^TX)^{-1}X^T \text{diag} \left[ \frac{e_i^2}{(1-h_{ii})^2} \right] X(X^TX)^{-1}.
$$
Everything except for the residuals is constant across your simulation iterations, so it can be precomputed. Once I implement these tricks, I achieve a speed up of roughly factor 50.
(Note: lm uses QR decomposition, which could also be implement similarly here. Maybe you can an ever bigger speed up by parallelizing the code.)
nrows = 10000
ncols = 40
n1 = 20
n2 = 20
df = data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
cov = data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
# old function
get_est_old = function(x){
mod = lm(x~cov$group)
vcov = sandwich::vcovHC(mod)
coef = as.numeric(mod$coefficients[2])
se = sqrt(as.numeric(diag(vcov)[2]))
stats = coef/se
return(stats)
}
# new function
# first construct design matrix
X = matrix(c(rep(1, ncols), rep(0, ncols / 2), rep(1, ncols / 2)), ncol = 2)
# these quantities will be used below
inv = solve(crossprod(X)) %*% t(X)
h = diag(X %*% inv)
get_est_new= function(x){
coef = (inv %*% x)
resid = x - (X %*% coef)
bread = (resid^2 / (1 - h)^2)[,1]
hc3 = inv %*% diag(bread) %*% t(inv)
se = sqrt(hc3[2,2])
stats = coef[2,1]/sqrt(hc3[2,2])
}
# Applying above function to full data #
system.time({
estimates_old = apply(df, 1, function(x) get_est_old(x))
})
#> user system elapsed
#> 7.876 0.042 7.929
system.time({
estimates_new = apply(df, 1, function(x) get_est_new(x))
})
#> user system elapsed
#> 0.141 0.016 0.158
# check
all.equal(estimates_old, estimates_new)
#> [1] TRUE
Created on 2021-09-04 by the reprex package (v2.0.1)
These posts could be of interest:
https://grantmcdermott.com/efficient-simulations-in-r/
https://grantmcdermott.com/simulations-remix-turn-up-the-base/
I have been trying to implement a random walk on the n-cycle algorithm in R.
By n-cycle I mean the set of integers Zn, or modulo n. Basically, it’s example 5.3.1 from the book “Markov chains and mixing time”, by Levin, Peres and Wilmer. The intention is as follows: consider two chains modeling the movement of two particles X and Y on Zn with starting points X1 and Y1. By the flip of a fair coin we decide which particle will move (the particles cannot move simultaneously unless they have coupled); the direction is decided by another flip of fair coin.
Once the two particle collide, they move together hereafter. It is part of a study project to implement a CFTP algorithm, so the length of the chains should have a pre-defined value, say T.
The code does not run and an error message appears. The error is “object ‘res’ not found”. However, I had previously defined “res” as a list to store the output of the function. Why does this happen and how could it be fixed?
I have two scripts: in the first one the code is split in smaller helper functions; the second one may be messier, as I tried to put all the helper functions within one single function.
Any help will be much appreciated.
This one is script 2.
# X1 - initial state of chain X
# Y1 - initial state of chain Y
# T - "length" of a chain, number of steps the chains will run for.
# n - length of the n-cycle, i.e., Zn.
Main_Function <- function (X1 = 8, Y1 = 4 , T = 20, n = 6){
X <- rep( X1, T) %% n # X, Y and res will store the results
Y <- rep( Y1, T) %% n
res <- list(X,Y) # Here I defined the object res. Later on R encounters an error "object 'res' not found".
ps <- TakeOneStep() # TakeOneStep is a function defined below
return(ps)
}
TakeOneStep <- function(){
incr_same <- sample(c(-1, 0, 1), size = 1, prob = c(1/4, 1/2, 1/4)) #direction of the particles after they have coupled
incr_dif <- sample(c(-1,1), size = 1, prob = c(1/2, 1/2)) # direction of the particles before coupling occurred.
choice <- runif(T) # determines which chain moves, before coupling occurred.
for(t in 2:T){
if(res[[1]][t-1]%%n == res[[2]][t-1]%%n){
res[[1]][t] <- (res[[1]][t-1] + incr_same) %% n
res[[2]][t] <- (res[[2]][t-1] + incr_same) %% n
}else{ if(choice[t] < 0.5) {
res[[1]][t] <- (res[[1]][t-1] + incr_dif) %% n
}else{res[[2]][t] <- (res[[2]][t-1] + incr_dif)%%n}
}
}
return(res)
}
I have seen a lot of this close to this but nothing has actually solved the issue I am running into. So my question is how do you address the vector value within a function such as a mean function, and how could you place the vector value into a title. I recently switched from SAS to R so im a little confused.
###### parameters #####
nphase1=50
nphase2=1000
varcount=5
meanshift= 0
sigmashift= 1
##### phase1 dataset/ control limits #####
for (i in 1:varcount)
{
assign (paste("x",i, sep=""), (rnorm(nphase1,0,1)))
mean_var[i]=mean(x[i])
std_var[i]=sd(x[i])
Upper_SPC_Limit_Method1_var[i]= mean_var[i] + (3 * std_var[i])
Lower_SPC_Limit_Method1_var[i]= mean_var[i] - (3 * std_var[i])
moving_range_var[i]= abs(diff(x[i]))
MR_mean[i]= mean(moving_range_var[i])
Upper_SPC_Limit_Method2_var[i] =mean_var[i] + (3 * MR_mean[i])
Lower_SPC_Limit_Method2_var[i] =mean_var[i] - (3 * MR_mean[i])
}
I am sure i will have to do something similar to the (assign(paste("x",i, sep="") for labeling individual the individual limits, but i can't get to that step without being able to calculate the mean of each variable inside the for loop.
what i am trying to do is create 5 variables that have 50 observations each(normal random dist). I want to take the mean & Sd of each variable to construct control limits using these numbers.
Thanks for your insight!
I believe the code below accomplishes what you desire. I make use of matrix(), with() and apply(), and strongly recommend reading up on them for this kind of work.
Apply() Tutorial
With() Primer
###### parameters #####
nphase1=50
nphase2=1000
varcount=5
meanshift= 0
sigmashift= 1
##### phase1 dataset/ control limits #####
x <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol = varcount)
mean_var <- apply(x, 2, mean)
std_var <- apply(x, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1_var <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1_var <- with(df_var, mean_var - 3 * std_var)
moving_range_var <- apply(x, 2, function(z) abs(diff(z)))
MR_mean <- apply(moving_range_var, 2, mean)
Upper_SPC_Limit_Method2_var <- with(df_var, mean_var + 3 * MR_mean)
Lower_SPC_Limit_Method2_var <- with(df_var, mean_var - 3 * std_var)
Your variables aren't named x[1], x[2], etc. They would be x1, x2, and so on. You should probably create a list if that's what you want to do, i.e. x[[i]] <- rnorm(nphase1, 0, 1), but still your code is inefficient. You should look into vectorizing it, making x a matrix, etc.
Here is the formula which I am trying to calculate in R.
So far, this is my approach using a simplified example
t <- seq(1, 2, 0.1)
expk <- function(k){exp(-2*pi*1i*t*k)}
set.seed(123)
dat <- ts(rnorm(100), start = c(1994,3), frequency = 12)
arfit <- ar(dat, order = 4, aic = FALSE) # represent \phi in the formula
tmp1 <- numeric(4)
for (i in seq_along(arfit$ar)){
ek <- expk(i)
arphi <- arfit$ar[i]
tmp1[i] <- ek * arphi
}
tmp2 <- sum(tmp1)
denom = abs(1-tmp2)^2
s2 <- t/denom
Error : Warning message:
In tmp1[i] <- ek * arphi :
number of items to replace is not a multiple of replacement length
I was trying to avoid using for loop and tried using sapply as in solutions to this question.
denom2 <- abs(1- sapply(seq_along(arfit$ar), function(x)sum(arfit$ar[x]*expf(x))))^2
but doesnt seem to be correct. The problem is to do the sum of the series(over index k) when it is taking values from another vector as well, in this case, t which is in the numerator.
Any solutions ?
Any suggestion for a test dataset, maybe using 0 and 1 to check if the calculation is done correctly in this loop here ?
Typing up the answer determined in chat. Here's a solution involving vapply.
First correct expk to:
expk <- function(k){sum(exp(-2*pi*1i*t*k))}
Then you can create this function and vapply it:
myFun <- function(i) return(expk(i) * arfit$ar[i])
tmp2 <- sum(vapply(seq_along(arfit$ar), myFun, complex(1)))