First post here. Trying to run a rolling linear regression on a time series: code is as follows
MX_data <- merge.zoo(as.zoo(MX_tr),as.zoo(MX_RER_2))
MX_tr and MX_RER_2 are ts objects with 37 rows. the value for width below, w, is a vector that defines the length of my desired windows on which regressions will be calculated
w<- c(4,4,4,4,4,4,4,4,5)
rollingbeta <- rollapply(MX_data,
width= w,
FUN = function(Z) coef(lm(formula=MX_tr~MX_RER_2, data =
as.data.frame(MX_data))),
by.column=FALSE, align="right")
the result i get gives me a matrix looking object with two columns, one for the intercept and one for the Beta term. Problem is that every value in each column is the same: -1.14 for the beta term, and 0.0866 for the intercept. These are the values i get from running the regression on the whole series.
help is much appreciated. thanks
Related
I am working with time series 551 of the monthly data of the M3 competition.
So, my data is :
library(forecast)
library(Mcomp)
# Time Series
# Subset the M3 data to contain the relevant series
ts.data<- subset(M3, 12)[[551]]
print(ts.data)
I want to implement time series cross-validation for the last 18 observations of the in-sample interval.
Some people would normally call this “forecast evaluation with a rolling origin” or something similar.
How can i achieve that ? Whats means the in-sample interval ? Which is the timeseries i must evaluate?
Im quite confused , any help in order to light up this would be welcome.
The tsCV function of the forecast package is a good place to start.
From its documentation,
tsCV(y, forecastfunction, h = 1, window = NULL, xreg = NULL, initial = 0, .
..)
Let ‘y’ contain the time series y[1:T]. Then ‘forecastfunction’ is
applied successively to the time series y[1:t], for t=1,...,T-h,
making predictions f[t+h]. The errors are given by e[t+h] =
y[t+h]-f[t+h].
That is first tsCV fit a model to the y[1] and then forecast y[1 + h], next fit a model to y[1:2] and forecast y[2 + h] and so on for T-h steps.
The tsCV function returns the forecast errors.
Applying this to the training data of the ts.data
# function to fit a model and forecast
fmodel <- function(x, h){
forecast(Arima(x, order=c(1,1,1), seasonal = c(0, 0, 2)), h=h)
}
# time-series CV
cv_errs <- tsCV(ts.data$x, fmodel, h = 1)
# RMSE of the time-series CV
sqrt(mean(cv_errs^2, na.rm=TRUE))
# [1] 778.7898
In your case, it maybe that you are supposed to
fit a model to ts.data$x and then forecast ts.data$xx[1]
fit mode the c(ts.data$x, ts.data$xx[1]) and forecast(ts.data$xx[2]),
so on.
I am new in coding, so I still struggle with simple things as loops, subsetting, and data frame vs. matrix.
I am trying to fit a ridge regression for a multivariable X (X1=Marker 1, X2= Marker, X3= Marker 3,..., X1333= Marker 1333), shown in the first image, as a predictor variable of Y, in the second image.
I want to compute the sum of the squared errors (SSE) for varying tuning parameter λ (between 1 and 20). My code is the following:
#install.packages("MASS")
library(MASS)
fitridge <- function(x,y){
fridge=lm.ridge (y ~ x, lambda = seq(0, 20, 2)) #Fitting a ridge regression for varying λ values
sum(residuals(fridge)^2) #This results in SSE
}
all_gcv= apply(as.matrix(genmark_new),2,fitridge,y=as.matrix(coleslev_new))
}
However, it returns this error, and I don't know what to do anymore. I have tried converting the data set into a matrix, a data frame, changing the order of rows and columns...
Error in colMeans(X[, -Inter]) : 'x' must be an array of at least two dimensions.
I just would like to take each marker value from a single row (first picture), pass them into my fitridge function that fits a ridge regression against the Y from the second data set (in the second picture).
And then subset the SSE and their corresponding lambda values
You cannot fit a ridge with only one independent variable. It is not meant for this. In your case, most likely you have to do:
genmark_new = data.frame(matrix(sample(0:1,1333*100,replace=TRUE),ncol=1333))
colnames(genmark_new) = paste0("Marker_",1:ncol(genmark_new))
coleslev_new = data.frame(NormalizedCholesterol=rnorm(100))
Y = coleslev_new$NormalizedCholesterol
library(MASS)
fit = lm.ridge (y ~ ., data=data.frame(genmark_new,y=Y),lambda = seq(0, 20, 2))
And calculate residuals for each lambda:
apply(fit$coef,2,function(i)sum((Y-as.matrix(genmark_new) %*% i)^2))
0 2 4 6 8 10 12 14
26.41866 27.88029 27.96360 28.04675 28.12975 28.21260 28.29530 28.37785
16 18 20
28.46025 28.54250 28.62459
If you need to fit each variable separately, you can consider using a linear model:
fitlm <- function(x,y){
fridge=lm(y ~ x)
sum(residuals(fridge)^2)
}
all_gcv= apply(genmark_new,2,fitlm,y=Y)
Suggestion, check out make notes or introductions to ridge, they are meant for multiple variate regressions, that is, more than 1 independent variable.
In R, I have a dataset of (x, y) points that is constantly being updated via simulation (values are appended to the end of the dataset).
I would like to compute the slope (via a linear model) of the line created by the data using only the last 10 listed datapoints.
The confusion here arises from the fact that the data are changing, and so I suspect a loop may be needed to iterate over the indices of the datapoints.
In R, one usually does something like
linreg <- lm(y ~ x, data = d) # set up linear model
summary.linreg <- summary(linreg) # output summary of model
beta1 <- coef(summary.linreg)[2] # extract slope
The change that is needed in my case is in linreg, specifically
linreg <- lm(y[?] ~ x[?], data = d) # subset response and predictor
For a non-changing dataset of 10 x-y points, one simply does [?] = [1:10] and the problem is solved. In my case though, I am at a standstill as to the best way to proceed efficiently.
Any thoughts?
No, don't subset inside the formula. Subset the data.frame. Inside your loop, after each database update, do this:
linreg <- lm(y ~ x, data = tail(d, 10))
If you want to loop over a data.frame rows, do this:
linreg <- lm(y ~ x, data = d[i:(i+9),])
If your data.frame is large and you only need the slope, you should use the more low-level function lm.fit for better performance. There might also be packages that provide functions for rolling regression.
I want to run a customized function based on ARIMA model. The function calls the ma3 coefficient from ARIMA (2, 0, 3) model ran on daily data of a year and subtracts ma3 coefficient from 2, for every firm. I have 5 years daily data for five firms, so each firm should have 5 year-wise values. My code:
>Stressy =function(x) 2-summary(arima(x, order=c(2,0,3)))$coefficients[1, "ma3"]
>Funny = aggregate(cbind(QQ) ~ Year + Firm , df, FUN = Stressy)
Running my code gives the following error:
Error in summary(arima(x, order = c(2, 0, 3)))$coefficients : $ operator is invalid for atomic vectors
I know the result can be estimated manually but my data set is large enough to be confusing when handled manually. Please suggest an edit to fix this.
There are two ways you could get the ma3 coefficient:
Stressy <- function(x) 2-coef(arima(x, order=c(2,0,3)))["ma3"]
or
Stressy <- function(x) 2-arima(x, order=c(2,0,3))$coef["ma3"]
Your original custom function didn't work because summary(arima_object) gives you a table, to which you cannot apply the $ operator:
x <- arima(df, c(2,0,3))
class(summary(x))
[1] "summaryDefault" "table"
I'm using Amelia to impute the missing values.
While I'm able to use Zelig and Amelia to do some calculations...
How do I use these packages to find the pooled means and standard deviations of the newly imputed data?
library(Amelia)
library(Zelig)
n= 100
x1= rnorm(n,0,1) #random normal distribution
x2= .4*x1+rnorm(n,0,sqrt(1-.4)^2) #x2 is correlated with x1, r=.4
x1= ifelse(rbinom(n,1,.2)==1,NA,x1) #randomly creating missing values
d= data.frame(cbind(x1,x2))
m=5 #set 5 imputed data frames
d.imp=amelia(d,m=m) #imputed data
summary(d.imp) #provides summary of imputation process
I couldn't figure out how to format the code in a comment so here it is.
foo <- function(x, fcn) apply(x, 2, fcn)
lapply(d.imp$imputations, foo, fcn = mean)
lapply(d.imp$imputations, foo, fcn = sd)
d.imp$imputations gives a list of all the imputed data sets. You can work with that list however you are comfortable with to get out the means and sds by column and then pool as you see fit. Same with correlations.
lapply(d.imp$imputations, cor)
Edit: After some discussion in the comments I see that what you are looking for is how to combine results using Rubin's rules for, for example, the mean of imputed data sets generated by Amelia. I think you should clarify in the title and body of your post that what you are looking for is how to combine results over imputations to get appropriate standard errors with Rubin's rules after imputing with package Amelia. This was not clear from the title or original description. "Pooling" can mean different things, particularly w.r.t. variances.
The mi.meld function is looking for a q matrix of estimates from each imputation, an se matrix of the corresponding se estimates, and a logical byrow argument. See ?mi.meld for an example. In your case, you want the sample means and se_hat(sample means) for each of your imputed data sets in the q and se matrices to pass to mi_meld, respectively.
q <- t(sapply(d.imp$imputations, foo, fcn = mean))
se <- t(sapply(d.imp$imputations, foo, fcn = sd)) / sqrt(100)
output <- mi.meld(q = q, se = se, byrow = TRUE)
should get you what you're looking for. For other statistics than the mean, you will need to get an SE either analytically, if available, or by, say, bootstrapping, if not.