I'm trying to get the residual standard deviation (residual standard error in R summaries) for rolling regressions. I'm trying to do rolling regressions on 20 days of stock returns over a total of 4000 days. I can do rolling regressions, and I can get the residual standard deviation from a regular lm regression, but not for the rolling regression.
My data is similar to the following, where the data frame has the returns of multiple stocks and the vector is an index return:
data<-as.data.frame(matrix(rexp(20000, rate=.1), ncol=20))
vector<-rexp(1000,rate=0.1)
I can produce a sigma for an lm regression: sigma(lm(data$V1~vector))
I can produce a rolling regression with library(roll) and roll_lm(vector,data$V1,width=20)and with library(rollRegres)and roll_regres(data$V1~vector,width=20)
Is there a way to get the residual standard error / residual standard deviation / sigma from such rolling regressions?
I would like to end up with a data frame containing only the residual standard deviations.
Thanks!
If you read the code for summary.lm, the residual standard error is the square root of residual sum of squares (rss) / degree of freedom of residuals (rdf). Since roll_lm doesn't retain this, you need to use the coefficients to get the prediction and calculate this again:
data<-as.data.frame(matrix(rexp(20000, rate=.1), ncol=20))
vector<-rexp(1000,rate=0.1)
library(roll)
WI = 20
rlm = roll_lm(vector,data$V1,width=WI)
rdf = WI - ncol(rlm$coefficients)
Below we go through every window, get the prediction and calculate rss and from there get the sigma:
sigma = sapply(1:(nrow(data)-WI+1),function(i){
# basically intercept + predictor * coef
pred = cbind(rep(1,WI),vector[i:(i+WI-1)]) %*% rlm$coefficients[WI+i-1,]
rss = sum((data$V1[i:(i+WI-1)] - pred)^2)
sqrt(rss/rdf)
})
We can wrap this up in a function, that takes as input an x,y:
roll_w_sigm = function(x,y,WI=20){
rlm = roll_lm(x=vector,y=y,width=WI)
rdf = WI - ncol(rlm$coefficients)
rlm$sigma = sapply(1:(length(y)-WI+1),function(i){
pred = cbind(rep(1,WI),vector[i:(i+WI-1)]) %*% rlm$coefficients[WI+i-1,]
rss = sum((y[i:(i+WI-1)] - pred)^2)
sqrt(rss/rdf)
})
rlm
}
For 1 column:
res = roll_w_sigm(vector,data$V1)
head(res$sigma)
[1] 9.102188 9.297425 9.324338 9.509460 7.849201 7.993087
For all columns:
lapply(data,function(i)roll_w_sigm(vector,i))
Related
I am currently working on a non-linear analysis of various datasets using nls model. On the other hand, I want to calculate the standard error of the regression of the nls model.
The formula of the standard error of regression:
n <- nrow(na.omit((data))
SE = (sqrt(sum(pv-av)^2)/(n-2))
where pv is the predicted value and av is the actual value.
I have a problem on calculating the standard error. Should I calculate the predicted value and actual value first? Are the values based on the dataset? Any help is highly appreciated. Thank You.
R provides this via sigma:
fm <- nls(demand ~ a + b * Time, BOD, start = list(a = 1, b = 1))
sigma(fm)
## [1] 3.085016
This would also work where deviance gives residual sum of squares.
sqrt(deviance(fm) / (nobs(fm) - length(coef(fm))))
## [1] 3.085016
I am very confused about the package Zelig and in particular the function sim.
What i want to do is estimate a logistic regression using a subset of my data and then estimate the fitted values of the remaining data to see how well the estimation performs. Some sample code follows:
data(turnout)
turnout <- data.table(turnout)
Shuffle the data
turnout <- turnout[sample(.N,2000)]
Create a sample for regression
turnout_sample <- turnout[1:1800,]
Create a sample for out of data testing
turnout_sample2 <- turnout[1801:2000,]
Run the regression
z.out1 <- zelig(vote ~ age + race, model = "logit", data = turnout_sample)
summary(z.out1)
Model:
Call:
z5$zelig(formula = vote ~ age + race, data = turnout_sample)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9394 -1.2933 0.7049 0.7777 1.0718
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.028874 0.186446 0.155 0.876927
age 0.011830 0.003251 3.639 0.000274
racewhite 0.633472 0.142994 4.430 0.00000942
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2037.5 on 1799 degrees of freedom
Residual deviance: 2002.9 on 1797 degrees of freedom
AIC: 2008.9
Number of Fisher Scoring iterations: 4
Next step: Use 'setx' method
Set the x values to the remaining 200 observations
x.out1 <- setx(z.out1,fn=NULL,data=turnout_sample2)
Simulate
s.out1 <- sim(z.out1,x=x.out1)
Get the fitted values
fitted <- s.out1$getqi("ev")
What i don't understand is that the list fitted now contains 1000 values and all the values are between 0,728 and 0,799.
1. Why are there 1000 values when what I am trying to estimate is the fitted value of 200 observations?
2. And why are the observations so closely grouped?
I hope someone can help me with this.
Best regards
The first question: From the signature of sim (sim(obj, x = NULL, x1 = NULL, y = NULL, num = 1000..) you see the default number of simulations is 1000. If you want to have 200, set num=200.
However, the sim in this example from documentation you use, actually generates (simulates) the probability that a person will vote given certain values (either computed by setx or computed and fixed on some value like this setx(z.out, race = "white")).
So in your case, you have 1000 simulated probability values between 0,728 and 0,799, which is what you are supposed to get.
So I'm using the quantreg package in R to conduct quantile regression analyses to test how the effects of my predictors vary across the distribution of my outcome.
FML <- as.formula(outcome ~ VAR + c1 + c2 + c3)
quantiles <- c(0.25, 0.5, 0.75)
q.Result <- list()
for (i in quantiles){
i.no <- which(quantiles==i)
q.Result[[i.no]] <- rq(FML, tau=i, data, method="fn", na.action=na.omit)
}
Then i call anova.rq which runs a Wald test on all the models and outputs a pvalue for each covariate telling me whether the effects of each covariate vary significantly across the distribution of my outcome.
anova.Result <- anova(q.Result[[1]], q.Result[[2]], q.Result[[3]], joint=FALSE)
Thats works just fine. However, for my particular data (and in general?), bootstrapping my estimates and their error is preferable. Which i conduct with a slight modification of the code above.
q.Result <- rqs(FML, tau=quantiles, data, method="fn", na.action=na.omit)
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb",
covariance=TRUE)
Here's where i get stuck. The quantreg currently cannot peform the anova (Wald) test on boostrapped estimates. The information files on the quantreg packages specifically states that "extensions of the methods to be used in anova.rq should be made" regarding the boostrapping method.
Looking at the details of the anova.rq method. I can see that it requires 2 components not present in the quantile model when bootstrapping.
1) Hinv (Inverse Hessian Matrix). The package information files specifically states "note that for se = "boot" there is no way to split the estimated covariance matrix into its sandwich constituent parts."
2) J which, according to the information files, is "Unscaled Outer product of gradient matrix returned if cov=TRUE and se != "iid". The Huber sandwich is cov = tau (1-tau) Hinv %*% J %*% Hinv. as for the Hinv component, there is no J component when se == "boot". (Note that to make the Huber sandwich you need to add the tau (1-tau) mayonnaise yourself.)"
Can i calculate or estimate Hinv and J from the bootstrapped estimates? If not what is the best way to proceed?
Any help on this much appreciated. This my first timing posting a question here, though I've greatly benefited from the answers to other peoples questions in the past.
For question 2: You can use R = for resampling. For example:
anova(object, ..., test = "Wald", joint = TRUE, score =
"tau", se = "nid", R = 10000, trim = NULL)
Where R is the number of resampling replications for the anowar form of the test, used to estimate the reference distribution for the test statistic.
Just a heads up, you'll probably get a better response to your questions if you only include 1 question per post.
Consulted with a colleague, and he confirmed that it was unlikely that Hinv and J could be 'reverse' computed from bootstrapped estimates. However we resolved that estimates from different taus could be compared using Wald test as follows.
From object rqs produced by
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb", covariance=TRUE)
you extract the bootstrapped Beta values for variable of interest in this case VAR, the first covariate in FML for each tau
boot.Bs <- sapply(q.Summary, function (x) x[["B"]][,2])
B0 <- coef(summary(lm(FML, data)))[2,1] # Extract liner estimate data linear estimate
Then compute wald statistic and get pvalue with number of quantiles for degrees of freedom
Wald <- sum(apply(boot.Bs, 2, function (x) ((mean(x)-B0)^2)/var(x)))
Pvalue <- pchisq(Wald, ncol(boot.Bs), lower=FALSE)
You also want to verify that bootstrapped Betas are normally distributed, and if you're running many taus it can be cumbersome to check all those QQ plots so just sum them by row
qqnorm(apply(boot.Bs, 1, sum))
qqline(apply(boot.Bs, 1, sum), col = 2)
This seems to be working, and if anyone can think of anything wrong with my solution, please share
I've performed multiple regression (specifically quantile regression with multiple predictors using quantreg in R). I have estimated the standard error and confidence intervals based on bootstrapping the estimates. Now i want to test whether the estimates at different quantiles differ significantly from one another (Wald test would be preferable). How can i do this?
FML <- as.formula(outcome ~ VAR + c1 + c2 + c3)
quantiles <- c(0.25, 0.5, 0.75)
q.Result <- rqs(FML, tau=quantiles, data, method="fn", na.action=na.omit)
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb",
covariance=TRUE)
From q.Summary i've extracted the bootstrapped (ie 10000) estimates (ie vector of 10000 bootstrapped B values).
Note: In reality I'm not especially interested comparing the estimates from all my covariates (in FML), I'm primarily interested comparing the estimates for VAR. What is the best way to proceed?
Consulted with a colleague, and we resolved that estimates from different taus could be compared using Wald test as follows.
From object rqs produced by
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb", covariance=TRUE)
you extract the bootstrapped Beta values for variable of interest in this case VAR, the first covariate in FML for each tau
boot.Bs <- sapply(q.Summary, function (x) x[["B"]][,2])
B0 <- coef(summary(lm(FML, data)))[2,1] # Extract liner estimate data linear estimate
Then compute wald statistic and get pvalue with number of quantiles for degrees of freedom
Wald <- sum(apply(boot.Bs, 2, function (x) ((mean(x)-B0)^2)/var(x)))
Pvalue <- pchisq(Wald, ncol(boot.Bs), lower=FALSE)
You also want to verify that bootstrapped Betas are normally distributed, and if you're running many taus it can be cumbersome to check all those QQ plots so just sum them by row
qqnorm(apply(boot.Bs, 1, sum))
qqline(apply(boot.Bs, 1, sum), col = 2)
This seems to be working, and if anyone can think of anything wrong with my solution, please share
I have split the Boston dataset into training and test sets as below:
library(MASS)
smp_size <- floor(.7 * nrow(Boston))
set.seed(133)
train_boston <- sample(seq_len(nrow(Boston)), size = smp_size)
train_ind <- sample(seq_len(nrow(Boston)), size = smp_size)
train_boston <- Boston[train_ind, ]
test_boston <- Boston[-train_ind,]
nrow(train_boston)
# [1] 354
nrow(test_boston)
# [1] 152
Now I get the RSE using lm function as below:
train_boston.lm <- lm(lstat~medv, train_boston)
summary(train_boston.lm)
summary(train_boston.lm)$sigma
How can I calculate Residual Standard error for the test data set? I can't use lm function on the test data set. Is there any method to calculate RSE on test data set?
Here your residual standard error is the same as
summary(train_boston.lm)$sigma
# [1] 4.73988
sqrt(sum((fitted(train_boston.lm)-train_boston$lstat)^2)/
(nrow(train_boston)-2))
# [1] 4.73988
you loose are estimating two parameters so your degrees of freedom is n-2
With your test data, you're not really doing the same estimation, but if you wanted to calculate the same type of calculation substituting the predicted value from the model for your new data for the fitted values from the original model, you can do
sqrt(sum((predict(train_boston.lm, test_boston)-test_boston$lstat)^2)/
(nrow(test_boston)-2))
Although it may make more sense just to calculate the standard deviation of the predicted residuals
sd(predict(train_boston.lm, test_boston)-test_boston$lstat)