Using coeftest results in predict.lm() - r

I am analyzing a dataset in which the variance of the error term in the regression is not constant for all observations. For this, I re-built the model, estimating heteroskedasticity-robust (Huber-White) standard errors using the coeftest function. Now, I want to use these new results for a prediction with predict() function.
The dataset looks like the following but with multiple X:
set.seed(123)
x <- rep(c(10, 15, 20, 25), each = 25)
e <- c()
e[1:25] <- rnorm(25, sd = 10)
e[26:50] <- rnorm(25, sd = 15)
e[51:75] <- rnorm(25, sd = 20)
e[76:100] <- rnorm(25, sd = 25)
y <- 720 - 3.3 * x + e
model <- lm(y ~ x)
library(lmtest)
library(sandwich)
coeftest(model, vcov=vcovHC(model, "HC1"))
I found the following solution for the issue on the internet:
predict.rob <- function(x,vcov,newdata){
if(missing(newdata)){ newdata <- x$model }
tt <- terms(x)
Terms <- delete.response(tt)
m.mat <- model.matrix(Terms,data=newdata)
m.coef <- x$coef
fit <- as.vector(m.mat %*% x$coef)
se.fit <- sqrt(diag(m.mat%*%vcov%*%t(m.mat)))
return(list(fit=fit,se.fit=se.fit))}
The remaining problem is that my regression has more than 1 regressor.
Is there any way to addapt this resolution to multiple (7) explanatory variables?
Thanks in advance!

I'm not sure but coeftest function is only performing a test. You can't use directly its result for your prediction. Maybe, you can in a way specifify to predict.lm the covaraince via vcovHC(model, "HC1"). I hope it will help you a bit.

Related

CausalImpact package in R doesn't work for Poisson bsts model

I'd like to use the CausalImpact package in R to estimate the impact of an intervention on infectious disease case counts. We typically characterize the distributions of case counts as either Poisson or negative binomial. The bsts() function allows us to specify the Poisson family. However this encountered an error in CausalImpact()
set.seed(1)
x1 <- 100 + arima.sim(model = list(ar = 0.999), n = 100)
y <- rpois(100, 1.2 * x1)
y[71:100] <- y[71:100] + 10
data <- cbind(y, x1)
pre.period <- c(1, 70)
post.period <- c(71, 100)
post.period.response <- y[post.period[1] : post.period[2]]
y[post.period[1] : post.period[2]] <- NA
ss <- AddLocalLevel(list(), y)
bsts.model <- bsts(y ~ x1, ss, family="poisson", niter = 1000)
impact <- CausalImpact(bsts.model = bsts.model,
post.period.response = post.period.response)
Error in rnorm(prod(dim(state.samples)), 0, sigma.obs) : invalid arguments
This is due to the fact that bsts.model has no sigma.obs slot when generated using family="poisson".
Am I doing this correctly or is there another way to use CausalImpact with Poisson data? (I'd also love to be able to use negative binomial data, but I won't get too greedy).
Last, is this the best place for coding issues for CausalImpact? I didn't see an Issues tab on the GitHub page.

Non linear regression with nls2 package

I work on agglomeration effect and i want to run a non linear regression with package nls2.
I am trying to run this model with R
dens=runif(100)
surf=rnorm(100, 10, 2)
zone=seq(1,100,1)
donnees<-data.frame(dens,surf,zone)
attach(donnees)
donnees$salaire<-rnorm(100, 1000,3)
mp<-rep(0,100)
MP<-rep(0,100)
MPfonc<-function(alpha){
for (i in 1:100){
for (j in 1:100){
if(j!=i){
mp[j]<- dens[j]/(surf[i]-surf[j])^alpha
}
}
MP[i]<-sum(mp)
}
return(MP)
}
fo <- salaire ~ const+ gamma1*dens+gamma2*surf+gamma3*MPfonc(alpha)
gstart <- data.frame(const = c(-100, 100), gamma1 = c(-10, 10),
gamma2 = c(-10, 10),gamma3 = c(-10, 10), alpha=c(-10, 10))
fm <- nls2(fo, start = gstart, alg = "plinear-random")
It does not run and I think it is a problem of alpha.
Can nls2 function accept a function (MP(alpha)) as an input?
Here is the specification of my model:
The problems are:
set.seed should be set to make the code reproducible
salaire is not defined -- it is defined in the data frame donnees but donnees is never used after that.
the elements summed in the sum call in MPfonc include NaN or NA elements so the sum becomes similarly undefined
For plinear algorithms the RHS of the formula must evaluate to a matrix of coefficients of the linear parameters.
For the plinear algorithms provide starting values only for the non-linear parameters (i.e. only for alpha).
the nls2 package is never loaded. A library statement is needed.
code posted to SO should be indented 4 spaces to cause it to format properly (this was addressed in an edit to the question)
the mathematical formulas in the question are not clear in how they relate to the problem and are missing significant elements, e.g. alpha. This needs to be cleaned up. We have assumed that MPfonc gives the desired result and just simplified it.
The following corrects all the points and adds some minor improvements.
library(nls2)
set.seed(123) # for reproducibility
dens <- runif(100)
surf <- rnorm(100, 10, 2)
zone <- seq(1, 100, 1)
salaire <- rnorm(100, 1000, 3)
MPfonc <- function(alpha) {
sapply(1:100, function(i) sum( (dens / (surf[i] - surf) ^ alpha)[-i], na.rm = TRUE ))
}
fo <- salaire ~ cbind(1, dens, surf, MPfonc(alpha))
gstart <- data.frame(alpha = c(-10, 10))
fm <- nls2(fo, start = gstart, alg = "plinear-random")
giving:
> fm
Nonlinear regression model
model: salaire ~ cbind(1, dens, surf, MPfonc(alpha))
data: parent.frame()
alpha .lin1 .lin.dens .lin.surf .lin4
0.90477 1001.20905 -0.50642 -0.12269 0.00681
residual sum-of-squares: 757.6
Number of iterations to convergence: 50
Achieved convergence tolerance: NA
Note: Now that we have the starting value we can use it with nls like this:
nls(fo, start = coef(fm)["alpha"], alg = "plinear")
Update: Some code improvements, corrections and clarifications.

Extracting model reliability from multiple GAMs applied across a dataframe

I have been able to apply a General Additive Model iteratively across a dataframe, so where sp_a is the response variable...
sp_a <- rnorm (100, mean = 3, sd = 0.9)
var_env_1 <- rnorm (100, mean = 1, sd = 0.3)
var_env_2 <- rnorm (100, mean = 5, sd = 1.6)
var_env_3 <- rnorm (100, mean = 10, sd = 1.2)
data <- data.frame (sp_a, var_env_1, var_env_2,var_env_3)
library(mgcv)
Gam <- lapply(data[,-1], function(x) summary(gam(data$sp_a ~ s(x))))
This creates a GAM between the response variable and each explanatory variable iteratively. However, how I would then extract p values or the s.pv from each model. Does anybody know how to do this? Also, it would be great to rank them by their AIC score like this...
Gam1 <- gam(sp_a ~ s(var_env_1))
Gam2 <- gam(sp_a ~ s(var_env_2))
Gam3 <- gam(sp_a ~ s(var_env_3))
AIC(Gam1,Gam2,Gam3)
But selecting this from the original 'Gam' output instead. Thank you for any help in advance.
In the end, it was evident I had to remove the summary option, that then allowed me to calculate AIC score for all models. Other interesting ways of formatting can be found here Using lapply on a list of models, as these functions work for different kinds of models (e.g. lm, glm).
Gam <- lapply(data[,-1], function(x) gam(data$sp_a ~ s(x)))
sapply(X = Gam, FUN = AIC)

Why parametric bootstrapping bias and standard error are zero here?

I'm performing parametric bootstrapping in R for a simple problem and getting Bias and Standard Error zero always. What am I doing wrong?
set.seed(12345)
df <- rnorm(n=10, mean = 0, sd = 1)
Boot.fun <-
function(data) {
m1 <- mean(data)
return(m1)
}
Boot.fun(data = df)
library(boot)
out <- boot(df, Boot.fun, R = 20, sim = "parametric")
out
PARAMETRIC BOOTSTRAP
Call:
boot(data = df, statistic = Boot.fun, R = 20, sim = "parametric")
Bootstrap Statistics :
original bias std. error
t1* -0.1329441 0 0
You need to add line of code to do the sampling, ie.
Boot.fun <-
function(data) {
data <- sample(data, replace=T)
m1 <- ...
since you didn't supply a function to the argument rand.gen to generate random values. This is discussed in the documentation for ?boot. If sim = "parametric" and you don't supply a generating function, then the original data is passed to statistic and you need to sample in that function. Since your simulation was run on the same data, there is no standard error or bias.

How to simulate quantities of interest using arm or rstanarm packages in R?

I would like to know how to simulate quantities of interest out of a regression model estimated using either the arm or the rstanarm packages in R. I am a newbie in Bayesian methods and R and have been using the Zelig package for some time. I asked a similar question before, but I would like to know if it is possible to simulate those quantities using the posterior distribution estimated by those packages.
In Zelig you can set the values you want for the independent values and it calculates the results for the outcome variable (expected value, probability, etc). An example:
# Creating a dataset:
set.seed(10)
x <- rnorm(100,20,10)
z <- rnorm(100,10,5)
e <- rnorm(100,0,1)
y <- 2*x+3*z+e
df <- data.frame(x,z,e,y)
# Loading Zelig
require(Zelig)
# Model
m1.zelig <- zelig(y ~ x + z, model="ls", data=df)
summary(m1.zelig)
# Simulating z = 10
s1 <- setx(m1.zelig, z = 10)
simulation <- sim(m1.zelig, x = s1)
summary(simulation)
So Zelig keeps x at its mean (20.56), and simulates the quantity of interest with z = 10. In this case, y is approximately 71.
The same model using arm:
# Model
require(arm)
m1.arm <- bayesglm(y ~ x + z, data=df)
summary(m1.arm)
And using rstanarm:
# Model
require(rstanarm)
m1.stan <- stanlm(y ~ x + z, data=df)
print(m1.stan)
Is there any way to simulate z = 10 and x equals to its mean with the posterior distribution estimated by those two packages and get the expected value of y? Thank you very much!
In the case of bayesglm, you could do
sims <- arm::sim(m1.arm, n = 1000)
y_sim <- rnorm(n = 1000, mean = sims#coef %*% t(as.matrix(s1)), sd = sims#sigma)
mean(y_sim)
For the (unreleased) rstanarm, it would be similar
sims <- as.matrix(m1.stan)
y_sim <- rnorm(n = nrow(sims), mean = sims[,1:(ncol(sims)-1)] %*% t(as.matrix(s1)),
sd = sims[,ncol(sims)])
mean(y_sim)
In general for Stan, you could pass s1 as a row_vector and utilize it in a generated quantities block of a .stan file like
generated quantities {
real y_sim;
y_sim <- normal_rng(s1 * beta, sigma);
}
in which case the posterior distribution of y_sim would appear when you print the posterior summary.

Resources