I am writing a custom script to bootstrap standard errors in a GLM in R and receive the following error:
Error in eval(predvars, data, env) : numeric 'envir' arg not of length one
Can someone explain what I am doing wrong? My code:
#Number of simulations
sims<-numbersimsdesired
#Set up place to store data
saved.se<-matrix(NA,sims,numberofcolumnsdesired)
y<-matrix(NA,realdata.rownumber)
x1<-matrix(NA,realdata.rownumber)
x2<-matrix(NA,realdata.rownumber)
#Resample entire dataset with replacement
for (sim in 1:sims) {
fake.data<-sample(1:nrow(data5),nrow(data5),replace=TRUE)
#Define variables for GLM using fake data
y<-realdata$y[fake.data]
x1<-realdata$x1[fake.data]
x2<-realdata$x2[fake.data]
#Run GLM on fake data, extract SEs, save SE into matrix
glm.output<-glm(y ~ x1 + x2, family = "poisson", data = fake.data)
saved.se[sim,]<-summary(glm.output)$coefficients[0,2]
}
An example: if we suppose sims = 1000 and we want 10 columns (suppose instead of x1 and x2, we have x1...x10) the goal is a dataset with 1,000 rows and 10 columns containing each explanatory variable's SEs.
There isn't a reason to reinvent the wheel. Here is an example of bootstrapping the standard error of the intercept with the boot package:
set.seed(42)
counts <- c(18,17,15,20,10,20,25,13,12)
x1 <- 1:9
x2 <- sample(9)
DF <- data.frame(counts, x1, x2)
glm1 <- glm(counts ~ x1 + x2, family = poisson(), data=DF)
summary(glm1)$coef
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) 2.08416378 0.42561333 4.896848 9.738611e-07
#x1 0.04838210 0.04370521 1.107010 2.682897e-01
#x2 0.09418791 0.04446747 2.118131 3.416400e-02
library(boot)
intercept.se <- function(d, i) {
glm1.b <- glm(counts ~ x1 + x2, family = poisson(), data=d[i,])
summary(glm1.b)$coef[1,2]
}
set.seed(42)
boot.intercept.se <- boot(DF, intercept.se, R=999)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = DF, statistic = intercept.se, R = 999)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* 0.4256133 0.103114 0.2994377
Edit:
If you prefer doing it without a package:
n <- 999
set.seed(42)
ind <- matrix(sample(nrow(DF), nrow(DF)*n, replace=TRUE), nrow=n)
boot.values <- apply(ind, 1, function(...) {
i <- c(...)
intercept.se(DF, i)
})
sd(boot.values)
#[1] 0.2994377
Related
I want to run a regression model using creatFolds function and specific set.seed.
I can run the regression model without creatFolds function with no Error.
The friedman dataset can be found here
wdbc <- read_csv("friedman.dat", col_names = FALSE, skip = 10)
wdbc[] <- lapply(wdbc, scale)
The regression model without createFolds Function
lm(X6 ~ ., data = wdbc)
Call:
lm(formula = X6 ~ ., data = wdbc)
Coefficients:
(Intercept) X1 X2 X3 X4 X5
2.553e-16 3.892e-01 3.563e-01 2.834e-04 5.739e-01 2.767e-01
The regression model with createFolds Function
FX <- function(seed, data, k=30) {
set.seed(seed)
folds <- createFolds(data[, "X6"], k=k, list=TRUE, returnTrain=TRUE)
lm1 <- lapply(folds, function(folds) lm(X6 ~ ., data=data[folds, ]))
}
seeds <- 1:30
res <- lapply(seeds, FX, data=wdbc)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
The createFolds function is the problematic part:
createFolds(wdbc[, "X6"], k=1, list=TRUE, returnTrain=TRUE)
$Fold1
integer(0)
This gives you an empty vector, hence you are regressing blank. So there is a difference when you do the different calls below:
typeof(wdbc[["X6"]])
[1] "double"
typeof(wdbc[,"X6"])
[1] "list"
typeof(as.matrix(wdbc)[,"X6"])
[1] "double"
Maybe someone else more well versed in tibble etc can chip in.
You can either convert your data into a data.frame or use data[["X6"]]:
FX <- function(seed, data, k=30) {
set.seed(seed)
folds <- createFolds(data[["X6"]], k=k, list=TRUE, returnTrain=TRUE)
lm1 <- lapply(folds, function(folds) lm(X6 ~ ., data=data[folds, ]))
}
seeds <- 1:30
res <- lapply(seeds, FX, data=wdbc)
res[[1]][[1]]
Call:
lm(formula = X6 ~ ., data = data[folds, ])
Coefficients:
(Intercept) X1 X2 X3 X4 X5
0.0007255 0.3933912 0.3600078 -0.0008834 0.5733893 0.2729162
I am trying to create a function that passes a parameter in as the dependent variable with the independent variables staying the same.
I have tried to use {{}} but see the problem as something like the below if select contains was possible.
test_func <- function(dataframe, dependent){
model <- tidy(lm({{ dependent }} ~ . - select(contains("x")), data = dataframe))
return(model)
}
test_func(datasets::anscombe, x1)
The function should pass as function(dataframe, dependent) with a single model.
Use reformulate().
f <- function(d, y) lm(reformulate(names(d)[grep("x", names(d))], response=y), data=d)
f(datasets::anscombe, "y1")
# Call:
# lm(formula = reformulate(names(d)[grep("x", names(d))], response = y),
# data = d)
#
# Coefficients:
# (Intercept) x1 x2 x3 x4
# 4.33291 0.45073 NA NA -0.09873
This question already has answers here:
Linear Regression and storing results in data frame [duplicate]
(5 answers)
Closed 7 years ago.
I need to store lm fit object in a data frame for further processing (This is needed as I will have around 200+ regressions to be stored in the data frame). I am not able to store the fit object in the data frame. Following code produces the error message:
x = runif(100)
y = 2*x+runif(100)
fit = lm(y ~x)
df = data.frame()
df = rbind(df, c(id="xx1", fitObj=fit))
Error in rbind(deparse.level, ...) :
invalid list argument: all variables should have the same length
I would like to get the data frame as returned by "do" call of dplyr, example below:
> tacrSECOutput
Source: local data frame [24 x 5]
Groups: <by row>
sector control id1 fit count
1 Chemicals and Chemical Products S tSector <S3:lm> 2515
2 Construation and Real Estate S tSector <S3:lm> 985
Please note that this is a sample output only. I would like to create the data frame (fit column for the lm object) in the above format so that my rest of the code can work on the added models.
What am I doing wrong? Appreciate the help very much.
The list approach:
Clearly based on #Pascal 's idea. Not a fan of lists, but in some cases they are extremely helpful.
set.seed(42)
x <- runif(100)
y <- 2*x+runif(100)
fit1 <- lm(y ~x)
set.seed(123)
x <- runif(100)
y <- 2*x+runif(100)
fit2 <- lm(y ~x)
# manually select model names
model_names = c("fit1","fit2")
# create a list based on models names provided
list_models = lapply(model_names, get)
# set names
names(list_models) = model_names
# check the output
list_models
# $fit1
#
# Call:
# lm(formula = y ~ x)
#
# Coefficients:
# (Intercept) x
# 0.5368 1.9678
#
#
# $fit2
#
# Call:
# lm(formula = y ~ x)
#
# Coefficients:
# (Intercept) x
# 0.5545 1.9192
Given that you have lots of models in your work space, the only "manual" thing you have to do is provide a vector of your models names (how are they stored) and then using the get function you can obtain the actual model objects with those names and save them in a list.
Store model objects in a dataset when you create them:
The data frame can be created using dplyr and do if you are planning to store the model objects when they are created.
library(dplyr)
set.seed(42)
x1 = runif(100)
y1 = 2*x+runif(100)
set.seed(123)
x2 <- runif(100)
y2 <- 2*x+runif(100)
model_formulas = c("y1~x1", "y2~x2")
data.frame(model_formulas, stringsAsFactors = F) %>%
group_by(model_formulas) %>%
do(model = lm(.$model_formulas))
# model_formulas model
# (chr) (chr)
# 1 y1~x1 <S3:lm>
# 2 y2~x2 <S3:lm>
It REALLY depends on how "organised" is the process that allows you to built those 200+ models you mentioned. You can build your models this way if they depend on columns of a specific dataset. It will not work if you want to build models based on various columns of different datasets, maybe of different work spaces or different model types (linear/logistic regression).
Store existing model objects in a dataset:
Actually I think you can still use dplyr using the same philosophy as in the list approach. If the models are already built you can use their names like this
library(dplyr)
set.seed(42)
x <- runif(100)
y <- 2*x+runif(100)
fit1 <- lm(y ~x)
set.seed(123)
x <- runif(100)
y <- 2*x+runif(100)
fit2 <- lm(y ~x)
# manually select model names
model_names = c("fit1","fit2")
data.frame(model_names, stringsAsFactors = F) %>%
group_by(model_names) %>%
do(model = get(.$model_names))
# model_names model
# (chr) (chr)
# 1 fit1 <S3:lm>
# 2 fit2 <S3:lm>
This seems to work:
x = runif(100)
y = 2*x+runif(100)
fit = lm(y ~x)
df <- data.frame()
fitvec <- serialize(fit,NULL)
df <- rbind(df, data.frame(id="xx1", fitObj=fitvec))
fit1 <- unserialize( df$fitObj )
print(fit1)
yields:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
0.529 1.936
Update Okay, now more complex, so as to get one row per fit.
vdf <- data.frame()
fitlist <- list()
niter <- 5
for (i in 1:niter){
# Create a new model each time
a <- runif(1)
b <- runif(1)
n <- 50*runif(1) + 50
x <- runif(n)
y <- a*x + b + rnorm(n,0.1)
fit <- lm(x~y)
fitlist[[length(fitlist)+1]] <- serialize(fit,NULL)
}
vdf <- data.frame(id=1:niter)
vdf$fitlist <- fitlist
for (i in 1:niter){
print(unserialize(vdf$fitlist[[i]]))
}
yields:
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.45689 0.07766
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.44922 0.00658
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.41036 0.04522
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.40823 0.07189
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.40818 0.08141
When I run a cluster standard error panel specification with plm and lfe I get results that differ at the second significant figure. Does anyone know why they differ in their calculation of the SE's?
set.seed(572015)
library(lfe)
library(plm)
library(lmtest)
# clustering example
x <- c(sapply(sample(1:20), rep, times = 1000)) + rnorm(20*1000, sd = 1)
y <- 5 + 10*x + rnorm(20*1000, sd = 10) + c(sapply(rnorm(20, sd = 10), rep, times = 1000))
facX <- factor(sapply(1:20, rep, times = 1000))
mydata <- data.frame(y=y,x=x,facX=facX, state=rep(1:1000, 20))
model <- plm(y ~ x, data = mydata, index = c("facX", "state"), effect = "individual", model = "within")
plmTest <- coeftest(model,vcov=vcovHC(model,type = "HC1", cluster="group"))
lfeTest <- summary(felm(y ~ x | facX | 0 | facX))
data.frame(lfeClusterSE=lfeTest$coefficients[2],
plmClusterSE=plmTest[2])
lfeClusterSE plmClusterSE
1 0.06746538 0.06572588
The difference is in the degrees-of-freedom adjustment. This is the usual first guess when looking for differences in supposedly similar standard errors (see e.g., Different Robust Standard Errors of Logit Regression in Stata and R). Here, the problem can be illustrated when comparing the results from (1) plm+vcovHC, (2) felm, (3) lm+cluster.vcov (from package multiwayvcov).
First, I refit all models:
m1 <- plm(y ~ x, data = mydata, index = c("facX", "state"),
effect = "individual", model = "within")
m2 <- felm(y ~ x | facX | 0 | facX, data = mydata)
m3 <- lm(y ~ facX + x, data = mydata)
All lead to the same coefficient estimates. For m3 the fixed effects are explicitly reported while they are not for m1 and m2. Hence, for m3 only the last coefficient is extracted with tail(..., 1).
all.equal(coef(m1), coef(m2))
## [1] TRUE
all.equal(coef(m1), tail(coef(m3), 1))
## [1] TRUE
The non-robust standard errors also agree.
se <- function(object) tail(sqrt(diag(object)), 1)
se(vcov(m1))
## x
## 0.07002696
se(vcov(m2))
## x
## 0.07002696
se(vcov(m3))
## x
## 0.07002696
And when comparing the clustered standard errors we can now show that felm uses the degrees-of-freedom correction while plm does not:
se(vcovHC(m1))
## x
## 0.06572423
m2$cse
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX))
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX, df_correction = FALSE))
## x
## 0.06572423
I'm looking for suggestions on how to deal with NA's in linear regressions when all occurrences of an independent/explanatory variable are NA (i.e. x3 below).
I know the obvious solution would be to exclude the independent/explanatory variable in question from the model but I am looping through multiple regions and would prefer not to have a different functional forms for each region.
Below is some sample data:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
# Quick plot of data
library(ggplot2)
library(reshape2)
df.melt <-melt(df, id=c("time"))
p <- ggplot(df.melt, aes(x=time, y=value)) +
geom_line() + facet_grid(variable ~ .)
p
I have read the documentation for lm and tried various na.action settings without success:
lm(y~x1+x2+x3, data=df, singular.ok=TRUE)
lm(y~x1+x2+x3, data=df, na.action=na.omit)
lm(y~x1+x2+x3, data=df, na.action=na.exclude)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.omit)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.exclude)
Is there a way to get lm to run without error and simply return a coefficient for the explanatory reflective of the lack of explanatory power (i.e. either zero or NA) from the variable in question?
Here's one idea:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
replaceNA<-function(x){
if(all(is.na(x))){
rep(0,length(x))
} else x
}
lm(y~x1+x2+x3, data= data.frame(lapply(df,replaceNA)))
Call:
lm(formula = y ~ x1 + x2 + x3, data = data.frame(lapply(df, replaceNA)))
Coefficients:
(Intercept) x1 x2 x3
0.05467 1.01133 -0.10613 NA
lm(y~x1+x2, data=df)
Call:
lm(formula = y ~ x1 + x2, data = df)
Coefficients:
(Intercept) x1 x2
0.05467 1.01133 -0.10613
So you replace the variables which contain only NA's with variable which contains only 0's. you get the coefficient value NA, but all the relevant parts of the model fits are same (expect qr decomposition, but if information about that is needed, it can be easily modified). Note that component summary(fit)$alias (see ?alias) might be useful.
This seems to relate your other question: Replace lm coefficients in [r]
You won't be able to include a column with all NA values. It does strange things to model.matrix
x1 <- 1:5
x2 <- rep(NA,5)
model.matrix(~x1+x2)
(Intercept) x1 x2TRUE
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$x2
[1] "contr.treatment"
So your alternative is to programatically create the model formula based on the data.
Something like...
make_formula <- function(variables, data, response = 'y'){
if(missing(data)){stop('data not specified')}
using <- Filter(variables,f= function(i) !all(is.na(data[[i]])))
deparse(reformulate(using, response))
}
variables <- c('x1','x2','x3')
make_formula(variables, data =df)
[1] "y ~ x1 + x2"
I've used deparse to return a character string so that there is no environment issues from creating the formula within the function. lm can happily take a character string which is a valid formula.