I have a data frame with response ratios for multiple locations and each location is assigned to a group (region). I want to generate a regression for each group (region) that uses Response Ratio (RR) as the response, location as the unit of replication, and each soil type as a predictor. I would like to use bootstrap resampling to generate confidence intervals around the coefficients for each soil type but I am not sure how to generate this.
#sample data
df <- data.frame(
group=rep(c('region1','region2'), 100),
subgroup=rep(c('location1','location2',
'location2', 'location1'), 25),
predictor = rep(c('soil1','soil2','soil3','soil4'), 25),
RR=rnorm(200)
)
Adding script from #Rui below. I actually have a multiple regression and so I added an additional predictor. It is still unclear to me how to extract the coefficient CIs for both soil type and temperature.
library(boot)
bootfun <- function(data, i) {
d <- data[i,]
fit <- lm(RR ~ soil_type + temperature, data = d)
coef(fit)
}
set.seed(2022)
set.seed(123)
df <- data.frame(
group=rep(c('region1','region2'), 100),
subgroup=rep(c('location1','location2',
'location2', 'location1'), 25),
soil_type = rep(c('soil1','soil2','soil3','soil4'), 25),
temperature = abs(rnorm(100, 2,1.75)),
RR=rnorm(200),
stringsAsFactors = TRUE
)
R <- 1000
b_list <- by(df, df$group, \(X) {
boot(X, bootfun, R, strata = X$subgroup)
})
b_list$region1
Function boot is base package boot has an argument strata. Split by group and apply a boot function with, for instance, by stratifying by location.
library(boot)
bootfun <- function(data, i) {
d <- data[i,]
fit <- lm(RR ~ predictor, data = d)
coef(fit)
}
set.seed(2022)
df <- data.frame(
group=rep(c('region1','region2'), 100),
subgroup=rep(c('location1','location2',
'location2', 'location1'), 25),
predictor = rep(c('soil1','soil2','soil3','soil4'), 25),
RR=rnorm(200),
stringsAsFactors = TRUE
)
R <- 1000
b_list <- by(df, df$group, \(X) {
boot(X, bootfun, R, strata = X$subgroup)
})
b_list$region1
#>
#> STRATIFIED BOOTSTRAP
#>
#>
#> Call:
#> boot(data = X, statistic = bootfun, R = R, strata = X$subgroup)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* -0.2608885 0.000469295 0.1541482
#> t2* 0.3502007 -0.004239248 0.2083503
b_list$region2
#>
#> STRATIFIED BOOTSTRAP
#>
#>
#> Call:
#> boot(data = X, statistic = bootfun, R = R, strata = X$subgroup)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* -0.03727332 -0.0001557172 0.1422502
#> t2* 0.11987005 0.0016393125 0.1952310
lapply(b_list, boot.ci)
#> Warning in sqrt(tv[, 2L]): NaNs produced
#> Warning in sqrt(tv[, 2L]): NaNs produced
#> $region1
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> FUN(boot.out = X[[i]])
#>
#> Intervals :
#> Level Normal Basic Studentized
#> 95% (-0.5635, 0.0408 ) (-0.5611, 0.0545 ) (-0.8227, -0.0225 )
#>
#> Level Percentile BCa
#> 95% (-0.5762, 0.0393 ) (-0.5733, 0.0446 )
#> Calculations and Intervals on Original Scale
#>
#> $region2
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> FUN(boot.out = X[[i]])
#>
#> Intervals :
#> Level Normal Basic Studentized
#> 95% (-0.3159, 0.2417 ) (-0.3260, 0.2460 ) (-0.3493, 0.1757 )
#>
#> Level Percentile BCa
#> 95% (-0.3206, 0.2514 ) (-0.3321, 0.2352 )
#> Calculations and Intervals on Original Scale
Created on 2022-10-25 with reprex v2.0.2
Edit
To get the bootstrapped confidence intervals of each coefficient, the code below uses two nested loops. The outer loop is by region, according to the original data partition. The inner loop is on index, meaning, on the matrix t returned by boot, see help("boot"), section Value. The index are the column numbers in any of
b_list$region1$t
b_list$region2$t
each of them with 3 columns.
library(boot)
npars <- ncol(b_list$region1$t)
ci_list <- lapply(b_list, \(region) {
ci <- lapply(seq.int(npars), \(index) {
boot.ci(region, index = index, type = c("norm","basic", "perc", "bca"))
})
setNames(ci, c("Intercept", "soil", "temperature"))
})
ci_list$region1$Intercept
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = region, type = c("norm", "basic", "perc",
#> "bca"), index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.2517, 0.6059 ) (-0.2423, 0.6043 )
#>
#> Level Percentile BCa
#> 95% (-0.2410, 0.6056 ) (-0.2414, 0.6048 )
#> Calculations and Intervals on Original Scale
ci_list$region2$temperature
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = region, type = c("norm", "basic", "perc",
#> "bca"), index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.2317, 0.0420 ) (-0.2416, 0.0404 )
#>
#> Level Percentile BCa
#> 95% (-0.2278, 0.0542 ) (-0.2265, 0.0570 )
#> Calculations and Intervals on Original Scale
Created on 2022-10-25 with reprex v2.0.2
Edit 2
Like I say in a comment below, in the new data the soil type uniquely identifies pairs of region and location, unique(df[1:3]) shows it. And it becomes useless to split by group and stratify within groups.
bootfun2 <- function(data, i) {
d <- data[i,]
fit <- lm(RR ~ temperature + soil_type, data = d)
coef(fit)
}
unique(df[1:3]) # soil type uniquely identifies region/location
#> group subgroup soil_type
#> 1 region1 location1 soil1
#> 2 region2 location2 soil2
#> 3 region1 location2 soil3
#> 4 region2 location1 soil4
fit <- lm(RR ~ temperature + soil_type, data = df)
coef(fit)
#> (Intercept) temperature soil_typesoil2 soil_typesoil3 soil_typesoil4
#> 0.25928498 -0.06352205 -0.17739104 -0.05243836 -0.20408527
set.seed(2022)
R <- 1000
b_3 <- boot(df, bootfun2, R)
b_3
#>
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#>
#>
#> Call:
#> boot(data = df, statistic = bootfun2, R = R)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* 0.25928498 0.005724634 0.18033509
#> t2* -0.06352205 -0.002910677 0.05161868
#> t3* -0.17739104 0.004932486 0.18665594
#> t4* -0.05243836 0.005796168 0.19602658
#> t5* -0.20408527 0.004914674 0.20355549
btype <- c("norm","basic", "perc", "bca")
ci_list3 <- lapply(seq_len(ncol(b_3$t)), \(index) {
boot.ci(b_3, type = btype, index = index)
})
names(ci_list3) <- names(coef(fit))
ci_list3
#> $`(Intercept)`
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = b_3, type = btype, index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.0999, 0.6070 ) (-0.0868, 0.6172 )
#>
#> Level Percentile BCa
#> 95% (-0.0986, 0.6054 ) (-0.0992, 0.6034 )
#> Calculations and Intervals on Original Scale
#>
#> $temperature
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = b_3, type = btype, index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.1618, 0.0406 ) (-0.1631, 0.0401 )
#>
#> Level Percentile BCa
#> 95% (-0.1672, 0.0360 ) (-0.1552, 0.0503 )
#> Calculations and Intervals on Original Scale
#>
#> $soil_typesoil2
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = b_3, type = btype, index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.5482, 0.1835 ) (-0.5541, 0.1955 )
#>
#> Level Percentile BCa
#> 95% (-0.5503, 0.1994 ) (-0.5542, 0.1927 )
#> Calculations and Intervals on Original Scale
#>
#> $soil_typesoil3
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = b_3, type = btype, index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.4424, 0.3260 ) (-0.4399, 0.3068 )
#>
#> Level Percentile BCa
#> 95% (-0.4117, 0.3350 ) (-0.4116, 0.3350 )
#> Calculations and Intervals on Original Scale
#>
#> $soil_typesoil4
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = b_3, type = btype, index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.6080, 0.1900 ) (-0.6116, 0.2127 )
#>
#> Level Percentile BCa
#> 95% (-0.6208, 0.2035 ) (-0.6284, 0.1801 )
#> Calculations and Intervals on Original Scale
Created on 2022-10-25 with reprex v2.0.2
I got the following error running the stepcAIC function in a linear mixed effect model (lmer):
Fehler in rep(no, length.out = len) :
attempt to replicate an object of type 'language'
I don't understand what the error is saying. All the factors are declared as "factors" and all the other variables as "numeric". The storage.mode is "integer" for the factors and "double" for the other variables.
This is my model and the step function:
biom_FULLO<-lmer((above_bio)~MUM_germ_time+MUM_num_seed+MUM_av_seed_mass+
MUM_num_inf+MUM_above_bio+MUM_total_bio+MUM_inf_size+MUM_root_bio+
MUM_CV_seed_mass+MUM_CV_SEM_per_inflor+MUM_CV_inflor_size+
MUM_seed_weight+germ_date_year+germ_time+flow_start_date+
height_3m+height_flow+num_inf+seed_gen+Early+
Late+seed_gen:Early+seed_gen:Late+
Early:Late+seed_gen:Early:Late+(1|ID_year), DT_gen_biom)
biom_step<-stepcAIC(biom_FULLO, direction = "backward",
trace = FALSE, data = DT_gen_biom)
Any idea anyone?
P.D.: traceback()
6: ifelse(wasGam, formula(modelInit$gam)[[2]], formula(modelInit)[[2]])
5: makeFormula(x, modelInit)
4: FUN(X[[i]], ...)
3: lapply(newSetup, function(x) makeFormula(x, modelInit))
2: calculateAllCAICs(newSetup = newSetup, modelInit = object, numCores = numCores,
data = data, calcNonOptimMod = calcNonOptimMod, nrmods = numberOfSavedModels,
...)
1: stepcAIC(biom_FULLO, direction = "backward", trace = FALSE, data = DT_gen_biom)
Just remove the parenthesis of your response variable in the LHS of your formula:
library(cAIC4)
#> Loading required package: lme4
#> Loading required package: Matrix
#> Loading required package: stats4
#> Loading required package: nlme
#>
#> Attaching package: 'nlme'
#> The following object is masked from 'package:lme4':
#>
#> lmList
library(lme4)
m1 <- lmer((Sepal.Length) ~ Sepal.Width + (1|Species), data = iris)
stepcAIC(m1)
#> Warning in nobars(formula(modelInit)) == formula(modelInit)[[2]]: longer object
#> length is not a multiple of shorter object length
#> Error in rep(no, length.out = len): attempt to replicate an object of type 'language'
m2 <- lmer(Sepal.Length ~ Sepal.Width + (1|Species), data = iris)
stepcAIC(m2)
#> $finalModel
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: Sepal.Length ~ Sepal.Width + (1 | Species)
#> Data: iris
#> REML criterion at convergence: 194.6361
#> Random effects:
#> Groups Name Std.Dev.
#> Species (Intercept) 1.010
#> Residual 0.438
#> Number of obs: 150, groups: Species, 3
#> Fixed Effects:
#> (Intercept) Sepal.Width
#> 3.4062 0.7972
#>
#> $additionalModels
#> NULL
#>
#> $bestCAIC
#> [1] 184.0044
Created on 2022-02-11 by the reprex package (v2.0.0)
I am using lm_robust of package 'estimatr' for a fixed effect model including HC3 robust standard errors. I had to switch from vcovHC(), because my data sample was just to large to be handled by it.
using following line for the regression:
lm_robust(log(SPREAD) ~ PERIOD, data = dat, fixed_effects = ~ STOCKS + TIME, se_type = "HC3")
The code runs fine, and the coefficients are the same as using fixed effects from package plm. Since I can not use coeftest to estimate HC3 standard errors with the plm output due to a too large data sample, I compared the HC3 estimator of lm_robustwith the HC1 of coeftest(model, vcov= vcovHC(model, type = HC1))
As result the HC3 standarderror of lm_robust is much smaller than HC1 from coeftest.
Does somebody has an explanation, since HC3 should be more restrictive than HC1. I appreciate any recommendations and solutions.
EDIT model used for coeftest:
plm(log(SPREAD) ~ PERIOD, data = dat, index = c("STOCKS", "TIME"), effect = "twoway", method = "within")
It appears that the vcovHC() method for plm automatically estimates cluster-robust standard errors, while for lm_robust(), it does not. Therefore, the HC1 estimation of the standard error for plm will appear inflated compared to lm_robust (of lm for that matter).
Using some toy data:
library(sandwich)
library(tidyverse)
library(plm)
library(estimatr)
library(lmtest)
set.seed(1981)
x <- sin(1:1000)
y <- 1 + x + rnorm(1000)
f <- as.character(sort(rep(sample(1:100), 10)))
t <- as.character(rep(sort(sample(1:10)), 100))
dat <- tibble(y = y, x = x, f = f, t = t)
lm_fit <- lm(y ~ x + f + t, data = dat)
plm_fit <- plm(y ~ x, index = c("f", "t"), model = "within", effect = "twoways", data = dat)
rb_fit <- lm_robust(y ~ x, fixed_effects = ~ f + t, data = dat, se_type = "HC1", return_vcov = TRUE)
sqrt(vcovHC(lm_fit, type = "HC1")[2, 2])
#> [1] 0.04752337
sqrt(vcovHC(plm_fit, type = "HC1"))
#> x
#> x 0.05036414
#> attr(,"cluster")
#> [1] "group"
sqrt(rb_fit$vcov)
#> x
#> x 0.04752337
rb_fit <- lm_robust(y ~ x, fixed_effects = ~ f + t, data = dat, se_type = "HC3", return_vcov = TRUE)
sqrt(vcovHC(lm_fit, type = "HC3")[2, 2])
#> [1] 0.05041177
sqrt(vcovHC(plm_fit, type = "HC3"))
#> x
#> x 0.05042142
#> attr(,"cluster")
#> [1] "group"
sqrt(rb_fit$vcov)
#> x
#> x 0.05041177
There does not appear to be equivalent cluster-robust standard error types in the two packages. However, the SEs get closer when specifying cluster-robust SEs in lm_robust():
rb_fit <- lm_robust(y ~ x, fixed_effects = ~ f + t, clusters = f, data = dat, se_type = "CR0")
summary(rb_fit)
#>
#> Call:
#> lm_robust(formula = y ~ x, data = dat, clusters = f, fixed_effects = ~f +
#> t, se_type = "CR0")
#>
#> Standard error type: CR0
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
#> x 0.925 0.05034 18.38 1.133e-33 0.8251 1.025 99
#>
#> Multiple R-squared: 0.3664 , Adjusted R-squared: 0.2888
#> Multiple R-squared (proj. model): 0.3101 , Adjusted R-squared (proj. model): 0.2256
#> F-statistic (proj. model): 337.7 on 1 and 99 DF, p-value: < 2.2e-16
coeftest(plm_fit, vcov. = vcovHC(plm_fit, type = "HC1"))
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> x 0.925009 0.050364 18.366 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Created on 2020-04-16 by the reprex package (v0.3.0)
If I run lm with a formula like Y ~ X1 + X2:X1 + X3:X1 where X1 is continuous and X2,X3 are categorical, I get a contrast for both levels of X2, but not X3.
The pattern is that the first categorical interaction gets both levels but not the second.
library(tidyverse)
library(magrittr)
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
#> The following object is masked from 'package:tidyr':
#>
#> extract
df = data.frame(Frivolousness = sample(1:100, 50, replace =T))
df %<>% mutate(
Personality=sample(c("Bad", "Good"), 50, replace = T),
Timing=ifelse(Frivolousness %% 2 == 0 & runif(50) > 0.2, "Early", "Late")
)
df %<>% mutate(
Enchantedness = 11 +
ifelse(Personality=="Good", 0.23, -0.052)*Frivolousness -
1.3*ifelse(Personality=="Good", 1, 0) +
10*rnorm(50)
)
df %<>% mutate(
Personality = factor(Personality, levels=c("Bad", "Good")),
Timing = factor(Timing, levels=c("Early", "Late"))
)
lm(Enchantedness ~ Personality + Timing + Timing:Frivolousness + Personality:Frivolousness, df)
#>
#> Call:
#> lm(formula = Enchantedness ~ Personality + Timing + Timing:Frivolousness +
#> Personality:Frivolousness, data = df)
#>
#> Coefficients:
#> (Intercept) PersonalityGood
#> 15.64118 -10.99518
#> TimingLate TimingEarly:Frivolousness
#> -1.41757 -0.05796
#> TimingLate:Frivolousness PersonalityGood:Frivolousness
#> -0.07433 0.33410
lm(Enchantedness ~ Personality + Timing + Personality:Frivolousness+ Timing:Frivolousness , df)
#>
#> Call:
#> lm(formula = Enchantedness ~ Personality + Timing + Personality:Frivolousness +
#> Timing:Frivolousness, data = df)
#>
#> Coefficients:
#> (Intercept) PersonalityGood
#> 15.64118 -10.99518
#> TimingLate PersonalityBad:Frivolousness
#> -1.41757 -0.05796
#> PersonalityGood:Frivolousness TimingLate:Frivolousness
#> 0.27614 -0.01636
Created on 2020-02-15 by the reprex package (v0.3.0)
I think the reason it is dropped is that there would be perfect colinearity if it was included. You really should have Frivolousness as a regressor on its own also. Then, you will see that R provides you with the result for just one level of both interactions.
You get this kind of weird behavior because you are missing the term main term, Frivolousness. If you do:
set.seed(111)
## run your data frame stuff
lm(Enchantedness ~ Personality + Timing + Timing:Frivolousness + Personality:Frivolousness, df)
Coefficients:
(Intercept) PersonalityGood
-1.74223 5.31189
TimingLate TimingEarly:Frivolousness
12.47243 0.19090
TimingLate:Frivolousness PersonalityGood:Frivolousness
-0.09496 0.17383
lm(Enchantedness ~ Personality + Timing + Frivolousness+Timing:Frivolousness + Personality:Frivolousness, df)
Coefficients:
(Intercept) PersonalityGood
-1.7422 5.3119
TimingLate Frivolousness
12.4724 0.1909
TimingLate:Frivolousness PersonalityGood:Frivolousness
-0.2859 0.1738
In your model, the interaction term TimingLate:Frivolousness means the change in slope of Frivolousness when Timing is Late. Since the default is not estimated, it has to do it for TimingEarly (the reference level). Hence you can see the coefficients for TimingEarly:Frivolousness and Frivolousness are the same.
As you can see the TimingLate:Frivolousness are very different and In your case I think doesn't make sense to do only the interaction term without the main effect, because it's hard to interpret or model it.
You can roughly check what is the slope for different groups of timing and the model with all terms gives a good estimate:
df %>% group_by(Timing) %>% do(tidy(lm(Enchantedness ~ Frivolousness,data=.)))
# A tibble: 4 x 6
# Groups: Timing [2]
Timing term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 Early (Intercept) 6.13 6.29 0.975 0.341
2 Early Frivolousness 0.208 0.0932 2.23 0.0366
3 Late (Intercept) 11.5 5.35 2.14 0.0419
4 Late Frivolousness -0.00944 0.107 -0.0882 0.930
I got an exercise, where I need to train a linear regression model and get some information about the model:
linear relationship between my chosen variable and the other variables
which variables are important for the model
significance
It´s easy to create an model with the lm-function, so that I can interpret it with
summary(mod).
mod <- lm(cars$height ~ ., data = cars)
The summary()-MEthod returns everything: r-squared, coefficients, p-value, significance ...
But when Im training my model like:
library(mlr)
lrn = makeLearner("regr.ksvm")
mod = train(learner = lrn, task = task)
pred = predict(object = mod, newdata = test)
performance(pred = pred, measures = list(mse, arsq))
I´m just getting the mse and r-squareZd. How to get to the other information like significance, important variables ...
Is there a hance to get access to this mod?
Thanks for help
library(mlr)
#> Loading required package: ParamHelpers
#> 'mlr' is in maintenance mode since July 2019. Future development
#> efforts will go into its successor 'mlr3' (<https://mlr3.mlr-org.com>).
lrn = makeLearner("regr.lm")
mod = train(learner = lrn, task = bh.task)
getLearnerModel(mod)
#>
#> Call:
#> stats::lm(formula = f, data = d)
#>
#> Coefficients:
#> (Intercept) crim zn indus chas1 nox
#> 3.646e+01 -1.080e-01 4.642e-02 2.056e-02 2.687e+00 -1.777e+01
#> rm age dis rad tax ptratio
#> 3.810e+00 6.922e-04 -1.476e+00 3.060e-01 -1.233e-02 -9.527e-01
#> b lstat
#> 9.312e-03 -5.248e-01
Created on 2020-01-15 by the reprex package (v0.3.0.9001)