R lme loop woes with "try" and lme confidence intervals - r

R users, I've still been hashing out bits and pieces related to my initial question as seen here and now I'm quite stuck.
http://stackoverflow.com/questions/12270578/skipping-over-an-error-warning-in-an-lme-loop-in-r
Here is the code using mtcars as an example dataset. I want to save the lower and upper confidence intervals for every response variable as listed below (but not for the intercept, though I still need its other coefficients for the lme but I've got that already), all in one go (my real dataset is very large and I'm trying to automate it as much as possible)
library(log10)
library(nlme)
library(lattice)
responseVariables = c("mpg",
"disp",
"hp")
carModels <- list()
carModelNames <- list()
coint<- list()
coint2<- list()
coint$fixed <- list()
lower<- list()
upper<- list()
carCIlower <- list()
carCIupper <- list()
for (i in responseVariables){
print("Doing: ")
print(i)
mtcars$tmp <- as.numeric(mtcars[,i])
tmpLme <- lme( log10(tmp) ~ I(log10(wt)), random = ~1 | carb / gear / am, data=mtcars,na.action=na.omit )
carModels <- append(carModels, list(tmpLme))
carModelNames <- append(carModelNames,i)
coint <- try(intervals(tmpLme))
if (inherits(coint, "try-error")) {
tmpLme <- lme(log10(tmp) ~ log10(wt), random = ~1 | carb / gear, data=mtcars, na.action=na.omit);
coint <- try(intervals(tmpLme));
} else if (inherits(coint, "try-error")) {
tmpLme <- lme(log10(tmp) ~ log10(wt), random = ~1 | carb / gear, data=mtcars, na.action=na.omit, method="ML");
coint <- try(intervals(tmpLme));
} else if (inherits(coint, "try-error")) {
tmpLme <- lme(log10(tmp) ~ log10(wt), random = ~1 | carb, data=mtcars, na.action=na.omit);
coint <- try(intervals(tmpLme));
#}
coint2<- append(coint, list(tmpLme))
lower <- dim(coint2$fixed)[1]
upper <- dim(coint2$fixed)[1]
carCIlower <- append(carCIlower, coint2$fixed[2,1],lower)
carCIupper <- append(carCIupper, coint2$fixed[2,3],upper)
vs_wt <- cbind(carModelNames , carCIlower , carCIlower )
}
}
Currently I can get the CI values if I run the commands for each response variable, but not as part of the loop. The loop doesn't proceed past the coint2 statement. Well it does but it doesn't give me answers for coint2 and beyond. Or if I run those lines again it'll only give me values for the last item in the loop (i=hp). Also I see there's an unbalanced curly bracket (I've commented it out to show you) but if I use it gives me the lower CI for "disp" for the lower and upper CI for all resp variables.
Can someone point out what's missing?

Related

Standard error for lme4 random effect predictions in lme4

I have an experiment with plants having different growth habits (growth_type), genotypes nested within growth types (ge), and blocks also nested within growth types (block). The objective is to test the influence of growth type and genotypes of plant performance. Here is a sample data and reproducible example.
data1 <- read.csv(text = "
growth_type,ge,block,performance
dwarf,A,1,128.32
dwarf,A,2,97.01
dwarf,A,3,91.05
dwarf,B,1,108.51
dwarf,B,2,121.11
dwarf,B,3,84.15
dwarf,C,1,132.55
dwarf,C,2,129.45
dwarf,C,3,122.33
tall,D,1,79.68
tall,D,2,122.5
tall,D,3,143.42
tall,E,1,149.29
tall,E,2,162.13
tall,E,3,135.42
tall,F,1,90.45
tall,F,2,127.4
tall,F,3,78.99")
These are the libraries I used:
library(dplyr)
library(lme4)
library(lsmeans)
The first step was fitting a model:
model.fit <-
lmer(performance ~ growth_type + (1 | block:growth_type) + (1 | ge:growth_type),
data = data1)
From this model, I can extract the fixed effect of growth type using lsmeans:
fixed.effect.estimates <- lsmeans::lsmeans(model.fit, "growth_type")
and this is the output:
What I need to obtain is the same output for the random effect. I am able to get the prediction interval, but I cannot get the standard error. This is what I tried:
# RANDOM EFFECT ESTIMATES
data1$pred.performance <-
predict(model.fit,
newdata = data1,
re.form= ~(1 | ge:growth_type))
pred.ge <- data1 %>%
distinct(ge, growth_type, pred.performance)
And this is what I've obtained. So far so good.
Then I used the bootMer function to build the prediction interval using bootstrapping.
mySumm <- function(.) {
predict(., newdata=pred.ge, re.form= ~(1 | ge:growth_type))
}
####Collapse bootstrap into median, 95% PI
sumBoot <- function(merBoot) {
return(
data.frame(fit = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.5, na.rm=TRUE))),
lwr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.025, na.rm=TRUE))),
upr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.975, na.rm=TRUE)))
)
)
}
##lme4::bootMer() method 1
PI.boot1.time <- system.time(
boot1 <- lme4::bootMer(model.fit, mySumm, nsim=250, use.u=TRUE, type="parametric")
)
PI.boot1 <- sumBoot(boot1)
cbind(pred.ge, PI.boot1)
This is what I obtained:
In summary, my questions are:
how can I get the standard errors as I did for the fixed effect components?
why random effect estimates from lme4::predict are different from lme4::bootMer?
Sorry for a long explanation.

Is there a function for substituting (or removing at all) explaining variables in a linear model (lm)?

I have a linear model with lots of explaining variables (independent variables)
model <- lm(y ~ x1 + x2 + x3 + ... + x100)
some of which are linear depended on each other (multicollinearity).
I want the machine to search for the name of the explaining variable which has the highest VIF coefficient (x2 for example), delete it from the formula and then run the old lm function with the new formula
model <- lm(y ~ x1 + x3 + ... + x100)
I already learned how to retrieve the name of the explaining variable which has the highest VIF coefficient:
max_vif <- function(x) {
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
But I still don't understand how to search the needed explaining variable, delete it from the formula and run the function again.
We can use the update function and paste in the column that needs to be removed. We first can fit a model, and then use update to change that model's formula. The model formula can be expressed as a character string, which allows you to concatenate the general formula .~. and whatever variable(s) you'd like removed (using the minus sign -).
Here is an example:
fit1 <- lm(wt ~ mpg + cyl + am, data = mtcars)
coef(fit1)
# (Intercept) mpg cyl am
# 4.83597190 -0.09470611 0.08015745 -0.52182463
rm_var <- "am"
fit2 <- update(fit1, paste0(".~. - ", rm_var))
coef(fit2)
# (Intercept) mpg cyl
# 5.07595833 -0.11908115 0.08625557
Using max_vif we can wrap this into a function:
rm_max_vif <- function(x){
# find variable(s) needing to be removed
rm_var <- max_vif(x)
# concatenate with "-" to remove variable(s) from formula
rm_var <- paste(paste0("-", rm_var), collapse = " ")
# update model
update(x, paste0(".~.", rm_var))
}
Problem solved!
I created a list containing all variables for lm model:
Price <- list(y,x1,...,x100)
Then I used different way for setting lm model:
model <- lm(y ~ ., data = Price)
So we can just delete variable with the highest VIF from Price list.
With the function i already came up the code will be:
Price <- list(y,x1,x2,...,x100)
model <- lm(y ~ ., data = Price)
max_vif <- function(x) { # Function for finding name of variable with the highest VIF
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
n <- max(data.frame(vif(model)))
while(n >= 5) { # Loop for deleting variable with the highest VIF from `Price` list one after another, untill there is no VIF equal or higher then 5
Price[[m]] <- NULL
model_auto <- lm(y ~ ., data = Price)
m <- max_vif(model)
n <- max(data.frame(vif(model)))
}

output table for multiple regressions

I ran several time-series regressions (one for every year) and now I'd like to generate a table similar to what coef() returns but also with level of significance (the stars), R-squared and F-statistic for each year which will look somewhat like this:
b0 b1 b2 b3 b4 R-sq. F-stat.
2010 ...*
2011 ...
2012 ...**
So far I tried mtable() from the memisc-package which gives me years as columns and coeffecients as rows but I'd prefer the result to be "transposed" (like above).
Since we don't have access to your data or the code you used to run your models, I created my own dummy models using the mtcars dataset:
data("mtcars")
model1 <- lm(mpg ~ wt + cyl, data = mtcars)
model2 <- lm(mpg ~ wt + cyl + hp, data = mtcars)
For future reference, you'll always want to supply some of your data using, for example, dput(head(my_dataframe, 20)). You should also put up more of the code you used to get where you're at; in fact, the minimum amount of code needed to reproduce your problem. You may want to read How to Create a Great R Reproducible Example for more information; it just helps others help you.
Then I rigged up the following (clumsy) function that I think does roughly what you're looking for. In any event, it should get you started in the right direction:
get_row <- function(x, coef_names) {
coef_mat <- coef(summary(x))
these_coef_names <- rownames(coef_mat)
rows <- match(coef_names, these_coef_names)
p <- coef_mat[rows, 4]
stars <- c("", "*", "**", "***")[(p < 0.05) + (p < 0.01) + (p < 0.001) + 1]
coefs <- round(coef_mat[rows, 1], 3)
output <- paste0(coefs, stars)
output <- ifelse(grepl("NA", output), NA, output)
return(output)
}
get_table <- function(...) {
models <- list(...)
if ( any(lapply(models, class) != "lm" ) ) {
stop("This function has only been tested with lm objects.")
}
coef_names <- unique(unlist(sapply(models, variable.names)))
coef_table <- t(sapply(models, get_row, coef_names))
colnames(coef_table) <- coef_names
return(coef_table)
}
get_table(model1, model2)
# (Intercept) wt cyl hp
# [1,] "39.686***" "-3.191***" "-1.508**" NA
# [2,] "38.752***" "-3.167***" "-0.942" "-0.018"

Writing function to identify confounding variables

I am trying to model an outcome as a function of several exposures, adjusting the models for any covariates that may be confounders (≥ 10% ∆ to outcome coefficient when added to model). I am looking at many covariates as potential confounders, so have created a dataframe with all of them and am using lapply (the outcome and exposures are in a separate dataframe which has already been attached). To make sorting through all my outputs easier, I have tried to write a function which will only display the output if the covariate is a confounder. The exposures and number of them are different in each model, so I find myself having to write code like bellow each time I run my analyses, but know there must be an easier way. Would there be a function I could write to just lapply with, using the model without confounders and the Covariates dataframe as arguments? Thanks!
lapply(Covariates, function(x) {
model <- summary(lm(Outcome ~ Exposure1 + Exposure2 + ... + x))
if ((model$coefficients[2, 1] - summary(lm(Outcome ~ Exposure))$coefficients[2, 1])/
model$coefficients[2, 1] >= .1)
return(model)
})
I have written a function to solve this problem!
confounder <- function(model) {
model.sum <- summary(model)
model.b <- model.sum$coefficients[2, 1]
oldmodel <- update(model, . ~ . -x)
oldmodel.sum <- summary(oldmodel)
oldmodel.b <- oldmodel.sum$coefficients[2, 1]
model.frame <- tidy(model.sum)
model.sub <- subset(model.frame, term = "x")
model.sub.b <- model.sub[, 5]
if ((model.b - oldmodel.b)/model.b >= .1 |
model.sub.b < .05)
return(model.sum)
}
I then lapply this function to the model:
lapply(Covariates, function(x) {
confounder(lm(Outcome ~ Exposure1 + Exposure2 + ... + x))
})

parallel computing in R , implementing bootstrap

I'm currently trying to compute model estimators using the BLB bootstrap , and would like to do so parallel. my code works fine when I'm not doing it parallel. the problem when I'm computing in parallel is that the results I get from each core contains NA values. I don't understand how I get NA values while the Iris Data set's values don't contain NA at all.
here is the code that I'm using :
library(doParallel)
library(itertools)
num_of_cores <- detectCores()
cl <- makePSOCKcluster(num_of_cores)
registerDoParallel(cl)
attach(iris)
data <- iris
coeftmp <- data.frame()
system.time(
r <- foreach(dat = isplitRows(data, chunks=num_of_cores),
.combine = cbind) %dopar% {
BLBsize = round(nrow(dat)^0.6)
for (i in 1:400){
set.seed(i)
# sampling B(n) data points from the original data set without replacement
sample_BOFN <- dat[sample(nrow(dat), size = BLBsize, replace = FALSE), ]
# sampling from the subsample with replacment
sample_bootstrap <- sample_BOFN[sample(nrow(sample_BOFN), size = nrow(sample_BOFN), replace = TRUE), ]
bootstrapModel <- glm(sample_bootstrap$Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width, data = sample_bootstrap)
coeftmp <- rbind(coeftmp, bootstrapModel$coefficients)
}
#calculating the estimators of the model with mean
colMeans(coeftmp)
})
I think you're going to have to go through a few iterations of the debugger on this to solve it. But you're getting NAsfrom this line
bootstrapModel <- glm(sample_bootstrap$Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width, data = sample_bootstrap)
I am guessing that you get a singularity from one of your sample_bootstraps, since a singularity would give you an NA coefficient. But it's possible something else is causing this error, though it's definitely from this line of code.... you'll need to step through the debugger to isolate it.
... ie, this is not a complete answer. But this should allow you to solve your own problem:
You can see this by investigating:
r2 <- foreach(dat = isplitRows(data, chunks=1)) %dopar% {
BLBsize = round(nrow(dat)^0.6)
for (i in 1:400){
set.seed(i)
# sampling B(n) data points from the original data set without replacement
sample_BOFN <- dat[sample(nrow(dat), size = BLBsize, replace = FALSE), ]
# sampling from the subsample with replacment
sample_bootstrap <- sample_BOFN[sample(nrow(sample_BOFN), size = nrow(sample_BOFN), replace = TRUE), ]
bootstrapModel <- glm(sample_bootstrap$Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width, data = sample_bootstrap)
coeftmp <- rbind(coeftmp, bootstrapModel$coefficients)
}
#calculating the estimators of the model with mean
# return a list, not just the colMeans -- for debugging purposes
return(list(coeftmp= coeftmp, result= colMeans(coeftmp)))
}
sum(is.na(r2[[1]][[1]])) # no missing coefficients with 1 core
r <- foreach(dat = isplitRows(data, chunks=num_of_cores)) %dopar% {
BLBsize = round(nrow(dat)^0.6)
for (i in 1:400){
set.seed(i)
# sampling B(n) data points from the original data set without replacement
sample_BOFN <- dat[sample(nrow(dat), size = BLBsize, replace = FALSE), ]
# sampling from the subsample with replacment
sample_bootstrap <- sample_BOFN[sample(nrow(sample_BOFN), size = nrow(sample_BOFN), replace = TRUE), ]
bootstrapModel <- glm(sample_bootstrap$Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width, data = sample_bootstrap)
coeftmp <- rbind(coeftmp, bootstrapModel$coefficients)
}
#calculating the estimators of the model with mean
# return a list, not just the colMeans -- for debugging purposes
return(list(coeftmp= coeftmp, result= colMeans(coeftmp)))
}
# lots of missing values in your coeftmp results.
lapply(r, function(l) {sum(is.na(l[[1]]))})

Resources