How can I extract coefficients from this model in caret? - r

I'm using the caret package with the leaps package to get the number of variables to use in a linear regression. How do I extract the model with the lowest RMSE that uses mdl$bestTune number of variables? If this can't be done are there functions in other packages you would recommend that allow for loocv of a stepwise linear regression and actually allow me to find the final model?
Below is reproducible code. From it, I can tell from mdl$bestTune that the number of variables should be 4 (even though I would have hoped for 3). It seems like I should be able to extract the variables from the third row of summary(mdl$finalModel) but I'm not sure how I would do this in a general case and not just this example.
library(caret)
set.seed(101)
x <- matrix(rnorm(36*5), nrow=36)
colnames(x) <- paste0("V", 1:5)
y <- 0.2*x[,1] + 0.3*x[,3] + 0.5*x[,4] + rnorm(36) * .0001
train.control <- trainControl(method="LOOCV")
mdl <- train(x=x, y=y, method="leapSeq", trControl = train.control, trace=FALSE)
coef(mdl$finalModel, as.double(mdl$bestTune))
mdl$bestTune
summary(mdl$finalModel)
mdl$results
Here's the context behind my question in case it's of interest. I have historical monthly returns hundreds of mutual fund. Each fund's returns will be a dependent variable that I'd like to regress against a set of returns on a handful (e.g. 5) factors. For each fund I want to run a stepwise regression. I expect only 1 to 3 of the five factors to be significant for any fund.

you can use:
coef(mdl$finalModel,unlist(mdl$bestTune))

Related

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

How to plot multi-level meta-analysis by study (in contrast to the subgroup)?

I am doing a multi-level meta-analysis. Many studies have several subgroups. When I make a forest plot studies are presented as subgroups. There are 60 of them, however, I would like to plot studies according to the study, then it would be 25 studies and it would be more appropriate. Does anyone have an idea how to do this forest plot?
I did it this way:
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | Author/Study,
test = "t",
method = "REML")
forest(full.model)
It is not clear to me if you want to aggregate to the Author level or to the Study level. If there are multiple rows of data for particular studies, then the model isn't really complete and you would want to add another random intercept for the level of the estimates within studies. Essentially, the lowest random effect should have as many values for nlvls in the output as there are estimates (k).
Let's first tackle the case where we have a multilevel structure with two levels, studies and multiple estimates within studies (for some technical reasons, some might call this a three-level model, but let's not get into this). I will use a fully reproducible example for illustration purposes, using the dat.konstantopoulos2011 dataset, where we have districts and schools within districts. We fit a multilevel model of the type as you have with:
library(metafor)
dat <- dat.konstantopoulos2011
res <- rma.mv(yi, vi, random = ~ 1 | district/school, data=dat)
res
We can aggregate the estimates to the district level using the aggregate() function, specifying the marginal var-cov matrix of the estimates from the model to account for their non-independence (note that this makes use of aggregate.escalc() which only works with escalc objects, so if it is not, you need to convert the dataset to one - see help(aggregate.escalc) for details):
agg <- aggregate(dat, cluster=dat$district, V=vcov(res, type="obs"))
agg
You will find that if you then fit an equal-effects model to these estimates based on the aggregated data that the results are identical to what you obtained from the multilevel model (we use an equal-effects model since the heterogeneity accounted for by the multilevel model is already encapsulated in vcov(res, type="obs")):
rma(yi, vi, method="EE", data=agg)
So, we can now use these aggregated values in a forest plot:
with(agg, forest(yi, vi, slab=district))
My guess based on your description is that you actually have an additional level that you should include in the model and that you want to aggregate to the intermediate level. This is a tad more complicated, since aggregate() isn't meant for that. Just for illustration purposes, say we use year as another (higher) level and I will mess a bit with the data so that all three variance components are non-zero (again, just for illustration purposes):
dat$yi[dat$year == 1976] <- dat$yi[dat$year == 1976] + 0.8
res <- rma.mv(yi, vi, random = ~ 1 | year/district/school, data=dat)
res
Now instead of aggregate(), we can accomplish the same thing by using a multivariate model, including the intermediate level as a factor and using again vcov(res, type="obs") as the var-cov matrix:
agg <- rma.mv(yi, V=vcov(res, type="obs"), mods = ~ 0 + factor(district), data=dat)
agg
Now the model coefficients of this model are the aggregated values and the var-cov matrix of the model coefficients is the var-cov matrix of these aggregated values:
coef(agg)
vcov(agg)
They are not all independent (since we haven't aggregated to the highest level), so if we want to check that we can obtain the same results as from the multilevel model, we must account for this dependency:
rma.mv(coef(agg), V=vcov(agg), method="EE")
Again, exactly the same results. So now we use these coefficients and the diagonal from vcov(agg) as their sampling variances in the forest plot:
forest(coef(agg), diag(vcov(agg)), slab=names(coef(agg)))
The forest plot cannot indicate the dependency that still remains in these values, so if one were to meta-analyze these aggregated values using only diag(vcov(agg)) as their sampling variances, the results would not be identical to what you get from the full multilevel model. But there isn't really a way around that and the plot is just a visualization of the aggregated estimates and the CIs shown are correct.
You need to specify your own grouping in a new column of data and use this as the new random effect:
df$study_group <- c(1,1,1,2,2,3,4,5,5,5) # example
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | study_group,
test = "t",
method = "REML")
forest(full.model)

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

R: glmrob can't predict models with dropped co-linear columns, while glm can?

I'm learning to implement robust glms in R, but can't figure out why I am unable to get glmrob to predict values from my regression models when I have a model where some columns are dropped due to co-linearity. Specifically when I use the predict function to predict values from a glmrob, it always gives NA for all values. I don't observe this when predicting values from the same data & model using glm. It doesn't seem to matter what data I use -- as long as there is a NA coefficient in the fitted model (and the NA isn't the last coefficient in the coefficient vector), the predict does not work.
This behavior holds for all datasets and models I have tried where an internal column is dropped due to co-linearity. I include a fake data set where two columns are dropped from the model, which gives two NAs in the coefficient list. Both glm and glmrob give nearly identical coefficients, yet predict only works with the glm model. So my question is: what don't I understand about robust regression that would prevent my glmrob models from generating predicted values?
library(robustbase)
#Make fake data with two categorial predictors
df <- data.frame("category" = rep(c("A","B","C"),each=6))
df$location <- rep(1:6,each=3)
val <- rep(c(500,50,5000),each=6)+rep(c(50,100,25,200,100,1),each=3)
df$value <- rpois(NROW(df),val)
#note that predict works if we omit the newdata parameter. However I need the newdata param
#so I use the original dataframe here as a stand-in.
mod <- glm(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) # works fine
mod <- glmrob(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) #predicts NA for all values
I've been digging into this and have concluded that the problem does not lie in my understanding of robust regression, but rather the problem lies with a bug in the robustbase package. The predict.lmrob function does not correctly pick the necessary coefficients from the model before the prediction. It needs to pick the first x non-NA coefficients (where x=rank of the model matrix). Instead it merely picks the first x coefficients without checking if they are NA. This explains why this problem only surfaces for models where the NA isn't the last coefficient in the coefficient vector.
To fix this, I copied the predict.lmrob source using:
getAnywhere(predict.lmrob)
and created my own replacement function. In this function I made a single modification to the code:
...
p <- object$rank
if (is.null(p)) {
df <- Inf
p <- sum(!is.na(coef(object)))
#piv <- seq_len(p) # old code
piv <- which(!is.na(coef(object))) # new code
}
else {
p1 <- seq_len(p)
piv <- if (p)
qr(object)$pivot[p1]
}
...
I've run a few hundred datasets using this change and it has worked well.

R: Regressions with group fixed effects and clustered standard errors with imputed dataset

I am trying to run regressions in R (multiple models - poisson, binomial and continuous) that include fixed effects of groups (e.g. schools) to adjust for general group-level differences (essentially demeaning by group) and that cluster standard errors to account for the nesting of participants in the groups. I am also running these over imputed data frames (created with mice). It seems that different disciplines use the phrase ‘fixed effects’ differently so I am having a hard time searching to troubleshoot.
I have fit random intercept models (with lme4) but they do not account for the school fixed effects (and the random effects are not of interest to my research questions). Putting the groups in as dummies slows the run down tremendously. I could also run a single level glm/lm with group dummies but I have not been able to find a strategy to cluster the standard errors with the imputed data (tried the clusterSE package). I could hand calculate the demeaning but there seems like there should be a more direct way to achieve this.

I have also looked at the lfe package but that does not seem to have glm options and the demeanlist function does not seem to be compatible with the imputed data frames.
In Stata, the command would be xtreg, fe vce (Cluster Variable), (fe = fixed effects, vce = clustered standard errors, with mi added to run over imputed dataframes). I could switch to Stata for the modeling but would definitely prefer to stay with R if possible!
Please let me know if this is better posted in cross-validated - I was on the fence but went with this one since it seemed to be more a coding question.
Thank you!
I would block bootstrap. The "block" handles the clustering and "bootstrap" handles the generated regressors.
There is probably a more elegant way to make this extensible to other estimators, but this should get you started.
# junk data
x <- rnorm(100)
y <- 1 + 2*x + rnorm(100)
dat1 <- data.frame(y, x, id=seq_along(y))
summary(lm(y ~ x, data=dat1))
# same point estimates, but lower SEs
dat2 <- dat1[rep(seq_along(y), each=10), ]
summary(lm(y ~ x, data=dat2))
# block boostrap helper function
require(boot)
myStatistic <- function(ids, i) {
myData <- do.call(rbind, lapply(i, function(i) dat2[dat2$id==ids[i], ]))
myLm <- lm(y ~ x, data=myData)
myLm$coefficients
}
# same point estimates from helper function if original data
myStatistic(unique(dat2$id), 1:100)
# block bootstrap recovers correct SEs
boot(unique(dat2$id), myStatistic, 500)

Resources