Cross-validation on comparing linear models - r

I have 2 linear models to compare: One is reduced model and the other one is full model. I have already done F-Test on comparing 2 linear models. But I do not know how to do this using 5-fold cross-validation comparing 2 models. The following code is what I have in R.
library(carData) #make sure this package is installed!
data("Anscombe") #NOTE: lowercase was the data from before, uppercase is this dataset (annoying)
spending = subset(Anscombe,! rownames(Anscombe) %in% c("HI","AK","DC"),c(1,2,4)) ##Data file.
mod_full = lm(education ~ urban + income, data=spending) ##Full model.
mod_reduced = lm(education ~ income, data = spending) ## Reduced model.
FTest = anova(mod_reduced, mod_full)
print(FTest) ##Print out the F-test result.

You can use the following code for comparing 2 models using 5-fold cross-validation
library(caret)
library(carData)
data("Anscombe")
spending = subset(Anscombe,! rownames(Anscombe) %in% c("HI","AK","DC"),c(1,2,4)) ##Data file.
#Setting for 5-fold cross-validation
trainControl <- trainControl(method="cv", number=5,
savePredictions=TRUE, classProbs=F)
#Full model
set.seed(7) #To have reproducible results
fit.full <- train(education ~ urban + income, data=spending, method="lm", metric="RMSE",
preProc=c("center", "scale"),
trControl=trainControl)
#Reduced model
set.seed(7)
fit.reduced <- train(education ~ income, data=spending, method="lm", metric="RMSE",
preProc=c("center", "scale"),
trControl=trainControl)
#For comparing 2 models
results <- resamples(list(Full=fit.full,Reduced=fit.reduced))
summary(results)
dotplot(results,scale="free")
# correlation between results
modelCor(results)
splom(results)
#Difference in model predictions
diffs <- diff(results)
#Summarize p-values for pair-wise comparisons
summary(diffs)

Related

Getting estimated means after multiple imputation using the mitml, nlme & geepack R packages

I'm running multilevel multiple imputation through the package mitml (using the panimpute() function) and am fitting linear mixed models and marginal models through the packages nlme and geepack and the mitml:with() function.
I can get the estimates, p-values etc for those through the testEstimates() function but I'm also looking to get estimated means across my model predictors. I've tried the emmeans package, which I normally use for getting estimated means when running nlme & geepack without multiple imputation but doing so emmeans tell me "Can't handle an object of class “mitml.result”".
I'm wondering is there a way to get pooled estimated means from the multiple imputation analyses I've run?
The data frames I'm analyzing are longitudinal/repeated measures and in long format. In the linear mixed model I want to get the estimated means for a 2x2 interaction effect and in the marginal model I'm trying to get estimated means for the 6 levels of 'time' variable. The outcome in all models is continuous.
Here's my code
# mixed model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100, group = "treatment")
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, lme(Dep ~ time*treatment, random = ~ 1|id, method = "ML", na.action = na.exclude, control = list(opt = "optim")))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
# marginal model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100)
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, geeglm(Dep ~ time, id = id, corstr ="unstructured"))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
is there a way to get pooled estimated means from the multiple imputation analyses I've run?
This is not a reprex without Data, so I can't verify this works for you. But emmeans provides support for mira-class (lists of) models in the mice package. So if you fit your model in with() using the mids rather than mitml.list class object, then you can use that to obtain marginal means of your outcome (and any contrasts or pairwise comparisons afterward).
Using example data found here, which uncomfortably loads an external workspace:
con <- url("https://www.gerkovink.com/mimp/popular.RData")
load(con)
## imputation
library(mice)
ini <- mice(popNCR, maxit = 0)
meth <- ini$meth
meth[c(3, 5, 6, 7)] <- "norm"
pred <- ini$pred
pred[, "pupil"] <- 0
imp <- mice(popNCR, meth = meth, pred = pred, print = FALSE)
## analysis
library(lme4) # fit multilevel model
mod <- with(imp, lmer(popular ~ sex + (1|class)))
library(emmeans) # obtain pooled estimates of means
(em <- emmeans(mod, specs = ~ sex) )
pairs(em) # test comparison

Get predictions from each tree in a random forest model in R (using train function in training and predict in predicting)

I am using a train function in training a Random Forest model:
fitControl = trainControl(method="oob")
tuneGrid = expand.grid(.mtry=c(15))
rfmod = train(p ~
x +
y +
z,
method="rf",
data=train_df,
tuneGrid=tuneGrid,
trControl=fitControl,
importance=TRUE,
allowParallel=TRUE)
This is a simplified code showing the structure of my model and I am using the training data train_df.
And I want to have the prediction from each tree. I searched somehow and tried this:
preds <- predict(rfmod, newdata = test_df[1], predict.all = TRUE)
I just used the first row of my test data test_df. After this, when I check preds it just still gives me only one prediction value, while I want the prediction from all the trees.
How can I do this to get all the predictions from all the trees?
Thanks in advance!!

How to calculate R-squared after using bagging function to develop CART decision trees?

I am using the following bagging function with ipred to bootstrap the sample 500 times in R in order to develop decision trees:
baggedsample <- bagging(p ~., data, nbagg=500, coob=TRUE, control = list
(minbucket=5))
After this, I would like to know the R-squared.
I notice that if I do the bagging with caret function, R-squared would be automatically calculated as follows:
# Specify 10-fold cross validation
ctrl <- trainControl(method = "cv", number = 10)
# CV bagged model
baggedsample <- train(
p ~ .,
data,
method = "treebag",
trControl = ctrl,
importance = TRUE
)
# assess results
baggedsample
RMSE Rsquared MAE
## 36477.25 0.7001783 24059.85
Appreciate any guidance on this issue, thanks.
Since you do not provide any data, I will illustrate using the built-in iris data.
You can simply compute R-squared from the formula.
attach(iris)
BAG = bagging(Sepal.Length ~ ., data=iris)
R2 = 1 - sum((Sepal.Length - predict(BAG))^2) /
sum((Sepal.Length - mean(Sepal.Length))^2)
R2
[1] 0.824782

Extract the Intercept from a Caret LASSO Model

I'm using the caret package in R to fit a LASSO regression model. My code runs fine, however I would like to extract the Intercept for the final model so I can build a scoring key using the selected predictors and coefficients.
For example, if "Extraversion" is the variable I am trying to model using survey items, I would like to produce the following scoring key:
Intercept + Survey_Item_1*Slope + Survey_Item_2*Slope + and so on
FWIW, I am able to extract the coefficients for the predictors.
My code for reference:
##Create Training & test set
set.seed(9808)
ind <- sample(0:1, nrow(df), replace=T, prob=c(.75,.25))
train <- df[ind==0,]
test <- df[ind==1,]
ctrl <- trainControl(method = "repeatedcv", number=5, repeats = 5)
##Train Lasso model
fit.lasso <- train(Extraversion ~., , data=train, method="lasso", preProc=c('scale','center','nzv'), trControl=ctrl)
fit.lasso
predict.enet(fit.lasso$finalModel, type='coefficients', s=fit.lasso$bestTune$fraction, mode='fraction')
##Fit models to test data
lasso_test<- predict(fit.lasso, newdata=test, na.action="na.pass")
postResample(pred = lasso_test, obs = test[,c(1)])

Obtaining predictions on test datasets for k-fold cross validation in caret

I'm a little confused how caret scores the test folds in k-fold cross validation.
I'd like to generate a data frame or matrix containing the scored records of the ten test datasets in 10-fold cross validation.
For example, using the iris dataset to train a decision tree model:
install.packages("caret", dependencies=TRUE)
library(caret)
data(iris)
train_control <- trainControl(method="cv", number=10, savePredictions = TRUE),
model <- train(Species ~ ., data=iris, trControl=train_control, method="rpart")
model$pred
The model$pred command lists predictions for ten folds in 450 records.
This doesn't seem right - shouldn't model$pred produce predictions for the 150 records in the ten test folds (1/10 * 150 = 15 records per test fold)? How are 450 records generated?
By default, train iterates over three values for the complexity parameter cp of rpart (see ?rpart.control):
library(caret)
data(iris)
train_control <- trainControl(method="cv", number=10, savePredictions = TRUE)
model <- train(Species ~ .,
data=iris,
trControl=train_control,
method="rpart")
nrow(model$pred)
# [1] 450
length(unique(model$pred$cp))
# [1] 3
You can change that for example by explicitly specifying cp=0.05:
model <- train(Species ~ .,
data=iris,
trControl=train_control,
method="rpart",
tuneGrid = data.frame(cp = 0.05))
nrow(model$pred)
# [1] 150
length(unique(model$pred$cp))
# [1] 1
or by using tuneLength=1 instead of the default 3:
model <- train(Species ~ .,
data=iris,
trControl=train_control,
method="rpart",
tuneLength = 1)
nrow(model$pred)
# [1] 150
length(unique(model$pred$cp))
# [1] 1

Resources