Pooled Regression Results using mice, caret, and glmnet - r

Not sure if this more of a statistics question but the closest similar problem I could find is here, although I couldn't get it to work for my case.
I am trying to develop a pooled, penalized logistic regression model. I used mice to create a mids object and then fit a model to each dataset using caret repeated cross-validation with elastic net regression (glmnet) to tune parameters. The fitted object is not of class "mira" but I think I fixed that by changing the object class with the right list items. The major issue is that glmnet does not have an associated vcov method, which is required by pool().
I would like to use penalized regression based on the amount of variables and uncertainty over which ones are the best predictors. My data consists of 4x numeric variables and 9x categorical variables of varying levels and I anticipate including interactions.
Does anyone know how I might be able to create my own vcov method or otherwise address this issue? I am not sure if this is possible.
Example data and code are enclosed, noting that I am not able to share the actual data.
library(mice)
library(caret)
dat <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1,4,3,1,1,2,2,3,5,2,4,5,1),
status=c(1,1,1,0,2,2,0,0,NA,1,2,0,1,1,1,NA,2,2,0,0,1,NA,2,0),
x=c(0,2,1,1,NA,NA,0,1,1,2,0,1,0,2,1,1,NA,NA,0,1,1,2,0,1),
sex=c("M","M","M","M","F","F","F","F","M","F","F","M","F","M","M","M","F","F","M","F","M","F","M","F")))
imp <- mice(dat,m=5, seed=192)
control = trainControl(method = "repeatedcv",
number = 10,
repeats=3,
verboseIter = FALSE)
mod <- list(analyses=vector("list", imp$m))
for(i in 1:imp$m){
mod$analyses[[i]] <- train(sex ~ .,
data = complete(imp, i),
method = "glmnet",
family="binomial",
trControl = control,
tuneLength = 10,
metric="Kappa")
}
obj <- as.mira(mod)
obj <- list(call=mod$analyses[[1]]$call, call1=imp$call, nmis=imp$nmis, analyses=mod$analyses)
oldClass(obj) <- "mira"
pool(obj)
Produces:
Error in pool(obj) : Object has no vcov() method.

Related

Save out a caret prediction model and apply to external data in R

I have run a caret prediction model
fit <- train(outcome~ ., data = training,
method = 'glmnet',
metric = "ROC",
tuneLength = 5,
trControl = fitControl)
fit
Now I want to apply that model to out of sample (external) validation set - however I do not have access to this data, I am sending the final models to a collaborator for them to apply to their data
I originally saved out the final model by:
combined_coef<-as.matrix(exp(coef(fit$finalModel, fit$bestTune$lambda)))
So it could be read in and applied it to the new data
fitValidation <- predict(fit, newdata = validation, type = "prob")
It wouldn't work on a data frame, or a matrix, and when read in as a list, the error msg was:
"Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"
So does it have to be the whole model fit object?
Is there an easier way to do that than save out and send the whole (massive) fit object?
Is there a way of only saving out the 'final model' (as above) and then applying this in the 'predict' call?
Thanks
As Sirius says, the best way to do this would be to just save the model object. It shouldn't be that large.
However, in a pinch, the other option would be for your collaborator to score the model by hand. One can do this by multiplying the validation matrix against the vector of coefficients. The code would look like the below, given that you have a matrix validation in the same format as your model matrix and coefficients as a vector. This calculation is for logistic regression, and given you are using ROC as your fit metric, this should be what you need.
# do the scoring via matrix multiplication
scores <- t(t(validation) * coefficients)
# sum the scores by row and exponentiate.
log_odds <- exp(rowSums(scores, na.rm = TRUE))
final_scores <- log_odds / (1 + log_odds)

R feature selection with LASSO

I have a small data set (37 observations x 23 features) and want to perform feature selection with LASSO regression in order to its reduce dimensionality. To achieve this, I designed the below code based on online tutorials
#Load the libraries
library(mlbench)
library(elasticnet)
library(caret)
#Initialize cross validation and train LASSO
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=My_Data_Frame, method='lasso', trControl=cv_5)
#Filter out the variables whose coefficients have squeezed to 0
drop <-predict.enet(lasso$finalModel, type='coefficients', s=lasso$bestTune$fraction, mode='fraction')$coefficients
drop<-drop[drop==0]%>%names()
My_Data_Frame<- My_Data_Frame%>%select(-drop)
In most cases the code runs without errors but it occasionally throws the following:
Warning messages:
1: model fit failed for Fold2: fraction=0.9 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
I sense this happens because my data has few rows and some variables have low variance.
Is there a way I can bypass or fix this issue (e.g. setting a parameter in the flow)?
You have a low number of observations, so there's a good chance in some training set, that some of your columns will be all zero, or very low variance. For example:
library(caret)
set.seed(222)
df = data.frame(ColumnY = rnorm(37),matrix(rbinom(37*23,1,p=0.15),ncol=23))
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_5)
Warning messages:
1: model fit failed for Fold4: fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = 0, ...) :
Some of the columns of x have zero variance
Before running below, check that for categorical columns, all of them don't have only 1 positive label..
One way is to increase the cv fold, if you set 5, you are using 80% of the data. Try 10 to use 90% of the data:
cv_10 <- trainControl(method="cv", number=10)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_10)
And as you might have seen.. since the dataset is so small, cross-validation might not offer you that much advantage, you can also do leave one out cross-validation:
tr <- trainControl(method="LOOCV")
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=tr)
You can use the FSinR package to perform feature selection. It is in R and accessible from CRAN. It has a wide variety of filter and wrapper methods that you can combine with search methods. The interface to generate the wrapper evaluator follows the caret interface. For example:
# Load the library
library(FSinR)
# Choose one of the search methods
searcher <- searchAlgorithm('sequentialForwardSelection')
# Choose one of the filter/wrapper evaluators (You can remove the fitting and resampling params if you want to make it simpler)(These are the parameters of the train and trainControl of caret)
resamplingParams <- list(method = "cv", number = 5)
fittingParams <- list(preProc = c("center", "scale"), metric="Accuracy", tuneGrid = expand.grid(k = c(1:20)))
evaluator <- wrapperEvaluator('knn', resamplingParams, fittingParams)
# You make the feature selection (returns the best features)
results <- featureSelection(My_Data_Frame, 'ColumnY', searcher, evaluator)

Automate variable selection based on varimp in R

In R, I have a logistic regression model as follows
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(result~., data=df,
trControl = train_control,
method = "glm",
family=binomial(link="logit"))
calculatedVarImp <- varImp(logit_Model, scale = FALSE)
I use multiple datasets that run through the same code, so the variable importance changes for each dataset. Is there a way to get the names of the variables that are less than n (e.g. 1) in the overall importance, so I can automate the removal of those variables and rerun the model.
I was unable to get the information from 'calculatedVarImp' variable by subsetting 'overall' value
lowVarImp <- subset(calculatedVarImp , importance$Overall <1)
Also, is there a better way of doing variable selection?
Thanks in advance
You're using the caret package. Not sure if you're aware of this, but caret has a method for stepwise logistic regression using the Akaike Information Criterion: glmStepAIC.
So it iteratively trains a model for every subset of predictors and stops at the one with the lowest AIC.
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(y~., data= train_data,
trControl = train_control,
method = "glmStepAIC",
family=binomial(link="logit"),
na.action = na.omit)
logit_Model$finalModel
This is as automated as it gets but it may be worth reading this answer about the downsides to this method:
See Also.

caret: different RMSE on the same data

I think, that my problem is quite weird. When I use RMSE metric for the best model selection by train function, I obtain different RMSE value from computed by my own function on the same data. Where ist the problem? Does my function work wrong?
library(caret)
library(car)
library(nnet)
data(oil)
ztest=fattyAcids[c(81:96),]
fit<-list(r1=c(1:80))
pred<-list(r1=c(81:96))
ctrl <- trainControl(method = "LGOCV",index=fit,indexOut=pred)
model <- train(Palmitic~Stearic+Oleic+Linoleic+Linolenic+Eicosanoic,
fattyAcids,
method='nnet',
linout=TRUE,
trace=F,
maxit=10000,
skip=F,
metric="RMSE",
tuneGrid=expand.grid(.size=c(10,11,12,9),.decay=c(0.005,0.001,0.01)),
trControl = ctrl,
preProcess = c("range"))
model
forecast <- predict(model, ztest)
Blad<-function(zmienna,prognoza){
RMSE<-((sum((zmienna-prognoza)^2))/length(zmienna))^(1/2)
estymatory <- c(RMSE)
names(estymatory) <-c('RMSE')
estymatory
}
Blad(ztest$Palmitic,forecast)
The resampled estimates shown in the output of train are calculated using rows 81:96. Once train figures out the right tuning parameter settings, it refits using all the data (1:96). The model from that data is used to make the new predictions.
For this reason, the model performance
> getTrainPerf(model)
TrainRMSE TrainRsquared method
1 0.9230175 0.8364212 nnet
is worse than the other predictions:
> Blad(ztest$Palmitic,forecast)
RMSE
0.3355387
The predictions in forecast are created from a model that included those same data points, which is why it looks better.
Max

How to view singularities in model fitted in caret train in R

I've got a database that is 161 x 151 and I applied the following on my dataset:-
> ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, savePred = T)
> model <- train(RT..seconds.~., data = cadets, method = "lm", trControl = ctrl)
For which I get in return
Coefficients: (82 not defined because of singularities)
I know this means that a lot of my variables are co-linear, and are therefore not independent variables. So I want to be able to look at the coefficient matrix of my data, so I did:-
cor(cadets, use="complete.obs", method ="kendall")
but the results as you can imagine was to big to fit it all into my R screen. Is there a way of viewing the model matrix so I can see which variables are co-linear with one another, and furthermore what can I do from here onwards to better improve the model if my variables are co-linear? How do I over come that?
Thanks
Its described in the preprocess section of the caret manual (about half way down page):
http://caret.r-forge.r-project.org/preprocess.html
so for you cadets data it's something like (not tested):
cadetsCor <- cor(cadets)
highlyCorCadets <- findCorrelation(cadetsCor, cutoff = 0.75)
cadets <- cadets[, -highlyCorCadets]
The other alternative is dimension reduction.. e.g PCA but then your model maybe gain in predictive power but lose interpretability.

Resources