R feature selection with LASSO - r

I have a small data set (37 observations x 23 features) and want to perform feature selection with LASSO regression in order to its reduce dimensionality. To achieve this, I designed the below code based on online tutorials
#Load the libraries
library(mlbench)
library(elasticnet)
library(caret)
#Initialize cross validation and train LASSO
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=My_Data_Frame, method='lasso', trControl=cv_5)
#Filter out the variables whose coefficients have squeezed to 0
drop <-predict.enet(lasso$finalModel, type='coefficients', s=lasso$bestTune$fraction, mode='fraction')$coefficients
drop<-drop[drop==0]%>%names()
My_Data_Frame<- My_Data_Frame%>%select(-drop)
In most cases the code runs without errors but it occasionally throws the following:
Warning messages:
1: model fit failed for Fold2: fraction=0.9 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
I sense this happens because my data has few rows and some variables have low variance.
Is there a way I can bypass or fix this issue (e.g. setting a parameter in the flow)?

You have a low number of observations, so there's a good chance in some training set, that some of your columns will be all zero, or very low variance. For example:
library(caret)
set.seed(222)
df = data.frame(ColumnY = rnorm(37),matrix(rbinom(37*23,1,p=0.15),ncol=23))
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_5)
Warning messages:
1: model fit failed for Fold4: fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = 0, ...) :
Some of the columns of x have zero variance
Before running below, check that for categorical columns, all of them don't have only 1 positive label..
One way is to increase the cv fold, if you set 5, you are using 80% of the data. Try 10 to use 90% of the data:
cv_10 <- trainControl(method="cv", number=10)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_10)
And as you might have seen.. since the dataset is so small, cross-validation might not offer you that much advantage, you can also do leave one out cross-validation:
tr <- trainControl(method="LOOCV")
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=tr)

You can use the FSinR package to perform feature selection. It is in R and accessible from CRAN. It has a wide variety of filter and wrapper methods that you can combine with search methods. The interface to generate the wrapper evaluator follows the caret interface. For example:
# Load the library
library(FSinR)
# Choose one of the search methods
searcher <- searchAlgorithm('sequentialForwardSelection')
# Choose one of the filter/wrapper evaluators (You can remove the fitting and resampling params if you want to make it simpler)(These are the parameters of the train and trainControl of caret)
resamplingParams <- list(method = "cv", number = 5)
fittingParams <- list(preProc = c("center", "scale"), metric="Accuracy", tuneGrid = expand.grid(k = c(1:20)))
evaluator <- wrapperEvaluator('knn', resamplingParams, fittingParams)
# You make the feature selection (returns the best features)
results <- featureSelection(My_Data_Frame, 'ColumnY', searcher, evaluator)

Related

Pooled Regression Results using mice, caret, and glmnet

Not sure if this more of a statistics question but the closest similar problem I could find is here, although I couldn't get it to work for my case.
I am trying to develop a pooled, penalized logistic regression model. I used mice to create a mids object and then fit a model to each dataset using caret repeated cross-validation with elastic net regression (glmnet) to tune parameters. The fitted object is not of class "mira" but I think I fixed that by changing the object class with the right list items. The major issue is that glmnet does not have an associated vcov method, which is required by pool().
I would like to use penalized regression based on the amount of variables and uncertainty over which ones are the best predictors. My data consists of 4x numeric variables and 9x categorical variables of varying levels and I anticipate including interactions.
Does anyone know how I might be able to create my own vcov method or otherwise address this issue? I am not sure if this is possible.
Example data and code are enclosed, noting that I am not able to share the actual data.
library(mice)
library(caret)
dat <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1,4,3,1,1,2,2,3,5,2,4,5,1),
status=c(1,1,1,0,2,2,0,0,NA,1,2,0,1,1,1,NA,2,2,0,0,1,NA,2,0),
x=c(0,2,1,1,NA,NA,0,1,1,2,0,1,0,2,1,1,NA,NA,0,1,1,2,0,1),
sex=c("M","M","M","M","F","F","F","F","M","F","F","M","F","M","M","M","F","F","M","F","M","F","M","F")))
imp <- mice(dat,m=5, seed=192)
control = trainControl(method = "repeatedcv",
number = 10,
repeats=3,
verboseIter = FALSE)
mod <- list(analyses=vector("list", imp$m))
for(i in 1:imp$m){
mod$analyses[[i]] <- train(sex ~ .,
data = complete(imp, i),
method = "glmnet",
family="binomial",
trControl = control,
tuneLength = 10,
metric="Kappa")
}
obj <- as.mira(mod)
obj <- list(call=mod$analyses[[1]]$call, call1=imp$call, nmis=imp$nmis, analyses=mod$analyses)
oldClass(obj) <- "mira"
pool(obj)
Produces:
Error in pool(obj) : Object has no vcov() method.

Caret - Factor issue in multi class classification

I want to perform a multi-class classification in the caretpackage. Below is a minimum example.
library(caret)
library(randomForest)
x <- data.frame("A"=seq(1,100), "B"=seq(1,100), "C"="class1")
x[,"C"] <- as.character(x[,"C"])
x[1,"C"] <- "class2"
x[2,"C"] <- "class3"
x[3,"C"] <- "class4"
x[4,"C"] <- "class5"
x[5,"C"] <- "class6"
x[6,"C"] <- "class7"
x[7,"C"] <- "class8"
x[8,"C"] <- "class9"
x[9,"C"] <- "class10"
x[10,"C"] <- "class11"
x[11,"C"] <- "class12"
x[,"C"] <- as.factor(x[,"C"])
control <- trainControl(method="repeatedcv", number=10, repeats=1, search="grid") set.seed(5) tunegrid <- expand.grid(.mtry=c(1:2)) fit <- train(x=x[,1:2], y=x$C, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(fit)
plot(fit)
When running the code I get an error stating 1: model fit failed for Fold2.Rep1: mtry=1 Error in randomForest.default(x, y, mtry = param$mtry, ...) :
Can't have empty classes in y.
Related posts suggest that this is due to unaccounted factors in the response variable - which is not taken account of in resampling. Typically, one runs into the problem, if there is a higher number of classes to be predicted (and little observations).
Is there any workaround to change the caret package such that the missing factors are removed in the resampling methods (e.g., by droplevels())?

R caret / Confusion matrix

I'd like to display the confusion matrix after a train() of the caret library, but I have some doubts. The "train()" should be on a train set ?(I'm not sure because of the "control" parameter). The "predict()" on the test set ? It seems weird to predict on the whole data set...
# df_corpus = Document Term Matrix + 1 column of Cos.code(class which are 203.2.2, 204.3.2 ...)
dataset <- df_corpus
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
seed <- 7
metric <- "Accuracy"
preProcess=c("center", "scale")
# Linear Discriminant Analysis
set.seed(seed)
fit.lda <- train(Cos.code~., data=dataset, method="lda", metric=metric,preProc=c("center", "scale"), trControl=control)
ldaClasses <- predict(fit.lda)
cm <- confusionMatrix(data = ldaClasses, dataset$Cos.code)
F1_score(cm$table, "lda")
Thank you for your help
You can get the confusion matrix like this:
confusionMatrix(predict(fit.lda,dataset$Cos.code),dataset$Cos.code)
You can calculate the confusion matrix in the same manner for your testing set, just switch the datasets.
But I believe your model should contain already the information that you want
Examine the information given when printing these two objects.
fit.lda
fit.lda$finalModel

R-Caret, caretList, The metric "Accuracy" was not in the result set

Trying to learn r-Caret and caretList.
I am trying to follow the tutorial caretEnsemble Classification example
I have encountered a few errors and searched how to fix some of the basic set up.
However, I am getting the error:
Warning messages:
1: In train.default(x, y, weights = w, ...) :
The metric "Accuracy" was not in the result set. ROC will be used instead.
2: In train.default(x, y, weights = w, ...) :
The metric "Accuracy" was not in the result set. ROC will be used instead.
My setup is:
#Libraries
library(caret)
library(devtools)
library(caretEnsemble)
#Data
library(mlbench)
dat <- mlbench.xor(500, 2)
X <- data.frame(dat$x)
Y <- factor(ifelse(dat$classes=='1', 'Yes', 'No'))
#Split train/test
train <- runif(nrow(X)) <= .66
#Setup CV Folds
#returnData=FALSE saves some space
folds=5
repeats=1
myControl <- trainControl(method='cv',
number=folds,
repeats=repeats,
returnResamp='none',
classProbs=TRUE,
returnData=FALSE,
savePredictions=TRUE,
verboseIter=TRUE,
allowParallel=TRUE,
summaryFunction=twoClassSummary,
index=createMultiFolds(Y[train],
k=folds,
times=repeats)
)
#Make list of all models
all.models<-caretList(Y~., data=X, trControl=myControl, methodList=c("blackboost", "parRF"))
I edited the section of "train all models" using caretList so that it will work with caretEnsemble and caretStack further down the code (link provided above).
How do I get the accuracies so that I can use them in caretEnsemble and caretStack?
I assume you would like to use 'Accuracy' as the summary metric that should be used to select the optimal base learner models across their resamples and the metalearner later on via caretEnsemble or caretStack.
In this case you must not set summaryFunction = twoClassSummary in trainControl because like this train will use 'ROC' as the performance metric and not 'Accuracy'. Instead you should go with the default setting for summaryFunction (That means you do not have to specify it explicitly in trainControl). Like this train which is called via caretList will automatically use 'Accuracy' as the performance metric because of the categorical response.
In addition, there a few other things to note:
You should not set returnResamp = FALSE in trainControl. Because when you do, you won't be able to compare the model's individual accuracies later via summary(resamples(model.list))
Even though you created an index to separate the data into a train and test set you don't use it when passing the data to caretList. The correct caretList call should begin like this caretList(Y[train] ~ ., data=X[train, ], ...
The tutorial you mentioned above is a bit outdated. You should also check out the package's current vignette and this tutorial from MachineLearningMastery. The latter also uses "Accuracy" as the performance metric in its example

Obtaining training Error using Caret package in R

I am using caret package in order to train a K-Nearest Neigbors algorithm. For this, I am running this code:
Control <- trainControl(method="cv", summaryFunction=twoClassSummary, classProb=T)
tGrid=data.frame(k=1:100)
trainingInfo <- train(Formula, data=trainData, method = "knn",tuneGrid=tGrid,
trControl=Control, metric = "ROC")
As you can see, I am interested in obtain the AUC parameter of the ROC. This code works good but returns the testing error (which the algorithm uses for tuning the k parameter of the model) as the mean of the error of the CrossValidation folds. I am interested in return, in addition of the testing error, the training error (the mean across each fold of the error obtained with the training data). ¿How can I do it?
Thank you
What you are asking is a bad idea on multiple levels. You will grossly over-estimate the area under the ROC curve. Consider the 1-NN model: you will have perfect predictions every time.
To do this, you will need to run train again and modify the index and indexOut objects:
library(caret)
set.seed(1)
dat <- twoClassSim(200)
set.seed(2)
folds <- createFolds(dat$Class, returnTrain = TRUE)
Control <- trainControl(method="cv",
summaryFunction=twoClassSummary,
classProb=T,
index = folds,
indexOut = folds)
tGrid=data.frame(k=1:100)
set.seed(3)
a_bad_idea <- train(Class ~ ., data=dat,
method = "knn",
tuneGrid=tGrid,
trControl=Control, metric = "ROC")
Max

Resources