Right now, I'm trying to use Caret rfe function to perform the feature selection, because I'm in a situation with p>>n and most regression techniques that don't involve some sort of regularisation can't be used well. I already used a few techniques with regularisation (Lasso), but what I want to try now is reduce my number of feature so that I'm able to run, at least decently, any kind of regression algorithm on it.
control <- rfeControl(functions=rfFuncs, method="cv", number=5)
model <- rfe(trainX, trainY, rfeControl=control)
predict(model, testX)
Right now, if I do it like this, a feature selection algorithm using random forest will be run, and then the model with the best set of features, according to the 5-fold cross-validation, will be used for the prediction, right?
I'm curious about two things here:
1) Is there an easy way to take the set of feature, and train another function on it that the one used for the feature selection? For example, reducing the number of features from 500 to 20 or so that seem more important and then applying k-nearest neighborhood.
I'm imagining an easy way to do it that would look like that:
control <- rfeControl(functions=rfFuncs, method="cv", number=5)
model <- rfe(trainX, trainY, method = "knn", rfeControl=control)
predict(model, testX)
2) Is there a way to tune the parameters of the feature selection algorithm? I would like to have some control on the values of mtry. The same way you can pass a grid of value when you are using the train function from Caret. Is there a way to do such a thing with rfe?
Here is a short example on how to perform rfe with an inbuilt model:
library(caret)
library(mlbench) #for the data
data(Sonar)
rctrl1 <- rfeControl(method = "cv",
number = 3,
returnResamp = "all",
functions = caretFuncs,
saveDetails = TRUE)
model <- rfe(Class ~ ., data = Sonar,
sizes = c(1, 5, 10, 15),
method = "knn",
trControl = trainControl(method = "cv",
classProbs = TRUE),
tuneGrid = data.frame(k = 1:10),
rfeControl = rctrl1)
model
#output
Recursive feature selection
Outer resampling method: Cross-Validated (3 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
1 0.6006 0.1984 0.06783 0.14047
5 0.7113 0.4160 0.04034 0.08261
10 0.7357 0.4638 0.01989 0.03967
15 0.7741 0.5417 0.05981 0.12001 *
60 0.7696 0.5318 0.06405 0.13031
The top 5 variables (out of 15):
V11, V12, V10, V49, V9
model$fit$results
#output
k Accuracy Kappa AccuracySD KappaSD
1 1 0.8082684 0.6121666 0.07402575 0.1483508
2 2 0.8089610 0.6141450 0.10222599 0.2051025
3 3 0.8173377 0.6315411 0.07004865 0.1401424
4 4 0.7842208 0.5651094 0.08956707 0.1761045
5 5 0.7941775 0.5845479 0.07367886 0.1482536
6 6 0.7841775 0.5640338 0.06729946 0.1361090
7 7 0.7932468 0.5821317 0.07545889 0.1536220
8 8 0.7687229 0.5333385 0.05164023 0.1051902
9 9 0.7982468 0.5918922 0.07461116 0.1526814
10 10 0.8030087 0.6024680 0.06117471 0.1229467
for more customization see:
https://topepo.github.io/caret/recursive-feature-elimination.html
Related
So this is starting to confuse me a bit. Having for example the following code that trains a GLM model:
glm_sens = train(
form = target ~ .,
data = ABT,
trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, classProbs = TRUE, summaryFunction = twoClassSummary, savePredictions = TRUE),
method = "glm",
family = "binomial",
metric = "Sens"
)
I expected that this trains a few models and then selects the one that performs best on the sensitivity. Yet when I read up on cross validation most I find is on how it is used to calculate average performance scores.
Was my assumption wrong?
caret does train different models, but normally it is done with different hyper-parameters. You can check out an an explanation of the process. Hyper parameters cannot be directly learned from the data so you need the training process. These parameters decide how your model will behave, for example you have lambda in lasso which decides how much regularization applied to the model.
In a glm, there is no hyper-parameter to train. I guess what you are looking for is something to select the best possible linear model out of the many potential variables. You can use step()
fit = lm(mpg ~ .,data=mtcars)
step(fit,direction="back")
Another option is to use leaps with caret, for example an equivalent of the above will be:
train(mpg~ .,data=mtcars,method='leapBackward', trControl=trainControl(method="cv",number=10),tuneGrid=data.frame(nvmax=2:6))
Linear Regression with Backwards Selection
32 samples
10 predictors
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 30, 28, 28, 28, 30, 28, ...
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 3.299712 0.9169529 2.783068
3 3.124146 0.8895539 2.750305
4 3.249803 0.8849213 2.853777
5 3.258143 0.8939493 2.823721
6 3.123481 0.8917197 2.723475
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 6.
You can check out more about variable selection using leaps in this website
Here's my question:
I have a medium size data set about the condition of a hydraulic system.
The data set is represented by 68 variables plus condition of the system(green, yellow, red)
I have to use several classifiers to predict the behaviour of the system so I have divided my data set into training and test set as follows:
(Talking about the conditions, the colour means: red-Warning, yellow-Pay attention, green-Good)
That's what I wrote
Tab$Condition=factor(Tab$Condition, labels=c("Yellow","Green","Red"))
set.seed(32343)
reg_Control = trainControl("repeatedcv", number = 5, repeats=5, verboseIter = T, classProbs =T)
inTrain = createDataPartition(y=Tab$Condition,p=0.75, list=FALSE)
training = Tab[inTrain,]
testing = Tab[-inTrain,]
I'm using a SVM linear classifier to predict the behaviour of the system.
I started by using a random value for C to see what kind of results I should get.
svmLinear = train(Condition ~.,data=training, method="svmLinear", trControl=reg_Control,tuneGrid=data.frame(C=seq(0.1,1,0.1)))
svmLPredictions = predict(svmLinear,newdata=training)
confusionMatrix(svmLPredictions,training$Condition)
#misclassification of 129/1655 accuracy of 92.21%
svmLPred = predict(svmLinear,newdata=testing)
confusionMatrix(svmLPred,testing$Condition)
#misclassification of 41/550 accuracy of 92.55%
I've used a SVM linear classifier to predict the behaviour of the system.
As Isaid before I started with RANDOM VALUE FOR C.
How do I decide then about the best value to use for the analysis??
Sorry if the question is banal but I'm a beginner!
Answers will be helpful!
Thanks
Caret calls other packages to run the actual modelling process. Caret itself is only a (very powerful) convenience package in this regard. However ,it does that automatically so a user might not realize this easily unless an error is thrown
Anyway , I have cobbled together an example to explain the process.
library(caret)
data("iris")
set.seed(1024)
tr <- createDataPartition(iris$Species, list = FALSE)
training <- iris[ tr,]
testing <- iris[-tr,]
#head(training)
fitControl <- trainControl(##smaller values for quick run
method = "repeatedcv",
number = 5,
repeats = 4)
set.seed(1024)
tunegrid=data.frame(C=c(0.25, 0.5, 1,5,8,12,100))
tunegrid
svmfit <- train(Species ~ ., data = training,
method = "svmLinear",
trControl = fitControl,
tuneGrid= tunegrid)
#print this, it will give model's accuracy (on train data) given various
# parameter values
svmfit
#C Accuracy Kappa
#0.25 0.9533333 0.930
#0.50 0.9666667 0.950
#1.00 0.9766667 0.965
#5.00 0.9800000 0.970
#8.00 0.9833333 0.975
#12.00 0.9833333 0.975
#100.00 0.9400000 0.910
#The final value used for the model was C = 8.
# it has already chosen the best model (as per train Accuracy )
# how well does it work on test data?
preds <-predict(svmfit, testing)
cmSVM <-confusionMatrix(preds, testing$Species)
print(cmSVM)
I'm trying to implement some functions to compare five different machine learning models to predict some values in a regression problem.
My intention is working on a suit of functions that could train the different codes and organize them in a suit of results. The models I select by instance are: Lasso, Random Forest, SVM, Linear Model and Neural Network. To tune some models I intend to use the references of Max Kuhn: https://topepo.github.io/caret/available-models.html.
However, since each model requires different tuning parameters, I'm in doubt how to set them:
First I set up the grid to 'nnet' model tunning. Here I selected different number of nodes in hidden layer and the decay coefficient:
my.grid <- expand.grid(size=seq(from = 1, to = 10, by = 1), decay = seq(from = 0.1, to = 0.5, by = 0.1))
Then I construct the functions that will run the five models 5 times in a 6-fold configuration:
my_list_model <- function(model) {
set.seed(1)
train.control <- trainControl(method = "repeatedcv",
number = 6,
repeats = 5,
returnResamp = "all",
savePredictions = "all")
# The tunning configurations of machine learning models:
set.seed(1)
fit_m <- train(ST1 ~.,
data = train, # my original dataframe, not showed in this code
method = model,
metric = "RMSE",
preProcess = "scale",
trControl = train.control
linout = 1 # linear activation function output
trace = FALSE
maxit = 1000
tuneGrid = my.grid) # Here is how I call the tune of 'nnet' parameters
return(fit_m)
}
Lastly, I execute the five models:
lapply(list(
Lass = "lasso",
RF = "rf",
SVM = "svmLinear",
OLS = "lm",
NN = "nnet"),
my_list_model) -> model_list
However, when I run this, it shows:
Error: The tuning parameter grid should not have columns fraction
By what I understood, I didn't know how to specify very well the tune parameters. If I try to throw away the 'nnet' model and change it, for example, to a XGBoost model, in the penultimate line, it seems it works well and results would be calculated. That is, it seems the problem is with the 'nnet' tuning parameters.
Then, I think my real question is: how to configure these different parameters of models, in special the 'nnet' model. In addition, since I didn't need to set up the parameters of lasso, random forest, svmLinear and linear model, how were they tuned by the caret package?
my_list_model <- function(model,grd=NULL){
train.control <- trainControl(method = "repeatedcv",
number = 6,
returnResamp = "all",
savePredictions = "all")
# The tuning configurations of machine learning models:
set.seed(1)
fit_m <- train(Y ~.,
data = df, # my original dataframe, not showed in this code
method = model,
metric = "RMSE",
preProcess = "scale",
trControl = train.control,
linout = 1, # linear activation function output
trace = FALSE,
maxit = 1000,
tuneGrid = grd) # Here is how I call the tune of 'nnet' parameters
return(fit_m)
}
first run below code and see all the related parameters
modelLookup('rf')
now make grid of all models based on above lookup code
svmGrid <- expand.grid(C=c(3,2,1))
rfGrid <- expand.grid(mtry=c(5,10,15))
create a list of all model's grid and make sure the name of model is same as name in the list
grd_all<-list(svmLinear=svmGrid
,rf=rfGrid)
model_list<-lapply(c("rf","svmLinear")
,function(x){my_list_model(x,grd_all[[x]])})
model_list
[[1]]
Random Forest
17 samples
3 predictor
Pre-processing: scaled (3)
Resampling: Cross-Validated (6 fold, repeated 1 times)
Summary of sample sizes: 14, 14, 15, 14, 14, 14, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
5 63.54864 0.5247415 55.72074
10 63.70247 0.5255311 55.35263
15 62.13805 0.5765130 54.53411
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 15.
[[2]]
Support Vector Machines with Linear Kernel
17 samples
3 predictor
Pre-processing: scaled (3)
Resampling: Cross-Validated (6 fold, repeated 1 times)
Summary of sample sizes: 14, 14, 15, 14, 14, 14, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
1 59.83309 0.5879396 52.26890
2 66.45247 0.5621379 58.74603
3 67.28742 0.5576000 59.55334
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was C = 1.
Can someone provide me a detailed example of using caret's rfe function with the glm or glmnet model? I tried something like this:
rfe_records <- Example_data_frame
rfe_ctrl <- rfeControl(functions = caretFuncs, method = "repeatedcv", repeats = 5, verbose = TRUE, classProbs = TRUE, summaryFunction = twoClassSummary)
number_predictors <- dim(rfe_records)[2]-1
x <- dplyr::select(rfe_records, -outcomeVariable)
y <- as.numeric(rfe_records$outcomeVariable)
glmProfile <- rfe(x, y, rfeControl = rfe_ctrl, sizes = c(1:number_predictors), method="glmnet", preProc = c("center", "scale"), metric = "Accuracy")
print(glmProfile)
But the results I'm getting are not what I needed. I specified Accuracy as the metric but I got:
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables RMSE Rsquared RMSESD RsquaredSD Selected
1 0.5047 0.10830 0.04056 0.11869 *
2 0.5058 0.09386 0.04728 0.11332
3 0.5117 0.08565 0.04999 0.10211
4 0.5139 0.07490 0.05042 0.10048
5 0.5166 0.07678 0.05456 0.09966
6 0.5202 0.08203 0.06174 0.10822
7 0.5187 0.08471 0.06207 0.10893
8 0.5168 0.07850 0.05939 0.09697
9 0.5175 0.08228 0.05966 0.10068
10 0.5176 0.08180 0.05980 0.10042
11 0.5179 0.08015 0.05950 0.09905
The top 1 variables (out of 1):
varName
According to this page caret uses the class of the outcome variable when it determines whether to use regression or classification with a function like glmnet that can do either. According to your code, you specified the outcome variable to be numeric with as.numeric() so glmnet chose to do regression, not classification as you intended. Specify your outcome variable as a two-level factor to get classification instead.
I'm trying to use caret to fit a PLS model while optimizing the number of components 'ncomps':
library("caret")
set.seed(342)
train <- as.data.frame ( matrix( rnorm(1e4) , 100, 100 ) )
ctrl <- rfeControl(functions = caretFuncs,
method = "repeatedcv",
number=2,
repeats=1,
verbose =TRUE
)
pls.fit.rfe <- rfe(V1 ~ .,
data = train,
method = "pls",
sizes = 6,
tuneGrid = data.frame(ncomp = 7),
rfeControl = ctrl
)
Error in { :
task 1 failed - "final tuning parameters could not be determined"
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Invalid number of components, ncomp
Setting sizes to 6 fixes the problem. It makes sense that I get an error when min(sizes) < max(ncomp), but is there a way to vary ncomp depending on the number of features used in the RFE iteration, i.e. the sizes variable? I would simply like to optimize over a wide range of sizes and #components at the same time.
Try using tuneLength = 7 instead of tuneGrid. The former is more flexible and will use an appropriate ncomp given the size of the data set:
> pls.fit.rfe pls.fit.rfe
Recursive feature selection
Outer resampling method: Cross-Validated (2 fold, repeated 1 times)
Resampling performance over subset size:
Variables RMSE Rsquared RMSESD RsquaredSD Selected
6 1.0229 0.01684 0.04192 0.0155092
99 0.9764 0.00746 0.01096 0.0008339 *
The top 5 variables (out of 99):
If you'd rather not do that, you can always write your own fit function too.
Max