How to deactivate embedded feature selection in caret package? - r

I am writing a machine learning code using caret package in R. A sample of code could be
weighted_fit <- train(outcome,
data = train,
method = 'glmnet',
trControl = ctrl)
As you know, some methods in caret package have built-in feature selection such as elastic net. My question is that is there any way to deactivate the built in feature selection in this code?
Thanks in advance for any comment.

#I will try to answer this question to the best of my ability:
#The train function in caret package comes with a parameter tuneGrid which can be used to create a data-frame of tuning parameters.
#The tuning parameter of elastic net regularization in glmnet() is alpha, so create the following:
glmgrid <- expand.grid(alpha = 0) will give ridge regularization.
glmgrid <- expand.grid(alpha = 1) will give lasso regularization.
#and then use
weighted_fit <- train(outcome,
data = train,
method = 'glmnet',
trControl = ctrl,
tuneGrid = glmgrid)
#In glmnet in r , the alpha values can be in the range [0,1] i.e. 0 to 1 including 0 and 1.
# GLMNET - https://www.rdocumentation.org/packages/glmnet/versions/2.0-18/topics/glmnet
# CARET - https://topepo.github.io/caret/index.html

Related

Making caret train rf faster when ranger is not an option

The website I am trying to run the code is using an old version of R and does not accept ranger as the library. I have to use the caret package. I am trying to process about 800,000 lines in my train data frame and here is the code I use
control <- trainControl(method = 'repeatedcv',
number = 3,
repeats = 1,
search = 'grid')
tunegrid <- expand.grid(.mtry = c(sqrt(ncol(train_1))))
fit <- train(value~.,
data = train_1,
method = 'rf',
ntree = 73,
tuneGrid = tunegrid,
trControl = control)
Looking at previous posts, I tried to tune my control parameters, is there any way I can make the model run faster? Am I able to specify a specific setting so that it just generates a model with the parameters I set, and not try multiple options?
This is my code from ranger which I optimized and currently having accurate model
fit <- ranger(value ~ .,
data = train_1,
num.trees = 73,
max.depth = 35,mtry = 7,importance='impurity',splitrule = "extratrees")
Thank you so much for your time
When you specify method='rf', caret is using the randomForest package to build the model. If you don't want to do all the cross-validation that caret is useful for, just build your model using the randomForest package directly. e.g.
library(randomForest)
fit <- randomForest(value ~ ., data=train_1)
You can specify values for ntree, mtry etc.
Note that the randomForest package is slow (or just won't work) for large datasets. If ranger is unavailable, have you tried the Rborist package?

cv.glmnet and Leave-one out CV

I'm trying to use the function cv.glmnet to find the best lambda (using the RIDGE regression) in order to predict the class of belonging of some objects.
So the code that I have used is:
CVGLM<-cv.glmnet(x,y,nfolds=34,type.measure = "class",alpha=0,grouped = FALSE)
actually I'm not using a K-fold cross validation because my size dataset is too small, in fact I have only 34 rows. So, I'm using in nfolds the number of my rows, to compute a Leave-one out CV.
Now, I have some questions:
1) First of all: Does cv.glmnet function tune the Hyperpameter lambda or also test the "final model"?
2)One time got the best lambda, what have I to do? Have I to use predict function?
If yes, which data I have to use if I use all data to find lambda since I have used LOO CV?
3)How can I calculate R^2 from cv.glmnet function?
Here is an attempt to answer your questions:
1) cv.glmnet tests the performance of each lambda by using the cross validation of your specification. Here is an example:
library(glmnet)
data(iris)
find best lambda for iris prediction:
CVGLM <- cv.glmnet(as.matrix(iris[,-5]),
iris[,5],
nfolds = nrow(iris),
type.measure = "class",
alpha = 0,
grouped = FALSE,
family = "multinomial")
the miss classification error of best lambda is in
CVGLM$cvm
#output
0.06
If you test this independently using LOOCV and best lambda:
z <- lapply(1:nrow(iris), function(x){
fit <- glmnet(as.matrix(iris[-x,-5]),
iris[-x,5],
alpha = 0,
lambda = CVGLM$lambda.min,
family="multinomial")
pred <- predict(fit, as.matrix(iris[x,-5]), type = "class")
return(data.frame(pred, true = iris[x,5]))
})
z <- do.call(rbind, z)
and check the error rate it is:
sum(z$pred != z$true)/150
#output
0.06
so it looks like there is no need to test the performance using the same method as in cv.glmnet since it will be the same.
2) when you have the optimal lambda you should fit a model on the whole data set using glmnet function. What you do after with the model is entirely up to you. Most people train a model to predict something.
3) what is R^2 for a classification problem? If you could explain that then you could calculate it.
R^2 = Explained variation / Total variation
what is this in terms of classes?
Anyhow R^2 is not used for classification but rather AUC, deviance, accuracy, balanced accuracy, kappa, joudens J and so on - most of these are used for binary classification but some are available for multinomial.
I suggest this as further reading

Activation Function NNET

I've created a neural network using caret and nnet.
Now, I need to deploy the NN in oracle for production.
I already have the weights for each input and hidden layers.
However, I'm not sure which was the activation function used. Is there any way to know? Or do you know an easier way to do this task?
This is code I used to develop the NN:
fitControl <- trainControl(
method = "cv",
number = 10)
nnet <- train(as.factor(TARGET) ~.-CONTRACT-MONTH-CUSTOMER,
data=stab_train_opt_norm,
method = "nnet",
trControl = fitControl)
From the nnet documentation in R:
One could use either a linear or a logistic activation/transfer fucntion in nnet. This could be specified by 'linout = TRUE' for linear and 'linout = FALSE' for logistic activation function. Default is logistic.

Pass PCA preprocessing arguments to train()

I'm trying to build a predictive model in caret using PCA as pre-processing. The pre-processing would be as follows:
preProc <- preProcess(IL_train[,-1], method="pca", thresh = 0.8)
Is it possible to pass the thresh argument directly to caret's train() function? I've tried the following, but it doesn't work:
modelFit_pp <- train(IL_train$diagnosis ~ . , preProcess="pca",
thresh= 0.8, method="glm", data=IL_train)
If not, how can I pass the separate preProc results to the train() function?
As per the documentation, you specify additional preprocessing arguments with trainControl
?trainControl
...
preProcOptions
A list of options to pass to preProcess. The type of pre-processing
(e.g. center, scaling etc) is passed in via the preProc option in train.
...
Since your dataset is not reproducible, let's look at an example. I will use the Sonar dataset from mlbench and use the pls algorithm just for fun.
library(caret)
library(mlbench)
data(Sonar)
ctrl <- trainControl(preProcOptions = list(thresh = 0.95))
mod <- train(Class ~ .,
data = Sonar,
method = "pls",
trControl = ctrl)
Although documentation isn't the most exciting read, definitely make sure to try to go through it. Package authors work hard to create documentation and there are many wonders to be found within.

caret::train: specify training data parameters

I am designing a neural network model that predicts estimation of van genuchten water retention parameters (theta_r, thera_s, alpha, n) using limited to more extended input data like texture, bulk density, and one or two water retention. Investigating neural networks in R project I found RSNNS package and I create and train multiple multi-layer perceptrons (MLPs) with tuning on the number of hidden units and the learning rate. The general performance characterized with training and testing RMSEs for these models is really poor and random, in fact, i used log-transformed values of alpha and n parameters to avoid bias and account for their approximately lognormal distributions but this does not help much :(. I was recommended to work with nnet and caret package but I've had trouble adapting the code i don't know what I'm doing wrong, any suggestion?
#input dataset
basic <- read.table(url("https://dl.dropboxusercontent.com/s/m8qe4k5swz1m3ij/basic.txt?dl=1&token_hash=AAH6Z3d6fWTLoQZYi04Ys72sdufdERE5gm4v7eF0cgMlkQ"), header=T, sep=" ")
#output dataset
fitted <- read.table(url("https://dl.dropboxusercontent.com/s/rjx745ej80osbbu/fitted.txt?dl=1&token_hash=AAHP1zcPQyw4uSe8rw8swVm3Buqe3TP7I1j-4_SOeeUTvw"), header=T, sep=" ")
# Use log-transformed values of alpha and n output parameters
fitted$alpha <- log(fitted$alpha)
fitted$n <- log(fitted$n)
#Fit model with caret package
library(caret)
model <- train(x = basic, y = fitted, method='nnet', linout=TRUE, trace = FALSE,
#Grid of tuning parameters to try:
tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1)))
caret is just a wrapper to the algorithms it is calling so you can specify any parameter in the algorith even if it is not an option in caret's tuning grid. This is accomplishing via the "..." in caret's train() function, which is basically saying that you can pass any extra parameters into the method you are calling. I'm not sure what parameters you want to adjust to your nnet call (and I'm getting errors accessing your dropbox data) so here is a trivial example passing in specific values to maxit and Hess:
> library(caret)
> m1 <- train(Species~.,data=iris, method='nnet', linout=TRUE, trace = FALSE,trControl=trainControl("cv"))
> #this time pass in values for maxint and Hess
> m2 <- train(Species~.,data=iris, method='nnet', linout=TRUE, trace = FALSE,trControl=trainControl("cv"),maxint=10,Hess=T)
> m1$finalModel$call
nnet.formula(formula = modFormula, data = data, size = tuneValue$.size,
decay = tuneValue$.decay, linout = TRUE, trace = FALSE)
> m2$finalModel$call
nnet.formula(formula = modFormula, data = data, size = tuneValue$.size,
decay = tuneValue$.decay, linout = TRUE, trace = FALSE, maxint = 10,
Hess = ..4)

Resources