feature selection function in caret package - r

I am posting this because this postfeture selection in caret hasent helped my issue and I have 2 questions regarding feature selection function in caret package
when I run code below on my matrix of gene expression allsamplecombat with 5 classes defined in y= :
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(t(allsamplecombat[filter,]), y = factor(info$clust), sizes=c(300,400,500,600,700,800,1000,1200), rfeControl=control)
I get an out put like this
So, I want to know if I can extract top features for each classes, because predictors(results) just give me the resulting feature without indicating importance for each classes.
my second problem is that when i try to change rfeControl functions to treebagFuncs and run 'parRF` method
control <- rfeControl(functions=treebagFuncs, method="cv", number=5)
results <- rfe(t(allsamplecombat[filter,]), y = factor(info$clust), sizes=c(400,500,600,700,800), rfeControl=control, method="parRF")
i get Error in { : task 1 failed - "subscript out of bounds" error.
what is wrong in my code?

For the importances, there is a sub-object called variables that contains this information for each step of the elimination.
treebagFuncs is designed to work with ipred's bagging function and isn't related to random forest.
You would probably used caretFuncs and pass method to that. However, if you are going to parallelize something, do it to the resampling loop and not the model function. This is generally more efficient. Note that if you do both with M workers, you might actually get M^3 (one for rfe, one for train, and one for parRF). There are options in rfe and train to turn their parallelism off.

Related

Unexpected behavior in R using lapply() with glm() and cv.glm()

I am trying to apply cross validation to a list of linear models and getting an error.
Here is my code:
library(MASS)
library(boot)
glm.fits = lapply(1:10,function(d) glm(nox~poly(dis,d),data=Boston))
cvs = lapply(1:10,function(i) cv.glm(Boston,glm.fits[[i]],K=10)$delta[1])
I get the error:
Error in poly(dis, d) : object 'd' not found
I then tried the following code:
library(MASS)
library(boot)
cvs=rep(0,10)
for (d in 1:10){
glmfit = glm(nox~poly(dis,d),data=Boston)
cvs[d] = cv.glm(Boston,glmfit,K=10)$delta[1]
}
and this worked.
Can anyone explain why my first attempt did not work, and suggest a fix?
Also, assuming a fix to the first attempt can be obtained, which way of writing code is better practice? (assume that I want a list of the various fits and that I would edit the latter code to preserve them) To me, the first attempt is more elegant.
In order for your first attempt to work, cv.glm (and maybe glm) would have to be written differently to take much more care about where evaluations are taking place.
The function cv.glm basically re-evaluates the model formula a bunch of times. It takes that model formula directly from the fitted glm object. So put yourself in R's shoes (as it were), and consider you're deep in the function cv.glm and you've been instructed to refit this model:
glm(formula = nox ~ poly(dis, d), data = Boston)
The fitted glm object has Boston in it, and glm knows to look first in Boston for variables, so it finds nox and dis easily. But where is d? It's not in Boston. It's not in the fitted glm object (and glm wouldn't know to look there anyway). In fact, it isn't anywhere. That d value existed only in the context of the lapply iterations and then disappeared.
In the second case, since d is currently an active variable in your for loop, after R fails to find d in the data frame Boston, it looks in the parent frame, in this case the global environment and finds your for loop index d and merrily keeps going.
If you need to use glm and cv.glm in this way I would just use the for loop; it might be possible to work around the evaluation issues, but it probably wouldn't be worth the time and hassle.

How to pass saved models to caretEnsemble

Reasonably new to this so sorry if I'm being thick.
Is there a way to pass existing models to caretEnsemble?
I have several models, run on the same training data, that I would like to ensemble with caretEnsemble.
Each model takes several hours to run, so I save them, then reload them when needed rather than re-run.
model_xgb <- train(oi_in_4_24_months~., method="xgbTree", data=training, trControl=train_control)
saveRDS(model_xgb, "model_xgb.rds")
model_logit <- train(oi_in_4_24_months~., method="LogitBoost", data=training, trControl=train_control)
saveRDS(model_logit, "model_logit.rds")
model_xgb <- readRDS("model_xgb.rds")
model_logit <- readRDS("model_logit.rds")
I want to pass these saved models to caretEnsemble, but as far as I can make out I can only pass a list of model types, e.g. "LogitBoost", "xgbTree", and caretEnsemble will both run the initial models, then ensemble them.
Is there a way to pass existing models, trained on the same data, to caretEnsemble?
The package author has an example script (https://gist.github.com/zachmayer/5152157) that suggests the following:
all_models <- list(model_xgb, model_logit)
names(all_models) <- sapply(all_models, function(x) x$method)
greedy <- caretEnsemble(all_models, iter=1000L)
But that produces an error
"Error: is(list_of_models, "caretList") is not TRUE".
I think that use of caretList previously wasn't compulsory, but now is.
I don't think you still need the solution to this but answering for anyone else that has the same question.
You can add models to be used by caretEnsemble or caretStack by using as.caretList(list(rpart2 = model_list1, gbm = model_list2))
But remember to use the same indexes for cross-validation/bootstrapping. 'If the indexes were different (or some stuff were not stored as "not/wrongly" specified in trainControl), it will throw an error when trying to use caretEnsemble or caretStack. Which is the expected behavior, obviously.' This issue on github has very clear and simple instructions.

In R, caret package RFE function selects more features than allowed in size

I have a simple code that uses rfe to perform feature selection on different time periods of my data. I use the following rfeControl and rfe function calls:
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(feature_selection_data
, feature_selection_target$value
, sizes = c(1:12)
, rfeControl = control)
Each time this runs I insert the values into a list:
include <- predictors(results)
include_list[[row]] <- include
Somehow, although I set size to a maximum of 12, in 2 out of my 20 time periods, the feature selection results in 65 features (which is the total number of features in the initial dataset).
I am new to using this function, I do not know what I'm doing wrong here, any help is appreciated!
Thank you!
If you look at the description of the RFE algorithm (http://topepo.github.io/caret/recursive-feature-elimination.html), you'll see that it is necessary to include all features in the first iteration.
Your next question will probably be how to then select the suboptimal models that have less features. One answer can be found here (although it's not too helpful):
Access all models produced by rfe in caret
I would suggest adjusting the ranking function to allow feature sets that are not optimal in terms of error, but that are smaller (see: http://topepo.github.io/caret/recursive-feature-elimination.html#the-selectsize-function).

100-fold-cross-validation for Ridge Regression in R

I have a huge dataset, and I am quite new to R, so the only way I can think of implementing 100-fold-CV by myself is through many for's and if's which makes it extremely inefficient for my huge dataset, and might even take several hours to compile. I started looking for packages that do this instead and found quite many topics related to CV on stackoverflow, and I have been trying to use the ones I found but none of them are working for me, I would like to know what I am doing wrong here.
For instance, this code from DAAG package:
cv.lm(data=Training_Points, form.lm=formula(t(alpha_cofficient_values)
%*% Training_Points), m=100, plotit=TRUE)
..gives me the following error:
Error in formula.default(t(alpha_cofficient_values)
%*% Training_Points) : invalid formula
I am trying to do Kernel Ridge Regression, therefore I have alpha coefficient values already computed. So for getting predictions, I only need to do either t(alpha_cofficient_values)%*% Test_Points or simply crossprod(alpha_cofficient_values,Test_Points) and this will give me all the predictions for unknown values. So I am assuming that in order to test my model, I should do the same thing but for KNOWN values, therefore I need to use my Training_Points dataset.
My Training_Points data set has 9000 columns and 9000 rows. I can write for's and if's and do 100-fold-CV each time take 100 rows as test_data and leave 8900 rows for training and do this until the whole data set is done, and then take averages and then compare with my known values. But isn't there a package to do the same? (and ideally also compare the predicted values with known values and plot them, if possible)
Please do excuse me for my elementary question, I am very new to both R and cross-validation, so I might be missing some basic points.
The CVST package implements fast cross-validation via sequential testing. This method significantly speeds up the computations while preserving full cross-validation capability. Additionaly, the package developers also added default cross validation functionality.
I haven't used the package before but it seems pretty flexible and straightforward to use. Additionally, KRR is readily available as a CVST.learner object through the constructKRRLearner() function.
To use the crossval functionality, you must first convert your data to a CVST.data object by using the constructData(x, y) function, with x the feature data and y the labels. Next, you can use one of the cross validation functions to optimize over a defined parameter space. You can tweak the settings of both the cv or fastcv methods to your liking.
After the cross validation spits out the optimal parameters you can create the model by using the learn function and subsequently predict new labels.
I puzzled together an example from the package documentation on CRAN.
# contruct CVST.data using constructData(x,y)
# constructData(x,y)
# Load some data..
ns = noisySinc(1000)
# Kernel ridge regression
krr = constructKRRLearner()
# Create parameter Space
params=constructParams(kernel="rbfdot", sigma=10^(-3:3),
lambda=c(0.05, 0.1, 0.2, 0.3)/getN(ns))
# Run Crossval
opt = fastCV(ns, krr, params, constructCVSTModel())
# OR.. much slower!
opt = CV(ns, krr, params, fold=100)
# p = list(kernel=opt[[1]]$kernel, sigma=opt[[1]]$sigma, lambda=opt[[1]]$lambda)
p = opt[[1]]
# Create model
m = krr$learn(ns, p)
# Predict with model
nsTest = noisySinc(10000)
pred = krr$predict(m, nsTest)
# Evaluate..
sum((pred - nsTest$y)^2) / getN(nsTest)
If further speedup is required, you can run the cross validations in parallel. View this post for an example of the doparallel package.

Recursive feature elimination in 'caret' for 'randomForest': set different ntree parameter for the first forest

I am currently trying to optimize the random forest classifier for a very high-dimensional dataset (p > 200k) using recursive feature elimination (RFE). caret package has a nice implementation for doing this (rfe()-function). However, I am also thinking about optimizing RAM and CPU usage.. That's why I wonder if there is an opportunity to set different (larger) number of trees to train the first forest (without feature elimination) and to use its importances to build the remaining ones (with RFE) using for example 500 trees with 10- or 5-fold cross-validation. I know that this option is available in varSelRF.. But how about caret? I didn't manage to find anything regarding this in the manual.
You can do that. The rfFuncs list has an object called fit that defines how the model is fit. One argument to this function is called 'first' which is TRUE on the first fit (there is also a 'last' arg). You can set ntree based on this.
See the feature selection vignette for more details.
Max

Resources