Sample size (sampsize) argument in cforest()? - r

I've been working with the RandomForest package in R, which is great for classification yet weak when it comes to calculating variable importance for datasets with highly correlated predictors (like mine)
I want to switch over to the party package (to use cforest) but I can't find the equivalent of the sampsize argument when using cforest(), this argument is particularly important for me because I have a highly imbalanced dataset and have to use sampling methods to cope with that problem.
Alternatively, is there a way to pass a randomForest forest to cforest (Can s3 objects be transformed into s4 objects?!) I can train the classifier in randomForest and use the party package to get the variable importance....
Many thanks...

Related

Factor Variable Importance (VIMP) with RandomForestSRC Package: cannot coerce to data.frame error

Good afternoon, all--thank you in advance for your help! I'm somewhat new to R, so my apologies if this is a trivial or otherwise inappropriate question.
TL;DR: I'm trying to determine Variable Importance (VIM) for factor variables with a random forest model built-in RandomForestSRC, which is not a built-in feature of that package. Using both the LIME and DALEX packages, I encounter the same error: cannot coerce class 'c("rfsrc, "predict", "class")' to a data.frame. Any assistance resolving this error, or alternate approaches, would be greatly appreciated!
I have a random forest model I've built in R, using the RandomForestSRC package. The model seems to work great--training and testing went fine, got the predicted output I needed, results seem in-line with what I would expect. Unfortunately, one of the requirements is that I need to be able to indicate how the model arrived at its conclusions (eg, I need to also include variable importance as a part of the output), for both continuous and factor variables.
This doesn't seem to be a built-in feature with the RandomForestSRC package, so I've looked into both the LIME and DALEX packages, both of which should be able to break out VIM from the existing RF model. Unfortunately, neither have native support for the RFSRC package, which means I've needed to build in the prediction functions myself, as recommended by this vignette:https://uc-r.github.io/dalex
model_type.rfsrc <- function (x, ...) {
return ('classification')
}
predict_model.rfsrc <- function (x, newdata, type, ...) {
as.data.frame(predict(x, newdata, ...)
}
Unfortunately, in running the VIM section of the model (in both LIME and DALEX), I'm asked to pass both the predicted output and the model that created that output. In doing so, it hits an error with the above predict_model function:
error in as.data.frame.default(predict(model, (newdata))):
cannot coerce class 'c("rfsrc, "predict", "class")' to a data.frame
And, like...of course, it can't; it's trying to turn the model itself into a data frame. Unfortunately, while I think I understand why R is giving me that error, that's about as far as I've been able to figure out on my own.
Additionally, I'm using the RandomForestSRC package for two reasons: it doesn't put a limit on the number of factor variables, and it can handle imbalanced data. I'm working with medical data, so both of these are necessary (eg, there are ~100,000 different medical codes that can be encoded in a single data variable, and the ratio of "people-who-don't-have-this-condition" vs "people-who-do-have-this-condition" is frequently 100 to 1). If anyone has any suggestions for alternative packages that handle these issues, though, and have built-in VIM functionality (or integrate with DALEX / LIME), that would be fantastic as well.
Thank you all very much for your help!

Decisional boundary SVM caret (R)

I have built an SVM-RBF model in R using Caret. Is there a way of plotting the decisional boundary?
I know it is possible to do so by using other R packages but unfortunately I’m forced to use the Caret package because this is the only package I found that allows me to calculate the variables importance.
In alternative, can you suggest a package that allows to plot the decision boundaries AND gives also the vars importance?
Thank you very much
First of all, unlike other methods, SVM does not produce feature importance. In your case, the importance score caret reports is calculated independent of the method itself: https://topepo.github.io/caret/variable-importance.html#model-independent-metrics
Second, the decision boundary (or hyperplane) you see in most textbook example is based on a toy problem with only two or three features. If you have more than three features, it is not trivial to visualize this hyperplane.

Using a 'gbm' model created in R package 'dismo' with functions in R package 'gbm'

This is a follow-up to a previous question I asked a while back that was recently answered.
I have built several gbm models with dismo::gbm.step, which relies on the gbm fitting functions found in R package gbm, as well as cross validation tools from R package splines.
As part of my analysis, I would like to use some of the graphical tools available in R (e. g. perspective plots) to visualize pairwise interactions in the data. Both the gbm and the dismo packages have functions for detecting and modelling interactions in the data.
The implementation in dismo is explained in Elith et. al (2008) and returns a statistic which indicates departures of the model predictions from a linear combination of the predictors, while holding all other predictors at their means.
The implementation in gbm uses Friedman`s H statistic (Friedman & Popescue, 2005), and returns a different metric, and also does NOT set the other variables at their means.
The interactions modelled and plotted with dismo::gbm.interactions are great and have been very informative. However, I would also like to use gbm::interact.gbm, partly for publication strength and also to compare the results from the two methods.
If I try to run gbm::interact.gbm in a gbm.object created with dismo, an error is returned…
"Error in is.factor(data[, x$var.names[j]]) :
argument "data" is missing, with no default"
I understand dismo::gmb.step adds extra data the authors thought would be useful to the gbm model.
I also understand that the answer to my question lies somewherein the source code.
My questions is...
Is it possible to modify a gbm object created in dismo to be used in gbm::gbm.interact? If so, would this be accomplished by...
a. Modifying the gbm object created in dismo::gbm.step?
b. Modifying the source code for gbm::interact.gbm?
c. Doing something else?
I will be going through the source code trying to solve this myself, if I come up with a solution before anyone answers I will answer my own question.
The gbm::interact.gbm function requires data as an argument interact.gbm <- function(x, data, i.var = 1, n.trees = x$n.trees).
The dismo gbm.object is essentially the same as the gbm gbm.object, but with extra information attached so I don't imagine changing the gbm.object would help.

parameter C. epsilon as vector in kernlab's ksvm in R

I am trying to use ksvm function of kernlab package in R for epsilon-SVM regression. I want to put parameters C(regularization constant) and epsilon (insensitivity) as vectors(length of vector = training data length). But I am not able to figure out how to do this. Please suggest some way.
Why do you assume that you can do it? According to documentation of ksvm you can only weight classes, not particular samples. Such modification is accessible in for example sklearn python library (as samples' weights).
To artificialy implement per samples C-weights you could oversample your data. It will be very inefficient (especially if you have large differences in C values), but it can be applied to almost any SVM library.

How to do feature selection with randomForest package?

I'm using randomForest in order to find out the most significant variables. I was expecting some output that defines the accuracy of the model and also ranks the variables based on their importance. But I am a bit confused now. I tried randomForest and then ran importance() to extract the importance of variables.
But then I saw another command rfcv (Random Forest Cross-Valdidation for feature selection), which should be the most appropriate for this purpose I suppose, but the question I have regarding this is: how to get the list of the most important variables? How to see the output after running it? Which command to use?
Another thing: What is the difference between randomForest and predict.randomForest?
I am not very familiar with randomforest and R therefore any help would be appreciated.
Thank you in advance!
After you have made a randomForest model you use predict.randomForest to use the model you created on new data e.g. build a random forest with training data then run your validation data through that model with predict.randomForest.
As for the rfcv there is an option recursive which (from the help):
whether variable importance is (re-)assessed at each step of variable
reduction
Its all in the help file

Resources