Up-sampling in R - randomForest - r

I have a highly imbalanced data and want to up-sample the minority class to improve accuracy (the minority class is the object of interest).
I tried using the "sampsize" option in the "randomForest" function - but it only allows for down-sampling. I read someplace, the "classwt" option can be used - but i am not sure how to use it.
Can anyone suggest a way to run Random Forest in R by up-sampling the minority class (using the "randomForest" library or other such libraries).
Thanks.

The simplest approach is to just duplicate the data of the minority class enough, but then you lose the OOB estimates.
What you want do do directly does not appear to be implemented, see also this question.

Related

Dealing with class imbalance with mlr3

Lately I have been advised to change machine learning framework to mlr3. But I am finding transition somewhat more difficult than I thought at the beginning. In my current project I am dealing with highly imbalanced data which I would like to balance before training my model. I have found out this tutorial which explains how to deal with imbalance via pipelines and graph learner:
https://mlr3gallery.mlr-org.com/posts/2020-03-30-imbalanced-data/
I am afraid that this approach will also perform class balancing with new data predicting. Why would I want to do this and reduce my testing sample ?
So the two question that are rising:
Am I correct not to balance classes in testing data?
If so, is there a way of doing this in mlr3?
Of course I could just subset the training data manually and deal with imbalance myself but that's just not fun anymore! :)
Anyway, thanks for any answers,
Cheers!
to answer your questions:
I am afraid that this approach will also perform class balancing with new data predicting.
This is not correct, where did you get this?
Am I correct not to balance classes in testing data?
Class balancing usually works by adding or removing rows (or adjusting weights). All those steps should not be applied during the prediction step, as we want exactly one predicted value for each row in the data. Weights on the other hand usually have no effect during the prediction phase.
Your assumption is correct.
If so, is there a way of doing this in mlr3?
Just use the PipeOpas described in the blog post.
During training, it will do the specified over- or under- sampling, while it does nothing during the prediction.
Cheers,

Difference between "mlp" and "mlpML"

I'm using the Caret package from R to create prediction models for maximum energy demand. What i need to use is neural network multilayer perceptron, but in the Caret package i found out there's 2 of the mlp method, which is "mlp" and "mlpML". what is the difference between the two?
I have read description from a book (Advanced R Statistical Programming and Data Models: Analysis, Machine Learning, and Visualization) but it still doesnt answer my question.
Caret has 238 different models available! However many of them are just different methods to call the same basic algorithm.
Besides mlp there are 9 other methods of calling a multi-layer-perceptron one of which is mlpML. The real difference is only in the parameters of the function call and which model you need depends on your use case and what you want to adapt about the basic model.
Chances are, if you don't know what mlpML or mlpWeightDecay,etc. does you are fine to just use the basic mlp.
Looking at the official documentation we can see that:
mlp(size) while mlpML(layer1,layer2,layer3) so in the first method you can only tune the size of the multi-layer-perceptron while in the second call you can tune each layer individually.
Looking at the source code here:
https://github.com/topepo/caret/blob/master/models/files/mlp.R
and here:
https://github.com/topepo/caret/blob/master/models/files/mlpML.R
It seems that the difference is that mlpML allows several hidden layers:
modelInfo <- list(label = "Multi-Layer Perceptron, with multiple layers",
while mlp has one single layer with hidden units.
The official documentation also hints at this difference. In my opinion, it is not particularly useful to have many different models that differ only very slightly, and the documentation does not explain those slight differences well.

Implement random forest without bootstrap

I want to implement the random forest algorithm of Breiman (2001) using all my training set to grow the trees. In other words, I want to keep the random selection of inputs at each node and remove the bootstrap stage. This is motivated by the fact that I'm working with few observations that exhibit auto-correlation.
I've gone through the documentation of the packages randomForest, ranger and Rborist, but I didn't find an answer. I've also tried to take a look at the source code of the function randomForest using getAnywhere(randomForest.default); but I have to admit that my R-level is too low to be able to get anything out of it.
Thank you in advance.
Edit. Note to future readers: if you want to modify the bootstrap step, make sure to set keep.inbag=T when using randomForest.
The sampsize argument in randomForest controls the number of samples used for each tree and the replace argument controls whether or not you are bootstrapping. So in your case, set sampsize=N (number of samples) and replace=FALSE.

Unsupervised discretization to convert continuous into categorical for frequent item set mining

I am using the Package ‘arules’ to mine frequent itemsets in my big data, but I cannot find suitable methods for discretization.
As the example in Package ‘arules’, several basic unsupervised methods can be used in the function ‘discretization’, but I want to estimate optimal number of categories in my large dataset, it seems more reasonable than assigning the number of categories.
Can you give me good advices for this, thanks.
#Michael Hahsler
I think there is little guidance on this for unsupervised discretization. Look at the histogram for each variable and decide manually. For k-means you could potentially use strategies to find k using internal validation techniques (i.e., elbow method). For supervised discretization there exist methods that will help you decide. Maybe someone else can help here.

Class Weight Syntax in Kernlab?

Hi I am trying out classification for imbalanced dataset in R using kernlab package, as the class distribution is not 1:1 I am using the option of class.weights in the ksvm() function call however I do not get any difference in the classification scenario when I add weights or remove weights? So the question is what is the correct syntax for declaring the class weights?
I am using the following function calls:
model = ksvm(dummy[1:466], lab_tr,type='C-svc',kernel=pre,cross=10,C=10,prob.model=F,class.weights=c("Negative"=0.7,"Positive"=0.3))
#this is the function call with class weights
model = ksvm(dummy[1:466], lab_tr,type='C-svc',kernel=pre,cross=10,C=10,prob.model=F)
Can anyone please comment on this, am I following the right syntax of adding weights? Also I discovered that if we use the weights with prob.model=T the ksvm function returns a error!
Your syntax is ok, but the problem of not-working-class-balance is fairly common in machine learning; in a way, the removal of some objects from the bigger class is an only method guaranteed to work, still it may be a source of error increase, and one must be careful to do it in an intelligent way (in SVM the potential support vectors should have the priority - of course now there is a question how to locate them).
You may also try to boost the weights over simple length ratio, lets say ten-fold, and check if it helped even a little or luckily rather overshoot the imbalance to the other side.

Resources