How to balance unbalanced classification 1:1 with SMOTE in R - r

I am doing binary classification and my current target class is composed of:
Bad: 3126 Good:25038
So I want the number of Bad (minority) examples to equal the number of Good examples (1:1).
So Bad needs to increase by ~8x (extra 21912 SMOTEd instances) and not increase the majority (Good). The code I am trying will not keep the number of Good constant, as currently.
Code I have tried:
Example 1:
library(DMwR)
smoted_data <- SMOTE(targetclass~., data, perc.over=700, perc.under=0, k=5, learner=NULL)
Example 1 output:
Bad:25008 Good:0
Example 2:
smoted_data <- SMOTE(targetclass~., data, perc.over=700, k=5, learner=NULL)
Example 2 output:
Bad: 25008 Good:43764
Example 3:
smoted_data <- SMOTE(targetclass~., data, perc.over=700, perc.under=100, k=5, learner=NULL)
Example 3 output:
Bad: 25008 Good: 21882

To achieve a 1:1 balance using SMOTE, you want to do this:
library(DMwR)
smoted_data <- SMOTE(targetclass~., data, perc.over=100)
I have to admit it doesn't seem obvious from the built-in documentation, but if you read the original documentation, it states:
The parameters perc.over and perc.under control the amount of
over-sampling of the minority class and under-sampling of the majority
classes, respectively.
perc.over will typically be a number above 100. For each case in the orginal data set belonging to the minority class, perc.over/100 new examples of that
class will be created. If perc.over is a value below 100 than a single
case will be generated for a randomly selected proportion (given by
perc.over/100) of the cases belonging to the minority class on the
original data set.
So when perc.over is 100, you essentially creating 1 new example (100/100 = 1).
The default of perc.under is 200, and that is what you want to keep.
The parameter perc.under controls the proportion of
cases of the majority class that will be randomly selected for the
final "balanced" data set. This proportion is calculated with respect
to the number of newly generated minority class cases.
prop.table(table(smoted_data$targetclass))
# returns 0.5 0.5

You can try using the ROSE package in R.
A research article with example is available here

You shoud use a perc.under of 114.423. Since (700/100)x3126x(114.423/100)=25038.04.
But note that since SMOTE does a random undersampling for the majority class, this way you would get a new data with duplicates in the majority class. That is to say, your new data will have 25038 GOOD samples, but they are not the same 25038 GOOD samples with the original data. Some GOOD samples will not be included and some will be duplicated in the newly generated data.

I recommend you to use the bimba package under development by me. It is not yet available on CRAN but you can easily install it from github.
You can find instructions on how to install it on its github page:
https://github.com/RomeroBarata/bimba
The only restriction on the data for the use of the SMOTE function implemented in bimba is that the predictors must be numeric and the target must be both the last column of the data frame that holds your data and have only two values.
As long as your data abide by these restrictions, using the SMOTE function is easy:
library(bimba)
smoted_data <- SMOTE(data, perc_min = 50, k = 5)
Where perc_min specifies the desired percentage of the minority class after over-sampling (in that case perc_min = 50 balance the classes). Note that the majority class is not under-sampled as in the DMwR package.

Related

SMOTE generates duplicate samples for majority sample

I'm quite new in R, and was trying out some samples using existing SMOTE packages. So I was trying performanceEstimation package, and followed their sample code for SMOTE. Below is the code as reference:
## A small example with a data set created artificially from the IRIS
## data
data <- iris[, c(1, 2, 5)]
data$Species <- factor(ifelse(data$Species == "setosa","rare","common"))
## checking the class distribution of this artificial data set
table(data$Species)
## now using SMOTE to create a more "balanced problem"
newData <- smote(Species ~ ., data, perc.over = 6,perc.under=1)
table(newData$Species)
## Checking visually the created data
## Not run:
par(mfrow = c(1, 2))
plot(data[, 1], data[, 2], pch = 19 + as.integer(data[, 3]),
main = "Original Data")
plot(newData[, 1], newData[, 2], pch = 19 + as.integer(newData[,3]),
main = "SMOTE'd Data")
## End(Not run)
Link: https://cran.r-project.org/web/packages/performanceEstimation/performanceEstimation.pdf
in the result for new data, i noticed that the majority samples are being generated as duplicates. below how the results look like:
..
Sepal.lenght
Sepal.Width
Species
146
6.7
3.0
common
146.1
6.7
3.0
common
106
7.6
3.0
common
60
5.2
2.7
common
107
4.9
2.5
common
107.1
4.9
2.5
common
107.2
4.9
2.5
common
the first column here is the index that you can see when you run "newData", or click on the newData variable in the Environment tab in RStudio. Just a note, the above table is just some snippet that i picked from result. common is the class in iris dataset.
So the question(s) here is,
Why SMOTE generates duplicate samples for the majority sample (common class)?
Will this duplicate sample affect the accuracy of the classification model?
As I understand, SMOTE undersamples majority class and oversamples minority class, and the oversampling portion generates a synthetic sample. the last 3 rows in above table are duplicates.
If you run the code, you will see the rows, indexed as decimals. I've tried to search around, but I couldn't find similar question in any of the forums. another point is that, I've tried other packages,and obtained similar results.
Thank you so much in advance!
To first question: If your goal is to achieve a balanced dataset, you do not have to oversample the majority class. e.g. class 1 is 100 and class 2 is 50 (like in your iris example) ... you can only oversample class 2 from 50 to 100 and leave class one unchanged. Using performanceEstimation::smote you can do:
newData <- performanceEstimation::smote(Species ~ ., data, perc.over = 1,perc.under=2, k =10)
table(newData$Species)
This results in 100 rare class and 100 common class.
No matter if you balance by undersampling or oversampling or both: Balancing will definitely lead to better results, if you are interested especially in the minority classes (like fraud detection, outlier detection, side effect of drugs in patients etc.).Otherwise your model is biased towards the majority class leading to low accuracy and precision for the minority class you are interested in.
To your second question: Note the k parameter I used from the package documentation. k determines, how many nearest neighbors will be used for a kind of interpolation. If you look into newData, you will notice, that many of the oversampled cases of the MINORITY (or rare) class (and not the majority class - or common class you showed in your table - which are duplicates since sampled with replacement) are interpolated. The radius of neighbors used for this interpolation is determined by k:
301 4.660510 3.278980 rare
311 4.757001 3.142999 rare
321 5.432725 3.432725 rare
You can see the interpolation from those numbers are numeric and not integers (thus not rounded) like in the original dataset.
I suggest you play around with the parameter k and inspect the plot (as you also did) before and after smote.
One note: The other smote package DMwR::SMOTE has the same parameters and functional structure and share the same logic for the perc.over and perc.under parameters and lead to almost same result. Here is the same data example that should leed to similar balanced iris data set 100/100 again:
newData <- DMwR::SMOTE(Species ~ ., data, perc.over = 100,perc.under=200, k =10)
table(newData$Species)
Note that perc.over and perc.under have same logic ... but should be interpreted as 100 % and 200% (versus in performanceEstimation package 1 and 2 means the same).
One final note to the smotefamily package. I can only recommend this last package if you want to do dbsmote and adasyn. No hassle with dbsmote and adasyn. But I can not recommend doing smotefamily::SMOTE, because the syntax is really a pain. It requires a "numeric-attributed" dataframe or matrix. The example of the documentation uses a data generator ... that "magically" generates the correct object, but leaves the reader alone in reproducing it. But that is just a side note ... to have all packages DMwR, performanceEstimation, and smotefamily in one comment :)

Create balanced dataset 1:1 using SMOTE without modifying the observations of the majority class in R

I am working on a binary classification problem for which I have an unbalanced dataset. I want to create a new more balanced dataset with 50% of observation in each class. For this, I am using SMOTE algorithm in R provided by DMwR library.
In the new dataset, I want to keep constant the observations of the majority class.
However, I meet two problems:
SMOTE reduces or increases the number of observations of the majority class (I want only increase the number of the minority class).
Some observations generated by SMOTE contains NA values.
Let assume that I have 20 observations: 17 observation in the majority class et only 3 observations in the minority class. Here my code:
library(DMwR)
library(dplyr)
sample_data <- data.frame(matrix(rnorm(200), nrow=20))
sample_data[1:17,"X10"] <- 0
sample_data[18:20,"X10"] <- 1
sample_data[,ncol(sample_data)] <- factor(sample_data[,ncol(sample_data)], levels = c('1','0'), labels = c('Yes','No'))
newDataSet <- SMOTE(X10 ~., sample_data, perc.over = 400, perc.under = 100)
In my code, I fixed the perc.over = 400 to create 12 new observations of the minority class, and I fixed perc.under = 100 to keep no change in the majority class.
However, when I check the newDataSet, I observe that SMOTE reduces the number of the majority class from 17 to 12. In addition, some generated observations have NA value.
The following image shows the obtained result:
According to ?SMOTE:
for each case in the original data set belonging to the minority class,
perc.over/100 new examples of that class will be created.
Moreover:
For instance, if 200 new examples were generated for the minority
class, a value of perc.under of 100 will randomly select exactly 200
cases belonging to the majority classes from the original data set to
belong to the final data set.
Therefore, in your case you are:
creating 12 new Yes (besides the original ones).
randomly selecting 12 No.
The new Yes containing NA might be related to the k paramenter of SMOTE. According to ?SMOTE:
k: A number indicating the number of nearest neighbours that are used
to generate the new examples of the minority class.
Its default value is 5, but in your original data you have only 3 Yes. Setting k = 2 seems to solve this issue.
A final comment: to achieve your goal I would use SMOTE only to increase the number of observations from the minority class (with perc.over = 400 or 500). Then, you can combine them with the original observations from the majority class.

How to calculate class weights for Random forests

I have datasets for 2 classes on which I have to perform binary classification. I chose Random forest as a classifier as it is giving me the best accuracy among other models.
Number of datapoints in dataset-1 is 462 and dataset-2 contains 735 datapoints. I have noticed that my data has minor class imbalance so I tried to optimise my training model and retrained my model by providing class weights. I provided following value of class weights.
cwt <- c(0.385,0.614) # Class weights
ss <- c(300,300) # Sample size
I trained the model using following code
tr_forest <- randomForest(output ~., data = train,
ntree=nt, mtry=mt,importance=TRUE, proximity=TRUE,
maxnodes=mn,sampsize=ss,classwt=cwt,
keep.forest=TRUE,oob.prox=TRUE,oob.times= oobt,
replace=TRUE,nodesize=ns, do.trace=1
)
Using chosen class weight has increased the accuracy of my model, but I am still doubtful whether my approach is correct or is it just a coincidence. How can I make sure my class weight choice is perfect?
I calculated class weights using following formula:
Class weight for positive class = (No. of datapoints in
dataset-1)/(Total datapoints)
Class weight for negative class = (No. of datapoints in
dataset-2)/(Total datapoints))
For dataset-1 462/1197 = 0.385
For dataset-2 735/1197 = 0.614
Is this an acceptable method, if not why it is improving the accuracy of my model. Please help me understand the nuances of class weights.
How can I make sure my class weight choice is perfect?
Well, you can certainly not - perfect is the absolutely wrong word here; we are looking for useful heuristics, which both improve performance and make sense (i.e. they don't feel like magic).
Given that, we do have an independent way of cross-checking your choice (which seems sound indeed), albeit in Python and not in R: the scikit-learn method of compute_class_weight; we don't even need the exact data - only the sample numbers for each class, which you have already provided:
import numpy as np
from sklearn.utils.class_weight import compute_class_weight
y_1 = np.ones(462) # dataset-1
y_2 = np.ones(735) + 1 # dataset-2
y = np.concatenate([y_1, y_2])
len(y)
# 1197
classes=[1,2]
cw = compute_class_weight('balanced', classes, y)
cw
# array([ 1.29545455, 0.81428571])
Actually, these are your numbers multiplied by ~ 2.11, i.e.:
cw/2.11
# array([ 0.6139595, 0.3859174])
Looks good (multiplications by a constant do not affect the outcome), save one detail: seems that scikit-learn advises us to use your numbers transposed, i.e. a 0.614 weight for class 1 and 0.386 for class 2, instead of vice versa as per your computation.
We have just entered the subtleties of the exact definitions of what a class weight actually is, which are not necessarily the same across frameworks and libraries. scikit-learn uses these weights to weight differently the misclassification cost, so it makes sense to assign a greater weight to the minority class; this was the very idea in a draft paper by Breiman (inventor of RF) and Andy Liaw (maintainer of the randomForest R package):
We assign a weight to each class, with the minority class given larger weight (i.e., higher misclassification cost).
Nevertheless, this is not what the classwt argument in the randomForest R method seems to be; from the docs:
classwt Priors of the classes. Need not add up to one. Ignored for regression.
"Priors of the classes" is in fact the analogy of the class presence, i.e. exactly what you have computed here; this usage seems to be the consensus of a related (and highly voted) SO thread, What does the parameter 'classwt' in RandomForest function in RandomForest package in R stand for?; additionally, Andy Liaw himself has stated that (emphasis mine):
The current "classwt" option in the randomForest package [...] is different from how the official Fortran code (version 4 and later) implements class weights.
where the official Fortran implementation I guess was as described in the previous quotation from the draft paper (i.e. scikit-learn-like).
I used RF for imbalanced data myself during my MSc thesis ~ 6 years ago, and, as far as I can remember, I had found the sampsize parameter much more useful that classwt, against which Andy Liaw (again...) has advised (emphasis mine):
Search in the R-help archive to see other options and why you probably shouldn't use classwt.
What's more, in an already rather "dark" context regarding detailed explanations, it is not at all clear what exactly is the effect of using both sampsize and classwt arguments together, as you have done here...
To wrap-up:
What you have done seems indeed correct and logical
You should try using the classwt and sampsize arguments in isolation (and not together), in order to be sure where your improved accuracy should be attributed

Reducing "treatment" sample size through MatchIt (or another package) to increase sample similarity

I am trying to match two samples on several covariates using MatchIt, but I am having difficulty creating samples that are similar enough. Both my samples are plenty large (~1000 in the control group, ~5000 in the comparison group).
I want to get a matched sample with participants as closely matched as possible and I am alright with losing sample size in the control group. Right now, MatchIt only returns two groups of 1000, whereas I want two groups that are very closely matched and would be fine with smaller groups (e.g., 500 instead of 1000).
Is there a way to do this through either MatchIt or another package? I would rather avoid using random sampling and then match if possible because I want as close a match between groups as possible.
Apologies for not having a reproducible example, I am still pretty new to using R and couldn't figure out how to make a sample of this issue...
Below is the code I have for matching the two groups.
data<- na.omit(data)
data$Group<- as.numeric(data$Group)
data$Group<- recode(data$Group, '1 = 1; 2 = 0')
m.out <- matchit(Group ~ Age + YearsEdu + Income + Gender, data = data, ratio = 1)
s.out <- summary(m.out, standardize = TRUE)
plot(s.out)
matched.data <- match.data(m.out)
MatchIt, like other similar packages, offers several matching routines that enable you to play around with the settings. Check out the argument method, which is set to method = 'nearest' by default. This means that unless you specify, it will look for the best match for each of the treatment observations. In your case, you will always have 1000 paired matches with this setting.
You can choose to set it to method = 'exact', which is much more restrictive. In the documentation you will find:
This technique matches each treated unit to all
possible control units with exactly the same values on all the covariates, forming subclasses
such that within each subclass all units (treatment and control) have the same covariate values.
On the lalonde dataset, you can run:
m.out <- matchit(treat ~ educ + black + hispan, data = lalonde, method = 'exact')
summary(m.out)
As a consequence, it discards some of the treatment observation that could not get matched. Have a look at the other possibilities for method, maybe you will find something you will like better.
That being said, be mindful not to discard too many treatment observations. If you do, you will make the treatment group look like the control group (instead of the opposite), which might lead to unwanted results.
You should look into the package designmatch, which implements a form of matching called cardinality matching that does what you want (i.e., find the largest matched set that yields desired balance). Unlike MatchIt, designmatch doesn't use a distance variable; instead, it uses optimization to solve the matching problem. You select exactly how balanced you want each covariate to be, and it will do its best to solve the problem while retaining as many matches as possible. The methodology is described in Zubizarreta, Paredes, & Rosenbaum (2014).

Random Forest with classes that are very unbalanced

I am using random forests in a big data problem, which has a very unbalanced response class, so I read the documentation and I found the following parameters:
strata
sampsize
The documentation for these parameters is sparse (or I didn´t have the luck to find it) and I really don´t understand how to implement it. I am using the following code:
randomForest(x=predictors,
y=response,
data=train.data,
mtry=lista.params[1],
ntree=lista.params[2],
na.action=na.omit,
nodesize=lista.params[3],
maxnodes=lista.params[4],
sampsize=c(250000,2000),
do.trace=100,
importance=TRUE)
The response is a class with two possible values, the first one appears more frequently than the second (10000:1 or more)
The list.params is a list with different parameters (duh! I know...)
Well, the question (again) is: How I can use the 'strata' parameter? I am using sampsize correctly?
And finally, sometimes I get the following error:
Error in randomForest.default(x = predictors, y = response, data = train.data, :
Still have fewer than two classes in the in-bag sample after 10 attempts.
Sorry If I am doing so many (and maybe stupid) questions ...
You should try using sampling methods that reduce the degree of imbalance from 1:10,000 down to 1:100 or 1:10. You should also reduce the size of the trees that are generated. (At the moment these are recommendations that I am repeating only from memory, but I will see if I can track down more authority than my spongy cortex.)
One way of reducing the size of trees is to set the "nodesize" larger. With that degree of imbalance you might need to have the node size really large, say 5-10,000. Here's a thread in rhelp:
https://stat.ethz.ch/pipermail/r-help/2011-September/289288.html
In the current state of the question you have sampsize=c(250000,2000), whereas I would have thought that something like sampsize=c(8000,2000), was more in line with my suggestions. I think you are creating samples where you do not have any of the group that was sampled with only 2000.
There are a few options.
If you have a lot of data, set aside a random sample of the data. Build your model on one set, then use the other to determine a proper cutoff for the class probabilities using an ROC curve.
You can also upsample the data in the minority class. The SMOTE algorithm might help (see the reference below and the DMwR package for a function).
You can also use other techniques. rpart() and a few other functions can allow different costs on the errors, so you could favor the minority class more. You can bag this type of rpart() model to approximate what random forest is doing.
ksvm() in the kernlab package can also use unbalanced costs (but the probability estimates are no longer good when you do this). Many other packages have arguments for setting the priors. You can also adjust this to put more emphasis on the minority class.
One last thought: maximizing models based on accuracy isn't going to get you anywhere (you can get 99.99% off the bat). The caret can tune models based on the Kappa statistic, which is a much better choice in your case.
Sorry, I don't know how to post a comment on the earlier answer, so I'll create a separate answer.
I suppose that the problem is caused by high imbalance of dataset (too few cases of one of the classes are present). For each tree in RF the algorithm creates bootstrap sample, which is a training set for this tree. And if you have too few examples of one of the classes in your dataset, then the bootstrap sampling will select examples of only one class (major class). And thus tree cannot be grown on only one class examples. It seems that there is a limit on 10 unsuccessful sampling attempts.
So the proposition of DWin to reduce the degree of imbalance to lower values (1:100 or 1:10) is the most reasonable one.
Pretty sure I disagree with the idea of removing observations from your sample.
Instead you might consider using a stratified sample to set a fixed percentage of each class each time it is resampled. This can be done with the Caret package. This way you will not be omitting observations by reducing the size of your training sample. It will not allow you to over represent your classes but will make sure that each subsample has a representative sample.
Here is an example I found:
len_pos <- nrow(example_dataset[example_dataset$target==1,])
len_neg <- nrow(example_dataset[example_dataset$target==0,])
train_model <- function(training_data, labels, model_type, ...) {
experiment_control <- trainControl(method="repeatedcv",
number = 10,
repeats = 2,
classProbs = T,
summaryFunction = custom_summary_function)
train(x = training_data,
y = labels,
method = model_type,
metric = "custom_score",
trControl = experiment_control,
verbose = F,
...)
}
# strata refers to which feature to do stratified sampling on.
# sampsize refers to the size of the bootstrap samples to be taken from each class. These samples will be taken as input
# for each tree.
fit_results <- train_model(example_dataset
, as.factor(sprintf("c%d", as.numeric(example_dataset$target)))
,"rf"
,tuneGrid = expand.grid(mtry = c( 3,5,10))
,ntree=500
,strata=as.factor(example_dataset$target)
,sampsize = c('1'=as.integer(len_pos*0.25),'0'=as.integer(len_neg*0.8))
)

Resources