SMOTE generates duplicate samples for majority sample - r

I'm quite new in R, and was trying out some samples using existing SMOTE packages. So I was trying performanceEstimation package, and followed their sample code for SMOTE. Below is the code as reference:
## A small example with a data set created artificially from the IRIS
## data
data <- iris[, c(1, 2, 5)]
data$Species <- factor(ifelse(data$Species == "setosa","rare","common"))
## checking the class distribution of this artificial data set
table(data$Species)
## now using SMOTE to create a more "balanced problem"
newData <- smote(Species ~ ., data, perc.over = 6,perc.under=1)
table(newData$Species)
## Checking visually the created data
## Not run:
par(mfrow = c(1, 2))
plot(data[, 1], data[, 2], pch = 19 + as.integer(data[, 3]),
main = "Original Data")
plot(newData[, 1], newData[, 2], pch = 19 + as.integer(newData[,3]),
main = "SMOTE'd Data")
## End(Not run)
Link: https://cran.r-project.org/web/packages/performanceEstimation/performanceEstimation.pdf
in the result for new data, i noticed that the majority samples are being generated as duplicates. below how the results look like:
..
Sepal.lenght
Sepal.Width
Species
146
6.7
3.0
common
146.1
6.7
3.0
common
106
7.6
3.0
common
60
5.2
2.7
common
107
4.9
2.5
common
107.1
4.9
2.5
common
107.2
4.9
2.5
common
the first column here is the index that you can see when you run "newData", or click on the newData variable in the Environment tab in RStudio. Just a note, the above table is just some snippet that i picked from result. common is the class in iris dataset.
So the question(s) here is,
Why SMOTE generates duplicate samples for the majority sample (common class)?
Will this duplicate sample affect the accuracy of the classification model?
As I understand, SMOTE undersamples majority class and oversamples minority class, and the oversampling portion generates a synthetic sample. the last 3 rows in above table are duplicates.
If you run the code, you will see the rows, indexed as decimals. I've tried to search around, but I couldn't find similar question in any of the forums. another point is that, I've tried other packages,and obtained similar results.
Thank you so much in advance!

To first question: If your goal is to achieve a balanced dataset, you do not have to oversample the majority class. e.g. class 1 is 100 and class 2 is 50 (like in your iris example) ... you can only oversample class 2 from 50 to 100 and leave class one unchanged. Using performanceEstimation::smote you can do:
newData <- performanceEstimation::smote(Species ~ ., data, perc.over = 1,perc.under=2, k =10)
table(newData$Species)
This results in 100 rare class and 100 common class.
No matter if you balance by undersampling or oversampling or both: Balancing will definitely lead to better results, if you are interested especially in the minority classes (like fraud detection, outlier detection, side effect of drugs in patients etc.).Otherwise your model is biased towards the majority class leading to low accuracy and precision for the minority class you are interested in.
To your second question: Note the k parameter I used from the package documentation. k determines, how many nearest neighbors will be used for a kind of interpolation. If you look into newData, you will notice, that many of the oversampled cases of the MINORITY (or rare) class (and not the majority class - or common class you showed in your table - which are duplicates since sampled with replacement) are interpolated. The radius of neighbors used for this interpolation is determined by k:
301 4.660510 3.278980 rare
311 4.757001 3.142999 rare
321 5.432725 3.432725 rare
You can see the interpolation from those numbers are numeric and not integers (thus not rounded) like in the original dataset.
I suggest you play around with the parameter k and inspect the plot (as you also did) before and after smote.
One note: The other smote package DMwR::SMOTE has the same parameters and functional structure and share the same logic for the perc.over and perc.under parameters and lead to almost same result. Here is the same data example that should leed to similar balanced iris data set 100/100 again:
newData <- DMwR::SMOTE(Species ~ ., data, perc.over = 100,perc.under=200, k =10)
table(newData$Species)
Note that perc.over and perc.under have same logic ... but should be interpreted as 100 % and 200% (versus in performanceEstimation package 1 and 2 means the same).
One final note to the smotefamily package. I can only recommend this last package if you want to do dbsmote and adasyn. No hassle with dbsmote and adasyn. But I can not recommend doing smotefamily::SMOTE, because the syntax is really a pain. It requires a "numeric-attributed" dataframe or matrix. The example of the documentation uses a data generator ... that "magically" generates the correct object, but leaves the reader alone in reproducing it. But that is just a side note ... to have all packages DMwR, performanceEstimation, and smotefamily in one comment :)

Related

Why does the importance parameter influence performance of Random Forest in R?

When using random forests in R I came across the following situation:
library(randomForest)
set.seed(42)
data(iris)
rf_noImportance <- randomForest(Species~.,data=iris,ntree=100,importance=F)
print(table(predict(rf_noImportance),iris$Species))
Output:
setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 3
virginica 0 3 47
and
library(randomForest)
set.seed(42)
data(iris)
rf_importance <- randomForest(Species~.,data=iris,ntree=100,importance=T)
print(table(predict(rf_importance),iris$Species))
Output:
setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 4
virginica 0 3 46
In the first example I set importance = FALSE and in the second example TRUE. From my understanding this should not affect the resulting prediction. There's also no indication for that behavior in the documentation.
According to the Cross Validated thread Do proximity or importance influence predictions by a random forest?, the importance flag should not influence the predictions, but it clearly does in the example above.
So why is the importance parameter of the randomForest method influencing the performance of the model?
This is a nice example for demonstrating the limits of reproducibility; in short:
Fixing the random seed does not make results comparable if the experiments invoke the random number generator in different ways.
Let's see why this is so here...
All reproducibility studies are based on a strong implicit assumption: all other being equal; if a change between two experiments invalidates this assumption, we cannot expect reproducibility in the deterministic sense you are seeking here (we may of course still expect reproducibility in the statistical sense, but this is not the issue here).
The difference between the two cases you present here (calculating feature importance or not) is really subtle; to see why it actually violates the principle stated above, we have to dig a little, both in the documentation as well as in the source code.
The documentation of the RF importance function already provides a strong hint (emphasis mine):
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data [...] Then the same is done after permuting each predictor variable.
You may have already started becoming suspicious; such data permutations are normally performed in a random sense, hence potentially invoking the random number generator (RNG) for one extra process when we use importance=TRUE, which process is absent in the importance=FALSE case.
In other words: if in the importance=TRUE case the RNG is involved in a way that is absent from the importance=FALSE case, then from the first time such a thing happens in the program the two cases stop being deterministically comparable, irrespectively of the common random seed.
At this point, this may be a strong hint, but it is still only a speculation; after all, and in principle, permutations can be performed deterministically, i.e. without involving any random processes (and hence the RNG). Where is the smoking gun?
Turns out that the smoking gun exists indeed, buried in the underlying C source code used by the randomForest R package; here is the relevant part of the C function permuteOOB:
/* Permute tp */
last = nOOB;
for (i = 0; i < nOOB; ++i) {
k = (int) last * unif_rand();
tmp = tp[last - 1];
tp[last - 1] = tp[k];
tp[k] = tmp;
last--;
}
where we can clearly see the function unif_rand() (i.e. the RNG) being invoked at line #4 of the snippet (here in the source code), in a method called only when we ask for importance=TRUE, and not in the opposite case.
Arguably, and given the fact that RF is an algorithm where randomness (and hence RNG use) enters from too many points, this should be enough evidence as to why the two cases you present are indeed not expected to be identical, since the different use of the RNG makes the actual results diverge. On the other hand, a difference of one single misclassification among 150 samples should provide enough reassurance that the two cases are still statistically similar. It should be also apparent that the subtle implementation issue here (i.e. the RNG involvement) does not violate the expectation that the two results should be equal in theory.

Create balanced dataset 1:1 using SMOTE without modifying the observations of the majority class in R

I am working on a binary classification problem for which I have an unbalanced dataset. I want to create a new more balanced dataset with 50% of observation in each class. For this, I am using SMOTE algorithm in R provided by DMwR library.
In the new dataset, I want to keep constant the observations of the majority class.
However, I meet two problems:
SMOTE reduces or increases the number of observations of the majority class (I want only increase the number of the minority class).
Some observations generated by SMOTE contains NA values.
Let assume that I have 20 observations: 17 observation in the majority class et only 3 observations in the minority class. Here my code:
library(DMwR)
library(dplyr)
sample_data <- data.frame(matrix(rnorm(200), nrow=20))
sample_data[1:17,"X10"] <- 0
sample_data[18:20,"X10"] <- 1
sample_data[,ncol(sample_data)] <- factor(sample_data[,ncol(sample_data)], levels = c('1','0'), labels = c('Yes','No'))
newDataSet <- SMOTE(X10 ~., sample_data, perc.over = 400, perc.under = 100)
In my code, I fixed the perc.over = 400 to create 12 new observations of the minority class, and I fixed perc.under = 100 to keep no change in the majority class.
However, when I check the newDataSet, I observe that SMOTE reduces the number of the majority class from 17 to 12. In addition, some generated observations have NA value.
The following image shows the obtained result:
According to ?SMOTE:
for each case in the original data set belonging to the minority class,
perc.over/100 new examples of that class will be created.
Moreover:
For instance, if 200 new examples were generated for the minority
class, a value of perc.under of 100 will randomly select exactly 200
cases belonging to the majority classes from the original data set to
belong to the final data set.
Therefore, in your case you are:
creating 12 new Yes (besides the original ones).
randomly selecting 12 No.
The new Yes containing NA might be related to the k paramenter of SMOTE. According to ?SMOTE:
k: A number indicating the number of nearest neighbours that are used
to generate the new examples of the minority class.
Its default value is 5, but in your original data you have only 3 Yes. Setting k = 2 seems to solve this issue.
A final comment: to achieve your goal I would use SMOTE only to increase the number of observations from the minority class (with perc.over = 400 or 500). Then, you can combine them with the original observations from the majority class.

Fuzzy C-Means Clustering in R

I am performing Fuzzy Clustering on some data. I first scaled the data frame so each variable has a mean of 0 and sd of 1. Then I ran the clValid function from the package clValid as follows:
library(dplyr)
df<-iris[,-5] # I do not use iris, but to make reproducible
clust<-sapply(df,scale)
intvalid <- clValid(clust, 2:10, clMethods=c("fanny"),
validation="internal", maxitems = 1000)
The results told me 4 would be the best number of clusters. Therefore I ran the fanny function from the cluster package as follows:
res.fanny <- fanny(clust, 4, metric='SqEuclidean')
res.fanny$coeff
res.fanny$k.crisp
df$fuzzy<-res.fanny$clustering
profile<-ddply(df,.(fuzzy),summarize,
count=length(fuzzy))
However, in looking at the profile, I only have 3 clusters instead of 4. How is this possible? Should I go with 3 clusters than instead of 4? How do I explain this? I do not know how to re create my data because it is quite large. As anybody else encountered this before?
This is an attempt at an answer, based on limited information and it may not fully address the questioners situation. It sounds like there may be other issues. In chat they indicated that they had encountered additional errors that I can not reproduce. Fanny will calculate and assign items to "crisp" clusters, based on a metric. It will also produce a matrix showing the fuzzy clustering assignment that may be accessed using membership.
The issue the questioner described can be recreated by increasing the memb.exp parameter using the iris data set. Here is an example:
library(plyr)
library(clValid)
library(cluster)
df<-iris[,-5] # I do not use iris, but to make reproducible
clust<-sapply(df,scale)
res.fanny <- fanny(clust, 4, metric='SqEuclidean', memb.exp = 2)
Calling res.fanny$k.crisp shows that this produces 4 crisp clusters.
res.fanny14 <- fanny(clust, 4, metric='SqEuclidean', memb.exp = 14)
Calling res.fanny14$k.crisp shows that this produces 3 crisp clusters.
One can still access the membership of each of the 4 clusters using res.fanny14$membership.
If you have a good reason to think there should be 4 crisp clusters one could reduce the memb.exp parameter. Which would tighten up the cluster assignments. Or if you are doing some sort of supervised learning, one procedure to tune this parameter would be to reserve some test data, do a hyperparameter grid search, then select the value that produces the best result on your preferred metric. However without knowing more about the task, the data, or what the questioner is trying to accomplish it is hard to suggest much more than this.
First of all I encourage to read the nice vignette of the clValid package.
The R package clValid contains functions for validating the results of a cluster analysis. There are three main types of cluster validation measures available. One of this measure is the Dunn index, the ratio between observations not in the same cluster to the larger intra-cluster distance. I focus on Dunn index for simplicity. In general connectivity should be minimized, while both the Dunn index and the silhouette width should be maximized.
clValid creators explicitly refer to the fanny function of the cluster package in their documentation.
The clValid package is useful for running several algorithms/metrics across a prespecified sets of clustering.
library(dplyr)
library(clValid)
iris
table(iris$Species)
clust <- sapply(iris[, -5], scale)
In my code I need to increase the iteration for reaching convergence (maxit = 1500).
Results are obtained with summary function applied to the clValid object intvalid.
Seems that the optimal number of clusters is 2 (but here is not the main point).
intvalid <- clValid(clust, 2:5, clMethods=c("fanny"),
maxit = 1500,
validation="internal",
metric="euclidean")
summary(intvalid)
The results from any method can be extracted from a clValid object for further analysis using the clusters method. Here the results from the 2 clusters solution are extracted(hc$2), with emphasis on the Dunnett coefficient (hc$2$coeff). Of course this results were related to the "euclidean" metric of the clValid call.
hc <- clusters(intvalid, "fanny")
hc$`2`$coeff
Now, I simply call fanny from cluster package using euclidean metric and 2 clusters. Results are completely overlapping with the previous step.
res.fanny <- fanny(clust, 2, metric='euclidean', maxit = 1500)
res.fanny$coeff
Now, we can look at the classification table
table(hc$`2`$clustering, iris[,5])
setosa versicolor virginica
1 50 0 0
2 0 50 50
and to the profile
df$fuzzy <- hc$`2`$clustering
profile <- ddply(df,.(fuzzy), summarize,
count=length(fuzzy))
profile
fuzzy count
1 1 50
2 2 100

How to balance unbalanced classification 1:1 with SMOTE in R

I am doing binary classification and my current target class is composed of:
Bad: 3126 Good:25038
So I want the number of Bad (minority) examples to equal the number of Good examples (1:1).
So Bad needs to increase by ~8x (extra 21912 SMOTEd instances) and not increase the majority (Good). The code I am trying will not keep the number of Good constant, as currently.
Code I have tried:
Example 1:
library(DMwR)
smoted_data <- SMOTE(targetclass~., data, perc.over=700, perc.under=0, k=5, learner=NULL)
Example 1 output:
Bad:25008 Good:0
Example 2:
smoted_data <- SMOTE(targetclass~., data, perc.over=700, k=5, learner=NULL)
Example 2 output:
Bad: 25008 Good:43764
Example 3:
smoted_data <- SMOTE(targetclass~., data, perc.over=700, perc.under=100, k=5, learner=NULL)
Example 3 output:
Bad: 25008 Good: 21882
To achieve a 1:1 balance using SMOTE, you want to do this:
library(DMwR)
smoted_data <- SMOTE(targetclass~., data, perc.over=100)
I have to admit it doesn't seem obvious from the built-in documentation, but if you read the original documentation, it states:
The parameters perc.over and perc.under control the amount of
over-sampling of the minority class and under-sampling of the majority
classes, respectively.
perc.over will typically be a number above 100. For each case in the orginal data set belonging to the minority class, perc.over/100 new examples of that
class will be created. If perc.over is a value below 100 than a single
case will be generated for a randomly selected proportion (given by
perc.over/100) of the cases belonging to the minority class on the
original data set.
So when perc.over is 100, you essentially creating 1 new example (100/100 = 1).
The default of perc.under is 200, and that is what you want to keep.
The parameter perc.under controls the proportion of
cases of the majority class that will be randomly selected for the
final "balanced" data set. This proportion is calculated with respect
to the number of newly generated minority class cases.
prop.table(table(smoted_data$targetclass))
# returns 0.5 0.5
You can try using the ROSE package in R.
A research article with example is available here
You shoud use a perc.under of 114.423. Since (700/100)x3126x(114.423/100)=25038.04.
But note that since SMOTE does a random undersampling for the majority class, this way you would get a new data with duplicates in the majority class. That is to say, your new data will have 25038 GOOD samples, but they are not the same 25038 GOOD samples with the original data. Some GOOD samples will not be included and some will be duplicated in the newly generated data.
I recommend you to use the bimba package under development by me. It is not yet available on CRAN but you can easily install it from github.
You can find instructions on how to install it on its github page:
https://github.com/RomeroBarata/bimba
The only restriction on the data for the use of the SMOTE function implemented in bimba is that the predictors must be numeric and the target must be both the last column of the data frame that holds your data and have only two values.
As long as your data abide by these restrictions, using the SMOTE function is easy:
library(bimba)
smoted_data <- SMOTE(data, perc_min = 50, k = 5)
Where perc_min specifies the desired percentage of the minority class after over-sampling (in that case perc_min = 50 balance the classes). Note that the majority class is not under-sampled as in the DMwR package.

how to use classwt in randomForest of R?

I have a highly imbalanced data set with target class instances in the following ratio 60000:1000:1000:50 (i.e. a total of 4 classes). I want to use randomForest for making predictions of the target class.
So, to reduce the class imbalance, I played with sampsize parameter, setting it to c(5000, 1000, 1000, 50) and some other values, but there was not much use of it. Actually, the accuracy of the 1st class decreased while I played with sampsize, though the improvement in other class predictions was very minute.
While digging through the archives, I came across two more features of randomForest(), which are strata and classwt that are used to offset class imbalance issue.
All the documents upon classwt were old (generally belonging to the 2007, 2008 years), which all suggested not the use the classwt feature of randomForest package in R as it does not completely implement its complete functionality like it does in fortran. So the first question is:
Is classwt completely implemented now in randomForest package of R? If yes, what does passing c(1, 10, 10, 10) to the classwt argument represent? (Assuming the above case of 4 classes in the target variable)
Another argument which is said to offset class imbalance issue is stratified sampling, which is always used in conjunction with sampsize. I understand what sampsize is from the documentation, but there is not enough documentation or examples which gave a clear insight into using strata for overcoming class imbalance issue. So the second question is:
What type of arguments have to be passed to stratain randomForest and what does it represent?
I guess the word weight which I have not explicitly mentioned in the question should play a major role in the answer.
classwt is correctly passed on to randomForest, check this example:
library(randomForest)
rf = randomForest(Species~., data = iris, classwt = c(1E-5,1E-5,1E5))
rf
#Call:
# randomForest(formula = Species ~ ., data = iris, classwt = c(1e-05, 1e-05, 1e+05))
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 2
#
# OOB estimate of error rate: 66.67%
#Confusion matrix:
# setosa versicolor virginica class.error
#setosa 0 0 50 1
#versicolor 0 0 50 1
#virginica 0 0 50 0
Class weights are the priors on the outcomes. You need to balance them to achieve the results you want.
On strata and sampsize this answer might be of help: https://stackoverflow.com/a/20151341/2874779
In general, sampsize with the same size for all classes seems reasonable. strata is a factor that's going to be used for stratified resampling, in your case you don't need to input anything.
You can pass a named vector to classwt.
But how weight is calculated is very tricky.
For example, if your target variable y has two classes "Y" and "N", and you want to set balanced weight, you should do:
wn = sum(y="N")/length(y)
wy = 1
Then set classwt = c("N"=wn, "Y"=wy)
Alternatively, you may want to use ranger package. This package offers flexible builds of random forests, and specifying class / sample weight is easy. ranger is also supported by caret package.
Random forests are probably not the right classifier for your problem as they are extremely sensitive to class imbalance.
When I have an unbalanced problem I usually deal with it using sampsize like you tried. However I make all the strata equal size and I use sampling without replacement.
Sampling without replacement is important here, as otherwise samples from the smaller classes will contain many more repetitions, and the class will still be underrepresented. It may be necessary to increase mtry if this approach leads to small samples, sometimes even setting it to the total number of features.
This works quiet well when there are enough items in the smallest class. However, your smallest class has only 50 items. I doubt you would get useful results with sampsize=c(50,50,50,50).
Also classwt has never worked for me.

Resources