I have just done oversampling in my dataset using Smote, included in DMwR package.
My dataset is formed by two classes. The original distribution is 12 vs 62. So, I have coded this oversampling:
newData <- SMOTE(Score ~ ., data, k=3, perc.over = 400,perc.under=150)
Now, the distribution is 60 vs 72.
However, when I display the 'newData' dataset I discover how SMOTE has made an oversampling and there are some samples repeated.
For example, the sample number 24 appears as 24.1, 24.2 and 24.3.
Is this correct? This affects directly in classification because the classifier will learn a model with data that it will be present in test, so this is not legal in classification.
Edit:
I think I didn't explain correctly my issue:
As you know, SMOTE is a technique to oversample. It creates new samples from the original ones, modifying the values of the features for it. However, when I display my new data generated by SMOTE, I obtain this:
(these values are the values of the features) Sample50: 1.8787547 0.19847987 -0.0105946940 4.420207 4.660536 1.0936388 0.5312777 0.07171645 0.008043167
Sample 50.1: 1.8787547 0.19847987 -0.0105946940 4.420207 4.660536 1.0936388 0.5312777 0.07171645
Sample 50 belongs to the original dataset. Sample 50.1 is the 'artificial' sample generated by SMOTE. However (and this is my issue), SMOTE has created a repeated sample, instead of creating a artificial one modifying 'a bit' the values of the features.
I hope you can understand me.
Thanks!
Smote is an algorithm that generates synthetic examples of a given class (the minority class) to handle imbalanced distributions. This strategy for generating new data is then combined with random under-sampling of the majority class. When you use SMOTE in package DMwR you need to specify an over-sampling percentage and also an under-sampling percentage. These value must be set carefully because the obtained distribution of the data may remain imbalanced.
In your case, and given the parameters set, namely the percentage of under- and over-sampling smote will introduce replicas of the examples of your minority class.
Your initial class distribution is 12 to 62 and after applying smote you end with 60 to 72. This means that the minority class was oversampled with smote and new synthetic examples of this class where produced.
However, your majority class which had 62 examples, now contains 72! The under sampling percentage was applied to this class but it actually increased the number of examples. Since the number of examples to select from the majority class is determined based on the examples of the minority class, the number of examples sampled from this class was larger then the already existing.
Therefore, you had 62 examples and the algorithm tried to randomly select 72! This means that some replicas of the examples of the majority class where introduced.
So, to explain the over-sampling and under-sampling you selected:
12 examples from the minority class with 400% of oversampling gives: 12*400/100=48. So, 48 new synthetic examples where added to the minority class (12+48=60 the final number of examples for the minority class).
The number of examples to select from the majority class are: 48*150/100=72.
But the majority class only has 62, so replicas are necessarily introduced.
I'm not sure about the implementation of SMOTE in DMwR but it should be safe for you to round the new data to the nearest integer value. One guess is that this is left for you to do in the off chance that you want to do regression instead of classification. Otherwise if you wanted regression and SMOTE returned integers, you'll have unintentionally lost information by going in the opposite direction (SMOTE -> integers -> reals).
If you are not familiar with what SMOTE does is, it creates 'new data' by looking at nearest neighbors to establish a neighborhood and then sampling from within that neighborhood. It is usually done when there is insufficient data in a classification problem for a given class. It operates on the assumptions that data near your data is similar because of proximity.
Alternately you can use Weka's implementation of SMOTE which does not make you do this additional work.
SMOTE is a very simple algorithm to generate synthetic samples. However, before you go ahead and start to use it, you have to understand your features. For example, should each of your features vary the same, etc.
Simply put, before you do, try to understand your data...!!!
Related
In a paper under review, I have a very large dataset with a relatively small number of imputations. The reviewer asked me to report how many nodes were in the tree I generated using the CART method within MICE. I don't know why this is important, but after hunting around for a while, my own interest is piqued.
Below is a simple example using this method to impute a single value. How many nodes are in the tree that the missing value is being chosen from? And how many members are in each node?
data(whiteside, package ="MASS")
data <- whiteside
data[1,2] <- NA
library(mice)
impute <- mice(data,m=100,method="cart")
impute2 <- complete(impute,"long")
I guess, whiteside is only used as an example here. So your actual data looks different.
I can't easily get the number of nodes for the tree generated in mice. The first problem is, that it isn't just one tree ... as the package names says mice - Multivariate Imputation by Chained Equations. Which means you are sequentially creating multiple CART trees. Also each incomplete variable is imputed by a separate model.
From the mice documentation:
The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model.
If you really want to get numbers of nodes for each used model, you probably would have to adjust the mice package itself and add some logging there.
Here is how you might approach this:
Calling impute <- mice(data,m=100,method="cart") you get a S3 object of class mids that contains information about the imputation. (but not the number of nodes for each tree)
But you can call impute$formulas, impute$method, impute$nmis to get some more information, which formulas were used and which variables actually had missing values.
From the mice.impute.cart documentation you can see, that mice uses rpart internally for creating the classification and regression trees.
Since the mids object does not contain information about the fitted trees I'd suggest, you use rpart manually with the formula from impute$formulas.
Like this:
library("rpart")
rpart(Temp ~ 0 + Insul + Gas, data = data)
Which will print / give you the nodes/tree. This wouldn't really be the tree used in mice. As I said, mice means multiple chained equations / multiple models after each other - meaning multiple possibly different trees after each other. (take a look at the algortihm description here https://stefvanbuuren.name/fimd/sec-cart.html for the univariate missingness case with CART). But this could at least be an indicator, if applying rpart on your specific data provides a useful model and thus leads to good imputation results.
I have been working on a survey of 10K customers who have been segmented into several customer segments. Now due to the nature of the respondents who actually completed the survey the researcher who did the qualitative work applied case weights (also known as probability weights) and supplied the data to me with all customers with one of 8 class labels. So we have a multi-class problem which of course is highly imbalanced.
One approach I have taken is to decompose these classes into a pairwise model which all contribute to a final vote. Now my question is two fold:
I am using the wonderful package SMOTE to balance each model to address the class imbalance problem. However, as each customer record has a related case weight SMOTE is treating each customer equally. After applying SMOTE the classes now appear to be equal but if you consider the respective case weights it actually isn't.
My second question is relating to my strategy. Should I need not worry about my case weights and just build my classification model on the raw unweighted data even though it doesn't represent the total customer base that I want to classify into each segment.
I have been using the R caret package to build these multiple binary classifiers.
Regards
I'm trying out a machine learning task (binary classification) using caret and was wondering if there is a way to incorporate information about "uncertain" class, or to weight the classes differently.
As an illustration, I've cut and paste some of the code from the caret homepage working with the Sonar dataset (placeholder code - could be anything):
library(mlbench)
testdat <- get(data(Sonar))
set.seed(946)
testdat$Source<-as.factor(sample(c(LETTERS[1:6],LETTERS[1:3]),nrow(testdat),replace = T))
yielding:
summary(testdat$Source)
A B C D E F
49 51 44 17 28 19
after which I would continue with a typical train,tune, and test routine once I decide on a model.
What I've added here is another factor column of a source, or where the corresponding "Class" came from. As an arbitrary example, say these were 6 different people who made their designation of "Class" using slightly different methods and I want to put greater importance on A's classification method than B's but less than C's and so forth.
The actual data are something like this, where there are class imbalances, both among the true/false, M/R, or whatever class, and among these Sources. From the vignettes and examples I have found, at least the former I would address by using a metric like ROC during tuning, but as to how to even incorporate the latter, I'm not sure.
separating the original data by Source and cycling through the factor
levels one at a time, using the current level to build a model and
the rest of the data to test it
instead of classification, turn it into a hybrid classification/regression problem, where I use the ranks of the sources as what I want to model. If A is considered best, then an "A positive" would get a score of +6, "A negative", a score of -6 and so on. Then perform a regression fit on these values, ignoring the Class column.
Any thoughts? Every search I conduct on classes and weights seems to reference the class imbalance issue, but assumes that the classification itself is perfect (or a standard on which to model). Is it even inappropriate to try to incorporate that information and I should just include everything and ignore the source? A potential issue with the first plan is that the smaller sources account for around a few hundred instances, versus over 10,000 for the larger sources, so I might also be concerned that a model built on a smaller set wouldn't generalize as well as one based on more data. Any thoughts would be appreciated.
There is no difference between weighting "because of importance" and weighting "because imbalance". These are exactly the same settings, they both refer to "how strongly should I penalize model for missclassifing sample from a particular class". Thus you do not need any regression (and should not do so! this is perfectly well stated classification problem, and you are simply overthinking it) but just providing samples weights, thats all. There are many models in caret accepting this kind of setting, including glmnet, glm, cforest etc. if you want to use svm you should change package (as ksvm does not support such things) for example to https://cran.r-project.org/web/packages/gmum.r/gmum.r.pdf (for sample or class weighting) or https://cran.r-project.org/web/packages/e1071/e1071.pdf (if it is class weighting)
I'm working with a large data set, so hope to remove extraneous variables and tune for an optimal m variables per branch. In R, there are two methods, rfcv and tuneRF, that help with these two tasks. I'm attempting to combine them to optimize parameters.
rfcv works roughly as follows:
create random forest and extract each variable's importance;
while (nvar > 1) {
remove the k (or k%) least important variables;
run random forest with remaining variables, reporting cverror and predictions
}
Presently, I've recoded rfcv to work as follows:
create random forest and extract each variable's importance;
while (nvar > 1) {
remove the k (or k%) least important variables;
tune for the best m for reduced variable set;
run random forest with remaining variables, reporting cverror and predictions;
}
This, of course, increases the run time by an order of magnitude. My question is how necessary this is (it's been hard to get an idea using toy datasets), and whether any other way could be expected to work roughly as well in far less time.
As always, the answer is it depends on the data. On one hand, if there aren't any irrelevant features, then you can just totally skip feature elimination. The tree building process in the random forest implementation already tries to select predictive features, which gives you some protection against irrelevant ones.
Leo Breiman gave a talk where he introduced 1000 irrelevant features into some medical prediction task that had only a handful of real features from the input domain. When he eliminated 90% of the features using a single filter on variable importance, the next iteration of random forest didn't pick any irrelevant features as predictors in its trees.
I am new to random forest classifier. I am using it to classify a dataset that has two classes.
- The number of features is 512.
- The proportion of the data is 1:4. I.e, 75% of the data is from the first class and 25% of the second one.
- I am using 500 trees.
The classifier produces an out of bag error of 21.52%.
The per class error for the first class (which is represented by 75% of the training data) is 0.0059. While the classification error for the second class is really high: 0.965.
I am looking for an explanation for this behaviour and if you have suggestion to improve the accuracy for the second class.
I am looking forwards to your help.
Thanks
In forgot to say that I'm using R and that I used nodesize of 1000 in the above test.
Here I repeated the training with only 10 trees and nodesize= 1 (just to give an idea) and below is the function call in R and the confusion matrix:
randomForest(formula = Label ~ ., data = chData30PixG12, ntree = 10,importance = TRUE, nodesize = 1, keep.forest = FALSE, do.trace = 50)
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 22
OOB estimate of error rate: 24.46%
Confusion matrix:
Irrelevant , Relevant , class.error
Irrelevant 37954 , 4510 , 0.1062076
Relevant 8775 , 3068 , 0.7409440
I agree with #usr that generally speaking when you see a Random Forest simply classifying (nearly) each observation as the majority class, this means that your features don't provide much information to distinguish the two classes.
One option is to run the Random Forest such that you over-sample observations from the minority class (rather than sampling with replacement from the entire data set). So you might specify that each tree is built on a sample of size N where you force N/2 of the observations to come from each class (or some other ratio of your choosing).
While that might help some, it is by no means a cure-all. It's might be more likely that you'll get more mileage out of finding better features that do a good job of distinguishing the classes than tweaking the RF settings.
I'm surprised nobody has mentioned using the 'classwt' parameter. The weighted random forest (WRF) is specifically designed to fix this kind of problem.
See here: Stack Exchange question #1
And here: Stack Exchange question #2
Article on weighted random forests: PDF
Well, this is the typical class imbalance problem. Random forest is that type of classifier aiming at maximising accuracy of the model. When one class is accounted for majority of the data, the easiest way for the classifier to achieve the accuracy is classifying all the observations into the majority class. This gives a very high accuracy as 0.75 in your case, but a bad model-almost no correct classification for the minority class. There are a lot of ways handling this. The easier way is using undersampling of the majority class to balance the data and then train the model with this balanced data. Hope this could help you.
You can try to balance the error results using sampsize = c(500,500) (ie. in each tree it will be used 500 of each class avoiding the problem of unbalance errors, you can change the numbers of course, as well size node so big it will make probably the trees really small (using a few variables in each one). You don't want to overtrain to the training set much (even when the RF model take care of that) but you want to use at least some of the variables in each tree.
If you will show your code, which caused such a bad classification, it will be useful. Now I see one reasson for such a bad performance - nodesize = 1000 is a too big value. How many observation in your dataset? Try to use default value of nodesize or set it to much less value.
Looks like the classifier failed completely to find structure in the data. The best it could do was to classify everything as class 1 because that is the most frequent class.