SMOTE in r reducing sample size significantly - r

I have a data set with around 130000 records. The records divided in two class of target variable,0 & 1. 1 contains only 0.09% of total proportion.
I'm running my analysis in R-3.5.1 on Windows 10. I used SMOTE algorithm to work with this imbalanced data set.
I used following code to handle imbalanced data set
library(DMwR)
data_code$target=as.factor(data_code$target) #Converted to factor as
# SMOTE works with factor data type
smoted_data <- SMOTE(target~., data_code, perc.over=100)
But after executing the code,I'm seeing the count for 0 is 212 & 1 is also 212 which is significant reduction of my sample size.Can you suggest me how do I handle this imbalanced data set with SMOTE without changing my data size

You need to play a bit with the two parameters avaiable from the function: perc.over and perc.under.
As per the doc from SMOTE:
The parameters perc.over and perc.under control the amount of
over-sampling of the minority class and under-sampling of the majority
classes, respectively.
So:
perc.over will tipically be a number above 100. With this type of
values, for each case in the orginal data set belonging to the
minority class, perc.over/100 new examples of that class will be
created
I can't see your data but, if your minority class has 100 cases and perc.over=100, the algorithm will generate 100/100 = 1 new cases from that class.
The parameter perc.under controls the proportion of cases of the
majority class that will be randomly selected for the final "balanced"
data set. This proportion is calculated with respect to the number of
newly generated minority class cases.
So for example a value of perc.under=100 will select from the majority class on the original data the same amount of observation that have been generated for the minority class.
In our example just 1 new case was generated so it will add just another one, resulting in a new dataset with 2 cases.
I suggest to use values above 100 for perc.over, and an even higher value for perc.under (defaults are 100 and 200).
Keep in mind that you're adding new observations that are not real in your minority class, I'd try to keep these under control.
Numeric example:
set.seed(123)
data <- data.frame(var1 = sample(50),
var2 = sample(50),
out = as.factor(rbinom(50, 1, prob=0.1)))
table(data$out)
# 0 1
# 43 7 # 50 rows total (original data)
smote_data <- DMwR::SMOTE(out ~ var1, data, perc.over = 200, perc.under = 400)
table(smote_data$out)
# 0 1
# 56 21 # 77 rows total (smote data)

An alternative to the DMwR package is the smotefamily package which does not reduce the sample size.
Instead, it creates additional data (= synthesized data) from the minority class, and adds it to the original data. So the output in the $data argument is ready for training. To tune the amount of synthesized data, you can modify the parameter dup_size. However, the default dup_size = 0 already optimizes the output to achieve balanced classes, so you don't need to tune it.
This is greatly explained in this blog post by Richard Richard.
Example code (with features in first two columns):
smote1 <- smotefamily::SMOTE(features, target, K = 4, dup_size = 0)
formula1 <- "class ~ ." %>% as.formula
model.smote <- caret::train(formula1, method = "rpart", smote1$data)
predictions.smote <- predict(model.smote, smote1$data[,1:2]) %>% print
cv2 <- confusionMatrix(smote1$data$class %>% as.factor, predictions.smote)
I find the smotefamily::SMOTE more convenient because you don't have to tune the two parameters perc_over and perc_under until you get an acceptable sample size, and the DMwR::SMOTE often generates NA values.

I know I'm a little too late to answer your question but hope this answer would help others! The package you're using is DMwR which uses a combination of SMOTE and under-sampling of the majority class.
I'd suggest you to use smotefamily::SMOTE as it only over samples the minority class, so you wouldn't lose your majority class observations.

Related

Error in generalized linear mixed model cross-validation: The value in 'data[[cat_col]]' must be constant within each ID

I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:
data <- groupdata2::fold(detect, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
cvms::cross_validate(
data,
"outcome ~ sex + year + season + (1 | bird) + (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)
This is the error I receive:
Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>% :
1 assertions failed:
* The value in 'data[[cat_col]]' must be constant within each ID.
In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?
Reproducible example:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
The full documentation says:
cat_col: Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually
want both classes represented in every fold.
N.B. If also passing an ‘id_col’, ‘cat_col’ should be
constant within each ID.
So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).
In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.
A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) + (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.
You may be able to utilize the new collapse_groups() function in groupdata2 v2.0.0 instead of fold() for this. It allows you to take existing groups (e.g. bird) and collapse them to fewer groups (e.g. folds) with attempted balancing of multiple categorical columns, numeric columns, and factor columns (number of unique levels - though the same levels might be in multiple groups).
It does not have the constraints that fold() does with regards to changing outcomes, but on the other hand does not come with the same "guarantees" in the "non-changing outcome" context. E.g. it does not guarantee at least one of each outcome levels in all folds.
You need more birds than the number of folds though, so I've added a few to the test data:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,
4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0,1,
0,1,1,0,1,1,0,0,1,1,0,0,1,0,0,1,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
# Combine 'bird' groups to folds
data <- groupdata2::collapse_groups(
data = df,
n = 3,
group_cols="bird",
cat_col="outcome",
col_name = ".folds"
) %>%
arrange(.folds)
# Check the balance of the relevant columns
groupdata2::summarize_balances(
data=data,
group_cols=".folds",
cat_cols="outcome"
)$Groups
> # A tibble: 3 × 6
> .group_col .group `# rows` `# bird` `# outc_0` `# outc_1`
> <fct> <fct> <int> <int> <dbl> <dbl>
> 1 .folds 1 14 3 7 7
> 2 .folds 2 10 2 6 4
> 3 .folds 3 10 2 4 6
summarize_balances() shows us that we created 3 folds with 14 rows in the first fold and 10 in the other folds. There are 3 unique bird levels in the first fold and 2 in the others (normally only unique within the group, but here we know that birds are only in one group, as that is how collapse_groups() works with its group_cols argument).
The outcome variable (here # outc_0 and # outc_1) are somewhat decently balanced.
With larger datasets, you might want to run multiple collapsings and choose the one with the best balance from the summary. That can be done by adding num_new_group_cols = 10 to collapse_groups() (for even better results, enable the auto_tune setting) and then listing all the created group columns when running summarize_balances().
Hope this helps you or others in a similar position. The constraint in fold() is hard to get around with its current internal approach, but collapse_groups hopefully does the trick in those cases.
See more https://rdrr.io/cran/groupdata2/man/collapse_groups.html

different output for PR AUC for different R packages

I find different numeric values for the computation of the Area Under the Precision Recall Curve (PRAUC) with the dataset I am working on when computed via 2 different R packages: yardstick and caret.
I am afraid I was not able to reproduce this mismatch with synthetic data, but only with my dataset (this is strange as well)
In order to make this reproducible, I am sharing the prediction output of my model, you can download it here https://drive.google.com/open?id=1LuCcEw-RNRcdz6cg0X5bIEblatxH4Rdz (don't worry, it's a small csv).
The csv contains a dataframe with 4 columns:
yes probability estimate of being in class yes
no = 1 - yes
obs actual class label
pred predicted class label (with .5 threshold)
here follows the code to produce the 2 values of PRAUC
require(data.table)
require(yardstick)
require(caret)
pr <- fread('pred_sample.csv')
# transform to factors
# put the positive class in the first level
pr[, obs := factor(obs, levels = c('yes', 'no'))]
pr[, pred := factor(pred, levels = c('yes', 'no'))] # this is actually not needed
# compute yardstick PRAUC
pr_auc(pr, obs, yes) # 0.315
# compute caret PRAUC
prSummary(pr, lev = c('yes', 'no')) # 0.2373
I could understand a little difference, due to the approximation when computing the area (interpolating the curve), but this seems way too high.
I even tried a third package, PRROC, and the result is still different, namely around .26.

cforest party unbalanced classes

I want to measure the features importance with the cforest function from the party library.
My output variable has something like 2000 samples in class 0 and 100 samples in class 1.
I think a good way to avoid bias due to class unbalance is to train each tree of the forest using a subsample such that the number of elements of class 1 is the same of the number of element in class 0.
Is there anyway to do that? I am thinking to an option like n_samples = c(20, 20)
EDIT:
An example of code
> iris.cf <- cforest(Species ~ ., data = iris,
+ control = cforest_unbiased(mtry = 2)) #<--- Here I would like to train the forest using a balanced subsample of the data
> varimp(object = iris.cf)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.048981818 0.002254545 0.305818182 0.271163636
>
EDIT:
Maybe my question is not clear enough.
Random forest is a set of decision trees. In general the decision trees are constructed using only a random subsample of the data. I would like that the used subsample has the same numbers of element in the class 1 and in the class 0.
EDIT:
The function that I am looking for is for sure available in the randomForest package
sampsize
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
I need the same for the party package. Is there any way to get it?
I will assume you know what you want to accomplish, but don't know enough R to do that.
Not sure if the function provides balancing of data as an argument, but you can do it manually. Below is the code I quickly threw together. More elegant solution might exist.
# just in case
myData <- iris
# replicate everything *10* times. Replicate is just a "loop 10 times".
replicate(10,
{
# split dataset by class and add separate classes to list
splitList <- split(myData, myData$Species)
# sample *20* random rows from each matrix in a list
sampledList <- lapply(splitList, function(dat) { dat[sample(20),] })
# combine sampled rows to a data.frame
sampledData <- do.call(rbind, sampledList)
# your code below
res.cf <- cforest(Species ~ ., data = sampledData,
control = cforest_unbiased(mtry = 2)
)
varimp(object = res.cf)
}
)
Hope you can take it from here.

What does monmlp.predict method return?

I use Package ‘monmlp’ package in R as follows. (Monotone multi-layer perceptron neural network)
model = monmlp.fit(trainData, trainLabs, hidden1=3, n.ensemble=1, bag=F,silent=T)
pred = monmlp.predict(testData,model)
preds = as.numeric(pred)
labs = as.numeric(testLabs)
pr = prediction(preds,labs)
pf = performance(pr,"auc")
pf#y.values[[1]]
I want to predict some new data using the trained model and take the instances which result higher than a threshold value like 0.9.
In brief, I want to take instances that more likely to be in class 1 using a threshold.
classes are 0 and 1, and
pred = monmlp.predict(testData,model)
head(pred)
returns
[,1]
311694 0.005271582
129347 0.005271582
15637 0.005271582
125458 0.005271582
315130 0.010411831
272375 0.010411831
What are these values? Probabilty values?
If yes what does these values mean?
pred[which(pred>1)]
[1] 1023.839 1023.839 1023.839
Thanks.
Regarding the output: "a matrix with number of rows equal to the number of samples and number of columns equal to the number of predictand variables. If weights is from an ensemble of models, the matrix is the ensemble mean and the attribute ensemble contains a list with predictions for each ensemble member."
Source:
http://cran.r-project.org/web/packages/monmlp/monmlp.pdf
I've never used the package nor the technique, but maybe the quoted answer may mean something to you

D-efficient balanced design with R

I was wondering if there is any package for generating a D-efficient balanced design with R
I tried the AlgDesign package, but I did not manage to get a balanced design.
Not sure if this is because of the small full factorial of the attributes and levels I am considering (32) or because balanced designs are out of the scope of AlgDesign.
I have 3 attributes: two attributes have four levels, and 1 has 2 levels
att 1: 4 levels
att 2: 4 levels
att 3: 2 levels
I use the following R code:
library(AlgDesign)
#-----------------------------
# define attributes and levels
#-----------------------------
desVarNames <- c("esource", "certified", "cost")
desLevels <- c(4,2, 4)
n <- 6 #number of choice sets
desOpt <- 4 #num option per choice set
set.seed(123456)
#generate full factorial
dat<-gen.factorial(desLevels,length(desLevels),varNames=desVarNames, center=TRUE)
destT <- optFederov(~., dat, nTrials = (n*(desOpt)), criterion="D")
destT
First, when you say "balanced" design, I suspect you mean that each factor level should appear the same number of times in the final design. (Actually, your variable "dat" above is a balanced design, too, since each possible combination appears only once).
The way optFederov works is by randomly selecting and replacing trials using Federov's exchange algorithm. As such, everytime a trial is exchanged with another candidate trial, an initially balanced design will become unbalanced, since if a trial "balances" a design, replacing it with any other trial will unbalance the design. If the algorithm was forced to keep a balanced design at each step, no replacement would be possible, and it would get stuck.
Not only a balanced design is incompatible with the way Feverov's exchange algorithm works, it is actually not desirable from a D-efficiency point of view.
For instance, if you have 4 factors with 2, 3, 5 and 7 levels respectively, the only way to have a balanced design is to include all 2*3*5*7 = 210 trials, whereas AlgDesign only suggests 19.
data = gen.factorial(c(2,3,5,7), factors = "all")
trials = optFederov(data = data, center = FALSE, criterion = "D")
In short, the notion of a "balanced d-efficient" design is largely antagonistic.

Resources