I have a dataset of some 20000 training examples, on which i want to do a binary classification.
The problem is the dataset is heavily imbalanced with only around 1000 being in the positive class. I am trying to use xgboost (in R) for doing my prediction.
I have tried oversampling and undersampling and no matter what i do, somehow the predictions always result in classifiying everything as the majority class.
I tried reading this article on how to tune parameters in xgboost.
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
But it only mentions which parameters help with imbalanced datasets, but not how to tune them.
I would appreciate if anyone has any advice on tuning the learning parameters of xgboost to handle imbalanced datasets and also on how to generate the validation set for such cases.
According to XGBoost documentation, the scale_pos_weight parameter is the one dealing with imbalanced classes. See, documentation here
scale_pos_weight, [default=1] Control the balance of positive and
negative weights, useful for unbalanced classes. A typical value to
consider: sum(negative cases) / sum(positive cases) See Parameters
Tuning for more discussion. Also see Higgs Kaggle competition demo for
examples: R, py1, py2, py3
Try something like this in R
bstSparse <- xgboost(data =xgbTrain , max_depth = 4, eta = 0.2, nthread = 2, nrounds = 200 ,
eval_metric = "auc" , scale_pos_weight = 48, colsample_bytree = 0.7,
gamma = 2.5,
eval_metric = "logloss",
objective = "binary:logistic")
Where scale_pos_weight is the imbalance. My baseline incidence rate is ~ 4%. use hyper parameter optimization. Can try that on scale_pos_weight too
A technique useful with neural networks is to introduce some noise into the observations.
In R there is the 'jitter' function to do this.
For your 1000 rare cases only apply a small amount of jitter to their features to give you another 1000 cases.
Run your code again and see if the predictions are now picking up any of the positive class.
You can experiment with more added cases and/or varying the amount of jitter.
HTH, cousin_pete
Related
I have a multiclass classification problem with 4 classes and am training various learners on the data in mlr. I am using the multiclass wrapper with the default "onevsrest". As I understand it, this methods creates 4 binary classifiers, one for each class, with that class as the positive case and all other classes combined as the negative case. I am using resample to train the multiclass classifier using 5-fold cross validation.
If that is correct, is there some way I can access the individual binary classifiers and calculate the various ROC measures (sensivity, specificity, F1, auc) for each binary classifier? The function calculateROCMeasures insists on binary results, but the methods it uses (e.g. getPredictionTruth, measureTPR etc) could be called directly.
I am aware of the measures multiclass.au1p etc, but these give results across the 4 classes and don't provide values for TP, TN, F1 etc.
This is the how I have set up the problem (I am still using mlr 2 - sorry - but I imagine the same general question also applies to mlr 3):
library(mlr)
cv.resamp = makeResampleDesc("CV", iters=5)
lrn = makeLearner(cl = "classif.cvglmnet", predict.type = "response")
mclrn = makeMulticlassWrapper(lrn, mcw.method = "onevsrest")
res = resample(learner = mclrn, task = iris.task, resampling = cv.resamp, models = TRUE, show.info = FALSE, extract = getFilteredFeatures)
but then I am not sure how to proceed (possibly there is no way to proceed?). I have examined the structure of res, returned from resample, and of res$pred. In res$pred I can see no evidence of the 4 binary models.
Any advice on how to obtain these scores for a multiclass problem in mlr would be appreciated.
I need to use the xgb.train function from the xgboost R package, and I was looking at the various parameters. Among those, subsample sounds like it could be used in place of splitting the dataset into a train and validation set (I am not 100% sure about this though, please if anybody can confirm I would be grateful).
Subsample ratio of the training instances. Setting it to 0.5 means
that XGBoost would randomly sample half of the training data prior to
growing trees. and this will prevent overfitting. Subsampling will
occur once in every boosting iteration.
Now, I like that xgb.train has
the capacity to follow the progress of the learning after each round (source)
using the watchlist parameter. However, how am I suppose to follow the progress of the learning if I don't explicitly pass the validation set to the wathclist? Will just passing the data that I use as my data argument to the xgb.train function suffice if I am also defining subsample as 0.7?
To clarify, here is a code example:
my_mod <- xgb.train(
params = my_params,
data = my_data,
eval_metric = "rmse",
watchlist = list(train = my_data),
early_stopping_rounds = 100,
nrounds = 1000,
subsample=0.7)
I have a question regarding rpart and overfitting. My goal is only to do well on prediction. My dataset is large, almost 20000 points. Using around 2.5% of these points as training I get a prediction error around 50%. But using 97.5% of the data as training I get around 30%. Since I am using so much data for training I guess there is a risk for overfitting.
I run this 1000 times with random training/test data + pruning the tree which is some sort of cross validation if I have understood it correctly, and I get pretty much stable results (same prediction error and importance of variables).
Can overfitting still be a problem, even though I have run this 1000 times and the prediction error is stable?
I also have a question regarding correlation between my explanatory variables. Can that be a problem in CART (as with regression)? In regression I would maybe use Lasso to try to fix the correlation. How can I fix the correlation with my classification tree?
When I plot the cptree I get this graph:
cptree plot
Here is the code I am running (I have repeated this 1000 times with different random splits each time).
set.seed(1) # For reproducability
train_frac = 0.975
n = dim(beijing_data)[1]
# Split into training and testing data
ii = sample(seq(1,dim(beijing_data)[1]),n*train_frac)
data_train = beijing_data[ii,]
data_test = beijing_data[-ii,]
fit = rpart(as.factor(PM_Dongsi_levels)~DEWP+HUMI+PRES+TEMP+Iws+
precipitation+Iprec+wind_dir+tod+pom+weekend+month+
season+year+day,
data = data_train, minsplit = 0, cp = 0)
plotcp(fit)
# Find the split with minimum CP and prune the tree
cp_fit = fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
pfit = prune(fit, cp = cp_fit)
pp <- predict(pfit, newdata = data_test, type = "class")
err = sum(data_test[,"PM_Dongsi_levels"] != pp)/length(pp)
print(err)
Link to beijing_data (as a RData-file so you can reproduce my example)
https://www.dropbox.com/s/6t3lcj7f7bqfjnt/beijing_data.RData?dl=0
The question is quite complex and it will be very hard to comprehensively answer. I will try to provide some insights and references for further reading.
Correlated features do not pose a severe problem for tree based methods as they do for models that use a hyper-plane as classification boundaries. When there are multiple correlated features the tree will just pick one and the rest will be ignored. However correlated features often cloud the interpretability of such a model, mask interaction and so on. Tree based models can also benefit from the removal of such variables since they will have to search a lesser space. Here is a decent resource on trees. Also check these videos 1, 2 and 3 and the ISLR book.
Models based on one tree tend to not perform as good as hyper plane based methods. So if you are interested mainly in the quality of prediction then you should explore models based on a bunch of trees such as bagging and boosting models. Popular implementations of bagging and boosting in R are randomForest and xgboost. Both can be utilized with little to no experience and can result in good predictions. Here is a resource on how to use the popular R machine learning library caret to tune a random forest. Another resource is the R mlr library which provides great wrappers for many great things related to ML, for instance here is a short blog post on Model based optimization of xgboost.
Re-sampling strategy for model validation varies with task and available data. With 20 k rows I would probably use over 50 - 60 % for training, 20 % for validation and 20 -30 % as test set. The 50 % test set I would use to select a suitable ML method, features, hyper parameters and so on by repeated K-fold cross validation (2-3 times repeated 4-5 - fold or similar). The 20 % validation set I would use to fine tune stuff and to get a feel on how good my cross validation on the train set generalizes. When I am satisfied with everything I would use the test set as a final proof I have a good model. Here are some resources on re-sampling: 1, 2, 3 and nested resampling.
In your situation I would use
z <- caret::createDataPartition(data$y, p = 0.6, list = FALSE)
train <- data[z,]
test <- data[-z,]
to split the data to train and test sets, I would then repeat the process to split the test set again with p = 0.5.
On the train data I would use this tutorial on random forests to tune the mtry and ntree parameters (Extend Caret section) using 5 fold repeated cross validation in caret and a grid search.
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
tunegrid <- expand.grid(.mtry = c(1:15), .ntree = c(200, 500, 700, 1000, 1200, 1500))
and so on, as detailed in the mentioned link.
On a final note, the more data you have to train on, the less likely you are to over-fit.
I have been building a couple different regression models using the caret package in R in order to make predictions about how fluorescent certain genetic sequences will become under certain experimental conditions.
I have followed the basic protocol of splitting my data into two sets: one "training-testing set" (80%) and one "hold-out set" (20%), the former of which would be utilized to build the models, and the latter would be used to test them in order to compare and pick the final model, based on metrics such as their R-squared and RMSE values. One such guide of the many I followed can be found here (http://www.kimberlycoffey.com/blog/2016/7/16/compare-multiple-caret-run-machine-learning-models).
However, I run into a block in that I do not know how to test and compare the different models based on how well they can predict the scores in the hold-out set. In the guide I linked to above, the author uses a ConfusionMatrix in order to calculate the specificity and accuracy for each model after building a predict.train object that applied the recently built models on the hold-out set of data (which is referred to as test in the link). However, ConfusionMatrix can only be applied to classification models, wherein the outcome (or response) is a categorical value (as far as my research has indicated. Please correct me if this is incorrect, as I have not been able to conclude without any doubt that this is the case).
I have found that the resamples method is capable of comparing multiple models against each other (source: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/resamples), but it cannot take into account how the new models fit with the data that I excluded from the training-testing sessions.
I tried to create predict objects using the recently built models and hold-out data, then calculate Rsquared and RMSE values using caret's R2 and RMSE methods. But I'm not sure if such an approach is best possible way for comparing and picking the best model.
At this point, I should note that all the model building methods I am using are based on linear regression, since I need to be able to extract the coefficients and apply them in a separate Python script.
Another option I considered was setting a threshold in my outcome, wherein any genetic sequence that had a fluorescence value over 100 was considered useful, while sequences scoring values under 100 were not. This would allow me utilize the ConfusionMatrix. But I'm not sure how I should implement this within my R code to make these two classes in my outcome variable. I'm further concerned that this approach might make it difficult to apply my regression models to other data and make predictions.
For what it's worth, each of the predictors is either an integer or a float, and have ranges that are not normally distributed.
Here is the code I thus far been using:
library(caret)
data <- read.table("mydata.csv")
sorted_Data<- data[order(data$fluorescence, decreasing= TRUE),]
splitprob <- 0.8
traintestindex <- createDataPartition(sorted_Data$fluorescence, p=splitprob, list=F)
holdoutset <- sorted_Data[-traintestindex,]
trainingset <- sorted_Data[traintestindex,]
traindata<- trainingset[c('x1', 'x2', 'x3', 'x4', 'x5', 'fluorescence')]
cvCtrl <- trainControl(method = "repeatedcv", number= 20, repeats = 20, verboseIter = FALSE)
modelglmStepAIC <- train(fluorescence~., traindata, method = "glmStepAIC", preProc = c("center","scale"), trControl = cvCtrl)
model_rlm <- train(fluorescence~., traindata, method = "rlm", preProc = c("center","scale"), trControl = cvCtrl)
pred_glmStepAIC<- predict.lm(modelglmStepAIC$finalModel, holdoutset)
pred_rlm<- predict.lm(model_rlm$finalModel, holdoutset)
glmStepAIC_r2<- R2(pred_glmStepAIC, holdoutset$fluorescence)
glmStepAIC_rmse<- RMSE(pred_glmStepAIC, holdoutset$fluorescence)
rlm_r2<- R2(pred_rlm, holdoutset$fluorescence)
rlm_rmse<- RMSE(pred_rlm, holdoutset$fluorescence)
The out-of-sample performance measures offered by Caret are RMSE, MAE and squared correlation between fitted and observed values (called R2). See more info here https://topepo.github.io/caret/measuring-performance.html
At least in time series regression context, RMSE is the standard measure for out-of-sample performance of regression models.
I would advise against discretising continuous outcome variable, because you are essentially throwing away information by discretising.
I am quite new to the neural network world so I ask for your understanding. I am generating some tests and thus I have a question about the parameters size and decay. I use the caret package and the method nnet. Example dataset:
require(mlbench)
require(caret)
require (nnet)
data(Sonar)
mydata=Sonar[,1:12]
set.seed(54878)
ctrl = trainControl(method="cv", number=10,returnResamp = "all")
for_train= createDataPartition(mydata$V12, p=.70, list=FALSE)
my_train=mydata[for_train,]
my_test=mydata[-for_train,]
t.grid=expand.grid(size=5,decay=0.2)
mymodel = train(V12~ .,data=my_train,method="nnet",metric="Rsquared",trControl=ctrl,tuneGrid=t.grid)
So, two are my questions. First, is this the best way with caret to use the nnet method?Second, I have read about the size and the decay (eg. Purpose of decay parameter in nnet function in R?) but I cannot understand how to use them in practice here. Can anyone help?
Brief Caret explanation
The Caret package lets you train different models and tuning hyper-parameters using Cross Validation (Hold-Out or K-fold) or Bootstrap.
There are two different ways to tune the hyper-parameters using Caret: Grid Search and Random Search. If you use Grid Search (Brute Force) you need to define the grid for every parameter according to your prior knowledge or you can fix some parameters and iterate on the remain ones. If you use Random Search you need to specify a tuning length (maximum number of iterations) and Caret is going to use random values for hyper-parameters until the stop criteria holds.
No matter what method you choose Caret is going to use each combination of hyper-parameters to train the model and compute performance metrics as follows:
Split the initial Training samples into two different sets: Training and Validation (For bootstrap or Cross validation) and into k sets (For k-fold Cross Validation).
Train the model using the training set and to predict on validation set (For Cross Validation Hold-Out and Bootstrap). Or using k-1 training sets and to predict using the k-th training set (For K-fold Cross Validation).
On the validation set Caret computes some performance metrics as ROC, Accuracy...
Once the Grid Search has finished or the Tune Length is completed Caret uses the performance metrics to select the best model according to the criteria previously defined (You can use ROC, Accuracy, Sensibility, RSquared, RMSE....)
You can create some plot to understand the resampling profile and to pick the best model (Keep in mind performance and complexity)
if you need more information about Caret you can check the Caret web page
Neural Network Training Process using Caret
When you train a neural network (nnet) using Caret you need to specify two hyper-parameters: size and decay. Size is the number of units in hidden layer (nnet fit a single hidden layer neural network) and decay is the regularization parameter to avoid over-fitting. Keep in mind that for each R package the name of the hyper-parameters can change.
An example of training a Neural Network using Caret for classification:
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
nnetGrid <- expand.grid(size = seq(from = 1, to = 10, by = 1),
decay = seq(from = 0.1, to = 0.5, by = 0.1))
nnetFit <- train(Label ~ .,
data = Training[, ],
method = "nnet",
metric = "ROC",
trControl = fitControl,
tuneGrid = nnetGrid,
verbose = FALSE)
Finally, you can make some plots to understand the resampling results. The following plot was generated from a GBM training process
GBM Training Process using Caret