R tuneRF unstable, how to optimize?

R tuneRF unstable, how to optimize? - r

The Short
I'm trying to use tuneRF to find the optimal mtry value for my randomForest function but I'm finding that the answer is extremely unstable and changes with run to run/different seeds. I would run a loop to see how it changes over a large number of runs but am unable to extract which mtry has the lowest OOB error.
The Long
I have a data.frame that has eight features but two of the features are inclusive meaning all the information in one is a subset of the other. As an example one feature could be a factor A ~ c("animal', "fish") and another feature a factor B ~c("dog", "cat", "salmon", "trout"). Hence all dogs and cats are animals and all salmon and trout are fish. These two variables are by far more significant than any of the other six. Hence if I run 3 forests, one that uses A, one that uses B and one that uses A & B, the last one seems to be do the best. I suspect this is because A &/or B are so significant that by including both I have double the chance of them being selected randomly as the initial feature. I further suspect that I shouldn't allow this to happen and that I should throw out A as a factor but I can not find any literature that actually says that.
Anyway getting back on track. I have two datasets tRFx and tRFx2 the first of which contains 7 features including B but not A and the second which contains 8 features with both A and B. I'm trying to see what the optimal mtry is for these two separate models, and then how they perform relative to each other. The problem is tuneRF seems, at least in this case, to be very unstable.
For the first dataset, (includes Feature B but not A)
> set.seed(1)
> tuneRF(x = tRFx, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 2 OOB error = 17.73%
Searching left ...
Searching right ...
mtry = 3 OOB error = 17.28%
0.02531646 0.01
mtry = 4 OOB error = 18.41%
-0.06493506 0.01
mtry OOBError
2.OOB 2 0.1773288
3.OOB 3 0.1728395
4.OOB 4 0.1840629
> set.seed(3)
> tuneRF(x = tRFx, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 2 OOB error = 18.07%
Searching left ...
Searching right ...
mtry = 3 OOB error = 18.18%
-0.00621118 0.01
mtry OOBError
2.OOB 2 0.1806958
3.OOB 3 0.1818182
ie for seed 1 mtry=3 but seed=3 mtry=2
And for the second dataset (includes both Features A & B)
> set.seed(1)
> tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 3 OOB error = 17.51%
Searching left ...
mtry = 2 OOB error = 16.61%
0.05128205 0.01
Searching right ...
mtry = 4 OOB error = 16.72%
-0.006756757 0.01
mtry OOBError
2.OOB 2 0.1661055
3.OOB 3 0.1750842
4.OOB 4 0.1672278
> set.seed(3)
> tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 3 OOB error = 17.4%
Searching left ...
mtry = 2 OOB error = 18.74%
-0.07741935 0.01
Searching right ...
mtry = 4 OOB error = 17.51%
-0.006451613 0.01
mtry OOBError
2.OOB 2 0.1874299
3.OOB 3 0.1739618
4.OOB 4 0.1750842
ie for seed 1 mtry=2 but seed=3 mtry=3
I was going to run a loop to see which mtry is optimal over a large number of simulations but don't know how to capture the optimal mtry from each iteration.
I know that I can use
> set.seed(3)
> min(tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01))
mtry = 3 OOB error = 17.4%
Searching left ...
mtry = 2 OOB error = 18.74%
-0.07741935 0.01
Searching right ...
mtry = 4 OOB error = 17.51%
-0.006451613 0.01
[1] 0.1739618
but I don't want to capture the OOB error (0.1739618) but the optimal mtry (3).
Any help (or even general comments on anything related to tuneRF) greatly appreciated. For anybody else who happens to stumble upon this looking for tuneRF help I also found this post helpful.
R: unclear behaviour of tuneRF function (randomForest package)
For what it's worth it seems that the optimal mtry for the smaller feature set (with non-inclusive features) is 3 and for the larger feature set is only 2, which initially is counter intuitive but when you consider the inclusive nature of A and B it does/may make sense.

There's not a big difference in performance in this case (and others) on which mtry you choose. Only if you wan't to win kaggle contests where winner takes all and then you would probably also be blending together many other learning algorithms in one huge ensemble. In practice you get almost the same predictions.
You don't need stepwise optimization when you test so few combinations of parameters. Just try them all and repeat many times to figure out which mtry is slightly better.
All the times I have used tuneRF, I have been disappointed. Every time I ended up writing my own stepwise optimization or simply tried all combinations many times.
The mtry vs. oob-err do not have to be a smooth curve with a single minimum, though general trend should be observed. I't can be difficult to tell if a minimum is due to noise or general tendency.
I wrote an example of to do solid mtry screening. The conclusion from this screening would be there's not much difference. mtry=2 seems best and it would be slightly faster to compute. Default value had been mtry=floor(ncol(X)/3) anyways.
library(mlbench)
library(randomForest)
data(PimaIndiansDiabetes)
y = PimaIndiansDiabetes$diabetes
X = PimaIndiansDiabetes
X = X[,!names(X)%in%"diabetes"]
nvar = ncol(X)
nrep = 25
rf.list = lapply(1:nvar,function(i.mtry) {
oob.errs = replicate(nrep,{
oob.err = tail(randomForest(X,y,mtry=i.mtry,ntree=2000)$err.rate[,1],1)})
})
plot(replicate(nrep,1:nvar),do.call(rbind,rf.list),col="#12345678",
xlab="mtry",ylab="oob.err",main="tuning mtry by oob.err")
rep.mean = sapply(rf.list,mean)
rep.sd = sapply(rf.list,sd)
points(1:nvar,rep.mean,type="l",col=3)
points(1:nvar,rep.mean+rep.sd,type="l",col=2)
points(1:nvar,rep.mean-rep.sd,type="l",col=2)

Related

How do I tune a posterior probability threshold value for a binary classifier using more than one performance measure with the mlr package in R?

The following link provided me with a greater understanding of incorporating ordinary cost in my binary classification model: https://mlr.mlr-org.com/articles/tutorial/cost_sensitive_classif.html
With a standard classifier, the default threshold is usually 0.5, and the aim is to minimize the total number of misclassification errors as much as possible (obtain the maximum accuracy). However, all misclassification errors are treated equally. This is not typically the case in a real-world setting since the cost of a false negative may be much greater than that of a false negative.
Using empirical thresholding, I was able to obtain the optimal threshold value for classifying the instance into good or bad while minimizing the average cost. On the other hand, this comes at the price of reducing the accuracy and other performance measures. This is illustrated in the following figure:
In the figure above, the red line denotes the standard threshold of 0.5 which maximizes accuracy but gives a sub-optimal average credit cost. The blue line denotes the desired threshold that minimizes the cost, but now the accuracy is drastically reduced.
Generally, I would not be concerned about the reduced accuracy. Suppose however there is also an incentive to not only minimize the cost but also to maximize the precision as well. Note that the precision is the positive predictive value or ppv = TP/(TP+FP)). Then the green line might be a good trade-off that gives a relatively low cost and a relatively high ppv. Here, I plotted the green line as the average of the red and blue lines (both credit cost and ppv functions seem to have about the same gradient between these regions so calculating the optimal threshold this way probably provides a good estimate), but is there a way to calculate this threshold exactly?
My thoughts are to create a new performance measure as a function of both the costs and the ppv, and then minimize this performance measure.
Example: measure = credit.costs*(-ppv)
But I'm not sure how to code this in R. Any advice on what should be done would be greatly appreciated.
My R code is as follows:
library(mlr)
## Load dataset
data(GermanCredit, package = "caret")
credit.task = makeClassifTask(data = GermanCredit, target = "Class")
## Removing 2 columns: Purpose.Vacation,Personal.Female.Single
credit.task = removeConstantFeatures(credit.task)
## Generate cost matrix
costs = matrix(c(0, 1, 5, 0), 2)
colnames(costs) = rownames(costs) = getTaskClassLevels(credit.task)
## Make cost measure
credit.costs = makeCostMeasure(id = "credit.costs", name = "Credit costs", costs = costs, best = 0, worst = 5)
## Set training scheme with repeated 10-fold cross-validation
set.seed(100)
rin = makeResampleInstance("RepCV", folds = 10, reps = 3, task = credit.task)
## Fit a logistic regression model (nnet::multinom())
lrn = makeLearner("classif.multinom", predict.type = "prob", trace = FALSE)
r = resample(lrn, credit.task, resampling = rin, measures = list(credit.costs, mmce), show.info = FALSE)
r
# Tune the threshold using average costs based on the predicted probabilities on the 3 test data sets
cost_tune.res = tuneThreshold(pred = r$pred, measure = credit.costs)
# Tune the threshold using precision based on the predicted probabilities on the 3 test data sets
ppv_tune.res = tuneThreshold(pred = r$pred, measure = ppv)
d = generateThreshVsPerfData(r, measures = list(credit.costs, ppv, acc))
plt = plotThreshVsPerf(d)
plt + geom_vline(xintercept=cost_tune.res$th, colour = "blue") + geom_vline(xintercept=0.5, colour = "red") +
geom_vline(xintercept=1/2*(cost_tune.res$th + 0.5), colour = "green")
calculateConfusionMatrix(r$pred)
performance(r$pred, measures = list(acc, ppv, credit.costs))
Finally, I'm also a bit confused that about my ppv value. When I observe my confusion matrix, I am calculating my ppv as 442/(442+289) = 0.6046512 but the reported value is slightly different (0.6053531). Is there something wrong with my calculation?

Lambda Issue, or cross validation

I am doing double cross validation with LASSO of glmnet package, however when I plot the results I am getting lambda of 0 - 150000 which is unrealistic in my case, not sure what is wrong I am doing, can someone point me in the right direction. Thanks in advance!
calcium = read.csv("calciumgood.csv", header=TRUE)
dim(calcium)
n = dim(calcium)[1]
calcium = na.omit(calcium)
names(calcium)
library(glmnet) # use LASSO model from package glmnet
lambdalist = exp((-1200:1200)/100) # defines models to consider
fulldata.in = calcium
x.in = model.matrix(CAMMOL~. - CAMLEVEL - AGE,data=fulldata.in)
y.in = fulldata.in[,2]
k.in = 10
n.in = dim(fulldata.in)[1]
groups.in = c(rep(1:k.in,floor(n.in/k.in)),1:(n.in%%k.in))
set.seed(8)
cvgroups.in = sample(groups.in,n.in) #orders randomly, with seed (8)
#LASSO cross-validation
cvLASSOglm.in = cv.glmnet(x.in, y.in, lambda=lambdalist, alpha = 1, nfolds=k.in, foldid=cvgroups.in)
plot(cvLASSOglm.in$lambda,cvLASSOglm.in$cvm,type="l",lwd=2,col="red",xlab="lambda",ylab="CV(10)")
whichlowestcvLASSO.in = order(cvLASSOglm.in$cvm)[1]; min(cvLASSOglm.in$cvm)
bestlambdaLASSO = (cvLASSOglm.in$lambda)[whichlowestcvLASSO.in]; bestlambdaLASSO
abline(v=bestlambdaLASSO)
bestlambdaLASSO # this is the lambda for the best LASSO model
LASSOfit.in = glmnet(x.in, y.in, alpha = 1,lambda=lambdalist) # fit the model across possible lambda
LASSObestcoef = coef(LASSOfit.in, s = bestlambdaLASSO); LASSObestcoef # coefficients for the best model fit

I found the dataset you referring at
Calcium, inorganic phosphorus and alkaline phosphatase levels in elderly patients.
Basically the data are "dirty", and it is a possible reason why the algorithm does not converge properly. E.g. there are 771 year old patients, bisides 1 and 2 for male and female, there is 22 for sex encodeing etc.
As for your case you removed only NAs.
You need to check data.frame imported types as well. E.g. instead of factors it could be imported as integers (SEX, Lab and Age group) which will affect the model.
I think you need:
1) cleanse the data;
2) if doesnot work submit *.csv file

krippendorffs alpha confusion - "irr" package

I have been trying to compute Krippendorff's alpha statistic for a small dataset, but am able to get very different alpha scores for what is essentially the same case of agreement for my data.
In a rating scale of 1 to 5, two ratings of 4.5 vs 5 would be considered the same distance or amount of agreement as ratings of 4.5 vs 4, however I am getting drastically different results in both Cohens kappa and Krippendorff's alpha in r using the irr package.
Data and code:
x<-matrix(c(4.5,5,5,5,5,5,5,5),nrow=2)
y<-matrix(c(4.5,4,5,5,5,5,5,5),nrow=2)
kripp.alpha(x,"ordinal")
kripp.alpha(y,"ordinal")
Results:
> kripp.alpha(x,"ordinal")
Subjects = 4
Raters = 2
alpha = 0
> kripp.alpha(y,"ordinal")
Krippendorff's alpha
Subjects = 4
Raters = 2
alpha = 0.964
I am ultimately hoping to use Krippendorffs alpha as I would like to make comparison performance between 2 raters to a larger group of raters.
Any suggestions, guidance, or references would be greatly appreciated.

Double check your object names, or typos in your call. Both of these produce the same alpha value. kripp.alpha is not stochastic (the same input will produce the same result).
x<-matrix(c(4.5,4,5,5,5,5,5,5),nrow=2)
y<-matrix(c(4.5,4,5,5,5,5,5,5),nrow=2)
y_kripp_alpha <- irr:::kripp.alpha(y, "ordinal")
x_kripp_alpha <- irr:::kripp.alpha(x, "ordinal")
y_kripp_alpha$value == x_kripp_alpha$value
# [1] TRUE
And in fact, each of the components is the same.

Use h2o.grid fine tune gbm model weight column issue

I am using h2o.grid hyperparameter search function to fine tune gbm model. h2o gbm allows add a weight column to specify the weight of each observation. However when I tried to add that in h2o.grid, it always error out saying illegal argument/missing value, even though the weight volume is populated.
Any one has similar experience? Thanks
Hyper-parameter: max_depth, 20
[2017-04-12 13:10:05] failure_details: Illegal argument(s) for GBM model: depth_grid_model_11. Details: ERRR on field: _weights_columns: Weights cannot have missing values.
ERRR on field: _weights_columns: Weights cannot have missing values.
============================
hyper_params = list( max_depth = c(4,6,8,12,16,20) ) ##faster for larger datasets
grid <- h2o.grid(
## hyper parameters
hyper_params = hyper_params,
## full Cartesian hyper-parameter search
search_criteria = list(strategy = "Cartesian"), ## default is Cartesian
## which algorithm to run
algorithm="gbm",
## identifier for the grid, to later retrieve it
grid_id="depth_grid",
## standard model parameters
x = X, #predictors,
y = Y, #response,
training_frame = datadev, #train,
validation_frame = dataval, #valid,
**weights_column = "Adj_Bias_correction",**
## more trees is better if the learning rate is small enough
## here, use "more than enough" trees - we have early stopping
ntrees = 10000,
## smaller learning rate is better
## since we have learning_rate_annealing, we can afford to start with a bigger learning rate
learn_rate = 0.05,
## learning rate annealing: learning_rate shrinks by 1% after every tree
## (use 1.00 to disable, but then lower the learning_rate)
learn_rate_annealing = 0.99,
## sample 80% of rows per tree
sample_rate = 0.8,
## sample 80% of columns per split
col_sample_rate = 0.8,
## fix a random number generator seed for reproducibility
seed = 1234,
## early stopping once the validation AUC doesn't improve by at least 0.01% for 5 consecutive scoring events
stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC",
## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
score_tree_interval = 10
)
## by default, display the grid search results sorted by increasing logloss (since this is a classification task)
grid

R e1071: Balanced Error Rate (BER) as error criterion in tune function

I'm kind of new to R and machine learning in general, so apologies if this seems stupid!
I'm using the e1071 package to tune the parameters of various models. My dataset is very unbalanced and I would like for the error criterion to be Balanced Error Rate... NOT overall classification error. However, I'm stumped as how to achieve this.
Here is my code:
#Find optimal value 'k' value for k-NN model (feature subset).
c <- data_train_sub[1:13]
d <- data_train_sub[,14]
knn2 <- tune.knn(c, d, k = 1:10, tunecontrol = tune.control(sampling = "cross", performances = TRUE, sampling.aggregate = mean)
)
summary(knn2)
plot(knn2)
Which returns this:
Parameter tuning of ‘knn.wrapper’:
- sampling method: 10-fold cross validation
- best parameters:
k
1
- best performance: 0.001190476
- Detailed performance results:
k error dispersion
1 1 0.001190476 0.003764616
2 2 0.005952381 0.006274360
3 3 0.003557423 0.005728122
4 4 0.005924370 0.008352124
5 5 0.005938375 0.008407043
6 6 0.005938375 0.008407043
7 7 0.007128852 0.008315090
8 8 0.009495798 0.009343555
9 9 0.008305322 0.009751997
10 10 0.008319328 0.009795292
Has anyone any experience of altering the error being assessed in this function?

Look at the class.weights argument of the svm() function:
a named vector of weights for the different classes, used for asymmetric class sizes...
Coefficient can easily be calculated as such:
class.weights = table(Xcal$species)/sum(table(Xcal$species))