High OOB error rate for random forest - r

I am trying to develop a model to predict the WaitingTime variable. I am running a random forest on the following dataset.
$ BookingId : Factor w/ 589855 levels "00002100-1E20-E411-BEB6-0050568C445E",..: 223781 471484 372126 141550 246376 512394 566217 38486 560536 485266 ...
$ PickupLocality : int 1 67 77 -1 33 69 67 67 67 67 ...
$ ExZone : int 0 0 0 0 1 1 0 0 0 0 ...
$ BookingSource : int 2 2 2 2 2 2 7 7 7 7 ...
$ StarCustomer : int 1 1 1 1 1 1 1 1 1 1 ...
$ PickupZone : int 24 0 0 0 6 11 0 0 0 0 ...
$ ScheduledStart_Day : int 14 20 22 24 24 24 31 31 31 31 ...
$ ScheduledStart_Month : int 6 6 6 6 6 6 7 7 7 7 ...
$ ScheduledStart_Hour : int 14 17 7 2 8 8 1 2 2 2 ...
$ ScheduledStart_Minute : int 6 0 58 55 53 54 54 0 12 19 ...
$ ScheduledStart_WeekDay: int 1 7 2 4 4 4 6 6 6 6 ...
$ Season : int 1 1 1 1 1 1 1 1 1 1 ...
$ Pax : int 1 3 2 4 2 2 2 4 1 4 ...
$ WaitingTime : int 45 10 25 5 15 25 40 15 40 30 ...
I am splitting the dataset into training/test subsets into 80%/20% using the sample method and then running a random forest excluding the BookingId factor. This is only used to validate the predictions.
set.seed(1)
index <- sample(1:nrow(data),round(0.8*nrow(data)))
train <- data[index,]
test <- data[-index,]
library(randomForest)
extractFeatures <- function(data) {
features <- c( "PickupLocality",
"BookingSource",
"StarCustomer",
"ScheduledStart_Month",
"ScheduledStart_Day",
"ScheduledStart_WeekDay",
"ScheduledStart_Hour",
"Season",
"Pax")
fea <- data[,features]
return(fea)
}
rf <- randomForest(extractFeatures(train), as.factor(train$WaitingTime), ntree=600, mtry=2, importance=TRUE)
The problem is that all attempts to try and decrease OOB error rate and increase the accuracy failed. The maximum accuracy that I managed to achieve was ~23%.
I tried to change the number of features used, different ntree and mtry values, different training/test ratios, and also taking into consideration only data with WaitingTime <= 40. My last attempt was to follow MrFlick's suggestion and get the same sample size for all classes of get the same sample size for all classes of my predicting variable (WaitingTime).1
tempdata <- subset(tempdata, WaitingTime <= 40)
rndid <- with(tempdata, ave(tempdata$Season, tempdata$WaitingTime, FUN=function(x) {sample.int(length(x))}))
data <- tempdata[rndid<=27780,]
Do you know of any other ways how I can achieve at least accuracy over 50%?
Records by WaitingTime class:
Thanks in advance!

Messing with the randomForest hyperparameters will almost assuredly not significantly increase your performance.
I would suggest using a regression approach for you data. Since waiting time isn't categorical, a classification approach may not work very well. Your classification model loses the ordering information that 5 < 10 < 15, etc.
One thing to first try is to use a simple linear regression. Bin the predicted values from the test set and recalculate the accuracy. Better? Worse? If it's better, than go ahead and try a randomForest regression model (or as I would prefer, gradient boosted machines).
Secondly, it's possible that your data is just not predictive of the variable that you're interested in. Maybe the data got messed up somehow upstream. It might be a good diagnostic to first calculate correlation and/or mutual information of the predictors with the outcome.
Also, with so many categorical labels, 23% might actually not be that bad. The probability of a particular datapoint to be correctly labeled based on random guess is N_class/N. So the accuracy of a random guess model is not 50%. You can calculate the adjusted rand index to show that it is better than random guess.

Related

How to use knn classification (class package) using training and test datasets

Dfcensus is the original data frame. I am trying to use Sex, EducYears and Age to predict whether a person's Income is "<=50K" or ">50K".
There are 20,000 rows in x_train_auto (training set) and 12,561 in x_test_auto (test set).
My classification variable (training set) has 15,124 <=50k and 4876 >50k.
Here is my code:
predictions = knn(train = x_train_auto, # response
test = x_test_auto, # response
cl = Df_census$Income[in_train_census], # prediction
k = 25)
table(predictions)
#<=50K
#12561
As you can see, all 12,561 test samples were predicted to have an Income of ">=50K".
This doesn't make sense. I am not sure where I am going wrong.
P.S.: I have sex one-hot encodes as 0 for male and 1 for female. And I have scaled Educ_years and Age and added sex to the data frame. I then added the one-hot encoded sex variable back into the scaled test and train data.
identifying the problem
Your provided x_test-auto.csv data suggests that you passed logical vectors with TRUEs and FALSEs (which define the indices of training and test samples rather than the actual data) to the train and test arguments of class::knn.
the solution
Rather, use the logical vector in x_train_auto (which I believe corresponds to in_train_census in your example) to define two separate data.frames, each containing all your desired predictors. These are then the training and the test set.
p <- c("Age","EducYears","Sex")
Df_train <- Df_census[in_train_census,p]
Df_test <- Df_census[!in_train_census,p]
In the knn function, pass the training set to the train argument, and the test set to the test argument, and further pass the outcome / target variable of the training set (as a factor) to cl.
The output (see ?class::knn) will be the predicted outcome for the test set.
Here is a complete and reproducible workflow using your data.
the data
library(class)
# read data from Dropbox
x_train_auto <- read.csv("https://dropbox.com/s/6kupkp4u4qyizy7/x_test_auto.csv?dl=1", row.names = 1)
Df_census <- read.csv("https://dropbox.com/s/ccvck8ajnatmpv0/Df_census.csv?dl=1", row.names = 1, stringsAsFactors = TRUE)
table(x_train_auto) # TRUE are training, FALSE are test set
#> x_train_auto
#> FALSE TRUE
#> 12561 20000
str(Df_census) # Income as factor, Sex is binary, Age and EducYears are numeric
#> 'data.frame': 32561 obs. of 15 variables:
#> $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
#> $ Work : Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
#> $ Fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
#> $ Education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
#> $ EducYears : int 13 13 9 7 13 14 5 9 14 13 ...
#> $ MaritalStatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
#> $ Occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
#> $ Relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
#> $ Race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
#> $ Sex : int 1 1 1 1 0 0 0 1 0 1 ...
#> $ CapitalGain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
#> $ CapitalLoss : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ HoursPerWeek : int 40 13 40 40 40 40 16 45 50 40 ...
#> $ NativeCountry: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
#> $ Income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
# predictors and response
p <- c("Age","EducYears","Sex")
y <- "Income"
# create data partition
in_train_census <- x_train_auto$x
Df_train <- Df_census[in_train_census,]
Df_test <- Df_census[!in_train_census,]
# check
dim(Df_train)
#> [1] 20000 15
dim(Df_test)
#> [1] 12561 15
table(Df_train$Income)
#>
#> <=50K >50K
#> 15124 4876
using class::knn
The knn (k-nearest-neighbors) algorithm can perform better or worse depending on the choice of the hyperparameter k. It's often difficult to know which k value is best for the classification of a particular dataset. In a machine learning setting, you'd want to try out different values of k to find a value that gives the highest performance on your test dataset (i.e., data which was not used for model fitting).
It's always important to strike a good balance between overfitting (model is too complex, and will give good results on the training data, but less accurate or even rubbish results on new test data) and underfitting (model is too trivial to explain the actual patterns in the data). In the case of knn, using a larger k value would probably better safeguard against overfitting, according to the explanations here.
# apply knn for various k using the given training / test set
r <- data.frame(array(NA, dim = c(0, 2), dimnames = list(NULL, c("k","accuracy"))))
for (k in 1:30) {
#cat("k =", k, "\n")
# fit model on training set, predict test set data
set.seed(60402) # to be reproducible
predictions <- knn(train = Df_train[,p],
test = Df_test[,p],
cl = Df_train[,y],
k = k)
# confusion matrix on test set
t <- table(pred = predictions, ref = Df_test[,y])
# accuracy
a <- sum(diag(t)) / sum(t)
# bind
r <- rbind(r, data.frame(k = k, accuracy = a))
}
visualize model assessment
# find best k
r[which.max(r$accuracy),]
#> k accuracy
#> 17 17 0.8007324
(k.best <- r[which.max(r$accuracy),"k"])
#> [1] 17
# plot
with(r, plot(k, accuracy, type = "l"))
abline(v = k.best, lty = 2)
Created on 2021-09-23 by the reprex package (v2.0.1)
interpretation
The loop results suggest that your optimal value of k for this particular training and test set is between 12 and 17 (see plot above), but the accuracy gain is very small compared to using k = 1 (it's at around 80% regardless of k).
additional thoughts
Given that high income is rarer compared to lower income, accuracy might not be the desired performance metric. Sensitivity might be equally or more important, and you could modify the example code to calculate and assess other performance metrics instead.
In addition to pure prediction, you might want to explore whether other variables could be informative predictors of the Income class, by adding them to the p vector and comparing the resulting accuracies.
Here, we base our conclusions on a particular realization of training and test data. Better machine learning practice would be to split your data into 2 (as here), but then repeatedly split the training set again to fit and assess many more models, using e.g. (repeated) k-fold cross validation. A good package to do this in R is e.g. caret or tidymodels.
To gain a better understanding regarding which variables are the best predictors of Income class, I would also carry out a logistic regression on various uncorrelated predictors.

Adding random term into glmer mixed-effect model; error message: failure to converge

I'm analyzing data from an experiment, replicated in time, where I measured plant emergence at the soil surface. I had 3 experimental runs, represented by the term trialnum, and would like to include trialnum as a random effect.
Here is a summary of variables involved:
data.frame: 768 obs. of 9 variables:
$ trialnum : Factor w/ 2 levels "2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ Flood : Factor w/ 4 levels "0","5","10","15": 2 2 2 2 2 2 1 1 1 1 ...
$ Burial : Factor w/ 4 levels "1.3","2.5","5",..: 3 3 3 3 3 3 4 4 4 4 ...
$ biotype : Factor w/ 6 levels "0","1","2","3",..: 1 2 3 4 5 6 1 2 3 4 ...
$ soil : int 0 0 0 0 0 0 0 0 0 0 ...
$ n : num 15 15 15 15 15 15 15 15 15 15 ...
Where trialnum is the experimental run, Flood, Burial, and biotype are input/independent variables, and soil is the response/dependent variable.
I previously created this model with all input variables:
glmfitALL <-glm(cbind(soil,n)~trialnum*Flood*Burial*biotype,family = binomial(logit),total)`
From this model I found that by running
anova(glmfitALL, test = "Chisq")
trialnum is significant. There were 3 experimental runs, I'm only including 2 of those in my analysis. I have been advised to incorporate trialnum as a random effect so that I do not have to report the experimental runs separately.
To do this, I created the following model:
glmerfitALL <-glmer(cbind(soil,n)~Flood*Burial*biotype + (1|trialnum),
data = total,
family = binomial(logit),
control = glmerControl(optimizer = "bobyqa"))
From this I get the following error message:
maxfun < 10 * length(par)^2 is not recommended. Unable to evaluate scaled gradientModel failed to converge: degenerate Hessian with 9 negative eigenvalues
I have tried running this model in a variety of ways including:
glmerfitALL <-glmer(cbind(soil,n)~Flood*Burial*biotype*(1|trialnum),
data = total,
family = binomial(logit),
control = glmerControl(optimizer = "bobyqa"))
as well as incorporating REML=FALSE and used optimx in place of bobyqa, but all reiterations resulted in a similar error message.
Because this is an "eigenvalue" error, does that mean there is a problem with my source file/original data?
I also found previous threads regarding the lmer4 error messages (sorry I did not save the link), and saw some comments raising issue with the lack of replicates of the random effect. Because I only have 2 replicates trialnum2 and trialnum3, am I able to even run trialnum as a random effect?
Regarding the eigenvalue, the chief recommendation for this is centring and/or scaling predictors.
Regarding the RE groups, around five are an approximate minimum.

predict() error: what can I do if one variable exists in training data but not in prediction data?

I have a training data set with the below variables
str(PairsTrain)
'data.frame': 1495698 obs. of 4 variables:
$ itemID_1 : int 1 4 8 12 15 19 20 20 22 26 ...
$ itemID_2 : int 4112648 1223296 2161930 5637025 113701
$ isDuplicate : int 1 0 1 0 0 0 0 0 1 0 ...
$ generationMethod: int 1 1 1 1 1 1 1 1 1 1 ...
I have learned from this dataset using the logistic regression glm() function
mod1 <- glm(isDuplicate ~., data = PairsTrain, family = binomial)
Below is the structure of my test dataset:
str(Test)
'data.frame': 1044196 obs. of 3 variables:
$ id : int 0 1 2 3 4 5 6 7 8 9 ...
$ itemID_1: int 5 5 6 11 23 23 30 31 36 47 ...
$ itemID_2: int 4670875 787210 1705280 3020777 5316130 3394969 2922567
I am trying to make predictions on my test data set like below
PredTest <- predict(mod1, newdata = Test, type = "response")
Error in eval(expr, envir, enclos) : object 'generationMethod' not found
I get the above error. I am thinking that the reason for the error I am getting is that the number of features in my test dataset doesn't match the training dataset.
I am not sure if I am correct and I am stuck here and don't know how to deal this situation.
OK, this is all you need:
test$generationMethod <- 0
You must have variable generationMethod in your test! It has been used for building models, hence required by predict when you make prediction. As you said you don't have this variable in test, use the above to create such variable in the test. This will have no effect in making prediciton, as this column is all 0; but, it will help you get pass the variable checking by predict.
Alternatively, you might consider removing variable generationMethod from your model development. Try:
mod2 <- glm(isDuplicate ~ itemID_1 + itemID_2, data = PairsTrain,
family = binomial)

Can't use glht post-hoc with repeated measures ANOVA in R?

I have a data frame with this structure:
'data.frame': 39 obs. of 3 variables:
$ topic : Factor w/ 13 levels "Acido Folico",..: 1 2 3 4 5 6 7 8 9 10 ...
$ variable: Factor w/ 3 levels "Both","Preconception",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : int 14 1 36 17 5 9 19 9 19 25 ...
and I want to test the effect value ~ variable, considering that observation are grouped in topics. So I thought to use a repeated measure ANOVA, where "variable" is considered as a repeatead measure on every topic.
the call is aov(value ~ variable + Error(topic/variable)).
Up to this everything's ok.
Then I wanted to perform a post-hoc test with glht(model, linfct= mcp(variable = 'Tukey')), but I receive two errors:
‘glht’ does not support objects of class ‘aovlist’
no ‘model.matrix’ method for ‘model’ found! Since, taking out the error term solve the error I suppose that is the problem.
So, how can I perform a post-hoc test over a repeated measure anova?
Thanks!

R, lme: specifying random effects for mixed model of before-after-gradient analysis

I'm trying to measure the biological impacts of an industrial development using a Before-After-Gradient approach. I am using a linear mixed model approach in R, and am having trouble specifying an appropriate model, especially the random effects. I've spent a lot of time researching this, but so far haven't come up with a clear solution--at least not one that I understand. I am new to LMM (and R for that matter) so would welcome any advice.
The response variables (for example, changes in abundance of key species) will be measured as a function of distance from the edge of impact, using plots established at fixed distances along multiple transects ("gradients") radiating out from the edge of the disturbance. Ideally, each plot would be sampled at multiple times both before and after the impact; however, for simplicity I'm starting by assuming the simplest case, where each plot is sampled once before and once after the impact. Assume also that the individual gradients are far enough apart that they can be considered spatially independent.
First, some simulated data. The effect here is linear instead of curvilinear, but you get the idea.
> str(bag)
'data.frame': 30 obs. of 5 variables:
$ Plot : Factor w/ 15 levels "G1-D0","G1-D100",..: 1 2 4 5 3 6 7 9 10 8 ...
$ Gradient: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...
$ Distance: Factor w/ 5 levels "0","100","300",..: 1 2 3 4 5 1 2 3 4 5 ...
$ Period : Factor w/ 2 levels "After","Before": 2 2 2 2 2 2 2 2 2 2 ...
$ response: num 0.633 0.864 0.703 0.911 0.676 ...
> bag
Plot Gradient Distance Period response
1 G1-D0 1 0 Before 0.63258749
2 G1-D100 1 100 Before 0.86422356
3 G1-D300 1 300 Before 0.70262745
4 G1-D700 1 700 Before 0.91056851
5 G1-D1500 1 1500 Before 0.67637353
6 G2-D0 2 0 Before 0.75879579
7 G2-D100 2 100 Before 0.77981992
8 G2-D300 2 300 Before 0.87714158
9 G2-D700 2 700 Before 0.62888739
10 G2-D1500 2 1500 Before 0.83217617
11 G3-D0 3 0 Before 0.87931801
12 G3-D100 3 100 Before 0.81931761
13 G3-D300 3 300 Before 0.74489963
14 G3-D700 3 700 Before 0.68984485
15 G3-D1500 3 1500 Before 0.94942006
16 G1-D0 1 0 After 0.00010000
17 G1-D100 1 100 After 0.05338171
18 G1-D300 1 300 After 0.15846741
19 G1-D700 1 700 After 0.34909588
20 G1-D1500 1 1500 After 0.77138824
21 G2-D0 2 0 After 0.00010000
22 G2-D100 2 100 After 0.05801157
23 G2-D300 2 300 After 0.11422562
24 G2-D700 2 700 After 0.34208601
25 G2-D1500 2 1500 After 0.52606733
26 G3-D0 3 0 After 0.00010000
27 G3-D100 3 100 After 0.05418663
28 G3-D300 3 300 After 0.19295391
29 G3-D700 3 700 After 0.46279103
30 G3-D1500 3 1500 After 0.58556186
As far as I can tell, the fixed effects should be Period (Before,After) and Distance, treating distance as continuous (not a factor) so we can estimate the slope. The interaction between Period and Distance (equivalent to the difference in slopes, before vs. after) measures the impact. I'm still scratching my head over how to specify the random effects. I assume I should control for variation among gradients, as follows:
result <- lme(response ~ Distance + Period + Distance:Period, random=~ 1 | Gradient, data=bag)
However, I suspect I may be missing some source of variation. For example, I'm not sure the above model controls for the re-sampling of individual plots before and after. Any suggestions?
With one sample / gradient, as you have, there's no need to specify random effects or anything about the gradients. You can do this with a straight multiple regression. Once you have multiple measures in each gradient then you can use the model you've specified. Which is that there's an expected main effect of gradient on the intercept of the model but that the effects (slopes) of Distance, Period, and their interactions, should be fixed.
You could specify additional random effects if you expect there to be an appreciable amount of variability among gradients in your other predictors. I'm not sure how you do it in lme, or even if you can, but in lmer an example might be:
lmer(response ~ Distance * Distance:Period + (1 + Distance | Gradient), data=bag)
That would allow the Distance slope to have a fixed effect component and one that varied with gradient. You can look up further specification of random effects but hopefully you see the general idea and then you can decide how complex to make your model.

Resources