R ranger treeInfo final nodes have the same class - r

When I use ranger for a classification model and treeInfo() to extract a tree, I see that sometimes a split results in two identical terminal nodes.
Is this expected behaviour? Why does it make sense to introduce a split where the final nodes are the same?
From this question, I take that the prediction variable could be the majority class (albeit for python and another random forest implementation). The ranger ?treeInfo documentation says it should be the predicted class.
MWE
library(ranger)
data <- iris
data$is_versicolor <- factor(data$Species == "versicolor")
data$Species <- NULL
rf <- ranger(is_versicolor ~ ., data = data,
num.trees = 1, # no need for many trees in this example
max.depth = 3, # keep depth at an understandable level
seed = 1351, replace = FALSE)
treeInfo(rf, 1)
#> nodeID leftChild rightChild splitvarID splitvarName splitval terminal prediction
#> 1 0 1 2 2 Petal.Length 2.60 FALSE <NA>
#> 2 1 NA NA NA <NA> NA TRUE FALSE
#> 3 2 3 4 3 Petal.Width 1.75 FALSE <NA>
#> 4 3 5 6 2 Petal.Length 4.95 FALSE <NA>
#> 5 4 7 8 0 Sepal.Length 5.95 FALSE <NA>
#> 6 5 NA NA NA <NA> NA TRUE TRUE
#> 7 6 NA NA NA <NA> NA TRUE TRUE
#> 8 7 NA NA NA <NA> NA TRUE FALSE
#> 9 8 NA NA NA <NA> NA TRUE FALSE
In this example, the last four rows (final nodes with nodeID 5 and 6, as well as 7 and 8) have the prediction TRUE and FALSE.
Graphically this would look like this

I think I found a (partial) answer to the issue, namely the mtry and min.node.size arguments and their functionality.
As the random forest chooses only mtry variables at each split, the final split might take only variables into account, which do not split the data in a way that results in a maximum gini difference (or whatever metric was chosen) but still in each final node, a given class might prevail.
Playing around with mtry and min.node.size can change this. But we still might get splits with the same results.

Related

How to use knn classification (class package) using training and test datasets

Dfcensus is the original data frame. I am trying to use Sex, EducYears and Age to predict whether a person's Income is "<=50K" or ">50K".
There are 20,000 rows in x_train_auto (training set) and 12,561 in x_test_auto (test set).
My classification variable (training set) has 15,124 <=50k and 4876 >50k.
Here is my code:
predictions = knn(train = x_train_auto, # response
test = x_test_auto, # response
cl = Df_census$Income[in_train_census], # prediction
k = 25)
table(predictions)
#<=50K
#12561
As you can see, all 12,561 test samples were predicted to have an Income of ">=50K".
This doesn't make sense. I am not sure where I am going wrong.
P.S.: I have sex one-hot encodes as 0 for male and 1 for female. And I have scaled Educ_years and Age and added sex to the data frame. I then added the one-hot encoded sex variable back into the scaled test and train data.
identifying the problem
Your provided x_test-auto.csv data suggests that you passed logical vectors with TRUEs and FALSEs (which define the indices of training and test samples rather than the actual data) to the train and test arguments of class::knn.
the solution
Rather, use the logical vector in x_train_auto (which I believe corresponds to in_train_census in your example) to define two separate data.frames, each containing all your desired predictors. These are then the training and the test set.
p <- c("Age","EducYears","Sex")
Df_train <- Df_census[in_train_census,p]
Df_test <- Df_census[!in_train_census,p]
In the knn function, pass the training set to the train argument, and the test set to the test argument, and further pass the outcome / target variable of the training set (as a factor) to cl.
The output (see ?class::knn) will be the predicted outcome for the test set.
Here is a complete and reproducible workflow using your data.
the data
library(class)
# read data from Dropbox
x_train_auto <- read.csv("https://dropbox.com/s/6kupkp4u4qyizy7/x_test_auto.csv?dl=1", row.names = 1)
Df_census <- read.csv("https://dropbox.com/s/ccvck8ajnatmpv0/Df_census.csv?dl=1", row.names = 1, stringsAsFactors = TRUE)
table(x_train_auto) # TRUE are training, FALSE are test set
#> x_train_auto
#> FALSE TRUE
#> 12561 20000
str(Df_census) # Income as factor, Sex is binary, Age and EducYears are numeric
#> 'data.frame': 32561 obs. of 15 variables:
#> $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
#> $ Work : Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
#> $ Fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
#> $ Education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
#> $ EducYears : int 13 13 9 7 13 14 5 9 14 13 ...
#> $ MaritalStatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
#> $ Occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
#> $ Relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
#> $ Race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
#> $ Sex : int 1 1 1 1 0 0 0 1 0 1 ...
#> $ CapitalGain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
#> $ CapitalLoss : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ HoursPerWeek : int 40 13 40 40 40 40 16 45 50 40 ...
#> $ NativeCountry: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
#> $ Income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
# predictors and response
p <- c("Age","EducYears","Sex")
y <- "Income"
# create data partition
in_train_census <- x_train_auto$x
Df_train <- Df_census[in_train_census,]
Df_test <- Df_census[!in_train_census,]
# check
dim(Df_train)
#> [1] 20000 15
dim(Df_test)
#> [1] 12561 15
table(Df_train$Income)
#>
#> <=50K >50K
#> 15124 4876
using class::knn
The knn (k-nearest-neighbors) algorithm can perform better or worse depending on the choice of the hyperparameter k. It's often difficult to know which k value is best for the classification of a particular dataset. In a machine learning setting, you'd want to try out different values of k to find a value that gives the highest performance on your test dataset (i.e., data which was not used for model fitting).
It's always important to strike a good balance between overfitting (model is too complex, and will give good results on the training data, but less accurate or even rubbish results on new test data) and underfitting (model is too trivial to explain the actual patterns in the data). In the case of knn, using a larger k value would probably better safeguard against overfitting, according to the explanations here.
# apply knn for various k using the given training / test set
r <- data.frame(array(NA, dim = c(0, 2), dimnames = list(NULL, c("k","accuracy"))))
for (k in 1:30) {
#cat("k =", k, "\n")
# fit model on training set, predict test set data
set.seed(60402) # to be reproducible
predictions <- knn(train = Df_train[,p],
test = Df_test[,p],
cl = Df_train[,y],
k = k)
# confusion matrix on test set
t <- table(pred = predictions, ref = Df_test[,y])
# accuracy
a <- sum(diag(t)) / sum(t)
# bind
r <- rbind(r, data.frame(k = k, accuracy = a))
}
visualize model assessment
# find best k
r[which.max(r$accuracy),]
#> k accuracy
#> 17 17 0.8007324
(k.best <- r[which.max(r$accuracy),"k"])
#> [1] 17
# plot
with(r, plot(k, accuracy, type = "l"))
abline(v = k.best, lty = 2)
Created on 2021-09-23 by the reprex package (v2.0.1)
interpretation
The loop results suggest that your optimal value of k for this particular training and test set is between 12 and 17 (see plot above), but the accuracy gain is very small compared to using k = 1 (it's at around 80% regardless of k).
additional thoughts
Given that high income is rarer compared to lower income, accuracy might not be the desired performance metric. Sensitivity might be equally or more important, and you could modify the example code to calculate and assess other performance metrics instead.
In addition to pure prediction, you might want to explore whether other variables could be informative predictors of the Income class, by adding them to the p vector and comparing the resulting accuracies.
Here, we base our conclusions on a particular realization of training and test data. Better machine learning practice would be to split your data into 2 (as here), but then repeatedly split the training set again to fit and assess many more models, using e.g. (repeated) k-fold cross validation. A good package to do this in R is e.g. caret or tidymodels.
To gain a better understanding regarding which variables are the best predictors of Income class, I would also carry out a logistic regression on various uncorrelated predictors.

Is there a way to calculate feature importance at observation level in isolation forest?

I am using Isolation Forest in R to perform Anomaly Detection on multivariate data.
I tried calculating the anomaly scores along with contribution of individual metric in calculating that score. I am able to get the anomaly score but facing problem in calculating importance of metrics.
I am able to get the desired result through BigML(online platform) but not through R.
R code:
> library(solitude) # tried 'IsolationForest' and 'h2o' but not getting desired result
> mo = isolation_forest(data)
> final_scores <- predict(mo,data)
> summary(mo)
Length Class Mode
forest 14 ranger list
> head(final_scores,5)
[1] 0.4156554 0.3923926 0.4262782 0.4595296 0.4174865
Output from BigMl :
I want to get the importance values for every metric(a,b,c,d) through R code, just like what I am getting in BigML
I think I am missing out some basic parameters. Actually I am new to R, so am not able to figure it out.
I have thought of something in order to get the feature importance at observation level but I am facing problem in implementing it.
Here is the snippet of what I am planning.
The dots in the metric are individual observations while the lines are splits based on specific variables.
I am able to trace individual trees of forest but the problem is that there are 500 trees in the forest and tracing individual tree and accessing their importance values is impractical. The below example is purely based on dummy data.
Output of individual tree:
> x = treeInfo(mo$forest,tree=3)
> x
nodeID leftChild rightChild splitvarID splitvarName splitval terminal prediction
1 0 1 2 2 c 0.6975663 FALSE NA
2 1 3 4 1 b 0.3455875 FALSE NA
3 2 5 6 0 a 0.2620023 FALSE NA
4 3 7 8 0 a 0.1425075 FALSE NA
5 4 9 10 0 a 0.6611566 FALSE NA
6 5 NA NA NA <NA> NA TRUE 10
7 6 NA NA NA <NA> NA TRUE 2
8 7 NA NA NA <NA> NA TRUE 6
9 8 NA NA NA <NA> NA TRUE 1
10 9 NA NA NA <NA> NA TRUE 3
11 10 NA NA NA <NA> NA TRUE 5
Any kind of help is appreciated.
Local feature importance can be estimated with the package lime.
library(solitude)
library(lime)
First, some toy data:
set.seed(1234)
data<-data.frame(rnorm(20,0,1),rnorm(20,0,0.5))
colnames(data)<-c("x","y")
row.names(data)<-seq(1,nrow(data),1)
Have a look at the toy data:
plot(data)
text(data-0.05,row.names(data))
These cases appear to be outliers:
outliers<-c(4,20)
Grow isolation forest:
model<-isolation_forest(data, importance="impurity")
As solitude is not supported in lime, we need to build two functions
so that lime can handle solitude objects. The model_type function tells lime what kind of model we have. The predict_model function enables lime to predict with solitude objects.
model_type.solitude <- function(x, ...) {
return("regression")
}
predict_model.solitude <- function(x, newdata, ...) {
pred <- predict(x, newdata)
return(as.data.frame(pred))
}
Then we can generate the lime object and estimate observation level feature importance (And number of permutations could be set higher for more reliable results):
lime1 <- lime(data, model)
importance <- data.frame(explain(data, lime1,
n_features = 2,n_permutations = 500 ))
Feature importance is in importance$feature_weight.
Casewise inspection of results:
importance[importance$case %in% outliers,c("case","feature","feature_weight")]
Plot:
plot_features(importance[importance$case %in% outliers,] , ncol = 2)
Hope that's helpful!
Of course, read up on lime as it is based on certain assumptions.

R: Error in contrasts when fitting linear models with `lm`

I've found Error in contrasts when defining a linear model in R and have followed the suggestions there, but none of my factor variables take on only one value and I am still experiencing the same issue.
This is the dataset I'm using: https://www.dropbox.com/s/em7xphbeaxykgla/train.csv?dl=0.
This is the code I'm trying to run:
simplelm <- lm(log_SalePrice ~ ., data = train)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
What is the issue?
Thanks for providing your dataset (I hope that link will forever be valid so that everyone can access). I read it into a data frame train.
Using the debug_contr_error, debug_contr_error2 and NA_preproc helper functions provided by How to debug "contrasts can be applied only to factors with 2 or more levels" error?, we can easily analyze the problem.
info <- debug_contr_error2(log_SalePrice ~ ., train)
## the data frame that is actually used by `lm`
dat <- info$mf
## number of cases in your dataset
nrow(train)
#[1] 1460
## number of complete cases used by `lm`
nrow(dat)
#[1] 1112
## number of levels for all factor variables in `dat`
info$nlevels
# MSZoning Street Alley LotShape LandContour
# 4 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 5
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55
As you can see, Utilities is the offending variable here as it has only 1 level.
Since you have many character / factor variables in train, I wonder whether you have NA for them. If we add NA as a valid level, we could possibly get more complete cases.
new_train <- NA_preproc(train)
new_info <- debug_contr_error2(log_SalePrice ~ ., new_train)
new_dat <- new_info$mf
nrow(new_dat)
#[1] 1121
new_info$nlevels
# MSZoning Street Alley LotShape LandContour
# 5 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 6
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55
We do get more complete cases, but Utilities still has one level. This means that most incomplete cases are actually caused by NA in your numerical variables, which we can do nothing (unless you have a statistically valid way to impute those missing values).
As you only have one single-level factor variable, the same method as given in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"? will work.
new_dat$Utilities <- 1
simplelm <- lm(log_SalePrice ~ 0 + ., data = new_dat)
The model now runs successfully. However, it is rank-deficient. You probably want to do something to address it, but leaving it as it is is fine.
b <- coef(simplelm)
length(b)
#[1] 301
sum(is.na(b))
#[1] 9
simplelm$rank
#[1] 292

How to prevent extrapolation using na.spline()

I'm having trouble with the na.spline() function in the zoo package. Although the documentation explicitly states that this is an interpolation function, the behaviour I'm getting includes extrapolation.
The following code reproduces the problem:
require(zoo)
vector <- c(NA,NA,NA,NA,NA,NA,5,NA,7,8,NA,NA)
na.spline(vector)
The output of this should be:
NA NA NA NA NA NA 5 6 7 8 NA NA
This would be interpolation of the internal NA, leaving the trailing NAs in place. But, instead I get:
-1 0 1 2 3 4 5 6 7 8 9 10
According to the documentation, this shouldn't happen. Is there some way to avoid extrapolation?
I recognise that in my example, I could use linear interpolation, but this is a MWE. Although I'm not necessarily wed to the na.spline() function, I need some way to interpolate using cubic splines.
This behavior appears to be coming from the stats::spline function, e.g.,
spline(seq_along(vector), vector, xout=seq_along(vector))$y
# [1] -1 0 1 2 3 4 5 6 7 8 9 10
Here is a work around, using the fact that na.approx strictly interpolates.
replace(na.spline(vector), is.na(na.approx(vector, na.rm=FALSE)), NA)
# [1] NA NA NA NA NA NA 5 6 7 8 NA NA
Edit
As #G.Grothendieck suggests in the comments below, another, no doubt more performant, way is:
na.spline(vector) + 0*na.approx(vector, na.rm = FALSE)

Multiple comparisons for GLMM dataset (proportion/binomial response) - lsmeans?

I have a glmm that runs fine, and produces results that makes biological sense. I want to do multiple comparisons with the levels predictor variable I'm interested in (a factor with 6 levels--labeled Body in the diagram). This factor and its interaction with Class were significant in the GLMM (as expected).
I have tried using lsmeans using this code:
lsmc <- lsmeans(modelc, ~ Class*Body)
plot(lsmc, by = "Class", intervals = TRUE, type = "response")
cld(lsmc)
The result is a confusing mishmash of grouping codes:
> cld(lsmc)
Class Body lsmean SE df asymp.LCL asymp.UCL .group
a 6 -4.134310 0.2707025 NA -4.664878 -3.603743 123
a 3 -3.970351 0.2728055 NA -4.505040 -3.435662 123
a 4 -3.928422 0.2704543 NA -4.458502 -3.398341 123
a 5 -3.882009 0.2692264 NA -4.409683 -3.354335 123456
b 6 -3.736560 0.4111311 NA -4.542362 -2.930758 1 4 7
a 1 -3.526359 0.2772493 NA -4.069757 -2.982960 456789
a 2 -3.343117 0.2711772 NA -3.874614 -2.811619 789
b 5 -3.200230 0.4107996 NA -4.005383 -2.395078 2 5 8
b 1 -2.879111 0.4122133 NA -3.687034 -2.071187 23 56 89
b 2 -2.840026 0.4110968 NA -3.645761 -2.034291 3 6 9
b 3 -2.818114 0.4102995 NA -3.622287 -2.013942 3 6 9
b 4 -2.649563 0.4096440 NA -3.452450 -1.846675 3 6 9
As far as I am aware, non-continuous grouping codes, as seen in all of Class b aren't a good sign.
Is there another way to go about multiple and/or pairwise comparisons with output from GLMMs?

Resources