Side effect for nodesize in R randomForest ? - r

I'm currently working on a randomForest model. In my configuration I realize that high nodesize values is the configuration which outperform using crossvalidation.
But then I realized something strange.
Here is a reproduicible piece of code :
For nodesize = nrow(data)+4 :
library(randomForest)
library(data.table)
set.seed(1)
n = 10
sigma = 0.4
X = runif(n)
Y = runif(n)
Z = X^2+X*Y+Y+sigma*rnorm(n)
Data = data.table(X,Y,Z)
model = randomForest(formula = as.formula('Z ~ X + Y'),data = Data,mtry = 1,tree= 500,nodesize = n+4,do.trace = TRUE)
pred = predict(model,Data)
print(pred)
1.041549 1.036075 1.266310 1.324197 1.308377 1.480041 1.691081 1.752463 1.203856 1.306943
For nodesize = nrow(data)+5
library(randomForest)
library(data.table)
set.seed(1)
n = 10
sigma = 0.4
X = runif(n)
Y = runif(n)
Z = X^2+X*Y+Y+sigma*rnorm(n)
Data = data.table(X,Y,Z)
model = randomForest(formula = as.formula('Z ~ X + Y'),data = Data,mtry = 1,tree= 500,nodesize = n+5,do.trace = TRUE)
pred = predict(model,Data)
print(pred)
1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427 1.330427
The prediction are the same for any observation from nodesize = n+5.
This is the case for any value of n (n = 20000 also).
For me the nodesize is the minimal number of records in a node in order for a split to be performed. So this means that if we have n records, the trees are grown by sampling n+4 observations with replacements. So if nodesize > n+4 no split is performed and the trees return global means : that's why the same prediction is attributed to every observations. Does that makes sense ? Is there a parameter to indicate how many sample are drawn from the original dataset ?
Thanks in advance

#Morgan is right that there was a misunderstanding regarding what nodesize does. It's the minimum size of any terminal node. But still, the behavior of randomForest seems unexpected (bug?). It does actually create branches if n <= nodesize <= n+5, that's what you've found.
If nodesize is 10 (the size of the sample), there should not be any splits, but randomForest still makes a split that splits off several observations. It also does that when nodesize is one of 11:14 (not shown here):
n = 10
sigma = 0.4
set.seed(100)
X = runif(n)
set.seed(200)
Y = runif(n)
set.seed(1)
Z = X^2+X*Y+Y+sigma*rnorm(n)
Data = data.frame(X,Y,Z)
#
# mtry = p, replace = F and sampsize = n to eliminate randomness
#
model = randomForest(formula = Z ~ X + Y, data = Data,
mtry = 2, ntree = 50, nodesize = 10, replace = F, sampsize = n)
grid <- expand.grid(X = seq(from = min(Data$X), to = max(Data$X), length.out = 100),
Y = seq(from = min(Data$Y), to = max(Data$Y), length.out = 100))
grid$grid_preds <- predict(model, grid)
ggplot(grid, aes(x = X, y = Y)) + geom_point(aes(color = grid_preds)) +
geom_point(data = Data, aes(x = X, y = Y, size = 4), color = "blue") +
theme(legend.position = "none")
If you set nodesize to n + 5 or higher randomForest does not make any splits anymore, as expected:
model = randomForest(formula = Z ~ X + Y, data = Data,
mtry = 2, ntree = 50, nodesize = 15, replace = F, sampsize = n)
grid <- expand.grid(X = seq(from = min(Data$X), to = max(Data$X), length.out = 100),
Y = seq(from = min(Data$Y), to = max(Data$Y), length.out = 100))
grid$grid_preds <- predict(model, grid)
ggplot(grid, aes(x = X, y = Y, color = grid_preds)) + geom_point()
As a comparison, ranger shows the expected behavior and does not attempt any splits if min.node.size >= n:
library(ranger)
rang = ranger(Z ~ X + Y, data = Data, write.forest = T,
replace = F, sample.fraction = 1,
mtry = 2, num.trees = 50, min.node.size = 10)
grid$grid_preds <- predict(rang, grid)$prediction
ggplot(grid, aes(x = X, y = Y, color = grid_preds)) + geom_point()
By the way, the n+5 rule in randomForest also holds for n other than 10. I'm wondering too what's going on there.

Related

Adaptive LASSO in bayesQR

I have been playing with the bayesQR package, and want to apply it to an application that calls for variable selection using LASSO. As far as I understand, this is possible in bayesQR, but I haven't been able to get any variables dropped. My toy example is below, where the extraneous variables (c and d) are dropped by glmnet, but not by bayesQR.
Is there something fundamental that I am missing? How can I get model4 below to have eliminated variables?
library(data.table)
library(bayestestR)
library(bayesQR)
library(glmnet)
# Generate data
n = 250
seed = 22
noise_sd = 5
set.seed(seed)
dt = data.table(a = runif(n, min = 0, max = 10),
b = runif(n, min = 0, max = 10),
c = runif(n, min = 0, max = 10),
d = rnorm(n, sd = .01 * noise_sd))
dt[, y := (a + rnorm(n, sd = 1 * noise_sd)) * 2 + (b + rnorm(n, sd = 2 * noise_sd)) + rnorm(n, sd = 2 * noise_sd)]
formula = y ~ a + b + c + d
# Just with GLM
model1 = glm(data = dt,
formula = formula)
# LASSO with glmnet
x = as.matrix(dt[, .(a,b,c,d)])
y = dt$y
cv_model <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_model$lambda.min
model2 <- glmnet(x, y, alpha = 1, lambda = best_lambda)
print(coef(model2))
# Quantile regression with bayesQR
model3 = bayesQR(data = dt,
formula = formula,
ndraw = 5000,
seed = seed)
# Quantile regression with bayesQR using adaptive lasso
model4 = bayesQR(data = dt,
formula = formula,
ndraw = 5000,
alasso = TRUE,
seed = seed)
message('GLM')
print(summary(model1))
message('glmnet with LASSO')
print(coef(model2))
message('bayesQR')
print(summary(model3))
message('bayesQR with ALASSO')
print(summary(model4))

Plotting classification prediction (K-nearest neighbor)

Im trying to plot my predictions using the k-nearest neighbor method but am unable to do do, I get an error message as seen below. Im sure it's something to do with how ive set up my plot but unsure as to how i need to change it. Dataset is here; https://drive.google.com/file/d/1GYnlsXgT2GS9ubeXq8Pm7iNUWDRGogU_/view?usp=sharing
set.seed(20220719)
#splitting training and testing data
ii = createDataPartition(classification[,3], p = .75, list = F)
#split the data using the indices returned by
createDataPartition
xTrain = classification[ii, 1:2] #predictors for training
yTrain = classification[ii, 3] #class label for training
xTest = classification[-ii, 1:2] #predictors for testing
yTest = classification[-ii, 3] #class label for testing
#set training options
#repeat 10 fold cross-validation, 5 times
opts = trainControl(method = 'repeatedcv', number = 10, repeats = 5)
#find optimal k (model)
kmeans_mod = train(x = xTrain, y = as.factor(yTrain),
method ='knn',
trControl = opts,
tuneGrid = data.frame(k = seq(3, 10)))
#test model on testing data
yTestPred = predict(kmeans_mod, newdata = xTest)
confusionMatrix(as.factor(yTestPred), as.factor(yTest))
#plot
plot(kmeans_mod, xTrain)
Gives the error message
Error in if (!(plotType %in% c("level", "scatter", "line"))) stop("plotType must be either level, scatter or line") :
the condition has length > 1
Im looking for an output like this;
To get a plot similar to the one in the question, you can create a grid of prediction points to produce the background classification map, then plot the test data on top using ggplot.
# Create prediction data frame for test data
preds <- data.frame(X1 = xTest[,1], X2 = xTest[,2], Group = yTestPred)
# Create classification grid
gr <- expand.grid(X1 = seq(min(classification[,1]), max(classification[,1]),
length.out = 100),
X2 = seq(min(classification[,2]), max(classification[,2]),
length.out = 100))
gr$Group <- predict(kmeans_mod, newdata = gr)
# Plot the result
library(ggplot2)
ggplot(gr, aes(X1, X2, col = Group)) +
geom_point(size = 0.6) +
geom_point(data = preds, shape = 21, aes(fill = Group),
col = "black", size = 3) +
theme_minimal(base_size = 16)
Though you may prefer a raster:
library(ggplot2)
ggplot(gr, aes(X1, X2, fill = Group)) +
geom_raster(alpha = 0.3) +
geom_point(data = preds, shape = 21, col = "black", size = 3) +
theme_minimal(base_size = 16)
And you may wish to color the test data points with their actual level rather than their predicted level to get a visual impression of the model accuracy:
library(ggplot2)
ggplot(gr, aes(X1, X2, fill = Group)) +
geom_raster(alpha = 0.3) +
geom_point(data = within(preds, Group <- factor(yTest)),
col = "black", size = 3, shape = 21) +
theme_minimal(base_size = 16)

Logistic regression for non-linear data

I have a data with continuous independent variable and binary dependent. Therefore I was trying to apply logistic regression for the analysis of this data. However in contrast to the classical case with S-shaped transition, I have a two transitions.
Here is an example of what I mean
library(ggplot)
library(visreg)
classic.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 14), 1, 0, rep(1, times = 14)))
model.classic = glm(formula = y ~ x,
data = classic.data,
family = "binomial")
summary(model.classic)
visreg(model.classic,
partial = FALSE,
scale = "response",
alpha = 0)
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
model.my = glm(formula = y ~ x,
data = my.data,
family = "binomial")
summary(model.my)
visreg(model.my,
partial = FALSE,
scale = "response",
alpha = 0)
The blue lines on both plots - it is outcome of glm, while red line it what I want to have.
Is there any way to apply logistic regression to such data? Or should I apply some other type of regression analysis?
In your second model, y is not a linear function of x. When you write y ~ x you assume that when x increases, y will increase/decrease depending on a positive/negative coefficient. That is not the case, it's increasing and then decreasing, making the average effect of x zero (hence the strait line). You therefore need a non-linear function. You could do that with a gam from the mgcv package, where the effect of x is modelled as a smooth function:
library(mgcv)
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
m = gam(y ~ s(x), data = my.data, family = binomial)
plot(m)
That would lead to the following fit on the original scale:
my.data$prediction = predict(m, type = "response")
plot(my.data$x, my.data$y)
lines(my.data$x, my.data$prediction, col = "red")

What is the shift parameter doing in R plot function?

While reading this page, I saw this code (I generated fake data):
y = rbinom(n = 1000,size = 1,prob = .5)
x = abs(rnorm(1000))
mydata = data.frame(x,y)
z <- gam(y ~ s(x), data = mydata, family = binomial(link = "logit"),
method = "GCV.Cp")
plot(z, se = 1, seWithMean = TRUE, rug = FALSE, shift = mean(predict(z)),
trans = function(x){exp(x)/(1+exp(x))}) # binomial data
What is the role of shift = mean(predict(z))?

How to graph my multiple linear regression model (caret)?

I have created an multiple linear regression model and would now like to plot it. But I can't seem to figure it out. Any help would be greatly appreciated! I used baruto to find the feature attributes and then used train() to get the model. When I try to plot model_lm I get the error:
There are no tuning parameters with more than 1 value.
Here is my code at what I have attempted so far:
rt_train <- rttotal2
rt_train$year <- NULL
#rt_train$box_office <- NULL
#impute na and address multicoliniearity
preproc <- preProcess(rt_train, method = c("knnImpute","center",
"scale"))
rt_proc <- predict(preproc, rt_train)
rt_proc$box_office <- rt_train$box_office
sum(is.na(rt_proc))
titles <- rt_proc$titles
rt_proc$titles <- NULL
#rt_train$interval <- as.factor(rt_train$interval)
dmy <- dummyVars(" ~ .", data = rt_proc,fullRank = T)
rt_transform <- data.frame(predict(dmy, newdata = rt_proc))
index <- createDataPartition(rt_transform$interval, p =.75, list = FALSE)
train_m <- rt_transform[index, ]
rt_test <- rt_transform[-index, ]
str(rt_train)
y_train <- train_m$box_office
y_test <-rt_test$box_office
train_m$box_office <- NULL
rt_test$box_office <- NULL
#selected feature attributes
boruta.train <- Boruta(interval~., train_m, doTrace =1)
#graph to see most important var to interval
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
plot(boruta.train, xlab = "", xaxt = "n")
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)
#get most important attributes
final.boruta <- TentativeRoughFix(boruta.train)
print(final.boruta)
getSelectedAttributes(final.boruta, withTentative = F)
boruta.rt_df <- attStats(final.boruta)
boruta.rt_df
boruta.rt_df <- setDT(boruta.rt_df, keep.rownames = TRUE)[]
predictors <- boruta.rt_df %>%
filter(., decision =="Confirmed") %>%
select(., rn)
predictors <- unlist(predictors)
control <- trainControl(method="repeatedcv",
number=10,
repeats=6)
#look at residuals
#p-value is very small so reject H0 that predictors have no effect so
#we can use rotten tomatoes to predict box_office ranges
train_m$interval <- NULL
model_lm <- train(train_m[,predictors],
y_train, method='lm',
trControl = control, tuneLength = 10)
model_lm #.568
#
plot(model_lm)
plot(model_lm)
z <- varImp(object=model_lm)
z <- setDT(z, keep.rownames = TRUE)
z$model <- NULL
z$calledFrom <- NULL
row.names(z)
plot(varImp(object=model_lm),main="Linear Model Variable Importance")
predictions<-predict.train(object=model_lm,rt_test[,predictors],type="raw")
table(predictions)
#get coeff
interc <- coef(model_lm$finalModel)
slope <- coef(model_lm$finalModel)
ggplot(data = rt_train, aes(y = box_office)) +
geom_point() +
geom_abline(slope = slope, intercept = interc, color = 'red')
This is what some of my input looks like. Thank you!!
Here is an example using the inbuilt data set cars:
data(cars, package = "datasets")
library(caret)
build the model
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 6)
model_lm <- train(dist ~ speed, data = cars, method='lm',
trControl = control, tuneLength = 10)
I will assume you would like to plot the final model.
You can use the caret predict.train function to get the predictions from the model and plot them:
pred <- predict(model_lm, cars)
pred <- data.frame(pred = pred, speed = cars$speed)
additionally you can provide the cars data set to geom point and plot the observations:
library(ggplot2)
ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x=speed, y = dist))
if you would like to obtain the confidence or prediction interval you can use the predict.lm function on model_lm$finalModel:
Here is an example for the prediction interval:
pred <- predict(model_lm$finalModel, cars, se.fit = TRUE, interval = "prediction")
pred <- data.frame(pred = pred$fit[,1], speed = cars$speed, lwr = pred$fit[,2], upr = pred$fit[,3])
pred_int <- ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x = speed, y = dist)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = speed), alpha = 0.2)
or the confidence interval:
pred <- predict(model_lm$finalModel, cars, se.fit = TRUE, interval = "confidence")
pred <- data.frame(pred = pred$fit[,1], speed = cars$speed, lwr = pred$fit[,2], upr = pred$fit[,3])
pred_conf <- ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x = speed, y = dist)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = speed), alpha = 0.2)
plotting them side by side:
library(cowplot)
plot_grid(pred_int, pred_conf)
to plot the linear dependence on two variables you can use a 3D plot, for more than 3 it will be a problem.

Resources