Running random forest algorithm with one variable - r

I'm using the random forest algorithm by using one predictor.
RF_MODEL <- randomForest(x=Data_[,my_preds], y=as.factor(Data_$P_A), data=Data_, ntree=1000, importance =T)
But I got this error message:
Error in if (n == 0) stop("data (x) has 0 rows") :
l'argument est de longueur nulle
Does this mean that we can't use RF with one variable?

The issue here is that when you specify x in randomForest, x should be "a data frame or a matrix of predictors, or a formula describing the model to be fitted". You are specifying a vector, Data_[, my_preds] where I assume my_preds is a string describing the column name. You get a vector by default when specifying one column of a data frame.
You can use drop = FALSE to ensure that x stays as a data frame column.
RF_MODEL <- randomForest(x = Data_[,my_preds, drop = FALSE],
y = as.factor(Data_$P_A),
data = Data_,
ntree = 1000, importance = TRUE)
We can demonstrate using the iris dataset.
library(randomForest)
randomForest(x = iris[, "Sepal.Width"], y = iris$Species, data = iris)
Error in if (n == 0) stop("data (x) has 0 rows") :
argument is of length zero
Using drop = FALSE:
randomForest(x = iris[, "Sepal.Width", drop = FALSE], y = iris$Species, data = iris)
Call:
randomForest(x = iris[, "Sepal.Width", drop = FALSE], y = iris$Species, data = iris)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 52.67%
Confusion matrix:
setosa versicolor virginica class.error
setosa 31 2 17 0.38
versicolor 3 20 27 0.60
virginica 17 13 20 0.60
You might also consider using a formula to avoid this issue:
randomForest(Species ~ Sepal.Width, data = iris)

Related

Approximated SHAP values for multi-classification problem using randomForest

I would like to use the fastshap package to obtain SHAP values plots for every category of my outcome in a multi-classification problem using a random forest classifier. I could only found chunks of the code around, but no explanation on how to procede from the beginning in obtaining the SHAP values in this case. Here is the code I have so far (my y has 5 classes, here I am trying to obtain SHAP values for class 3):
library(randomForest)
library(fastshap)
set.seed(42)
sample <- sample.int(n = nrow(ITA), size = floor(.75*nrow(ITA)), replace=F)
train <- ITA [sample,]
test <- ITA [-sample,]
set.seed(42)
rftrain <-randomForest(y ~ ., data=train, ntree=500, importance = TRUE)
p_function_3<- function(object, newdata)
caret::predict.train(object,
newdata = newdata,
type = "prob")[,3]
shap_values_G <- fastshap::explain(rftrain,
X = train,
pred_wrapper = p_function_3,
nsim = 50,
newdata=train[which(y==3),])
Now, I took the code largely from an example I found online, and I tried to adapt it (I am not an expert R user), but it does not work.. Can you please help me in correcting it? Thanks!
Here is a working example (with a different dataset), but I think the logic is the same.
library(randomForest)
library(fastshap)
set.seed(42)
ix <- sample(nrow(iris), 0.75 * nrow(iris))
train <- iris[ix, ]
test <- iris[-ix, ]
xvars <- c("Sepal.Width", "Sepal.Length")
yvar <- "Species"
fit <- randomForest(reformulate(xvars, yvar), data = train, ntree = 500)
pred_3 <- function(model, newdata) {
predict(model, newdata = newdata, type = "prob")[, "virginica"]
}
shap_values_3 <- fastshap::explain(
fit,
X = train, # Reference data
feature_names = xvars,
pred_wrapper = pred_3,
nsim = 50,
newdata = train[train$Species == "virginica", ] # For these rows, you will calculate explanations
)
head(shap_values_3)
# Sepal.Width Sepal.Length
# <dbl> <dbl>
# 1 0.101 0.381
# 2 0.159 -0.0109
# 3 0.0736 -0.0285
# 4 0.0564 0.161
# 5 0.0649 0.594
# 6 0.232 0.0305

Using nls or nlsLM to fit global and group-specific parameters

I would like to use nls to fit a global parameter and group-specific parameters. The closest I have found to a minimum reproducible example is below (found here: https://stat.ethz.ch/pipermail/r-help/2015-September/432020.html)
#Generate some data
d <- transform(data.frame(x=seq(0,1,len=17),
group=rep(c("A","B","B","C"),len=17)), y =
round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2))
#Fit to model using nls
nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=rep(3,length(levels(d$group)))))
This gives me an error:
Error in numericDeriv(form[[3L]], names(ind), env, central = nDcentral) :
Missing value or an infinity produced when evaluating the model
I have not been able to figure out if the error is coming from bad guesses for the starting values, or the way this code is dealing with group-specific parameters. It seems the line with p=rep(3,length(levels(d$group))) is for generating c(3,3,3), but switching this part of the code does not remove the problem (same error obtained as above):
#Fit to model using nls
nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3, 3, 3)))
Switching to nlsLM gives a different error which leads be to believe I am having an issue with the group-specific parameters:
#Generate some data
library(minpack.lm)
d <- transform(data.frame(x=seq(0,1,len=17),
group=rep(c("A","B","B","C"),len=17)), y =
round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2))
#Fit to model using nlsLM
nlsLM(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3,3,3)))
Error:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
Any ideas?
I think you can do this much more easily with nlme::gnls:
fit2 <- nlme::gnls(y~1/(b+x^p),
params = list(p~group-1, b~1),
data=d,
start = list(b=1, p = rep(3,3)))
Results:
Generalized nonlinear least squares fit
Model: y ~ 1/(b + x^p)
Data: d
Log-likelihood: 62.05887
Coefficients:
p.groupA p.groupB p.groupC b
2.262383 2.895903 3.475324 1.407561
Degrees of freedom: 17 total; 13 residual
Residual standard error: 0.007188101
The params argument allows you to specify fixed-effect submodels for each nonlinear parameter. Using p ~ b-1 parameterizes the model with a separate estimate for each group, rather than fitting a baseline (intercept) value for the first group and the differences between successive groups. (In R's formula language, -1 or +0 signify "fit a model without intercept/set the intercept to 0", which in this case corresponds to fitting all three groups separately.)
I'm quite surprised that gnls and nls don't give identical results (although both give reasonable results); would like to dig in further ...
Parameter estimates (code below):
term nls gnls
1 b 1.41 1.40
2 pA 2.28 2.28
3 pB 3.19 3.14
4 pC 3.60 3.51
par(las = 1, bty = "l")
plot(y~x, data = d, col = d$group, pch = 16)
xvec <- seq(0, 1, length = 21)
f <- function(x) factor(x, levels = c("A","B","C"))
## fit1 is nls() fit
ll <- function(g, c = 1) {
lines(xvec, predict(fit1, newdata = data.frame(group=f(g), x = xvec)), col = c)
}
Map(ll, LETTERS[1:3], 1:3)
d2 <- expand.grid(x = xvec, group = f(c("A","B","C")))
pp <- predict(fit2, newdata = d2)
ll2 <- function(g, c = 1) {
lines(xvec, pp[d2$group == g], lty = 2, col = c)
}
Map(ll2, LETTERS[1:3], 1:3)
legend("bottomleft", lty = 1:2, col = 1, legend = c("nls", "gnls"))
library(tidyverse)
library(broom)
library(broom.mixed)
(purrr::map_dfr(list(nls=fit1, gnls=fit2), tidy, .id = "pkg")
%>% select(pkg, term, estimate)
%>% group_by(pkg)
## force common parameter names
%>% mutate(across(term, ~ c("b", paste0("p", LETTERS[1:3]))))
%>% pivot_wider(names_from = pkg, values_from = estimate)
)
I was able to get this by switching the class of the group from chr to factor. Note the addition of factor() when generating the dataset.
> d <- transform(data.frame(
+ x=seq(0,1,len=17),
+ group=rep(factor(c("A","B","B","C")),len=17)),
+ y=round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2)
+ )
> str(d)
'data.frame': 17 obs. of 3 variables:
$ x : num 0 0.0625 0.125 0.1875 0.25 ...
$ group: Factor w/ 3 levels "A","B","C": 1 2 2 3 1 2 2 3 1 2 ...
$ y : num 0.71 0.71 0.71 0.71 0.69 0.7 0.69 0.69 0.62 0.64 ...
> nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3,3,3)))
Nonlinear regression model
model: y ~ 1/(b + x^p[group])
data: d
b p1 p2 p3
1.406 2.276 3.186 3.601
residual sum-of-squares: 9.537e-05
Number of iterations to convergence: 5
Achieved convergence tolerance: 4.536e-06

Linear regression with `lm()`: prediction interval for aggregated predicted values

I'm using predict.lm(fit, newdata=newdata, interval="prediction") to get predictions and their prediction intervals (PI) for new observations. Now I would like to aggregate (sum and mean) these predictions and their PI's based on an additional variable (i.e. a spatial aggregation on the zip code level of predictions for single households).
I learned from StackExchange, that you cannot aggregate the prediction intervals of single predictions just by aggregating the limits of the prediction intervals. The post is very helpful to understand why this can't be done, but I have a hard time translating this bit into actual code. The answer reads:
Here's a reproducible example:
library(dplyr)
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit regression model
fit1 <- lm(Petal.Width ~ Petal.Length, data=train)
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
#Predict Pedal.Width for new data incl prediction intervals for each prediction
predictions1<-predict(fit1, newdata=pred, interval="prediction")
predictions2<-predict(fit2, newdata=pred, interval="prediction")
# Aggregate data by summing predictions for species
#NOT correct for prediction intervals
predictions_agg1<-data.frame(predictions1,Species=pred$Species) %>%
group_by(Species) %>%
summarise_all(funs(sum,mean))
predictions_agg2<-data.frame(predictions2,Species=pred$Species) %>%
group_by(Species) %>%
summarise_all(funs(sum,mean))
I couldn't find a good tutorial or package which describes how to properly aggregate predictions and their PI's in R when using predict.lm(). Is there something out there? Would highly appreciate if you could point me in the right direction on how to do this in R.
Your question is closely related to a thread I answered 2 years ago: linear model with `lm`: how to get prediction variance of sum of predicted values. It provides an R implementation of Glen_b's answer on Cross Validated. Thanks for quoting that Cross Validated thread; I didn't know it; perhaps I can leave a comment there linking the Stack Overflow thread.
I have polished my original answer, wrapping up line-by-line code cleanly into easy-to-use functions lm_predict and agg_pred. Solving your question is then simplified to applying those functions by group.
Consider the iris example in your question, and the second model fit2 for demonstration.
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
We split pred by group Species, then apply lm_predict (with diag = FALSE) on all sub data frames.
oo <- lapply(split(pred, pred$Species), lm_predict, lmObject = fit2, diag = FALSE)
To use agg_pred we need to specify a weight vector, whose length equals to the number of data. We can determine this by consulting the length of fit in each oo[[i]]:
n <- lengths(lapply(oo, "[[", 1))
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
#List of 3
# $ setosa : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
# $ versicolor: num [1:13] 1 1 1 1 1 1 1 1 1 1 ...
# $ virginica : num [1:14] 1 1 1 1 1 1 1 1 1 1 ...
SUM <- Map(agg_pred, w, oo)
SUM[[1]] ## result for the first group, for example
#$mean
#[1] 2.499728
#
#$var
#[1] 0.1271554
#
#$CI
# lower upper
#1.792908 3.206549
#
#$PI
# lower upper
#0.999764 3.999693
sapply(SUM, "[[", "CI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 1.792908 16.41526 26.55839
#upper 3.206549 17.63953 28.10812
If aggregation operation is average, we rescale w by n and call agg_pred.
w <- mapply("/", w, n)
#List of 3
# $ setosa : num [1:11] 0.0909 0.0909 0.0909 0.0909 0.0909 ...
# $ versicolor: num [1:13] 0.0769 0.0769 0.0769 0.0769 0.0769 ...
# $ virginica : num [1:14] 0.0714 0.0714 0.0714 0.0714 0.0714 ...
AVE <- Map(agg_pred, w, oo)
AVE[[2]] ## result for the second group, for example
#$mean
#[1] 1.3098
#
#$var
#[1] 0.0005643196
#
#$CI
# lower upper
#1.262712 1.356887
#
#$PI
# lower upper
#1.189562 1.430037
sapply(AVE, "[[", "PI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 0.09088764 1.189562 1.832255
#upper 0.36360845 1.430037 2.072496
This is great! Thank you so much! There is one thing I forgot to mention: in my actual application I need to sum ~300,000 predictions which would create a full variance-covariance matrix which is about ~700GB in size. Do you have any idea if there is a computationally more efficient way to directly get to the sum of the variance-covariance matrix?
Use the fast_agg_pred function provided in the revision of the original Q & A. Let's start it all over.
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
## list of new data
newdatlist <- split(pred, pred$Species)
n <- sapply(newdatlist, nrow)
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
SUM <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
SIMPLIFY = FALSE)
If aggregation operation is average, we do
w <- mapply("/", w, n)
AVE <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
SIMPLIFY = FALSE)
Note that we can't use Map in this case as we need to provide more arguments to fast_agg_pred. Use mapply in this situation, with MoreArgs and SIMPLIFY.

Comparing percent change of model coefficients

I am working through step 3 of purposeful model-building from Hosmer-Lemeshow and it suggests to compare the percent change in coefficients between a full model [Iris.mod1] and a reduced model [Iris.mod2]. I would like to automate this step if possible.
Right now I have the following code:
#Make species a binomial DV
iris = subset(iris, iris$Species != 'virginica')
iris$Species = as.numeric(ifelse(iris$Species == 'setosa', 1, 0))
#Build models
Iris.mod1 = glm(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,
data = iris, family = binomial())
Iris.mod2 = glm(Species~Sepal.Length+Petal.Length, data = iris, family =
binomial())
The dataset I am actually using has about 93 variables and 1.7 million rows. But I am using the iris data just for this example.
#Try to see if any coefficients changed by > 20%
paste(names(which((summary(Iris.mod1)$coefficients[2:
(nrow(summary(Iris.mod1)$coefficients)),1] -
(summary(Iris.mod2)$coefficients[2:
(nrow(summary(Iris.mod2)$coefficients)),1]/
(summary(Iris.mod1)$coefficients[2:nrow(summary(Iris.mod1)$coefficients)),1]
> 0.2 == TRUE)))))
However, this code is full of errors and I am lost in a sea of parenthesis.
Is there an efficient way to determine which variables coefficient changed by more than 20%?
Thank you in advance.
The broom package is really nice for making data frames of model coefficients and terms. We can use that to get things in a workable format:
library(broom)
m_list = list(m1 = Iris.mod1, m2 = Iris.mod2)
t_list = lapply(m_list, tidy)
library(dplyr)
library(tidyr)
bind_rows(t_list, .id = "mod") %>%
select(term, estimate, mod) %>%
spread(key = mod, value = estimate) %>%
mutate(p_change = (m2 - m1) / m1 * 100,
p_change_gt_20 = p_change > 20)
# term m1 m2 p_change p_change_gt_20
# 1 (Intercept) -6.556265 -65.84266 904.2709 TRUE
# 2 Petal.Length -19.053588 -49.04616 157.4117 TRUE
# 3 Petal.Width -25.032928 NA NA NA
# 4 Sepal.Length 9.878866 37.56141 280.2199 TRUE
# 5 Sepal.Width 7.417640 NA NA NA

Using neural networks neuralnet in R to predict factor values

I am using neuralnet package, use several inputs to predict an output.
Originally, my output is a factor variable, and I saw the error:
Error in neurons[[i]] %*% weights[[i]] :
requires numeric/complex matrix/vector arguments
When I converted the output to numeric variable, the error disappeared. Is there a way to neural network with factor output?
I adapted code that I found at this site, which uses the iris dataset with the neuralnet package to predict iris species from the morphological data.
Without a reproducible example, I'm not sure if this applies to your case. The key here was to convert the factorial response level to its own binary variable. The prediction is a bit different than other models in R - you choose the factor level with the highest score.
Example code:
library(neuralnet)
# Make training and validation data
set.seed(1)
train <- sample(nrow(iris), nrow(iris)*0.5)
valid <- seq(nrow(iris))[-train]
iristrain <- iris[train,]
irisvalid <- iris[valid,]
# Binarize the categorical output
iristrain <- cbind(iristrain, iristrain$Species == 'setosa')
iristrain <- cbind(iristrain, iristrain$Species == 'versicolor')
iristrain <- cbind(iristrain, iristrain$Species == 'virginica')
names(iristrain)[6:8] <- c('setosa', 'versicolor', 'virginica')
# Fit model
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
plot(nn)
# Predict
comp <- compute(nn, irisvalid[-5])
pred.weights <- comp$net.result
idx <- apply(pred.weights, 1, which.max)
pred <- c('setosa', 'versicolor', 'virginica')[idx]
table(pred, irisvalid$Species)
#pred setosa versicolor virginica
# setosa 23 0 0
# versicolor 1 21 7
# virginica 0 1 22
This might raise warnings:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
So replace it with:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain, hidden = 3,lifesign = "full")
If this does not work:
comp <- compute(nn, irisvalid[-5])
then use
comp <- neuralnet::compute(nn, irisvalid[,1:4])

Resources