randomForest model doesn't run - r

I try to run a randomForest model on iris data without the variable Petal.Length. The code give me errors on prédiction. How can I code properly? Thanks for help.
Richard
data (iris)
attach (iris)
iris$id <- 1:nrow(iris)
library (dplyr)
train <- iris %>%
sample_frac (0.8)
test <- iris %>%
anti_join(train, by = "id")
library (randomForest)
library (caret)
fit <- randomForest(Species ~
Sepal.Length +Sepal.Width +Petal.Width, data = train,)
prediction <- predict (fit, test [1:2 , 4])
confusionMatrix (test$Species,prediction)

You subsetting for test dataset is wrong. Just use
prediction <- predict (fit, newdata = test)
in place of
predict (fit, test [1:2 , 4])
It will automatically take the required independent variables.
Or you can use like
prediction <- predict (fit, subset(test, select = -c(Petal.Length)))

In the prediction function, you have to supply all the numeric data used for training. Try this instead
prediction <- predict (fit, test[ , c(1:4)])

Related

How can I calculate the mean square error in R of a regression tree?

I am working with the wine quality database.
I am studying regression trees depending on different variables as:
library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)
vinos <- read.csv(file = 'Wine.csv', header = T)
arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)
arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)
I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as
aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")
and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?
This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.
Remember that the function of the Mean Squared Error (MSE) is the following one:
So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).
A solution for your wine dataset could be this one, based on the previous website.
library(rpart)
library(dplyr)
library(data.table)
vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
# Split data into train and test subsets
sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
train <- vinos[sample_index, ]
test <- vinos[-sample_index, ]
# Train regression trees models
arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
# Make predictions for each model
pred0 <- predict(arbol0, newdata = test)
pred1 <- predict(arbol1, newdata = test)
# Calculate MSE for each model
mean((pred0 - test$quality)^2)
mean((pred1 - test$quality)^2)

How to extract random intercepts from mixed effects Tidymodels

I am trying to extract random intercepts from tidymodels using lme4 and multilevelmod. I able to do this using lme4 below:
Using R and lme4:
library("tidyverse")
library("lme4")
# set up model
mod <- lmer(Reaction ~ Days + (1|Subject),data=sleepstudy)
# create expanded df
expanded_df <- with(sleepstudy,
data.frame(
expand.grid(Subject=levels(Subject),
Days=seq(min(Days),max(Days),length=51))))
# create predicted df with **random intercepts**
predicted_df <- data.frame(expanded_df,resp=predict(mod,newdata=expanded_df))
predicted_df
# plot intercepts
ggplot(predicted_df,aes(x=Days,y=resp,colour=Subject))+
geom_line()
Using tidymodels:
# example from
# https://github.com/tidymodels/multilevelmod
library("multilevelmod")
library("tidymodels")
library("tidyverse")
library("lme4")
#> Loading required package: parsnip
data(sleepstudy, package = "lme4")
# set engine to lme4
mixed_model_spec <- linear_reg() %>% set_engine("lmer")
# create model
mixed_model_fit_tidy <-
mixed_model_spec %>%
fit(Reaction ~ Days + (1 | Subject), data = sleepstudy)
expanded_df_tidy <- with(sleepstudy,
data.frame(
expand.grid(Subject=levels(Subject),
Days=seq(min(Days),max(Days),length=51))))
predicted_df_tidy <- data.frame(expanded_df_tidy,resp=predict(mixed_model_fit_tidy,new_data=expanded_df_tidy))
ggplot(predicted_df_tidy,aes(x=Days,y=.pred,colour=Subject))+
geom_line()
Using the predict() function seems to gives only the fixed effect predictions.
Is there a way to extract the random intercepts from tidymodels and multilevelmod? I know the package is still in development so it might not be possible at this stage.
I think you can work around this as follows:
predicted_df_tidy <- mutate(expanded_df_tidy,
.pred = predict(mixed_model_fit_tidy,
new_data=expanded_df_tidy,
type = "raw", opts=list(re.form=NULL)))
bind_cols() instead of mutate() might be useful in some circumstances?
the issue is that multilevelmod internally sets the default for prediction to re.form = NA; the code above resets it to re.form = NULL (which is the lme4 default, i.e. include all random effects in the prediction)
If you actually want the random intercepts (only) I guess you could predicted_df_tidy %>% filter(Days==0)
PS If you want to be more 'tidy' about this I think you can use purrr::cross_df() in place of expand.grid and pipe the results directly to mutate() ...

Missing object in randomForest model when predicting on test dataset

Sorry if it was already asked, but I couldn't find it in half an hour of looking, so I would appreciate if you can point me to some direction.
I have a trouble with missing object in the model, while I don't actually use this object when building the model, it's just present in the dataset. (as you can see in the example below).
It is a problem, because I have already trained some rf models, I am loading the models into environment and I am reusing them as they are. The test dataset doesn't contain some variables that are present in dataset upon which the model was built, but they are not used in the model itself!
library(randomForest)
data(iris)
smp_size <- floor(0.75*nrow(iris))
set.seed(123)
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
test <- iris[-train_ind, ]
test$Sepal.Length <- NULL # for the sake of example I drop this column
rf_model <- randomForest(Species ~ . - Sepal.Length, # I don't use the column in training model
data = train)
rf_prediction <- predict(rf_model, newdata = test)
When I try to predict on test dataset, I get an error:
Error in eval(expr, envir, enclos) : object 'Sepal.Length' not found
What I hope to achieve, is use the models I have already built, as redoing them without missing variables would be costly.
Thanks for advice!
As your models are already built. You will want to add missing columns back on to the test set before running the model. Just add the missing columns with a value of 0 as in the following exmaple.
library(randomForest)
library(dplyr)
data(iris)
smp_size <- floor(0.75*nrow(iris))
set.seed(123)
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
test <- iris[-train_ind, ]
test$Sepal.Length <- NULL
rf_model <- randomForest(Species ~ . - Sepal.Length,
data = train)
# adding the missing column to your test set.
missingColumns <- setdiff(colnames(train),colnames(test))
test[,missingColumns] <- 0
rf_prediction <- predict(rf_model, newdata = test)
rf_prediction
#showing this produce the same results
train2 <- iris[train_ind, ]
test2 <- iris[-train_ind, ]
test2$Sepal.Length <- NULL
train2$Sepal.Length <- NULL
rf_model2 <- randomForest(Species ~ .,
data = train2)
rf_prediction2 <- predict(rf_model2, newdata = test2)
rf_prediction2 == rf_prediction

Predict using randomForest package in R

How can I use result of randomForest call in R to predict labels on some unlabled data (e.g. real world input to be classified)?
Code:
train_data = read.csv("train.csv")
input_data = read.csv("input.csv")
result_forest = randomForest(Label ~ ., data=train_data)
labeled_input = result_forest.predict(input_data) # I need something like this
train.csv:
a;b;c;label;
1;1;1;a;
2;2;2;b;
1;2;1;c;
input.csv:
a;b;c;
1;1;1;
2;1;2;
I need to get something like this
a;b;c;label;
1;1;1;a;
2;1;2;b;
Let me know if this is what you are getting at.
You train your randomforest with your training data:
# Training dataset
train_data <- read.csv("train.csv")
#Train randomForest
forest_model <- randomForest(label ~ ., data=train_data)
Now that the randomforest is trained, you want to give it new data so it can predict what the labels are.
input_data$predictedlabel <- predict(forest_model, newdata=input_data)
The above code adds a new column to your input_data showing the predicted label.
You can use the predict function
for example:
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.pred <- predict(iris.rf, iris[ind == 2,])
This is from http://ugrad.stat.ubc.ca/R/library/randomForest/html/predict.randomForest.html

Predict probabilities with bigrf

I am able to build a model with bigrf() package, but is there a way to predict probabilities instead of classes? For class prediction I use
predictions <- predict(forest, test, testset$y)
forest is a model. I tried type = "prob" but does not do anything. Is there a way to do this?
I have big data, so I need to use this package in order to be able to process it.
UPD:
library(bigrf)
library(randomForest)
data("iris")
iris <- iris[iris$Species != "virginica",]
x <- iris[,1:4]
y <- iris$Species
vars <- c(1:4)
s = sample(1:nrow(x), 60)
registerDoParallel(cores=detectCores(all.tests=TRUE))
forest <- bigrfc(x[s, ], y[s], ntree=5L, varselect=vars)
predictions <- predict(forest, x[-s, ])
So, the question is how to get probabilities in predictions instead of classes from object class bigrfc?
According to this post, it should be possible to obtain the class probabilities with
predictions_probs <- predictions#testvotes/rowSums(predictions#testvotes)
I haven't tested it though. HTH.

Resources