R random forest inconsistent predictions - r

I recently built a random forest model using the ranger package in R. However, I noticed that the predictions stored in the ranger object during training (accessible with model$predictions) do not match the prediction I get if I run the predict command on the same dataset using the model created. The following code reproduces the problem on the mtcars dataset. I created a binary variable just for the sake of converting this to a classification problem though I saw similar results with regression trees as well.
library(datasets)
library(ranger)
mtcars <- mtcars
mtcars$mpg2 <- ifelse(mtcars$mpg > 19.2 , 1, 0)
mtcars <- mtcars[,-1]
mtcars$mpg2 <- as.factor(mtcars$mpg2)
set.seed(123)
mod <- ranger(mpg2 ~ ., mtcars, num.trees = 20, probability = T)
mod$predictions[1,] # Probability of 1 = 0.905
predict(mod, mtcars[1,])$predictions # Probability of 1 = 0.967
This problem also carries on to the randomForest package where I observed a similar problem reproducible with the following code.
library(randomForest)
set.seed(123)
mod <- randomForest(mpg2 ~ ., mtcars, ntree = 20)
mod$votes[1,]
predict(mod, mtcars[1,], type = "prob")
Can someone please tell me why this is happening? I would expect the results to be the same. Am I doing something wrong or is there an error in my understanding of some inherent property of random forest that leads to this scenario?

I think you may want to look a little more deeply into how a random forest works. I really recommend Introduction to Statistical Learning in R (ISLR), which is available for free online here.
That said, I believe the main issue here is that you are treating the mod$votes value and the predict() value as the same, when they are not quite the same thing. If you look at the documentation of the randomForest function, the mod$votes or mod$predicted values are out-of-bag ("OOB") predictions for the input data. This is different from the value that the predict() function produces, which evaluates an observation on the model produced by randomForest(). Typically, you would want to train the model on one set of data, and use the predict() function on the test set.
Finally, you may need to re-run your set.seed() function every time your make the random forest if you want to achieve the same results for the mod object. I think there is a way to set the seed for an entire session, but I am not sure. This looks like a useful post: Fixing set.seed for an entire session
Side note: Here, you are not specifying the number of variables to use for each tree, but the default is good enough in most cases (check the documentation for each of the random forest functions you are using for the default). Maybe you are doing that in your actual code and didn't include it in your example, but I thought it was worth mentioning.
Hope this helps!
Edit:
I tried training the random forest using all of the data except for the first observation (Mazda RX4) and then used the predict function on just that observation, which I think illustrates my point a bit better. Try running something like this:
library(randomForest)
set.seed(123)
mod <- randomForest(mpg2 ~ ., mtcars[-1,], ntree = 200)
predict(mod, mtcars[1,], type = "prob")

As you have converted mpg to mpg2, was expecting that you want to build classification model. But nevertheless mod$predictions gives you probability while your model is trying to learn from your data points and predict(mod,mtcars[,1:10])$predictions option gives probability from trained model. Have run same code with Probability = F, and got below result, you can see prediction from trained model is prefect whereas from mod$predictions option we have 3 miss classifications.
mod <- ranger(mpg2 ~ ., mtcars, num.trees = 20, probability = F)
> table(mtcars$mpg2,predict(mod, mtcars[,1:10])$predictions)
0 1
0 17 0
1 0 15
> table(mtcars$mpg2,mod$predictions)
0 1
0 15 2
1 1 14

Related

Why does ranger predict give different numbers when re-applied to training data?

I am very new to machine learning. I am trying to explore fitting random forests with the ranger library in R. My dependent variable is continuous - so it would be a regression tree (and not just classification). Upon trying out the functions, I have noticed that there seems to be a discrepancy between ranger and predict ranger. The following lines result in different predictions in results and results_alternative:
rf_reg <- ranger(formula = y ~ ., data = training_df)
results <- rf_reg$predictions
results_alterantive <- predict(rf_reg, data = training_df)$predictions
Could anybody please explain why there is a discrepancy and what is causing it? Which one is correct? I have tried it with classification on iris data and that seemed to give the same results. Many thanks!

Most important variables for finding group membership

I have a dataset 8100 observations of 118 variables that are used to determine which one of 4 groups each respondent falls into. I am interested in which variables are the most important for predicting group membership. My data is a combination of ordinal and binary. I initially did a discriminant function analysis, but then read that this does not handle binary data well. Next I tried a multinomial logistic regression. However, from here I am struggling to work out which variables are the most important. I had tried an r-part decision tree, but then I read that these are not very stable, and indeed, when I ran it on a random half of my data I got different results every time. Now I am trying a dominance analysis. I can get it working for a linear model (lm), but for both the multinomial logistic regression and the discriminant function analysis I get the error:
Error in daRawResults(x = x, constants = constants, terms = terms, fit.functions = fit.functions, :
Not implemented method to retrieve data from model
Does anyone have any advice for what else I can try? Only 4 of the 118 variables are binary, so I can remove them if needed and will still have a good analysis.
Here is a reproducible example including a much smaller example dataset:
set.seed(1) ## for reproducibility
remotes::install_github("clbustos/dominanceAnalysis") # If you don't have the dominance analysis package
library(dominanceanalysis)
library(MASS)
library(nnet)
mydata <- data.frame(Segments=sample(1:4, 15, replace=TRUE),
var1=sample(1:7, 15, replace=TRUE),
var2=sample(1:7, 15, replace=TRUE),
var3=sample(1:6, 15, replace=TRUE),
var4=sample(1:2, 15, replace=TRUE))
# Show that it works for a linar model
LM<-lm(Segments ~., mydata)
da.LM<-dominanceAnalysis(LM);da.LM
#var1 is the most important, followed by var4
# Try the discriminant function analysis
DFA <- lda(Segments~., data=mydata)
da.DFA <- dominanceAnalysis(DFA)
# Error
# Try multinomial logistic regression
MLR <- multinom(Segments ~ ., data = mydata, maxit=500)
da.MLR <- dominanceAnalysis(MLR)
# Error
I've discovered a partial answer.
The dominanceanalysis package can only be used on these models: Ordinary Least Squares, Generalized Linear Models, Dynamic Linear Models and Hierarchical Linear Models.
Source: https://github.com/clbustos/dominanceAnalysis
This explains why it didn't work for my data - I wasn't using those models.
I have decided to pursue the decision tree option of variable selection by using a random forest.

the questions about predict function in randomForestSRC package

In common with other machine learning methods, I divided my original data set (7-training data set: 3-test data set).
Here is my code.
install.packages(randomForestSRC)
library(randomForestSRC)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
train <- sample(1:nrow(data), round(nrow(data) * 0.70))
data.grow <- rfsrc(Surv(days, status) ~ .,
data[train, ],
ntree = 100,
tree.err=T,
importance=T,
nsplit=1,
proximity=T)
data.pred <- predict(data.grow,
data[-train , ],
importance=T,
tree.err=T)
What I have a question is that predict function in this code.
Originally, I wanted to construct a prediction model based on random survival forest to predict the diseae development.
For example, After I build the prediction model with training data set, I wanted to know the probability of disease development with test data which has no information about disease incidence for each individual becuase I would like to know the probability of diease development based on the subject's general characteristics such as age, bmi, sex, something like that.
However, unlike my intention to build a predicion model as I said above, "predict" function in this package didn't work based on the data which has no status information (event/censored).
"predict" function must work with outcome information (event/censored).
Therefore, I cannot understand what the "predict" function means.
If "precict" function works only with oucome information, then how can I make a predction for disease development based on the subject's general characteristics in the future?
In addition, if the prediction in this model is constructed with the outcome information, what the meaning is "predct" in the random survival forest model.
Please let me know what the "predict" function in this package means is.
Thank you for reading my long question.
The predict for this type of model, i.e. predict.rfsrc, works much like you'd expect it to if you've used predict with glm, lm, RRF or other models.
The predict statement does not require you to know the outcome for the prediction data set. I am trying to understand why you thought that it did.
Your example rfsrc statement does not work because it refers to columns that are not in the example data set.
I think the best plan is that I will show you using a reproducible example, below. If you have further questions you can ask me in a comment.
# Train a RFSRC model
mtcars.mreg <- rfsrc(Surv(mpg, cyl) ~., data = mtcars[1:30,],
tree.err=TRUE, importance = TRUE)
# Simulate new data
new_data <- mtcars[31:32,]
# predict
predicted <-predict(mtcars.mreg, new_data)
predicted
Sample size of test (predict) data: 2
Number of grow trees: 1000
Average no. of grow terminal nodes: 4.898
Total no. of grow variables: 9
Analysis: RSF
Family: surv-CR
Test set error rate: NA
predicted$predicted
event.1 event.2 event.3
[1,] 0.4781338 2.399299 14.71493
[2,] 3.2185606 4.720809 2.15895

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

While using the predict function in R to get the predictions from a Random Forest model, I misspecified the training data as newdata as follows:
RF1pred <- predict(RF1, newdata=TrainS1, type = "class")
Used like this, I get extremely high accuracy and AUC, which I am sure is not right, but I couldn't find a good explanation for it. This thread is the closest I got, but I can's say I fully understand the explanation there.
If someone could elaborate, I will be grateful.
Thank you!
EDIT: Important to note: I get sensible accuracy and AUC if I run the prediction without specifying a dataset altogether, like so:
RF1pred <- predict(RF1, type = "class")
If a new dataset is not explicitly specified, isn't the training data used for prediction. Hence, shouldn't I get the same results from both lines of code?
EDIT2: Here is a sample code with random data that illustrates the point. When predicting without specifying newdata, the AUC is 0.4893. When newdata=train is explicitly specified, the AUC is 0.7125.
# Generate sample data
set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T), x2=rpois(100,10), y=sample(0:1, 100, replace=T))
# Build random forest
library(randomForest)
model <- randomForest(x1 ~ x2, data=train)
pred1 <- predict(model)
pred2 <- predict(model, newdata = train)
# Calculate AUC
library(ROCR)
ROCRpred1 <- prediction(pred1, train$x1)
AUC <- as.numeric(performance(ROCRpred1, "auc")#y.values)
AUC # 0.4893
ROCRpred2 <- prediction(pred2, train$x1)
AUC <- as.numeric(performance(ROCRpred2, "auc")#y.values)
AUC # 0.7125
If you look at the documentation for predict.randomForest you will see that if you do not supply a new data set you will get the out-of-bag (OOB) performance of the model. Since the OOB performance is theoretically related to the performance of your model on a different data set, the results will be much more realistic (although still not a substitute for a real, independently collected, validation set).

Resources