I am sorry for posting this question again but I really need help on this now.
I am trying to calculate the AUC of training set of randomForest model in R and there are two ways to calculate this but give different results. The following is a reproductible example of my question. I really appreciate it if someone could help!!!
library(randomForest)
library(pROC)
library(ROCR)
# prep training to binary outcome
train <- iris[iris$Species %in% c('virginica', 'versicolor'),]
train$Species <- droplevels(train$Species)
# build model
rfmodel <- randomForest(Species~., data=train, importance=TRUE, ntree=2)
#the first way to calculate training auc
rf_p_train <- predict(rfmodel, type="prob",newdata = train)[,2]
rf_pr_train <- prediction(rf_p_train, train$Species)
r_auc_train1 <- performance(rf_pr_train, measure = "auc")#y.values[[1]]
r_auc_train1 #0.9888
#the second way to calculate training auc
rf_p_train <- as.vector(rfmodel$votes[,2])
rf_pr_train <- prediction(rf_p_train, train$Species);
r_auc_train2 <- performance(rf_pr_train, measure = "auc")#y.values[[1]]
r_auc_train2 #0.9175
To receive the same results for both prediction functions you should exclude the newdata parameter from the first one (explained in the package documentation for the predict function),
rf_p_train <- predict(rfmodel, type="prob")[,2]
rf_pr_train <- prediction(rf_p_train, train$Species)
r_auc_train1 <- performance(rf_pr_train, measure = "auc")#y.values[[1]]
r_auc_train1
returns,
[1] 0.8655172
The second function returns the OOB votes as explained in the package documentation of the randomForest function,
rf_p_train <- as.vector(rfmodel$votes[,2])
rf_pr_train <- prediction(rf_p_train, train$Species);
r_auc_train2 <- performance(rf_pr_train, measure = "auc")#y.values[[1]]
r_auc_train2
returns (the same result),
[1] 0.8655172
Related
I am working with the wine quality database.
I am studying regression trees depending on different variables as:
library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)
vinos <- read.csv(file = 'Wine.csv', header = T)
arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)
arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)
I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as
aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")
and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?
This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.
Remember that the function of the Mean Squared Error (MSE) is the following one:
So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).
A solution for your wine dataset could be this one, based on the previous website.
library(rpart)
library(dplyr)
library(data.table)
vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
# Split data into train and test subsets
sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
train <- vinos[sample_index, ]
test <- vinos[-sample_index, ]
# Train regression trees models
arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
# Make predictions for each model
pred0 <- predict(arbol0, newdata = test)
pred1 <- predict(arbol1, newdata = test)
# Calculate MSE for each model
mean((pred0 - test$quality)^2)
mean((pred1 - test$quality)^2)
I was wondering how to measure the performance of prediction (using the test dataset) of an mlr3 model? For example, if I create a knn model using mlr3 like so:
library("mlr3")
library("mlr3learners")
# get data and split into training and test
aq <- na.omit(airquality)
train <- sample(nrow(aq), round(.7*nrow(aq))) # split 70-30
aqTrain <- aq[train, ]
aqTest <- aq[-train, ]
# create model
aqT <- TaskRegr$new(id = "knn", backend = aqTrain, target = "Ozone")
aqL <- lrn("regr.kknn")
aqMod <- aqL$train(aqT)
I can measure the mean square error of the model predictions doing something like:
prediction <- aqL$predict(aqT)
measure <- msr("regr.mse")
prediction$score(measure)
But how do I incorporate the test data into this? That is, how do I measure the performance of predictions on the test data?
In the previous version of mlr I could do something like; get the predictions using the test dataset and measure the performance of, say, the MSE or Rsquared values like so:
pred <- predict(aqMod, newdata = aqTest)
performance(pred, measures = list(mse, rsq))
Any suggestions as to how I can do this in mlr3?
You should trying this code
pred <- aqMod$predict_newdata(aqTest)
pred$score(list(msr("regr.mse"),
msr("regr.rmse")))
I would like to analysis my data based on the gradient boosted model.
On the other hand, as my data is a kind of cohort, I have a trouble understanding the result of this model.
Here's my code. Analysis was performed based on the example data.
install.packages("randomForestSRC")
install.packages("gbm")
install.packages("survival")
library(randomForestSRC)
library(gbm)
library(survival)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
set.seed(9512)
train <- sample(1:nrow(data), round(nrow(data)*0.7))
data.train <- data[train, ]
data.test <- data[-train, ]
set.seed(9741)
gbm <- gbm(Surv(days, status)~.,
data.train,
interaction.depth=2,
shrinkage=0.01,
n.trees=500,
distribution="coxph")
summary(gbm)
set.seed(9741)
gbm.pred <- predict.gbm(gbm,
n.trees=500,
newdata=data.test,
type="response")
As I read the package documnet, "gbm.pred" is the result of cox's partial likelihood.
set.seed(9741)
lambda0 = basehaz.gbm(t=data.test$days,
delta=data.test$status,
t.eval=sort(data.test$days),
cumulative = FALSE,
f.x=gbm.pred,
smooth=T)
hazard=lambda0*exp(gbm.pred)
In this code, lambda0 is a baseline hazard fuction.
So, according to formula: h(t/x)=lambda0(t)*exp(f(x))
"hazard" is hazard function.
However, what I've wanted to calculte was the "survival function".
Because, I would like to compare the outcome of original data (data$status) to the prediction result (survival function).
Please let me know how to calculate survival function.
Thank you
Actually, the returns is cumulative baseline hazard function(integral part: \int^t\lambda(z)dz), and survival function can be computed as below:
s(t|X)=exp{-e^f(X)\int^t\lambda(z)dz}
f(X) is prediction of gbm, which is equal to log-hazard proportion.
I think this tutorial about gbm-based survival analysis would help to u!
https://github.com/liupei101/Tutorial-Machine-Learning-Based-Survival-Analysis/blob/master/Tutorial_Survival_GBM.ipynb
I used two ways to calculate the AUC of training set on the randomForest but I get very different results. The two ways are as follows:
rfmodel <- randomForest(y~., data=train, importance=TRUE, ntree=1000)
Way 1 of calculating AUC of train set:
`rf_p_train <- predict(rfmodel, type="prob",newdata = train)[,'yes']
rf_pr_train <- prediction(rf_p_train, train$y)
r_auc_train[i] <- performance(rf_pr_train, measure = "auc")#y.values[[1]] `
Way 2 of calculating AUC of train set:
rf_p_train <- as.vector(rfmodel$votes[,2])
rf_pr_train <- prediction(rf_p_train, train$y)
r_auc_train[i] <- performance(rf_pr_train, measure = "auc")#y.values[[1]]
Way 1 gives me AUC around 1 but way 2 gives me AUC around 0.65. I am wondering why these two results differ so much. Could anyone help me with this? Really appreciate it. For the data, I am sorry that I am not allowed to share it here. This is the first time for me to ask question here. Please forgive me if there is anything unclear. Thanks a lot!
OK. The second way is correct. Why? Because in the first way, you treat training data as a new dataset and try to fit it again. In the second way, what you get is actually the so called out of bag estimate, and that should be the way to calculate AUC.
I am not sure what data you are using. It is best if you provide a reproducible example but I think I was able to piece one together
library(randomForest)
#install.packages("ModelMetrics")
library(ModelMetrics)
# prep training to binary outcome
train <- iris[iris$Species %in% c('virginica', 'versicolor'),]
train$Species <- droplevels(train$Species)
# build model
rfmodel <- randomForest(Species~., data=train, importance=TRUE, ntree=2)
# generate predictions
preds <- predict(rfmodel, type="prob",newdata = train)[,2]
# Calculate AUC
auc(train$Species, preds)
# Calculate LogLoss
logLoss(train$Species, preds)
I am looking for some guidance on a homework assignment I am working on for a class. We are given a dataset with 14K observations and we are asked to build a prediction model. I subset the dataset into training and testing (4909 observations), here I am using the caret package, which predicts the last variable "classe". I pulled out the near zero variables and built the model but when I tried to do predictions I only get 97 predictions back. I reviewed the help files but still can't figure out where I am going wrong. Any hints would be appreciated.
Here is the Code:
set.seed(1234)
pml.training <- read.csv("./data/pml-training.csv")
#
library(caret)
inTrain <- createDataPartition(y=pml.training$classe, p=0.75, list=FALSE)
training <- pml.training[inTrain,]
testing <- pml.training[-inTrain,]
# Pull out the Near Zero Value (NZV)
nzv <- nearZeroVar(training, saveMetrics=TRUE)
omit <- which(nzv$nzv==TRUE)
training <- training[,-omit]
testing <- testing[,-omit]
# Fit the model
modFit <- train(classe ~., method="rf", data=training)
modFit
print(modFit$finalModel)
plot(modFit)
# Try and predict on the testing model
pred <- predict(modFit, newdata=testing)
testing$predRight <- pred==testing$classe
print(table(pred, testing$classe))
Thanks, Pat C.
Have you checked
sum(complete.cases(subset(testing, select = -classe)))
?