I was wondering how to measure the performance of prediction (using the test dataset) of an mlr3 model? For example, if I create a knn model using mlr3 like so:
library("mlr3")
library("mlr3learners")
# get data and split into training and test
aq <- na.omit(airquality)
train <- sample(nrow(aq), round(.7*nrow(aq))) # split 70-30
aqTrain <- aq[train, ]
aqTest <- aq[-train, ]
# create model
aqT <- TaskRegr$new(id = "knn", backend = aqTrain, target = "Ozone")
aqL <- lrn("regr.kknn")
aqMod <- aqL$train(aqT)
I can measure the mean square error of the model predictions doing something like:
prediction <- aqL$predict(aqT)
measure <- msr("regr.mse")
prediction$score(measure)
But how do I incorporate the test data into this? That is, how do I measure the performance of predictions on the test data?
In the previous version of mlr I could do something like; get the predictions using the test dataset and measure the performance of, say, the MSE or Rsquared values like so:
pred <- predict(aqMod, newdata = aqTest)
performance(pred, measures = list(mse, rsq))
Any suggestions as to how I can do this in mlr3?
You should trying this code
pred <- aqMod$predict_newdata(aqTest)
pred$score(list(msr("regr.mse"),
msr("regr.rmse")))
Related
I am working with the wine quality database.
I am studying regression trees depending on different variables as:
library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)
vinos <- read.csv(file = 'Wine.csv', header = T)
arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)
arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)
I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as
aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")
and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?
This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.
Remember that the function of the Mean Squared Error (MSE) is the following one:
So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).
A solution for your wine dataset could be this one, based on the previous website.
library(rpart)
library(dplyr)
library(data.table)
vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
# Split data into train and test subsets
sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
train <- vinos[sample_index, ]
test <- vinos[-sample_index, ]
# Train regression trees models
arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
# Make predictions for each model
pred0 <- predict(arbol0, newdata = test)
pred1 <- predict(arbol1, newdata = test)
# Calculate MSE for each model
mean((pred0 - test$quality)^2)
mean((pred1 - test$quality)^2)
I have a response variable contains 100 observation and I wish to estimate them by using 8 independent variables via employing supper Vector Regression.
I have searched a lot to find a template in order to implement my SVR with training and testing sets in R, but I could not find the way which I wanted.
I have used the following code to fit the model and calculate RMSE, but I want to check my model for unseen data and I do not know how to perform this in R.
My code is as follows:
data<-read.csv("Enzyme.csv",header = T)
Testset <- data[c(11:30),]
Trainset <- data[-c(11:30), ]
#attached dependent variable
Y<-Trainset$Urease
Trainset<-Trainset[,-c(1)]
SVMUr <- svm (Urease~., data=Trainset, kernel="radial",gamma=
1,epsilon=seq(0,1,0.1), cost=10)
summary(SVMUr)
################### RMSE SVMUr ##########################
RMSE <- function(observed, predicted){
sqrt(mean((predicted - observed)^2, na.rm=TRUE))
}
RMSE(observed =Y,predicted = predSVMUr)
######## Check the model for unseen data via using testset ######
predicted_test <- predict(SVMUr, Testset[,-1])
RMSE(Testset$Urease, predicted_test)
The way you want to go about testing your model is to:
First apply your model on unseen data using predict(SVMUr, Testset[,-1]) assuming the first variable is your target response Y. If it is the 15th variable for example, replace -1 with -15.
Now use the RMSE() function to get the RMSE of the model on your test dataset
Additional Recommendation:
I would not split the data the way you do because as you've pointed out you end up with too little training data in relation to test data. If you want to split it by 80%-20%, you can adjust from my code below:
data<-read.csv("Enzyme.csv",header = T)
split_data <- sample(nrow(data), nrow(data)*0.8)
Trainset <- data[split_data, ]
Testset <- data[-split_data, ]
That would put 80% of your data in the train set and 20% in the test set.
The rest of the code:
SVMUr <- svm (Urease~., data=Trainset, kernel="radial",gamma=
1,epsilon=seq(0,1,0.1), cost=10)
summary(SVMUr)
################### RMSE SVMUr ##########################
RMSE <- function(observed, predicted){
sqrt(mean((predicted - observed)^2, na.rm=TRUE))
}
RMSE(observed =Trainset$Urease, predicted = predSVMUr)
######## Check the model for unseen data via using testset ######
predicted_test <- predict(SVMUr, Testset[,-1])
RMSE(Testset$Urease, predicted_test)
i would like to develop a cox proportional hazard model with r, use it to predict input and evaluate the accuracy of the model. For the evaluation I would like to use the Brior score.
# import various packages, needed at some point of the script
library("survival")
library("survminer")
library("prodlim")
library("randomForestSRC")
library("pec")
library("rpart")
library("mlr")
library("Hmisc")
library("ipred")
# load lung cancer data
data("lung")
head(lung)
# recode status variable
lung$status <- lung$status-1
# Delete rows with missing values
lung <- na.omit(lung)
# split data into training and testing
## 80% of the sample size
smp_size <- floor(0.8 * nrow(lung))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(lung)), size = smp_size)
# training and testing data
train.lung <- lung[train_ind, ]
test.lung <- lung[-train_ind, ]
# time and failure event
s <- Surv(train.lung$time, train.lung$status)
# create model
cox.ph2 <- coxph(s~age+meal.cal+wt.loss, data=train.lung)
# predict
pred <- predict(cox.ph2, newdata = train.lung)
# evaluate
sbrier(s, pred)
as an outcome of the prediction I would expect the time (as in "when does this individuum experience failure). Instead I get values like this
[1] 0.017576359 -0.135928959 -0.347553969 0.112509137 -0.229301199 -0.131861582 0.044589175 0.002634008
[9] 0.345966978 0.209488560 0.002418358
What does that mean?
Furthermore sbrier does not work. Apparently it can not work with the prediction pred (no surprise there)
How do I solve this? How do I make a prediction with cox.ph2? How can I evaluate the model afterwards?
The predict() function won't return a time value, you have to specify the argument type = c("lp", "risk","expected","terms","survival") in the predict() function.
If you want to get the hazard ratios :
predict(cox.ph2, newdata = test.lung, type = "risk")
Note that you want to predict the values on the test set not the training set.
I have read that you can use AFT models in your case :
https://stats.stackexchange.com/questions/79362/how-to-get-predictions-in-terms-of-survival-time-from-a-cox-ph-model
You also can read this post :
Calculate the Survival prediction using Cox Proportional Hazard model in R
Hope it will help
I am new to the caret library. I would like to use the train function to run cross-validation on my dataset (using the rpart method for classification). My goal is is to produce learning curves using the data returned from my call to train. The learning curve would plot the dataset size on the x-axis. The error of the predictions on the training and cross validation sets would be plotted as a function of dataset size.
My question is, does caret make predictions on both the training and cv folds? If the answer is yes, how would I go about extracting that data?
Assuming the answer is yes, here is a simple code sample that you could append to to illustrate:
library(MASS)
data(biopsy)
biopsy <- biopsy[, -1]
names(biopsy) <- c("thick", "u.size", "u.shape", "adhsn", "s.size", "nucl", "chrom", "n.nuc", "mit", "class")
biopsy.v2 <- na.omit(biopsy)
set.seed(1)
ind <- sample(2, nrow(biopsy.v2), replace = TRUE, prob = c(0.7, + 0.3))
biop.train <- biopsy.v2[ind == 1, ]
tr.model <- caret::train(class ~ ., data= biop.train, trControl = trainControl(method="cv", number=4, verboseIter = FALSE, savePredictions = "final"), method='rpart')
#Can I extract train and cv accuracies from tr.model?
Thanks.
note: I realize that I may need to call train repeatedly with different samples of my dataset (assuming caret doesn't also support this), and that is not reflected in the code sample here.
You can try this:
A data frame with predictions for each resample:
tr.model$pred
A data frame with columns for each performance metric. Each row corresponds to each resample:
tr.model$resample
A data frame with the final parameters:
tr.model$bestTune
A data frame with the training error rate and values of the tuning parameters:
tr.model$results
To specify repeated CV:
trainControl(..., repeats = n)
where n is an integer (the number of complete sets of folds to compute)
EDIT: determine which resamples were in the test folds:
the relevant information is in tr.model$pred data frame:
tr.model$pred[tr.model$pred$Resample=="Fold1",4:5]
tr.model$pred[tr.model$pred$Resample=="Fold2",4:5]
tr.model$pred[tr.model$pred$Resample=="Fold3",4:5]
tr.model$pred[tr.model$pred$Resample=="Fold4",4:5]
the ones that were not in the test folds were in the training folds
I am looking for some guidance on a homework assignment I am working on for a class. We are given a dataset with 14K observations and we are asked to build a prediction model. I subset the dataset into training and testing (4909 observations), here I am using the caret package, which predicts the last variable "classe". I pulled out the near zero variables and built the model but when I tried to do predictions I only get 97 predictions back. I reviewed the help files but still can't figure out where I am going wrong. Any hints would be appreciated.
Here is the Code:
set.seed(1234)
pml.training <- read.csv("./data/pml-training.csv")
#
library(caret)
inTrain <- createDataPartition(y=pml.training$classe, p=0.75, list=FALSE)
training <- pml.training[inTrain,]
testing <- pml.training[-inTrain,]
# Pull out the Near Zero Value (NZV)
nzv <- nearZeroVar(training, saveMetrics=TRUE)
omit <- which(nzv$nzv==TRUE)
training <- training[,-omit]
testing <- testing[,-omit]
# Fit the model
modFit <- train(classe ~., method="rf", data=training)
modFit
print(modFit$finalModel)
plot(modFit)
# Try and predict on the testing model
pred <- predict(modFit, newdata=testing)
testing$predRight <- pred==testing$classe
print(table(pred, testing$classe))
Thanks, Pat C.
Have you checked
sum(complete.cases(subset(testing, select = -classe)))
?