Get row number for prediction with caret - r

I use caret a lot for my machine learning tasks in R and I like it a lot.
But I face the following problem:
I train a model in caret, say a linear regression with lm()
When I want to score new data, I do: predict(model, new_data)
When new_datacontains missing values in my predictors, predict returns no prediction, instead of say NA
Is it possible to either:
return a prediction for all rows in new_data with a prediction of NA when it is not possible or
return predictions + the row number of the dataframe the prediction corresponds to?
E.g. like the mlr-package does with an id-column that shows which row the prediction corresponds to:
Here is the link to the mlr-predict page with more details:
mlr-package: predict with row-id
Any help greatly appreciated!

You can identify the cases with missing values prior to running caret::train() by creating a new column with the row names in your data set, since these default to the row numbers in the data frame.
Using the Sonar data set from the mlbench package as an illustration:
library(mlbench)
data(Sonar)
library(caret)
set.seed(95014)
# add row numbers
Sonar$rowId <- rownames(Sonar)
# create training & testing data sets
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
# set column 60 to NA for some values in test data
testing[48:51,60] <- NA
testing[!complete.cases(testing),"rowId"]
...and the output:
> testing[!complete.cases(testing),"rowId"]
[1] "193" "194" "200" "206"
You can then run predict() on the rows in the test data set that have complete cases. Again using the Sonar dataset with a random forest model and 3 fold cross validation to expedite processing:
fitControl <- trainControl(method = "cv",number = 3)
fit <- train(x,y, method="rf",data=Sonar,trControl = fitControl)
predicted <- predict(fit,testing[complete.cases(testing),])
Another way to handle this situation is to use an imputation strategy to eliminate the missing values for the independent variables in your model. My article on Github, Strategies for Handling Missing Values links to a number of research papers on this topic.

Related

How can I calculate the mean square error in R of a regression tree?

I am working with the wine quality database.
I am studying regression trees depending on different variables as:
library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)
vinos <- read.csv(file = 'Wine.csv', header = T)
arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)
arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)
I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as
aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")
and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?
This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.
Remember that the function of the Mean Squared Error (MSE) is the following one:
So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).
A solution for your wine dataset could be this one, based on the previous website.
library(rpart)
library(dplyr)
library(data.table)
vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
# Split data into train and test subsets
sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
train <- vinos[sample_index, ]
test <- vinos[-sample_index, ]
# Train regression trees models
arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
# Make predictions for each model
pred0 <- predict(arbol0, newdata = test)
pred1 <- predict(arbol1, newdata = test)
# Calculate MSE for each model
mean((pred0 - test$quality)^2)
mean((pred1 - test$quality)^2)

Extimate prediction accuracy of cox ph

i would like to develop a cox proportional hazard model with r, use it to predict input and evaluate the accuracy of the model. For the evaluation I would like to use the Brior score.
# import various packages, needed at some point of the script
library("survival")
library("survminer")
library("prodlim")
library("randomForestSRC")
library("pec")
library("rpart")
library("mlr")
library("Hmisc")
library("ipred")
# load lung cancer data
data("lung")
head(lung)
# recode status variable
lung$status <- lung$status-1
# Delete rows with missing values
lung <- na.omit(lung)
# split data into training and testing
## 80% of the sample size
smp_size <- floor(0.8 * nrow(lung))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(lung)), size = smp_size)
# training and testing data
train.lung <- lung[train_ind, ]
test.lung <- lung[-train_ind, ]
# time and failure event
s <- Surv(train.lung$time, train.lung$status)
# create model
cox.ph2 <- coxph(s~age+meal.cal+wt.loss, data=train.lung)
# predict
pred <- predict(cox.ph2, newdata = train.lung)
# evaluate
sbrier(s, pred)
as an outcome of the prediction I would expect the time (as in "when does this individuum experience failure). Instead I get values like this
[1] 0.017576359 -0.135928959 -0.347553969 0.112509137 -0.229301199 -0.131861582 0.044589175 0.002634008
[9] 0.345966978 0.209488560 0.002418358
What does that mean?
Furthermore sbrier does not work. Apparently it can not work with the prediction pred (no surprise there)
How do I solve this? How do I make a prediction with cox.ph2? How can I evaluate the model afterwards?
The predict() function won't return a time value, you have to specify the argument type = c("lp", "risk","expected","terms","survival") in the predict() function.
If you want to get the hazard ratios :
predict(cox.ph2, newdata = test.lung, type = "risk")
Note that you want to predict the values on the test set not the training set.
I have read that you can use AFT models in your case :
https://stats.stackexchange.com/questions/79362/how-to-get-predictions-in-terms-of-survival-time-from-a-cox-ph-model
You also can read this post :
Calculate the Survival prediction using Cox Proportional Hazard model in R
Hope it will help

Different no of tuples for the prediction model and test set data in SVM

I have a dataset with two columns as shown below, where Column 1, timestamp is a particular value for time for which Column.10 gives the total power usage at that instance of time. There are totally 81502 instances for this data.
I'm doing support vector regression on this data in R using the e1071 package to predict the future usage of power. The code is given below. I first divided the dataset into training and test data. Then using the training data modeled the data using the svm function and then predict the power usage for the testset.
library(e1071)
attach(data.csv)
index <- 1:nrow(data.csv)
testindex <- sample(index,trunc(length(index)/3))
testset <- na.omit(data.csv[testindex, ])
trainingset <- na.omit(data.csv[-testindex, ])
model <- svm(Column.10 ~ timestamp, data=trainingset)
prediction <- predict(model, testset[,-2])
tab <- table(pred = prediction, true = testset[,2])
However, when I try to make a confusion matrix from the prediction, I'm getting the error:
Error in table(pred = prediction, true = testset[, 2]) : all arguments must have the same length
So I tried to find the length of the two arguments and found that
the length(prediction) to be 81502
and the length(testset[,2]) to be 27167
Since I had done the prediction only for the testset, I don't know how prediction is done for 81502 values. How are the total no of values different for the prediction and the testset? How is the power value for the entire dataset getting predicted eventhough I gave it only for the testset?
Change
prediction <- predict(model, testset[,-2])
in
prediction <- predict(model, testset)
However, you should not use table when doing regression, use the MSE instead.

R - factor examcard has new levels

I built a classification model in R using C5.0 given below:
library(C50)
library(caret)
a = read.csv("All_SRN.csv")
set.seed(123)
inTrain <- createDataPartition(a$anatomy, p = .70, list = FALSE)
training <- a[ inTrain,]
test <- a[-inTrain,]
Tree <- C5.0(anatomy ~ ., data = training,
trControl = trainControl(method = "repeatedcv", repeats = 10,
classProb = TRUE))
TreePred <- predict(Tree, test)
The training set has features like - examcard, coil_used, anatomy_region, bodypart_anatomy and anatomy(target class). All the features are categorical variables. There are a total of 10k odd values, I divided the data into training and test data. The learner worked great with this training and test set partioned in 70:30 ratio, but the problem comes when I provide the test set with new values given below:
TreePred <- predict(Tree, test_add)
Here, test_add contains the already present test set and a set of new values and on executing the learner fails to classify the new values and throws the following error:
Error in model.frame.default(object$Terms, newdata, na.action = na.action, : factor examcard has new levels
I tried to merge the new factor levels with the existing one using:
Tree$xlevels[["examcard"]] <- union(Tree$xlevels[["examcard"]], levels(test_add$examcard))
But, this wasn't of much help since the code executed with the following message and didn't yield any fruitful result:
predict code called exit with value 1
The feaure examcard holds a good deal of primacy in the classification hence can't be ignored. How can these set of values be classified?
You cannot create a prediction for factor levels in your test set that are absent in your training set. Your model will not have coefficients for these new factor levels.
If you are doing a 70/30 split, you need to repartition your data using caret::CreateDataPartition...
... or your own stratified sample function to ensure that all levels are represented in the training set: use the "split-apply-combine" approach: split the data set by examcard, and for each subset, apply the split, then combine the training subsets and the testing subsets.
See this question for more details.

Can I do predict.glmnet on test data with different number of predictor variables?

I used glmnet to build a predictive model on a training set with ~200 predictors and 100 samples, for a binomial regression/classification problem.
I selected the best model (16 predictors) that gave me the max AUC. I have an independent test set with only those variables (16 predictors) which made it into the final model from the training set.
Is there any way to use the predict.glmnet based on the optimal model from the training set with new test set which has data for only those variables that made it into the final model from the training set?
glmnet requires the exact same number/names of variables from the training dataset to be in the validation/test set. For example:
library(caret)
library(glmnet)
df <- ... # a dataframe with 200 variables, some of which you want to predict on
# & some of which you don't care about.
# Variable 13 ('Response.Variable') is the dependent variable.
# Variables 1-12 & 14-113 are the predictor variables
# All training/testing & validation datasets are derived from this single df.
# Split dataframe into training & testing sets
inTrain <- createDataPartition(df$Response.Variable, p = .75, list = FALSE)
Train <- df[ inTrain, ] # Training dataset for all model development
Test <- df[ -inTrain, ] # Final sample for model validation
# Run logistic regression , using only specified predictor variables
logCV <- cv.glmnet(x = data.matrix(Train[, c(1:12,14:113)]), y = Train[,13],
family = 'binomial', type.measure = 'auc')
# Test model over final test set, using specified predictor variables
# Create field in dataset that contains predicted values
Test$prob <- predict(logCV,type="response", newx = data.matrix(Test[,
c(1:12,14:113) ]), s = 'lambda.min')
For a completely new set of data, you could constrain the new df to the necessary variables using some variant of the following method:
new.df <- ... # new df w/ 1,000 variables, which include all predictor variables used
# in developing the model
# Create object with requisite predictor variable names that we specified in the model
predictvars <- c('PredictorVar1', 'PredictorVar2', 'PredictorVar3',
... 'PredictorVarK')
new.df$prob <- predict(logCV,type="response", newx = data.matrix(new.df[names(new.df)
%in% predictvars ]), s = 'lambda.min')
# the above method limits the new df of 1,000 variables to
# whatever the requisite variable names or indices go into the
# model.
Additionally, glmnet only deals with matrices. This is probably why you're getting the error you post in the comment to your question. Some users (myself included) have found that as.matrix() doesn't resolve the issue; data.matrix() seems to work though (hence why it's in the above code). This issue is addressed in a thread or two on SO.
I assume that all variables in the new dataset to be predicted also need to be formatted the same as they were in the dataset used for model development. I usually pull all of my data from the same source so I haven't encountered what glmnet will do in cases where formatting is different.

Resources