How to Create a loop (when levels do not overlap the reference) - r

I have written some code in R. This code takes some data and splits it into a training set and a test set. Then, I fit a "survival random forest" model on the training set. After, I use the model to predict observations within the test set.
Due to the type of problem I am dealing with ("survival analysis"), a confusion matrix has to be made for each "unique time" (inside the file "unique.death.time"). For each confusion matrix made for each unique time, I am interested in the corresponding "sensitivity" value (e.g. sensitivity_1001, sensitivity_2005, etc.). I am trying to get all these sensitivity values : I would like to make a plot with them (vs unique death times) and determine the average sensitivity value.
In order to do this, I need to repeatedly calculate the sensitivity for each time point in "unique.death.times". I tried doing this manually and it is taking a long time.
Could someone please show me how to do this with a "loop"?
I have posted my code below:
#load libraries
library(survival)
library(data.table)
library(pec)
library(ranger)
library(caret)
#load data
data(cost)
#split data into train and test
ind <- sample(1:nrow(cost),round(nrow(cost) * 0.7,0))
cost_train <- cost[ind,]
cost_test <- cost[-ind,]
#fit survival random forest model
ranger_fit <- ranger(Surv(time, status) ~ .,
data = cost_train,
mtry = 3,
verbose = TRUE,
write.forest=TRUE,
num.trees= 1000,
importance = 'permutation')
#optional: plot training results
plot(ranger_fit$unique.death.times, ranger_fit$survival[1,], type = 'l', col = 'red') # for first observation
lines(ranger_fit$unique.death.times, ranger_fit$survival[21,], type = 'l', col = 'blue') # for twenty first observation
#predict observations test set using the survival random forest model
ranger_preds <- predict(ranger_fit, cost_test, type = 'response')$survival
ranger_preds <- data.table(ranger_preds)
colnames(ranger_preds) <- as.character(ranger_fit$unique.death.times)
From here, another user (Justin Singh) from a previous post (R: how to repeatedly "loop" the results from a function?) suggested how to create a loop:
sensitivity <- list()
for (time in names(ranger_preds)) {
prediction <- ranger_preds[which(names(ranger_preds) == time)] > 0.5
real <- cost_test$time >= as.numeric(time)
confusion <- confusionMatrix(as.factor(prediction), as.factor(real), positive = 'TRUE')
sensitivity[as.character(i)] <- confusion$byclass[1]
}
But due to some of the observations used in this loop, I get the following error:
Error in confusionMatrix.default(as.factor(prediction), as.factor(real), :
The data must contain some levels that overlap the reference.
Does anyone know how to fix this?
Thanks

Certain values in prediction and/or real have only 1 unique value in them. Make sure the levels of the factors are the same.
sapply(names(ranger_preds), function(x) {
prediction <- factor(ranger_preds[[x]] > 0.5, levels = c(TRUE, FALSE))
real <- factor(cost_test$time >= as.numeric(x), levels = c(TRUE, FALSE))
confusion <- caret::confusionMatrix(prediction, real, positive = 'TRUE')
confusion$byClass[1]
}, USE.NAMES = FALSE) -> result
result

Related

How to get around error "factor has new levels" in cross-validation glm?

My goal is to use cross-validation to evaluate the performance of a linear model.
My problem is that my training and testing sets might not always have the same variable levels.
Here is a reproducible data example:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
summary(data)
Now let's make a glm model:
model_glm <- glm(x~., data = data)
And let's use cross-validation on this model:
library(boot)
cross_validation_glm <- cv.glm(data = data, glmfit = model_glm, K = 10)
And this is the kind of error output that you will get:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor z has new levels F
if you don't get this error, re-run the cross validation and at some point you will get a similar error.
The nature of the problem here is that when you do cross-validation, the train and test subsets might not have the exact same variable levels. Here our variable z has three levels (D,E,F).
In the total amount of our data there is much more D's than E's and F's.
Thus whenever you take a small subset of the whole data (to do cross-validation).
There is a very good chance that your z variable are all going to be set at the D's level.
Thus Eand F levels gets dropped, thus we get the error (This answer is helpful to understand the problem: https://stackoverflow.com/a/51555998/10972294).
My question is: how to avoid the drop in the first place?
If it is not possible, what are the alternatives?
(Keep in mind that this a reproducible example, the actual data I am using has many variables like z, I would like to avoid deleting them.)
To answer your question in the comment, I don't know if there is a function or not. Most likely there is one, but I have no idea on which package would contain it. For this example, this function should work:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
#optional tag row for later identification:
#data$rowid<-1:nrow(data)
stratified <- function(df, column, percent){
#split dataframe into groups based on column
listdf<-split(df, df[[column]])
testsubgroups<-lapply(listdf, function(x){
#pick the number of samples per group, round up.
numsamples <- ceiling(percent*nrow(x))
#selects the rows
whichones <-sample(1:nrow(x), numsamples, replace = FALSE)
testsubgroup <-x[whichones,]
})
#combine the subgroups into one data frame
testgroup<-do.call(rbind, testsubgroups)
testgroup
}
testgroup<-stratified(data, "z", 0.8)
This will just split the initial data by column z, if you are interested is grouping by multiple columns then this could be extended by using the group_by function from the dplyr package, but that would be another question.
Comment on the statistics: If you just have a few examples for any particular factor, what type of fit do you expect? A poor fit with wide confidence limits.

MLR, example dependent cost of misclassification, makeCostSensWeightedPairsWrapper

This question has been seen 74 times and has received only one response (as of noon (PDT) Wed, Aug-14).
I've rewritten the question to make it as clear as possible and I'll appreciate any help.
As a summary, I need a small but complete example on a dataset with binary response on how to use MLR's makeCostSensWeightedPairsWrapper to obtain prediction probabilities on a test set.
In the MLR tutorial in the part on cost-sensitive classification
https://mlr.mlr-org.com/articles/tutorial/cost_sensitive_classif.html
there is a paragraph on "Example-dependent misclassification costs" and an example is given based on the iris dataset.
In the code snippet below, I modified the iris data set so as to contain only two classes as I'm interested in binary classification only.
library( mlr )
set.seed( 12347 )
n1 = 100; ntrain = 70
df = iris[ 1:n1, ] # 100 points in df so as to have two classes only (setosa and versicolor)
df$Species = factor( df$Species ) # refactor the response
# partition df into a training set (70 points) and test set (30 points)
#
ix = sample( 1:n1, ntrain, replace=FALSE )
xtest = df[ setdiff( 1:n1, ix ), ] ## test set
ntest = nrow( xtest )
xtrain = df[ ix, ] # this is the training set
# create cost matrix, same as in the MLR example
#
cost = matrix(runif(ntrain * 2, 0, 2000), ntrain) * (1 - diag(2))[xtrain$Species,] + runif(ntrain, 0, 10)
colnames(cost) = levels(xtrain$Species)
rownames(cost) = rownames(xtrain)
xtrain$Species = NULL # this is done according to the MLR example
# cost-sensitive task
#
costsens.task = makeCostSensTask(id = "xtrain", data = xtrain, cost = cost )
costsens.task
##lrn = makeLearner("classif.multinom", trace = FALSE, predict.type="prob" )
lrn = makeLearner( "classif.gbm", predict.type="prob" )
lrn = makeCostSensWeightedPairsWrapper( lrn ); lrn
mod = train(lrn, costsens.task ); mod
pred = predict( mod, newdata = xtest, pred.type="prob" );
perf = performance( pred, measures = list(auc), task = costsens.task)
# I get the following error:
# Error in FUN(X[[i]], ...) :
# You need to have a 'truth' column in your pred object for measure auc!
My original project is to do a binary classification which incorporates example-dependent misclassification costs.
The goal is to do a prediction on a test dataset, obtain the probabilities and show the performance (using ROCR, for which there are MLR-mapping functions).
NOTE: The learners I've tried are 'classif.multinom' and (I guessed) 'classif.gbm' as the two that might be compatible with the weighted pairs wrapper.
My questions are:
Q1: Where in the code snippet and how to specify that I want probabilities as an output of the cost-sensitive classifier?
Q2: Which learner can be used so as to produce classification probabilities?
Q3: How to avoid the error above and get the class probabilities?
Once again, I'd really appreciate any help, even more so if there is anyone who can answer promptly.
OK, after a number of days and almost a hundred views of this question (about 20 are mine :) there has been only one comment and no answers.
From what I could understand exploring some of the available MLR documentation, it seems that the output of example-based cost-sensitive method (makeCostSensWeightedPairsWrapper) is labels only and no prediction probabilities.
In other words, no probabilities are available from a cost-sensitive task, only the new labels are given, which are in turn computed based on the probabilities of the base classifier.
So, this the answer which I'll accept.
As for MLR errors, in this case at least, it would be helpful to get an explicit error message instead of a spurious one, or simply to note this in the documentation.

Calculating prediction accuracy of a tree using rpart's predict method

I have constructed a decision tree using rpart for a dataset.
I have then divided the data into 2 parts - a training dataset and a test dataset. A tree has been constructed for the dataset using the training data. I want to calculate the accuracy of the predictions based on the model that was created.
My code is shown below:
library(rpart)
#reading the data
data = read.table("source")
names(data) <- c("a", "b", "c", "d", "class")
#generating test and train data - Data selected randomly with a 80/20 split
trainIndex <- sample(1:nrow(x), 0.8 * nrow(x))
train <- data[trainIndex,]
test <- data[-trainIndex,]
#tree construction based on information gain
tree = rpart(class ~ a + b + c + d, data = train, method = 'class', parms = list(split = "information"))
I now want to calculate the accuracy of the predictions generated by the model by comparing the results with the actual values train and test data however I am facing an error while doing so.
My code is shown below:
t_pred = predict(tree,test,type="class")
t = test['class']
accuracy = sum(t_pred == t)/length(t)
print(accuracy)
I get an error message that states -
Error in t_pred == t : comparison of these types is not implemented In
addition: Warning message: Incompatible methods ("Ops.factor",
"Ops.data.frame") for "=="
On checking the type of t_pred, I found out that it is of type integer however the documentation
(https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/predict.rpart.html)
states that the predict() method must return a vector.
I am unable to understand why is the type of the variable is an integer and not a list. Where have I made the mistake and how can I fix it?
Try calculating the confusion matrix first:
confMat <- table(test$class,t_pred)
Now you can calculate the accuracy by dividing the sum diagonal of the matrix - which are the correct predictions - by the total sum of the matrix:
accuracy <- sum(diag(confMat))/sum(confMat)
My response is very similar to #mtoto's one but a bit more simply... I hope it also helps.
mean(test$class == t_pred)

Variable importance for support vector machine and naive Bayes classifiers in R

I’m working on building predictive classifiers in R on a cancer dataset.
I’m using random forest, support vector machine and naive Bayes classifiers. I’m unable to calculate variable importance on SVM and NB models
I end up receiving the following error.
Error in UseMethod("varImp") :
no applicable method for 'varImp' applied to an object of class "c('svm.formula', 'svm')"
I would greatly appreciate it if anyone could help me.
Given
library(e1071)
model <- svm(Species ~ ., data = iris)
class(model)
# [1] "svm.formula" "svm"
library(caret)
varImp(model)
# Error in UseMethod("varImp") :
# no applicable method for 'varImp' applied to an object of class "c('svm.formula', 'svm')"
methods(varImp)
# [1] varImp.bagEarth varImp.bagFDA varImp.C5.0* varImp.classbagg*
# [5] varImp.cubist* varImp.dsa* varImp.earth* varImp.fda*
# [9] varImp.gafs* varImp.gam* varImp.gbm* varImp.glm*
# [13] varImp.glmnet* varImp.JRip* varImp.lm* varImp.multinom*
# [17] varImp.mvr* varImp.nnet* varImp.pamrtrained* varImp.PART*
# [21] varImp.plsda varImp.randomForest* varImp.RandomForest* varImp.regbagg*
# [25] varImp.rfe* varImp.rpart* varImp.RRF* varImp.safs*
# [29] varImp.sbf* varImp.train*
There is no function varImp.svm in methods(varImp), therefore the error. You might want to have a look at this post on Cross Validated, too.
If you use R, the variable importance can be calculated with Importance method in rminer package. This is my sample code:
library(rminer)
M <- fit(y~., data=train, model="svm", kpar=list(sigma=0.10), C=2)
svm.imp <- Importance(M, data=train)
In detail, refer to the following link https://cran.r-project.org/web/packages/rminer/rminer.pdf
I have created a loop that iteratively removes one predictor at a time and captures in a data frame various performance measures derived from the confusion matrix. This is not supposed to be a one size fits all solution, I don't have the time for it, but it should not be difficult to apply modifications.
Make sure that the predicted variable is last in the data frame.
I mainly needed specificity values from the models and by removing one predictor at a time, I can evaluate the importance of each predictor, i.e. by removing a predictor, the smallest specificity of the model(less predictor number i) means that the predictor has the most importance. You need to know on what indicator you will attribute importance.
You can also add another for loop inside to change between kernels, i.e. linear, polynomial, radial, but you might have to account for the other parameters,e.g. gamma. Change "label_fake" with your target variable and df_final with your data frame.
SVM version:
set.seed(1)
varimp_df <- NULL # df with results
ptm1 <- proc.time() # Start the clock!
for(i in 1:(ncol(df_final)-1)) { # the last var is the dep var, hence the -1
smp_size <- floor(0.70 * nrow(df_final)) # 70/30 split
train_ind <- sample(seq_len(nrow(df_final)), size = smp_size)
training <- df_final[train_ind, -c(i)] # receives all the df less 1 var
testing <- df_final[-train_ind, -c(i)]
tune.out.linear <- tune(svm, label_fake ~ .,
data = training,
kernel = "linear",
ranges = list(cost =10^seq(1, 3, by = 0.5))) # you can choose any range you see fit
svm.linear <- svm(label_fake ~ .,
kernel = "linear",
data = training,
cost = tune.out.linear[["best.parameters"]][["cost"]])
train.pred.linear <- predict(svm.linear, testing)
testing_y <- as.factor(testing$label_fake)
conf.matrix.svm.linear <- caret::confusionMatrix(train.pred.linear, testing_y)
varimp_df <- rbind(varimp_df,data.frame(
var_no=i,
variable=colnames(df_final[,i]),
cost_param=tune.out.linear[["best.parameters"]][["cost"]],
accuracy=conf.matrix.svm.linear[["overall"]][["Accuracy"]],
kappa=conf.matrix.svm.linear[["overall"]][["Kappa"]],
sensitivity=conf.matrix.svm.linear[["byClass"]][["Sensitivity"]],
specificity=conf.matrix.svm.linear[["byClass"]][["Specificity"]]))
runtime1 <- as.data.frame(t(data.matrix(proc.time() - ptm1)))$elapsed # time for running this loop
runtime1 # divide by 60 and you get minutes, /3600 you get hours
}
Naive Bayes version:
varimp_nb_df <- NULL
ptm1 <- proc.time() # Start the clock!
for(i in 1:(ncol(df_final)-1)) {
smp_size <- floor(0.70 * nrow(df_final))
train_ind <- sample(seq_len(nrow(df_final)), size = smp_size)
training <- df_final[train_ind, -c(i)]
testing <- df_final[-train_ind, -c(i)]
x = training[, names(training) != "label_fake"]
y = training$label_fake
model_nb_var = train(x,y,'nb', trControl=ctrl)
predict_nb_var <- predict(model_nb_var, newdata = testing )
confusion_matrix_nb_1 <- caret::confusionMatrix(predict_nb_var, testing$label_fake)
varimp_nb_df <- rbind(varimp_nb_df, data.frame(
var_no=i,
variable=colnames(df_final[,i]),
accuracy=confusion_matrix_nb_1[["overall"]][["Accuracy"]],
kappa=confusion_matrix_nb_1[["overall"]][["Kappa"]],
sensitivity=confusion_matrix_nb_1[["byClass"]][["Sensitivity"]],
specificity=confusion_matrix_nb_1[["byClass"]][["Specificity"]]))
runtime1 <- as.data.frame(t(data.matrix(proc.time() - ptm1)))$elapsed # time for running this loop
runtime1 # divide by 60 and you get minutes, /3600 you get hours
}
Have fun!

PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short.
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results.
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.

Resources