R: Understanding K-Fold Validation Correctly? - r

Good Afternoon.
I wanted a sanity check after doing research about k-Fold Cross-Validation. I will provide my understanding, and then provide an example of how to execute the preconceived understanding in R.
I would really appreciate any help on if I'm thinking about this incorrectly, or if my code is not reflecting my thought process / the correct procedures. Take the basic predictive modeling scenario on a continuous response variable:
Have a population dataset (xDF)
I want to split the dataset into k=10 separate parts, train a model on 9 of them (binded), and then validate on the remaining validation set
I then want to loop through each validation set to observe how the model performs on un-trained segments of the data
Model performance measures (RMSE for this example) on the kth-fold validation set that display similar results on the k+1...k+9th validation set reveals that the model is well-generalized
R Code:
#Declaring randomly sampled validation indices
ind <- sample(seq_len(nrow(xDF)), size = nrow(xDF))
n <- (nrow(xDF)/10)
nr <- nrow(xDF)
validation_ind <- split(ind, rep(1:ceiling(nr/n), each=n, length.out=nr))
#Looping through validation sets to obtain Model Performance measure of each set
RMSEsF <- double(10)
RMSEsFT <- double(10)
R2F <- double(10)
R2FT <- double(10)
rsq <- function (x, y) cor(x, y) ^ 2
for (i in 1:10){
validate = as.data.frame(xDF[unlist(validation_ind[i]),])
train = as.data.frame(xDF[unlist(validation_ind[-i]),])
rf_train = randomForest(y~.,data=train,mtry=3)
predictions_rf = predict(rf_train,validate)
predictions_rft = predict(rf_train, train)
RMSEsF[i] = RMSE(predictions_rf, validate$y)
RMSEsFT[i] = RMSE(predictions_rft, train$y)
R2F[i] = rsq(predictions_rf, validate$y)
R2FT[i] = rsq(predictions_rft, train$y)
print(".")
}
RMSEsF
RMSEsFT
Am I going about this correctly?
Many thanks in advance.

Related

How to predict in kknn function? library(kknn)

I try to use kknn + loop to create a leave-out-one cross validation for a model, and compare that with train.kknn.
I have split the data into two parts: training (80% data), and test (20% data). In the training data, I exclude one point in the loop to manually create LOOCV.
I think something gets wrong in predict(knn.fit, data.test). I have tried to find how to predict in kknn through the kknn package instruction and online but all the examples are "summary(model)" and "table(validation...)" rather than the prediction on a separate test data. The code predict(model, dataset) works successfully in train.kknn function, so I thought I could use the similar arguments in kknn.
I am not sure if there is such a prediction function in kknn. If yes, what arguments should I give?
Look forward to your suggestion. Thank you.
library(kknn)
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
# train.data + validation.data is the 80% data I split.
}
pred.knn <- predict(knn.fit, data.test) # data.test is 20% data.
Here is the error message:
Error in switch(type, raw = object$fit, prob = object$prob,
stop("invalid type for prediction")) : EXPR must be a length 1
vector
Actually I try to compare train.kknn and kknn+loop to compare the results of the leave-out-one CV. I have two more questions:
1) in kknn: is it possible to use another set of data as test data to see the knn.fit prediction?
2) in train.kknn: I split the data and use 80% of the whole data and intend to use the rest 20% for prediction. Is it an correct common practice?
2) Or should I just use the original data (the whole data set) for train.kknn, and create a loop: data[-i,] for training, data[i,] for validation in kknn? So they will be the counterparts?
I find that if I use the training data in the train.kknn function and use prediction on test data set, the best k and kernel are selected and directly used in generating the predicted value based on the test dataset.
In contrast, if I use kknn function and build a loop of different k values, the model generates the corresponding prediction results based on
the test data set each time the k value is changed. Finally, in kknn + loop, the best k is selected based on the best actual prediction accuracy rate of test data. In short, the best k train.kknn selected may not work best on test data.
Thank you.
For objects returned by kknn, predict gives the predicted value or the predicted probabilities of R1 for the single row contained in validation.data:
predict(knn.fit)
predict(knn.fit, type="prob")
The predict command also works on objects returned by train.knn.
For example:
train.kknn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 10,
kernel = "rectangular", scale = TRUE)
class(train.kknn.fit)
# [1] "train.kknn" "kknn"
pred.train.kknn <- predict(train.kknn.fit, data.test)
table(pred.train.kknn, as.factor(data.test$R1))
The train.kknn command implements a leave-one-out method very close to the loop developed by #vcai01. See the following example:
set.seed(43210)
n <- 500
data.train <- data.frame(R1=rbinom(n,1,0.5), matrix(rnorm(n*10), ncol=10))
library(kknn)
pred.kknn <- array(0, nrow(data.train))
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
pred.kknn[i] <- predict(knn.fit)
}
knn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 40,
kernel = "rectangular", scale = TRUE)
pred.train.kknn <- predict(knn.fit, data.train)
table(pred.train.kknn, pred.kknn)
# pred.kknn
# pred.train.kknn 1 2
# 0 374 14
# 1 9 103

leave-one-out cross validation with knn in R

I have defined my training and test sets as follows:
colon_samp <-sample(62,40)
colon_train <- colon_data[colon_samp,]
colon_test <- colon_data[-colon_samp,]
And the KNN function:
knn_colon <- knn(train = colon_train[1:12533], test = colon_test[1:12533], cl = colon_train$class, k=2)
Here is my LOOCV loop for KNN:
newColon_train <- data.frame(colon_train, id=1:nrow(colon_train))
id <- unique(newColon_train$id)
loo_colonKNN <- NULL
for(i in id){
knn_colon <- knn(train = newColon_train[newColon_train$id!=i,], test = newColon_train[newColon_train$id==i,],cl = newColon_train[newColon_train$id!=i,]$Y)
loo_colonKNN[[i]] <- knn_colon
}
print(loo_colonKNN)
When I print loo_colonKNNit gives me 40 predictions (i.e. the 40 train set predictions), however, I would like it to give me the 62 predictions (all of my n samples in the original dataset). How might I go about doing this?
Thank you.
You would simply call the knn function again, using a different test parameter:
[...]
knn_colon2 <- knn(train = newColon_train[newColon_train$id!=i,],
test = newColon_test[newColon_test$id==i,],
cl = newColon_train[newColon_train$id!=i,]$Y)
This is caused by KNN being an non-parametric, instance based model: the data itself is the model, hence "training" is just holding the data for "later" prediction and does not require any computationally intensive model fitting procedure. Consequently it is unproblematic to call the training procedure multiple times to apply it to multiple test sets.
But be aware that the idea of CV is to only evaluate on the left partition each time, so looking at all samples is probably not what you want to do. And, instead of coding this yourself, you might be better off using e.g. the knn.cv function or the caret framework instead, which provides APIs for partitioning, resampling, etc. all in one, therefore is pretty convenient in such tasks.

PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short.
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results.
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.

Setting Random seeds do not affect classification methods C5.0 and ctree

I want to compare between two different classification methods, namely ctree and C5.0 in the libraries partyand c50 respectively, the comparison is to test their sensitivity to the initial start points. The test should be carried 30 times for each time the number of wrong classified items are calculated and stored in a vector then by using t-test I hope to see if they are really different or not.
library("foreign"); # for read.arff
library("party") # for ctree
library("C50") # for C5.0
trainTestSplit <- function(data, trainPercentage){
newData <- list();
all <- nrow(data);
splitPoint <- floor(all * trainPercentage);
newData$train <- data[1:splitPoint, ];
newData$test <- data[splitPoint:all, ];
return (newData);
}
ctreeErrorCount <- function(st,ss){
set.seed(ss);
model <- ctree(Class ~ ., data=st$train);
class <- st$test$Class;
st$test$Class <- NULL;
pre = predict(model, newdata=st$test, type="response");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
C50ErrorCount <- function(st,ss){
model <- C5.0(Class ~ ., data=st$train, seed=ss);
class <- st$test$Class;
pre = predict(model, newdata=st$test, type="class");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
compare <- function(n = 30){
data <- read.arff(file.choose());
set.seed(100);
errors = list(ctree = c(), c50 = c());
seeds <- floor(abs(rnorm(n) * 10000));
for(i in 1:n){
splitData <- trainTestSplit(data, 0.66);
errors$ctree[i] <- ctreeErrorCount(splitData, seeds[i]);
errors$c50[i] <- C50ErrorCount(splitData, seeds[i]);
}
cat("\n\n");
cat("============= ctree Vs C5.0 =================\n");
cat(paste(errors$ctree, " ", errors$c50, "\n"))
tt <- t.test(errors$ctree, errors$c50);
print(tt);
}
The program shown is supposedly doing the job of comparison, but because of the number of errors does not change in the vectors then the t.test function produces an error. I used iris inside R (but changing class to Class) and Winchester breast cancer data which can be downloaded here to test it but any data can be used as long as it has Class attribute
But I get in to the problem that the result of both methods remain constant and not changes while I am changing the random seed, theoretically ,as described in their documentation,both of the functions use random seeds, ctree uses set.seed(x) while C5.0 uses an argument called seed to set seed, unfortunatly I can not find the effect.
Could you please tell me how to control initials of these functions
ctrees does only depend on a random seed in the case where you configure it to use a random selection of input variables (ie that mtry > 0 within ctree_control). See http://cran.r-project.org/web/packages/party/party.pdf (p. 11)
In regards to C5.0-trees the seed is used this way:
ctrl = C5.0Control(sample=0.5, seed=ss);
model <- C5.0(Class ~ ., data=st$train, control = ctrl);
Notice that the seed is used to select a sample of the data, not within the algoritm itself. See http://cran.r-project.org/web/packages/C50/C50.pdf (p. 5)

How to perform 10 fold cross validation with LibSVM in R?

I know that in MatLab this is really easy ('-v 10').
But I need to do it in R. I did find one comment about adding cross = 10 as parameter would do it. But this is not confirmed in the help file so I am sceptical about it.
svm(Outcome ~. , data= source, cost = 100, gamma =1, cross=10)
Any examples of a successful SVM script for R would also be appreciated as I am still running into some dead ends?
Edit: I forgot to mention outside of the tags that I use the libsvm package for this.
I am also trying to perform a 10 fold cross validation. I think that using tune is not the right way in order to perform it, since this function is used to optimize the parameters, but not to train and test the model.
I have the following code to perform a Leave-One-Out cross validation. Suppose that dataset is a data.frame with your data stored. In each LOO step, the observed vs. predicted matrix is added, so that at the end, result contains the global observed vs. predicted matrix.
#LOOValidation
for (i in 1:length(dataset)){
fit = svm(classes ~ ., data=dataset[-i,], type='C-classification', kernel='linear')
pred = predict(fit, dataset[i,])
result <- result + table(true=dataset[i,]$classes, pred=pred);
}
classAgreement(result)
So in order to perform a 10-fold cross validation, I guess we should manually partition the dataset, and use the folds to train and test the model.
for (i in 1:10)
train <- getFoldTrainSet(dataset, i)
test <- getFoldTestSet(dataset,i)
fit = svm(classes ~ ., train, type='C-classification', kernel='linear')
pred = predict(fit, test)
results <- c(results,table(true=test$classes, pred=pred));
}
# compute mean accuracies and kappas ussing results, which store the result of each fold
I hope this help you.
Here is a simple way to create 10 test and training folds using no packages:
#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use test and train data howeever you desire...
}
Here is my generic code to run a k-fold cross validation aided by cvsegments to generate the index folds.
# k fold-cross validation
set.seed(1)
k <- 80;
result <- 0;
library('pls');
folds <- cvsegments(nrow(imDF), k);
for (fold in 1:k){
currentFold <- folds[fold][[1]];
fit = svm(classes ~ ., data=imDF[-currentFold,], type='C-classification', kernel='linear')
pred = predict(fit, imDF[currentFold,])
result <- result + table(true=imDF[currentFold,]$classes, pred=pred);
}
classAgreement(result)

Resources