How to predict in kknn function? library(kknn) - r

I try to use kknn + loop to create a leave-out-one cross validation for a model, and compare that with train.kknn.
I have split the data into two parts: training (80% data), and test (20% data). In the training data, I exclude one point in the loop to manually create LOOCV.
I think something gets wrong in predict(knn.fit, data.test). I have tried to find how to predict in kknn through the kknn package instruction and online but all the examples are "summary(model)" and "table(validation...)" rather than the prediction on a separate test data. The code predict(model, dataset) works successfully in train.kknn function, so I thought I could use the similar arguments in kknn.
I am not sure if there is such a prediction function in kknn. If yes, what arguments should I give?
Look forward to your suggestion. Thank you.
library(kknn)
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
# train.data + validation.data is the 80% data I split.
}
pred.knn <- predict(knn.fit, data.test) # data.test is 20% data.
Here is the error message:
Error in switch(type, raw = object$fit, prob = object$prob,
stop("invalid type for prediction")) : EXPR must be a length 1
vector
Actually I try to compare train.kknn and kknn+loop to compare the results of the leave-out-one CV. I have two more questions:
1) in kknn: is it possible to use another set of data as test data to see the knn.fit prediction?
2) in train.kknn: I split the data and use 80% of the whole data and intend to use the rest 20% for prediction. Is it an correct common practice?
2) Or should I just use the original data (the whole data set) for train.kknn, and create a loop: data[-i,] for training, data[i,] for validation in kknn? So they will be the counterparts?
I find that if I use the training data in the train.kknn function and use prediction on test data set, the best k and kernel are selected and directly used in generating the predicted value based on the test dataset.
In contrast, if I use kknn function and build a loop of different k values, the model generates the corresponding prediction results based on
the test data set each time the k value is changed. Finally, in kknn + loop, the best k is selected based on the best actual prediction accuracy rate of test data. In short, the best k train.kknn selected may not work best on test data.
Thank you.

For objects returned by kknn, predict gives the predicted value or the predicted probabilities of R1 for the single row contained in validation.data:
predict(knn.fit)
predict(knn.fit, type="prob")
The predict command also works on objects returned by train.knn.
For example:
train.kknn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 10,
kernel = "rectangular", scale = TRUE)
class(train.kknn.fit)
# [1] "train.kknn" "kknn"
pred.train.kknn <- predict(train.kknn.fit, data.test)
table(pred.train.kknn, as.factor(data.test$R1))
The train.kknn command implements a leave-one-out method very close to the loop developed by #vcai01. See the following example:
set.seed(43210)
n <- 500
data.train <- data.frame(R1=rbinom(n,1,0.5), matrix(rnorm(n*10), ncol=10))
library(kknn)
pred.kknn <- array(0, nrow(data.train))
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
pred.kknn[i] <- predict(knn.fit)
}
knn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 40,
kernel = "rectangular", scale = TRUE)
pred.train.kknn <- predict(knn.fit, data.train)
table(pred.train.kknn, pred.kknn)
# pred.kknn
# pred.train.kknn 1 2
# 0 374 14
# 1 9 103

Related

Output is lagging when trying to get lambda and alpha values after running Elastic-Net Regression Model

I am new to R and Elastic-Net Regression Model. I am running Elastic-Net Regression Model on the default dataset, titanic. I am trying to obtain the Alpha and Lambda values after running the train function. However when I run the train function, the output keeps on lagging and I had to wait for the output but there is no output at all. it is empty.... I am trying Tuning Parameters.
data(Titanic)
example<- as.data.frame(Titanic)
example['Country'] <- NA
countryunique <- array(c("Africa","USA","Japan","Australia","Sweden","UK","France"))
new_country <- c()
#Perform looping through the column, TLD
for(loopitem in example$Country)
{
#Perform random selection of an array, countryunique
loopitem <- sample(countryunique, 1)
#Load the new value to the vector
new_country<- c(new_country,loopitem)
}
#Override the Country column with new data
example$Country<- new_country
example$Class<- as.factor(example$Class)
example$Sex<- as.factor(example$Sex)
example$Age<- as.factor(example$Age)
example$Survived<- as.factor(example$Survived)
example$Country<- as.factor(example$Country)
example$Freq<- as.numeric(example$Freq)
set.seed(12345678)
trainRowNum <- createDataPartition(example$Survived, #The outcome variable
#proportion of example to form the training set
p=0.3,
#Don't store the result in a list
list=FALSE);
# Step 2: Create the training mydataset
trainData <- example[trainRowNum,]
# Step 3: Create the test mydataset
testData <- example[-trainRowNum,]
alphas <- seq(0.1,0.9,by=0.1);
lambdas <- 10^seq(-3,3,length=100)
#Logistic Elastic-Net Regression
en <- train(Survived~. ,
data = trainData,
method = "glmnet",
preProcess = NULL,
trControl = trainControl("repeatedcv",
number = 10,
repeats = 5),
tuneGrid = expand.grid(alpha = alphas,
lambda = lambdas)
)
Could you please kindly advise on what values are recommended to assign to Alpha and lambda?
Thank you
I'm not quite sure what the problem is. Your code runs fine for me. If I look at the en object it says:
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were alpha = 0.1 and lambda
= 0.1.
It didn't take long to run for me. Do you have a lot stored in your R session memory that could be slowing down your system and causing it to lag? Maybe try re-starting RStudio and running the above code from scratch.
To see the full results table with Accuracy for all combinations of Alpha and Lambda, look at en$results
As a side-note, you can easily carry out cross-validation directly in the glmnet package, using the cv.glmnet function. A helper package called glmnetUtils is also available, that lets you select the optimal Alpha and Lambda values simultaneously using the cva.glmnet function. This allows for parallelisation, so may be quicker than doing the cross-validation via caret.

R: Understanding K-Fold Validation Correctly?

Good Afternoon.
I wanted a sanity check after doing research about k-Fold Cross-Validation. I will provide my understanding, and then provide an example of how to execute the preconceived understanding in R.
I would really appreciate any help on if I'm thinking about this incorrectly, or if my code is not reflecting my thought process / the correct procedures. Take the basic predictive modeling scenario on a continuous response variable:
Have a population dataset (xDF)
I want to split the dataset into k=10 separate parts, train a model on 9 of them (binded), and then validate on the remaining validation set
I then want to loop through each validation set to observe how the model performs on un-trained segments of the data
Model performance measures (RMSE for this example) on the kth-fold validation set that display similar results on the k+1...k+9th validation set reveals that the model is well-generalized
R Code:
#Declaring randomly sampled validation indices
ind <- sample(seq_len(nrow(xDF)), size = nrow(xDF))
n <- (nrow(xDF)/10)
nr <- nrow(xDF)
validation_ind <- split(ind, rep(1:ceiling(nr/n), each=n, length.out=nr))
#Looping through validation sets to obtain Model Performance measure of each set
RMSEsF <- double(10)
RMSEsFT <- double(10)
R2F <- double(10)
R2FT <- double(10)
rsq <- function (x, y) cor(x, y) ^ 2
for (i in 1:10){
validate = as.data.frame(xDF[unlist(validation_ind[i]),])
train = as.data.frame(xDF[unlist(validation_ind[-i]),])
rf_train = randomForest(y~.,data=train,mtry=3)
predictions_rf = predict(rf_train,validate)
predictions_rft = predict(rf_train, train)
RMSEsF[i] = RMSE(predictions_rf, validate$y)
RMSEsFT[i] = RMSE(predictions_rft, train$y)
R2F[i] = rsq(predictions_rf, validate$y)
R2FT[i] = rsq(predictions_rft, train$y)
print(".")
}
RMSEsF
RMSEsFT
Am I going about this correctly?
Many thanks in advance.

R Neural Net Issues

Here is the updated code. My issue is with the output of "results". I'll post below as the format for readability.
library("neuralnet")
library("ggplot2")
setwd("C:/Users/Aaron/Documents/UMUC/R/Data For Assignments")
trainset <- read.csv("SOTS.csv")
head(trainset)
## val data classification
str(trainset)
## building the neural network
risknet <- neuralnet(Overall.Risk.Value ~ Finance + Personnel + Information.Dissemenation.C, trainset, hidden = 10, lifesign = "minimal", linear.output = FALSE, threshold = 0.1)
##plot nn
plot(risknet, rep="best")
##import scoring set
score_set <- read.csv("SOSS.csv")
##select subsets-training and scoring match
score_test <- subset(score_set, select = c("Finance", "Personnel", "Information.Dissemenation.C"))
##display values of score_test
head(score_test)
##neural network compute function score_test and the neural net "risknet"
risknet.results <- compute(risknet, score_test)
##Actual value of Overall.Risk.Value variable wanting to predict. net.result = a matrix containing the overall result of the neural network
results <- data.frame(Actual = score_set$Overall.Risk.Value, Prediction = risknet.results$net.result)
results[1:14, ]
The output of results is not as expected. For instance, the actual data is a number between 5 and 8, whereas "Prediction" displays outputs of .9995...for each result.
Thanks again for the help.
This is how you train and predict:
Use training data to learn model parameters (the variable risknet in your case)
Use parameters to predict scores on test data
Here is an example very much similar to yours that explains how this is done.
The default activation function in neuralnet is "logistic". When linear.output is set as FALSE, it ensures that the output is mapped by the activation function to the interval [0,1].(R_Journal (neuralnet)- Frauke Günther)
I just updated linear.output=TRUE in your code and final result looks much better.
Thanks for the help!

Setting Random seeds do not affect classification methods C5.0 and ctree

I want to compare between two different classification methods, namely ctree and C5.0 in the libraries partyand c50 respectively, the comparison is to test their sensitivity to the initial start points. The test should be carried 30 times for each time the number of wrong classified items are calculated and stored in a vector then by using t-test I hope to see if they are really different or not.
library("foreign"); # for read.arff
library("party") # for ctree
library("C50") # for C5.0
trainTestSplit <- function(data, trainPercentage){
newData <- list();
all <- nrow(data);
splitPoint <- floor(all * trainPercentage);
newData$train <- data[1:splitPoint, ];
newData$test <- data[splitPoint:all, ];
return (newData);
}
ctreeErrorCount <- function(st,ss){
set.seed(ss);
model <- ctree(Class ~ ., data=st$train);
class <- st$test$Class;
st$test$Class <- NULL;
pre = predict(model, newdata=st$test, type="response");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
C50ErrorCount <- function(st,ss){
model <- C5.0(Class ~ ., data=st$train, seed=ss);
class <- st$test$Class;
pre = predict(model, newdata=st$test, type="class");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
compare <- function(n = 30){
data <- read.arff(file.choose());
set.seed(100);
errors = list(ctree = c(), c50 = c());
seeds <- floor(abs(rnorm(n) * 10000));
for(i in 1:n){
splitData <- trainTestSplit(data, 0.66);
errors$ctree[i] <- ctreeErrorCount(splitData, seeds[i]);
errors$c50[i] <- C50ErrorCount(splitData, seeds[i]);
}
cat("\n\n");
cat("============= ctree Vs C5.0 =================\n");
cat(paste(errors$ctree, " ", errors$c50, "\n"))
tt <- t.test(errors$ctree, errors$c50);
print(tt);
}
The program shown is supposedly doing the job of comparison, but because of the number of errors does not change in the vectors then the t.test function produces an error. I used iris inside R (but changing class to Class) and Winchester breast cancer data which can be downloaded here to test it but any data can be used as long as it has Class attribute
But I get in to the problem that the result of both methods remain constant and not changes while I am changing the random seed, theoretically ,as described in their documentation,both of the functions use random seeds, ctree uses set.seed(x) while C5.0 uses an argument called seed to set seed, unfortunatly I can not find the effect.
Could you please tell me how to control initials of these functions
ctrees does only depend on a random seed in the case where you configure it to use a random selection of input variables (ie that mtry > 0 within ctree_control). See http://cran.r-project.org/web/packages/party/party.pdf (p. 11)
In regards to C5.0-trees the seed is used this way:
ctrl = C5.0Control(sample=0.5, seed=ss);
model <- C5.0(Class ~ ., data=st$train, control = ctrl);
Notice that the seed is used to select a sample of the data, not within the algoritm itself. See http://cran.r-project.org/web/packages/C50/C50.pdf (p. 5)

How to perform 10 fold cross validation with LibSVM in R?

I know that in MatLab this is really easy ('-v 10').
But I need to do it in R. I did find one comment about adding cross = 10 as parameter would do it. But this is not confirmed in the help file so I am sceptical about it.
svm(Outcome ~. , data= source, cost = 100, gamma =1, cross=10)
Any examples of a successful SVM script for R would also be appreciated as I am still running into some dead ends?
Edit: I forgot to mention outside of the tags that I use the libsvm package for this.
I am also trying to perform a 10 fold cross validation. I think that using tune is not the right way in order to perform it, since this function is used to optimize the parameters, but not to train and test the model.
I have the following code to perform a Leave-One-Out cross validation. Suppose that dataset is a data.frame with your data stored. In each LOO step, the observed vs. predicted matrix is added, so that at the end, result contains the global observed vs. predicted matrix.
#LOOValidation
for (i in 1:length(dataset)){
fit = svm(classes ~ ., data=dataset[-i,], type='C-classification', kernel='linear')
pred = predict(fit, dataset[i,])
result <- result + table(true=dataset[i,]$classes, pred=pred);
}
classAgreement(result)
So in order to perform a 10-fold cross validation, I guess we should manually partition the dataset, and use the folds to train and test the model.
for (i in 1:10)
train <- getFoldTrainSet(dataset, i)
test <- getFoldTestSet(dataset,i)
fit = svm(classes ~ ., train, type='C-classification', kernel='linear')
pred = predict(fit, test)
results <- c(results,table(true=test$classes, pred=pred));
}
# compute mean accuracies and kappas ussing results, which store the result of each fold
I hope this help you.
Here is a simple way to create 10 test and training folds using no packages:
#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use test and train data howeever you desire...
}
Here is my generic code to run a k-fold cross validation aided by cvsegments to generate the index folds.
# k fold-cross validation
set.seed(1)
k <- 80;
result <- 0;
library('pls');
folds <- cvsegments(nrow(imDF), k);
for (fold in 1:k){
currentFold <- folds[fold][[1]];
fit = svm(classes ~ ., data=imDF[-currentFold,], type='C-classification', kernel='linear')
pred = predict(fit, imDF[currentFold,])
result <- result + table(true=imDF[currentFold,]$classes, pred=pred);
}
classAgreement(result)

Resources