perform knn with clustered input data - r

How do I perform knn cross validation with input data that has been clustered using k-means.
I seem to be unable to find the correct function which is able to do so.
predict.strengt from fpcseem to be able to compute some form of prediction rate given a classifier method, but it seems it test it against the training set, which in my mind doesn't seem that beneficial.
Aren't there any function which can perform cross validation?
Example:
library("datasets")
library("stats")
iris_c3 = kmeans(iris$Sepal.Length,center= 10, iter.max = 30)
How do I provide iris_c3 as training data for some form of knn which also performs cv, if a given test set was provided in the same manner.

Related

Missing value handling with imputation in a nested resampling procedure such that there is no information bleed from train to test

I am looking through the documentation for the nested resampling procedure in the mlr3tuning package and I do not see any way that the package can handle NA values such that any information bleed between the training and hold-out sets is avoided, which would result in overly optimistic performance stats. I would ideally like a way to split my data in a nested resampling procedure such that:
full_data = N
train = N - holdout
test = holdout
Then I could perform an imputation on the train and test datasets separately and then run the model on train, predict on test and then select new holdouts and train from the full dataset, run the imputation on them separately and train, predict, repeat for the number of outer_loops.
Is there a way of doing this? Am I missing something obvious?
mlr3 handles all of this for you if you use pipelines (see the relevant part of the mlr3 book). If you make imputation part of such a pipeline, it makes sure to train/test appropriately, just like for the model itself.
Briefly as an explanation, just like with the machine learning model you don't want to do any adjustments based on the test set; in particular you shouldn't impute based on test data. This will cause similar problems as doing this with a model, i.e. biased evaluation results that may not be representative of the true generalization error.

It's normal to test model on independent data after cross validation

I want to perform a random forest model, so I split my data into 70% for the train and 30% for the test. I applied a cross validation procedure on my train data (70%) and obtained a precision for the cross validation. After that, I test my model on the test data (30%), then I have another clarification.
So, I want to know if this is a good approach to test the robustness of my model, and what is the interpretation of these two precision.
Thanks in advance.
You do not need to perform Cross-Validation when building a RF model, as RF calculates its own CV score knows as OOB score. In fact, the results that you get from the model (the confusion matrix at model_name$confusion) is based on the OOB scores.
You can use the OOB scores (and the various metrics derived from them, such as Precision, Recall, etc.) to select a model from a list of models (for ex. models with different parameters / arguments) and then use the test data to check if the selected model generalises well.

How to perform a train, test and validation set to predict

I have a really large dataset and i'm trying to build a classification model using R.
However I need to use a train, test and validation set. But i'm a bit confused about the way to perform this. For example, I built a tree using a train set and then i computed the predicion using a test set. But I believe that i should be using the train and the test set to best tune the tree and after that use the validation set to validate. How can i do this?
library(rpart)
part.installed <- rpart(TARGET ~ RS_DESC+SAP_STATUS +
ACTIVATION_STATUS+ROTUL_STATUS+SIM_STATUS+RATE_PLAN_SEGMENT_NORM,
trainSet, method="class")
part.predictions <- predict(part.installed, testSet, type="class")
(P.S the tree is only an example. It could be another classification algorithm)
Usually the terminology is as follows:
The training set is used to build the classifier
The validation set is used to tune the algorithm hyperparameters repeatedly. So there will be some overfitting here, but that is why there is another stage:
The test set must not be touched until the classifier is final to prevent overfitting. It serves to estimate the true accuracy, if you would put the model into production.

Applying k-fold Cross Validation model using caret package

Let me start by saying that I have read many posts on Cross Validation and it seems there is much confusion out there. My understanding of that it is simply this:
Perform k-fold Cross Validation i.e. 10 folds to understand the average error across the 10 folds.
If acceptable then train the model on the complete data set.
I am attempting to build a decision tree using rpart in R and taking advantage of the caret package. Below is the code I am using.
# load libraries
library(caret)
library(rpart)
# define training control
train_control<- trainControl(method="cv", number=10)
# train the model
model<- train(resp~., data=mydat, trControl=train_control, method="rpart")
# make predictions
predictions<- predict(model,mydat)
# append predictions
mydat<- cbind(mydat,predictions)
# summarize results
confusionMatrix<- confusionMatrix(mydat$predictions,mydat$resp)
I have one question regarding the caret train application. I have read A Short Introduction to the caret Package train section which states during the resampling process the "optimal parameter set" is determined.
In my example have I coded it up correctly? Do I need to define the rpart parameters within my code or is my code sufficient?
when you perform k-fold cross validation you are already making a prediction for each sample, just over 10 different models (presuming k = 10).
There is no need make a prediction on the complete data, as you already have their predictions from the k different models.
What you can do is the following:
train_control<- trainControl(method="cv", number=10, savePredictions = TRUE)
Then
model<- train(resp~., data=mydat, trControl=train_control, method="rpart")
if you want to see the observed and predictions in a nice format you simply type:
model$pred
Also for the second part of your question, caret should handle all the parameter stuff. You can manually try tune parameters if you desire.
An important thing to be noted here is not confuse model selection and model error estimation.
You can use cross-validation to estimate the model hyper-parameters (regularization parameter for example).
Usually that is done with 10-fold cross validation, because it is good choice for the bias-variance trade-off (2-fold could cause models with high bias, leave one out cv can cause models with high variance/over-fitting).
After that, if you don't have an independent test set you could estimate an empirical distribution of some performance metric using cross validation: once you found out the best hyper-parameters you could use them in order to estimate de cv error.
Note that in this step the hyperparameters are fixed but maybe the model parameters are different accross the cross validation models.
In the first page of the short introduction document for caret package, it is mentioned that the optimal model is chosen across the parameters.
As a starting point, one must understand that cross-validation is a procedure for selecting best modeling approach rather than the model itself CV - Final model selection. Caret provides grid search option using tuneGrid where you can provide a list of parameter values to test. The final model will have the optimized parameter after training is done.

Random Forest Crossvalidation in R

I am working on a random forest in R and I would like to add the 10- folds cross validation to my model. But I am quite stuck there.
This is sample of my code.
install.packages('randomForest')
library(randomForest)
set.seed(123)
fit <- randomForest(as.factor(sickrabbit) ~ Feature1,..., FeatureN ,data=training1, importance=TRUE,sampsize = c(200,300),ntree=500)
I found online the function rfcv in caret but I am not sure to understand how it works. Can anyone help with this function or propose an easier way to implement cross validation. Can you do it using random forest package instead of caret?
You don't need to cross-validate a random forest model. You are getting stuck with the randomForest package because it wasn't designed to do this.
Here is a snippet from Breiman's official documentation:
In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:
Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.
Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.

Resources