adjusting the K in knn train() command in R - r

I'm trying to classify with the knn algorithm. My question is how do I adjust the number of neighbors the algorithm uses?
For example, I want to use 3, 9 and 12?
How do I adjust this in the command?
species_knn = train(species ~., method= "knn", data = species, trControl=trainControl(method = 'cv', number = 3))

Here is an example of grid search using iris data:
library(caret)
construct a grid of hyper parameters which you would like to tune:
grid = expand.grid(k = c(3, 9, 12)) #in this case data.frame(k = c(3, 9, 12)) will do
provide the grid in tuneGrid argument:
species_knn = train(Species ~., method= "knn",
data = iris,
trControl = trainControl(method = 'cv',
number = 3,
search = "grid"),
tuneGrid = grid)
species_knn$results
#output
k Accuracy Kappa AccuracySD KappaSD
1 3 0.9666667 0.9499560 0.02309401 0.0346808964
2 9 0.9600000 0.9399519 0.00000000 0.0000416525
3 12 0.9533333 0.9299479 0.01154701 0.0173066504
Here is a list of all available models and hyper parameters.

Related

Hyperparameters not changing results from random forest regression trees

I am trying to tune the hyperparameters of a random forest regression model and all of the accuracy measures are exactly the same, regardless of changes to hyperparameters. I've tested the same code on the "diamonds" dataset and have been able to reproduce the problem. Here is my code:
train = diamonds[,c(1, 5, 8:10)]
x = c(1:6)
folds = sample(x,size = nrow(diamonds), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model.rds")
write.csv(results1, "sample_model.csv", row.names = FALSE)
Here's what I get for results:
What the heck?
UPDATE:
I reduced the sample size to 1000 to allow for faster processing and got different results, still all identical to each other. Code:
train = diamonds[,c(1, 5, 8:10)]
train = train[c(1:1000),]
x = c(1:6)
folds = sample(x,size = nrow(train), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model2.rds")
write.csv(results1, "sample_model2.csv", row.names = FALSE)
Results:
This seems to be an issue with your cross-validation folds. When I run your code and look at the results of model it says:
Summary of sample sizes: 1, 1, 1, 1, 1, 1, ...
indicating that each fold only has a sample size of 1.
I think if you define folds like this, it will work more like you're expecting it to:
folds<-createFolds(train$carat, k = 6, returnTrain=TRUE)
The results then look like this:
Random Forest
1000 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 832, 833, 835, 834, 834, 832, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 0.01582362 0.9933839 0.00985451
3 0.01601980 0.9932625 0.00994588
4 0.01567161 0.9935624 0.01018242
Tuning parameter 'splitrule' was held constant at a value
of variance
Tuning parameter 'min.node.size' was held constant
at a value of 20
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 4, splitrule
= variance and min.node.size = 20.

PCA preprocess parameter in caret's train function

I am conducting knn regression on my data, and would like to:
a) cross-validate through repeatedcv to find an optimal k;
b) when building knn model, using PCA at 90% level threshold to reduce dimensionality.
library(caret)
library(dplyr)
set.seed(0)
data = cbind(rnorm(20, 100, 10), matrix(rnorm(400, 10, 5), ncol = 20)) %>%
data.frame()
colnames(data) = c('True', paste0('Day',1:20))
tr = data[1:15, ] #training set
tt = data[16:20,] #test set
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
#trying to find the optimal k from 1:10
trControl = train.control,
preProcess = c('scale','pca'),
metric = "RMSE",
data = tr)
My questions:
(1) I notice that someone suggested to change the pca parameter in trainControl:
ctrl <- trainControl(preProcOptions = list(thresh = 0.8))
mod <- train(Class ~ ., data = Sonar, method = "pls",
trControl = ctrl)
If I change the parameter in the trainControl, does it mean the PCA is still conducted during the KNN? Similar concern as this question
(2) I found another example which fits my situation - I am hoping to change the threshold to 90% but I don't know where can I change it in Caret's train function, especially I still need the scale option.
I apologize for my tedious long description and random references. Thank you in advance!
(Thank you Camille for the suggestions to make the code work!)
To answer your questions:
I notice that someone suggested to change the pca parameter in
trainControl:
mod <- train(Class ~ ., data = Sonar, method = "pls",trControl = ctrl)
If I change the parameter in the trainControl, does it mean the PCA is
still conducted during the KNN?
Yes if you do it with:
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3,preProcOptions = list(thresh = 0.9))
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = train.control,
preProcess = c('scale','pca'),
metric = "RMSE",
data = tr)
You can check under preProcess:
k$preProcess
Created from 15 samples and 20 variables
Pre-processing:
- centered (20)
- ignored (0)
- principal component signal extraction (20)
- scaled (20)
PCA needed 9 components to capture 90 percent of the variance
This will answer 2) which is to use preProcess separately:
mdl = preProcess(tr[,-1],method=c("scale","pca"),thresh=0.9)
mdl
Created from 15 samples and 20 variables
Pre-processing:
- centered (20)
- ignored (0)
- principal component signal extraction (20)
- scaled (20)
PCA needed 9 components to capture 90 percent of the variance
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = train.control,
metric = "RMSE",
data = predict(mdl,tr))

Ensemble different datasets in R

I am trying to combine signals from different models using the example described here . I have different datasets which predicts the same output. However, when I combine the model output in caretList, and ensemble the signals, it gives an error
Error in check_bestpreds_resamples(modelLibrary) :
Component models do not have the same re-sampling strategies
Here is the reproducible example
library(caret)
library(caretEnsemble)
df1 <-
data.frame(x1 = rnorm(200),
x2 = rnorm(200),
y = as.factor(sample(c("Jack", "Jill"), 200, replace = T)))
df2 <-
data.frame(z1 = rnorm(400),
z2 = rnorm(400),
y = as.factor(sample(c("Jack", "Jill"), 400, replace = T)))
library(caret)
check_1 <- train( x = df1[,1:2],y = df1[,3],
method = "nnet",
tuneLength = 10,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = T))
check_2 <- train( x = df2[,1:2],y = df2[,3] ,
method = "nnet",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = T))
combine <- c(check_1, check_2)
ens <- caretEnsemble(combine)
First of all, you are trying to combine 2 models trained on different training data sets. That is not going to work. All ensemble models will need to be based on the same training set. You will have different sets of resamples in each trained model. Hence your current error.
Also building your models without using caretList is dangerous because you will have a big change of getting different resample strategies. You can control that a bit better by using the index in trainControl (see vignette).
If you use 1 dataset you can use the following code:
ctrl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
savePredictions = "final")
set.seed(1324)
# will generate the following warning:
# indexes not defined in trControl. Attempting to set them ourselves, so
# each model in the ensemble will have the same resampling indexes.
models <- caretList(x = df1[,1:2],
y = df1[,3] ,
trControl = ctrl,
tuneList = list(
check_1 = caretModelSpec(method = "nnet", tuneLength = 10),
check_2 = caretModelSpec(method = "nnet", tuneLength = 10, preProcess = c("center", "scale"))
))
ens <- caretEnsemble(models)
A glm ensemble of 2 base models: nnet, nnet
Ensemble results:
Generalized Linear Model
200 samples
2 predictor
2 classes: 'Jack', 'Jill'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results:
Accuracy Kappa
0.5249231 0.04164767
Also read this guide on different ensemble strategies.

R: using ranger with caret, tuneGrid argument

I'm using the caret package to analyse Random Forest models built using ranger. I can't figure out how to call the train function using the tuneGrid argument to tune the model parameters.
I think I'm calling the tuneGrid argument wrong, but can't figure out why it's wrong. Any help would be appreciated.
data(iris)
library(ranger)
model_ranger <- ranger(Species ~ ., data = iris, num.trees = 500, mtry = 4,
importance = 'impurity')
library(caret)
# my tuneGrid object:
tgrid <- expand.grid(
num.trees = c(200, 500, 1000),
mtry = 2:4
)
model_caret <- train(Species ~ ., data = iris,
method = "ranger",
trControl = trainControl(method="cv", number = 5, verboseIter = T, classProbs = T),
tuneGrid = tgrid,
importance = 'impurity'
)
Here is the syntax for ranger in caret:
library(caret)
add . prior to tuning parameters:
tgrid <- expand.grid(
.mtry = 2:4,
.splitrule = "gini",
.min.node.size = c(10, 20)
)
Only these three are supported by caret and not the number of trees. In train you can specify num.trees and importance:
model_caret <- train(Species ~ ., data = iris,
method = "ranger",
trControl = trainControl(method="cv", number = 5, verboseIter = T, classProbs = T),
tuneGrid = tgrid,
num.trees = 100,
importance = "permutation")
to get variable importance:
varImp(model_caret)
#output
Overall
Petal.Length 100.0000
Petal.Width 84.4298
Sepal.Length 0.9855
Sepal.Width 0.0000
To check if this works set number of trees to 1000+ - the fit will be much slower. After changing importance = "impurity":
#output:
Overall
Petal.Length 100.00
Petal.Width 81.67
Sepal.Length 16.19
Sepal.Width 0.00
If it does not work I recommend installing latest ranger from CRAN and caret from git hub:
devtools::install_github('topepo/caret/pkg/caret')
To train the number of trees you can use lapply with fixed folds created by createMultiFolds or createFolds.
EDIT: while the above example works with caret package version 6.0-84, using the names of hyper parameters without dots works as well.
tgrid <- expand.grid(
mtry = 2:4,
splitrule = "gini",
min.node.size = c(10, 20)
)

Keeping one parameter fixed and search on randomly in caret

I would like to keep the parameter alpha fixed at 1 and use random search for lambda, is this possible?
library(caret)
X <- iris[, 1:4]
Y <- iris[, 5]
fit_glmnet <- train(X, Y, method = "glmnet", tuneLength = 2, trControl = trainControl(search = "random"))
I do not think this can be achieved by specifying directly in caret train but here is how to emulate the desired behavior:
From this link
one can see random search for lambda is achieved by:
lambda = 2^runif(len, min = -10, 3)
where len is the tune length
To emulate random search over one parameter:
len <- 2
fit_glmnet <- train(X, Y,
method = "glmnet",
tuneLength = len,
trControl = trainControl(search = "grid"),
tuneGrid = data.frame(alpha = 1, lambda = 2^runif(len, min = -10, 3)))
First off, I'm not sure you can use a random search and fix specific tuning parameters.
However, as an alternative you could use a grid search for optimising tuning parameters instead of a random search. You can then fix tuning parameters using tuneGrid:
fit <- train(
X,
Y,
method = "glmnet",
tuneLength = 2,
trControl = trainControl(search = "grid"),
tuneGrid = data.frame(alpha = 1, lambda = 10^seq(-4, -1, by = 0.5)));
fit;
#glmnet
#
#150 samples
# 4 predictor
# 3 classes: 'setosa', 'versicolor', 'virginica'
#
#No pre-processing
#Resampling: Bootstrapped (25 reps)
#Summary of sample sizes: 150, 150, 150, 150, 150, 150, ...
#Resampling results across tuning parameters:
#
# lambda Accuracy Kappa
# 0.0001000000 0.9398036 0.9093246
# 0.0003162278 0.9560817 0.9336278
# 0.0010000000 0.9581838 0.9368050
# 0.0031622777 0.9589165 0.9379580
# 0.0100000000 0.9528997 0.9288533
# 0.0316227766 0.9477923 0.9212374
# 0.1000000000 0.9141015 0.8709753
#
#Tuning parameter 'alpha' was held constant at a value of 1
#Accuracy was used to select the optimal model using the largest value.
#The final values used for the model were alpha = 1 and lambda = 0.003162278.

Resources