PCA as preprocessing for KNN model in caret not working - r

I am executing a kNN classification with PCA as preprocessing (threshold = 0.8). However, when it seems to have concluded and I execute varImp() function, it does not give the importance of the Principal Components (as it should be doing). When I open the model fit information, the preProcess section is also NULL.
It seems as it is not doing the PCA.
The code for training control and training that I am using is:
ctrl <- trainControl(method = "cv",
number = 10,
summaryFunction = defaultSummary,
preProcOptions = list(thresh=0.8),
classProbs = TRUE
)
set.seed(150)
knn.fit = train(defective~.,
data = fTR,
method = "knn",
preProcess = 'pca',
tuneGrid = data.frame(k = 4),
#tuneGrid = data.frame(k = seq(3,5,1)),
#tuneLength = 10,
trControl = ctrl,
metric = "Kappa")
What am I doing wrong?

Related

Not seeing decision boundary of SVM plot in R

I am new using SVM in r. I am doing analysis on a breast cancer dataset and I would like to plot the decision boundary on two of the features. but the boundary does not show where it supposed to show. The following is the code:
set.seed(134)
fitc = trainControl(method = "cv", number = 10, search = "random", savePredictions = T)
modfitsvm = train(diagnosis~., data = trainset, method = "svmLinear", trControl = fitc, tuneLength = 10)
plot(modfitsvm)
bestmod = svm(diagnosis~., data = trainset, kernel = "linear", cost = 0.18)
# train data error
svm.ytrain<-trainset$diagnosis
svm.predy<-predict(bestmod, trainset)
mean(svm.ytrain!=svm.predy)
# test data error
svm.ytest<-testset$diagnosis
svm.predtesty<-predict(bestmod, testset)
mean(svm.ytest!=svm.predtesty)
svmtab = confusionMatrix(svm.predtesty,testset$diagnosis)
fourfoldplot(svmtab$table, conf.level = 0, margin = 1, main=paste("SVM Testing(",round(svmtab$overall[1]*100,3),"%)",sep=""))
I am not seeing boundary but the model achieved 94% accuracy.

I don't know if my CARET model is classification or regression?

I am using CARET package in R for predictive modelling. I don't know if my model is classification and regression, but whenever I use regression only models, it gives an error.
control <- trainControl(method='repeatedcv',
number=10,
repeats=3,
classProbs = T)
set.seed(1)
scaled_t$Training=factor(scaled_t$Training)
knn_grid1 = train(Training~.,
data=scaled_t,
method = "knn",
trControl = control,
tuneGrid = expand.grid(k = seq(1, 31, by = 2))
)
knn_grid1

Identical results from random forest hyperparameter tuning in R

I am trying out hyperparameter tuning of randomforest in R using the library(RandomForest). Why am I getting identical results for different values of hyperparameter i.e. maxnodes?
I am using the Titanic dataset that was obtained from Kaggle. I applied grid search only on the hyperparameter 'mtry' and the results gave me different values of Accuracy and Kappa for each 'mtry'.
However, when I tried to search for the best 'maxnode' value, all of them return the same value of Accuracy and Kappa.
This is for tuning 'mtry'
library(randomForest)
control <- trainControl(
method = "cv",
number = 10,
search = "grid"
)
tuneGrid <- expand.grid(.mtry = c(1:10))
rf_mtry <- train(train_X,
train_Y,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = control,
importance = TRUE,
nodesize = 14,
ntree = 300
)
This is for tuning 'maxnodes'
mtry_best <- rf_mtry$bestTune$mtry
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = mtry_best)
for (maxnodes in c(2:20)) {
set.seed(1234)
rf_maxnode <- train(train_X,
train_Y,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = control,
importance = TRUE,
nodesize = 5,
maxnodes = maxnodes,
ntree = 300
)
current_iteration <- toString(maxnodes)
store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry)
I expected to see different values of Accuracy and Kappa for different maxnode, but they were identical.

PCA preprocess parameter in caret's train function

I am conducting knn regression on my data, and would like to:
a) cross-validate through repeatedcv to find an optimal k;
b) when building knn model, using PCA at 90% level threshold to reduce dimensionality.
library(caret)
library(dplyr)
set.seed(0)
data = cbind(rnorm(20, 100, 10), matrix(rnorm(400, 10, 5), ncol = 20)) %>%
data.frame()
colnames(data) = c('True', paste0('Day',1:20))
tr = data[1:15, ] #training set
tt = data[16:20,] #test set
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
#trying to find the optimal k from 1:10
trControl = train.control,
preProcess = c('scale','pca'),
metric = "RMSE",
data = tr)
My questions:
(1) I notice that someone suggested to change the pca parameter in trainControl:
ctrl <- trainControl(preProcOptions = list(thresh = 0.8))
mod <- train(Class ~ ., data = Sonar, method = "pls",
trControl = ctrl)
If I change the parameter in the trainControl, does it mean the PCA is still conducted during the KNN? Similar concern as this question
(2) I found another example which fits my situation - I am hoping to change the threshold to 90% but I don't know where can I change it in Caret's train function, especially I still need the scale option.
I apologize for my tedious long description and random references. Thank you in advance!
(Thank you Camille for the suggestions to make the code work!)
To answer your questions:
I notice that someone suggested to change the pca parameter in
trainControl:
mod <- train(Class ~ ., data = Sonar, method = "pls",trControl = ctrl)
If I change the parameter in the trainControl, does it mean the PCA is
still conducted during the KNN?
Yes if you do it with:
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3,preProcOptions = list(thresh = 0.9))
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = train.control,
preProcess = c('scale','pca'),
metric = "RMSE",
data = tr)
You can check under preProcess:
k$preProcess
Created from 15 samples and 20 variables
Pre-processing:
- centered (20)
- ignored (0)
- principal component signal extraction (20)
- scaled (20)
PCA needed 9 components to capture 90 percent of the variance
This will answer 2) which is to use preProcess separately:
mdl = preProcess(tr[,-1],method=c("scale","pca"),thresh=0.9)
mdl
Created from 15 samples and 20 variables
Pre-processing:
- centered (20)
- ignored (0)
- principal component signal extraction (20)
- scaled (20)
PCA needed 9 components to capture 90 percent of the variance
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = train.control,
metric = "RMSE",
data = predict(mdl,tr))

Ensemble different datasets in R

I am trying to combine signals from different models using the example described here . I have different datasets which predicts the same output. However, when I combine the model output in caretList, and ensemble the signals, it gives an error
Error in check_bestpreds_resamples(modelLibrary) :
Component models do not have the same re-sampling strategies
Here is the reproducible example
library(caret)
library(caretEnsemble)
df1 <-
data.frame(x1 = rnorm(200),
x2 = rnorm(200),
y = as.factor(sample(c("Jack", "Jill"), 200, replace = T)))
df2 <-
data.frame(z1 = rnorm(400),
z2 = rnorm(400),
y = as.factor(sample(c("Jack", "Jill"), 400, replace = T)))
library(caret)
check_1 <- train( x = df1[,1:2],y = df1[,3],
method = "nnet",
tuneLength = 10,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = T))
check_2 <- train( x = df2[,1:2],y = df2[,3] ,
method = "nnet",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = T))
combine <- c(check_1, check_2)
ens <- caretEnsemble(combine)
First of all, you are trying to combine 2 models trained on different training data sets. That is not going to work. All ensemble models will need to be based on the same training set. You will have different sets of resamples in each trained model. Hence your current error.
Also building your models without using caretList is dangerous because you will have a big change of getting different resample strategies. You can control that a bit better by using the index in trainControl (see vignette).
If you use 1 dataset you can use the following code:
ctrl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
savePredictions = "final")
set.seed(1324)
# will generate the following warning:
# indexes not defined in trControl. Attempting to set them ourselves, so
# each model in the ensemble will have the same resampling indexes.
models <- caretList(x = df1[,1:2],
y = df1[,3] ,
trControl = ctrl,
tuneList = list(
check_1 = caretModelSpec(method = "nnet", tuneLength = 10),
check_2 = caretModelSpec(method = "nnet", tuneLength = 10, preProcess = c("center", "scale"))
))
ens <- caretEnsemble(models)
A glm ensemble of 2 base models: nnet, nnet
Ensemble results:
Generalized Linear Model
200 samples
2 predictor
2 classes: 'Jack', 'Jill'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results:
Accuracy Kappa
0.5249231 0.04164767
Also read this guide on different ensemble strategies.

Resources