I am conducting knn regression on my data, and would like to:
a) cross-validate through repeatedcv to find an optimal k;
b) when building knn model, using PCA at 90% level threshold to reduce dimensionality.
library(caret)
library(dplyr)
set.seed(0)
data = cbind(rnorm(20, 100, 10), matrix(rnorm(400, 10, 5), ncol = 20)) %>%
data.frame()
colnames(data) = c('True', paste0('Day',1:20))
tr = data[1:15, ] #training set
tt = data[16:20,] #test set
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
#trying to find the optimal k from 1:10
trControl = train.control,
preProcess = c('scale','pca'),
metric = "RMSE",
data = tr)
My questions:
(1) I notice that someone suggested to change the pca parameter in trainControl:
ctrl <- trainControl(preProcOptions = list(thresh = 0.8))
mod <- train(Class ~ ., data = Sonar, method = "pls",
trControl = ctrl)
If I change the parameter in the trainControl, does it mean the PCA is still conducted during the KNN? Similar concern as this question
(2) I found another example which fits my situation - I am hoping to change the threshold to 90% but I don't know where can I change it in Caret's train function, especially I still need the scale option.
I apologize for my tedious long description and random references. Thank you in advance!
(Thank you Camille for the suggestions to make the code work!)
To answer your questions:
I notice that someone suggested to change the pca parameter in
trainControl:
mod <- train(Class ~ ., data = Sonar, method = "pls",trControl = ctrl)
If I change the parameter in the trainControl, does it mean the PCA is
still conducted during the KNN?
Yes if you do it with:
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3,preProcOptions = list(thresh = 0.9))
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = train.control,
preProcess = c('scale','pca'),
metric = "RMSE",
data = tr)
You can check under preProcess:
k$preProcess
Created from 15 samples and 20 variables
Pre-processing:
- centered (20)
- ignored (0)
- principal component signal extraction (20)
- scaled (20)
PCA needed 9 components to capture 90 percent of the variance
This will answer 2) which is to use preProcess separately:
mdl = preProcess(tr[,-1],method=c("scale","pca"),thresh=0.9)
mdl
Created from 15 samples and 20 variables
Pre-processing:
- centered (20)
- ignored (0)
- principal component signal extraction (20)
- scaled (20)
PCA needed 9 components to capture 90 percent of the variance
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = train.control,
metric = "RMSE",
data = predict(mdl,tr))
Related
I am new using SVM in r. I am doing analysis on a breast cancer dataset and I would like to plot the decision boundary on two of the features. but the boundary does not show where it supposed to show. The following is the code:
set.seed(134)
fitc = trainControl(method = "cv", number = 10, search = "random", savePredictions = T)
modfitsvm = train(diagnosis~., data = trainset, method = "svmLinear", trControl = fitc, tuneLength = 10)
plot(modfitsvm)
bestmod = svm(diagnosis~., data = trainset, kernel = "linear", cost = 0.18)
# train data error
svm.ytrain<-trainset$diagnosis
svm.predy<-predict(bestmod, trainset)
mean(svm.ytrain!=svm.predy)
# test data error
svm.ytest<-testset$diagnosis
svm.predtesty<-predict(bestmod, testset)
mean(svm.ytest!=svm.predtesty)
svmtab = confusionMatrix(svm.predtesty,testset$diagnosis)
fourfoldplot(svmtab$table, conf.level = 0, margin = 1, main=paste("SVM Testing(",round(svmtab$overall[1]*100,3),"%)",sep=""))
I am not seeing boundary but the model achieved 94% accuracy.
I am conducting knn regression on my data, and would like to:
a) cross-validate through repeatedcv to find an optimal k;
b) when building knn model, using PCA at 90% level threshold to reduce dimensionality.
library(caret)
library(dplyr)
set.seed(0)
data = cbind(rnorm(15, 100, 10), matrix(rnorm(300, 10, 5), ncol = 20)) %>%
data.frame()
colnames(data) = c('True', paste0('Day',1:20))
tr = data[1:10, ] #training set
tt = data[11:15,] #test set
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = train.control,
preProcess = c('scale','pca'),
metric = "RMSE",
data = tr)
My question is: currently the PCA threshold is by default 95% (not sure), how can I change it to 80%?
You can try to add preProcOptions argument in trainControl
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3, preProcOptions = list(thresh = 0.80))
I am trying to combine signals from different models using the example described here . I have different datasets which predicts the same output. However, when I combine the model output in caretList, and ensemble the signals, it gives an error
Error in check_bestpreds_resamples(modelLibrary) :
Component models do not have the same re-sampling strategies
Here is the reproducible example
library(caret)
library(caretEnsemble)
df1 <-
data.frame(x1 = rnorm(200),
x2 = rnorm(200),
y = as.factor(sample(c("Jack", "Jill"), 200, replace = T)))
df2 <-
data.frame(z1 = rnorm(400),
z2 = rnorm(400),
y = as.factor(sample(c("Jack", "Jill"), 400, replace = T)))
library(caret)
check_1 <- train( x = df1[,1:2],y = df1[,3],
method = "nnet",
tuneLength = 10,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = T))
check_2 <- train( x = df2[,1:2],y = df2[,3] ,
method = "nnet",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = T))
combine <- c(check_1, check_2)
ens <- caretEnsemble(combine)
First of all, you are trying to combine 2 models trained on different training data sets. That is not going to work. All ensemble models will need to be based on the same training set. You will have different sets of resamples in each trained model. Hence your current error.
Also building your models without using caretList is dangerous because you will have a big change of getting different resample strategies. You can control that a bit better by using the index in trainControl (see vignette).
If you use 1 dataset you can use the following code:
ctrl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
savePredictions = "final")
set.seed(1324)
# will generate the following warning:
# indexes not defined in trControl. Attempting to set them ourselves, so
# each model in the ensemble will have the same resampling indexes.
models <- caretList(x = df1[,1:2],
y = df1[,3] ,
trControl = ctrl,
tuneList = list(
check_1 = caretModelSpec(method = "nnet", tuneLength = 10),
check_2 = caretModelSpec(method = "nnet", tuneLength = 10, preProcess = c("center", "scale"))
))
ens <- caretEnsemble(models)
A glm ensemble of 2 base models: nnet, nnet
Ensemble results:
Generalized Linear Model
200 samples
2 predictor
2 classes: 'Jack', 'Jill'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results:
Accuracy Kappa
0.5249231 0.04164767
Also read this guide on different ensemble strategies.
I am trying to repeat the following lines of code:
x.mat <- as.matrix(train.df[,predictors])
y.class <- train.df$Response
cv.lasso.fit <- cv.glmnet(x = x.mat, y = y.class,
family = "binomial", alpha = 1, nfolds = 10)
... with the caret package, but it doesn't work:
trainControl <- trainControl(method = "cv",
number = 10,
# Compute Recall, Precision, F-Measure
summaryFunction = prSummary,
# prSummary needs calculated class probs
classProbs = T)
modelFit <- train(Response ~ . -Id, data = train.df,
method = "glmnet",
trControl = trainControl,
metric = "F", # Optimize by F-measure
alpha=1,
family="binomial")
The parameter "alpha" is not recognized, and "the model fit fails in every fold".
What am I doing wrong? Help would be much appreciated. Thanks.
Try to use tuneGrid. For example as follows:
tuneGrid=expand.grid(
.alpha=1,
.lambda=seq(0, 100, by = 0.1))
I am comparing different machine learning methods using caret, but though the methods are very different, I am getting identical variable contributions.
vNNet, ctree, enet, knn, M5, pcr, ridge, svmRadial give the same variable contributions.
Some of these will take importance = TRUE as input: vNNet, enet, knn, pcr, ridge, svmRadial do.
Others generated an error with importance = TRUE: ctree, M5.
(The error is "Something is wrong; all the RMSE metric values are missing:")
My question is why do different methods give the same variable importance?
This seems wrong, but I can't see what I've done wrong.
library(ggplot2)
library(caret)
library(elasticnet)
library(party)
data_set <- diamonds[1:1000, c(1, 5, 6, 7, 8, 9, 10)]
formula <- price ~ carat + depth + table + x + y + z
set.seed(100)
enet_model <- train(formula,
importance = TRUE,
data = data_set,
method = "enet",
trControl = trainControl(method = "cv"),
preProc = c("center", "scale"))
set.seed(100)
ctree_model <- train(formula,
data = data_set,
method = "ctree",
trControl = trainControl(method = "cv"))
set.seed(Set_seed_seed)
knn_model <- train(formula,
importance = TRUE,
data = data_set,
method = "knn",
preProc = c("center", "scale"),
tuneGrid = data.frame(k = 1:20),
trControl = training_control)
varImp(enet_model)
varImp(ctree_model)
varImp(knn_model)
I'm using caret 6.0-52
From ?varImp:
For models that do not have corresponding varImp methods, see filterVarImp.
Those methods don't have importance scores implemented so you get model-free measures. I can add one for enet based on the coefficient values but knn and ctree have no obvious methods.