I'm trying to cross-value with the OneR algorithm and I don't quite know how to do it.With the example code I get the error "Error in x[0, , drop = FALSE] : incorrect number of dimensions"
glass <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data",
col.names=c("","RI","Na","Mg","Al","Si","K","Ca","Ba","Fe","Type")
str(glass)
head(glass)
standard.features <- scale(glass[,2:10])
data <- cbind(standard.features,glass[11])
data$Type<-factor(data$Type)
anyNA(data)
inTraining <- createDataPartition(data$Type, p = .7, list = FALSE, times =1 )
training <- data[ inTraining,]
testing <- data[-inTraining,]
set.seed(12345)
fitControl <- trainControl(## 5-fold CV
method = "cv",
number = 5
)
model <- OneR(Type~.,data= training)
oneRFit1 <- train(model,
trControl = fitControl)
It is fairly easy to write your own loop to carry out cross-validation. However, it looks like you want to use the caret package to manage it. If so, just use the method argument inside caret's train function to specify that you want to use OneR:
oneRFit1 <- train(Type~.,
data=training,
method="OneR" ,
trControl = fitControl)
str(iris)
head(iris)
set.seed(123)
inTraining <- createDataPartition(iris$Species, p = .7, list = FALSE, times =1 )
training <- iris[ inTraining,]
testing <- iris[-inTraining,]
set.seed(123)
train.control <- trainControl(method = "cv", number = 2)
# Train the model
oneRFit <- train(Species ~., data = training, method = "OneR",
trControl = train.control)
Related
I am working with the R programming language. I am trying to learn how to make a "confusion matrix" for multiclass variables (e.g. How to construct the confusion matrix for a multi class variable).
Suppose I generate some data and fit a decision tree model :
#load libraries
library(rpart)
library(caret)
#generate data
a <- rnorm(1000, 10, 10)
b <- rnorm(1000, 10, 5)
d <- rnorm(1000, 5, 10)
group_1 <- sample( LETTERS[1:3], 1000, replace=TRUE, prob=c(0.33,0.33,0.34) )
e = data.frame(a,b,d, group_1)
e$group_1 = as.factor(d$group_1)
#split data into train and test set
trainIndex <- createDataPartition(e$group_1, p = .8,
list = FALSE,
times = 1)
training <- e[trainIndex,]
test <- e[-trainIndex,]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 5,
## repeated ten times
repeats = 1)
#fit decision tree model
TreeFit <- train(group_1 ~ ., data = training,
method = "rpart2",
trControl = fitControl)
From here, I am able to store the results into a "confusion matrix":
pred <- predict(TreeFit,test)
table_example <- table(pred,test$group_1)
This satisfies my requirements - but this "table" requires me to manually calculate the different accuracy metrics of "A", "B" and "C" (as well as the total accuracy).
My question: Is it possible to use the caret::confusionMatrix() command for this problem?
e.g.
pred <- predict(TreeFit, test, type = "prob")
labels_example <- as.factor(ifelse(pred[,2]>0.5, "1", "0"))
con <- confusionMatrix(labels_example, test$group_1)
This way, I would be able to directly access the accuracy measurements from the confusion matrix. E.g. metric = con$overall[1]
Thanks
Is this what you're looking for?
pred <- predict(
TreeFit,
test)
con <- confusionMatrix(
test$group_1,
pred)
con
con$overall[1]
Same output as in:
table(test$group_1, pred)
Plus accuracy metrics.
I am trying to train a couple of ML models using the trainControl and train functions in Caret but I am always getting the same error which just says:
Error: Stopping
without giving any more details.
The problem is the same for gbm and ranger, so I am wondering if it has something to do with a conflict of the packages I am also using in my code.
library(ggplot2)
library(lattice)
library(caret)
library(rlang)
library(tidyverse)
library(Matrix)
library(glmnet)
library(iterators)
library(parallel)
library(doParallel) # parallel processing.
registerDoParallel(cores=16)
library(randomForest)
library(gbm)
library(ranger)
library(data.table)
library(smooth)
data<- data.frame(A=seq(as.Date("2019-01-01"), by=1, len=100),B=as.numeric(runif(100, 50, 150)),C=as.numeric(runif(100, 50, 150)))
# define data sets
data_training<-data[1:60,]
data_test<-data[(60+1):nrow(data),]
# creating sampling seeds
set.seed(123)
n=nrow(data_training)
tuneLength.num <- 5
seeds <- vector(mode = "list", length = n) # creates an empty vector containing lists
for(i in 1:(n-1)){ # choose tuneLength.num random samples from 1 to 1000
seeds[[i]] <- sample.int(1000, tuneLength.num)
}
# For the last model:
seeds[[n]] <- sample.int(1000, 10)
# Define TimeControl for training and fitting:
trainingTimeControl <- trainControl(method = "timeslice",
initialWindow = 25,
horizon = 1,
fixedWindow = TRUE,
returnResamp="all",
allowParallel = TRUE,
seeds = seeds,
savePredictions = TRUE)
gbm.mod<- caret::train(B ~.- A,
data = data_training,
method = "gbm",
distribution = "gaussian",
trControl = trainingTimeControl,
tuneLength=tuneLength.num,
metric="RMSE")
EDIT: The following code works just fine:
gbm<-gbm(formula = B ~ . - A,
distribution = "gaussian",
data = data_training,
keep.data = TRUE)
It would be great if anyone has an idea what is going on here. The code is working fine with svmRadial.
I am trying to investigate my model with R with machine learning. Training model in general works not well.
# # Logistic regression multiclass
for (i in 1:30) {
# split data into training/test
trainPhyIndex <- createDataPartition(subs_phy$Methane, p=10/17,list = FALSE)
trainingPhy <- subs_phy[trainPhyIndex,]
testingPhy <- subs_phy[-trainPhyIndex,]
# Pre-process predictor values
trainXphy <- trainingPhy[,names(trainingPhy)!= "Methane"]
preProcValuesPhy <- preProcess(x= trainXphy,method = c("center","scale"))
# using boot to avoid over-fitting
fitControlPhyGLMNET <- trainControl(method = "repeatedcv",
number = 10,
repeats = 4,
savePredictions="final",
classProbs = TRUE
)
fit_glmnet_phy <- train (Methane~.,
trainingPhy,
method = "glmnet",
tuneGrid = expand.grid(
.alpha =0.1,
.lambda = 0.00023),
metric = "Accuracy",
trControl = fitControlPhyGLMNET)
pred_glmnet_phy <- predict(fit_glmnet_phy, testingPhy)
# Get the confusion matrix to see accuracy value
u <- union(pred_glmnet_phy,testingPhy$Methane)
t <- table(factor(pred_glmnet_phy, u), factor(testingPhy$Methane, u))
accu_glmnet_phy <- confusionMatrix(t)
# accu_glmnet_phy<-confusionMatrix(pred_glmnet_phy,testingPhy$Methane)
glmnetstatsPhy[(nrow(glmnetstatsPhy)+1),] = accu_glmnet_phy$overall
}
glmnetstatsPhy
The program always stopped on fit_glmnet_phy <- train (Methane~., ..
this command and shows
Metric Accuracy not applicable for regression models
I have no idea about this error
I also attached the type of mathane
enter image description here
Try normalizing the input columns and mapping the output column as factors. This helped me resolve an issue similar to it.
So I have this assignment where I have to create 3 different models (r). I can do them individually without a problem. However I want to take it a step further and to create a function that trains all of them with a for loop. (I know I could create a function that trained the 3 models each time. I am not looking for other solutions to the problem, I want to do it this way (or in a similar fashion) because now I have 3 models but imagine if I wanted to train 20!
I tried creating a list to store all three models, but i keep having some warnings.
library(caret)
library(readr)
library(rstudioapi)
library(e1071)
library(dplyr)
library(rpart)
TrainingFunction <- function(method,formula,data,tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
if(method == "rf") {Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)}
else if (method == "knn"){
preObj <- preProcess(data[, c(13,14,15)], method=c("center", "scale"))
data <- predict(preObj, data)
Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)
}
else if (method == "svm"){Model <- svm(formula, data = data,cost=1000 , gamma = 0.001)}
Model
}
So this is a training function I created, and it works, but now I want to train all three at once !
So I tried this:
methods <- c("rf","knn","svm")
Models <- vector(mode = "list" , length = length(methods))
for(i in 1:length(methods))
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
This are the warnings :
Warning messages:
1: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
2: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
3: In svm.default(x, y, scale = scale, ..., na.action = na.action) :
Variable(s) ‘ProductType.GameConsole’ constant. Cannot scale data.
4: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
When I do Models the output is this :
[[1]]
[1] "rf"
[[2]]
[1] "knn"
[[3]]
svm(formula = formula, data = data, cost = 1000, gamma = 0.001)
Consider switch to avoid the many if and else especially if extending to 20 models. Then use lapply to build a list without initialization or iterative assignment:
TrainingFunction <- function(method, formula, data, tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
Model <- switch(method,
"rf" = train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
"knn" = {
preObj <- preProcess(data[,c(13,14,15)],
method=c("center", "scale"))
data <- predict(preObj, data)
train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
}
"svm" = svm(formula, data = data, cost = 1000, gamma = 0.001)
)
}
methods <- c("rf","knn","svm")
Model_list <-lapply(methods, function(m)
TrainingFunction(m, Volume~., List$trainingSet, 5))
I think the problem comes from this line:
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
If you want to assign your model to the i-th place of the list, you should do it with a double bracket, like this:
{Models[[i]] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
Another alternative would be use lapply instead of an explicit loop, so you avoid that problem altogether:
train_from_method <- function(methods) {TrainingFunction(methods,Volume~.,List$trainingSet,5)}
Models <- lapply(species_vector, train_from_method)
I am doing a stack of models in R as follows:
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3, returnResamp="final", savePredictions="final", classProbs=TRUE, selectionFunction="oneSE", verboseIter=TRUE)
models_stack <- caretStack(
model_list,
data=train_data,
tuneLength=10,
method="glmnet",
metric="ROC",
trControl=ctrl
)
1) Why am I seeing the following error? What can I do? I am stuck now.
Timing stopped at: 0.89 0.005 0.91
Show Traceback
Error in (function (x, y, family = c("gaussian", "binomial", "poisson", : unused argument (data = list(c(-0.00891097103286995, 0.455282701499392, 0.278236211515583, 0.532932725880776, 0.511036607368827, 0.688757947257125, -0.560727863490874, -0.21768155316146, 0.642219917023467, 0.220363129901216, 0.591732278371339, 1.02850020403572, -1.02417799431585, 0.806359545011601, -1.21490317454699, -0.671361009441299, 0.927344615788642, -0.10449847318776, 0.595493217624868, -1.05586363903119, -0.138457794869817, -1.026253562838, -1.38264471633224, -1.32900800143341, 0.0383617314263342, -0.82222313323842, -0.644251885665736, -0.174126438952992, 0.323934240274895, -0.124613523895458, 0.299359713721601, -0.723599218327519, -0.156528054435544, -0.76193093842169, 0.863217455799044, -1.01340448660914, -0.314365383747751, 1.19150804114605, 0.314703439577839, 1.55580594654149, -0.582911462615421, -0.515291378382375, 0.305142268138296, 0.513989405541095, -1.85093305614114, 0.436468060668601, -2.18997828727424, 1.12838871469007, -1.17619542016998, -0.218175589380355
2) Is there not supposed to have a "data" parameter? If i need to use a different dataset for my level 1 supervisor model what I can do?
3) Also I wanted to use AUC/ROC but got these errors
The metric "AUC" was not in the result set. Accuracy will be used instead.
and
The metric "ROC" was not in the result set. Accuracy will be used instead.
I saw some online examples that ROC can be used, is it because it is not for this model? What metrics can I use besides Accuracy for this model? If I need to use ROC, what are the other options.
As requested by #RLave, this is how my model_list is done
grid.xgboost <- expand.grid(.nrounds=c(40,50,60),.eta=c(0.2,0.3,0.4),
.gamma=c(0,1),.max_depth=c(2,3,4),.colsample_bytree=c(0.8),
.subsample=c(1),.min_child_weight=c(1))
grid.rf <- expand.grid(.mtry=3:6)
model_list <- caretList(y ~.,
data=train_data_0,
trControl=ctrl,
tuneList=list(
xgbTree=caretModelSpec(method="xgbTree", tuneGrid=grid.xgboost),
rf=caretModelSpec(method="rf", tuneGrid=grid.rf)
)
)
My train_data_0 and train_data are both from the same dataset. My dataset predicators are all numeric values with the label as a binary label
your question contains three questions:
Why am I seeing the following error? What can I do? I am stuck now.
caretStack should not have a data parameter, the data is generated based on predictions of models in caretList. Take a look at this reproducible example:
library(caret)
library(caretEnsemble)
library(mlbench)
using the Sonar data set:
data(Sonar)
create grid for hyper parameter tune for xgboost:
grid.xgboost <- expand.grid(.nrounds = c(40, 50, 60),
.eta = c(0.2, 0.3, 0.4),
.gamma = c(0, 1),
.max_depth = c(2, 3, 4),
.colsample_bytree = c(0.8),
.subsample = c(1),
.min_child_weight = c(1))
create grid for rf tune:
grid.rf <- expand.grid(.mtry = 3:6)
create train control:
ctrl <- trainControl(method="cv",
number=5,
returnResamp = "final",
savePredictions = "final",
classProbs = TRUE,
selectionFunction = "oneSE",
verboseIter = TRUE,
summaryFunction = twoClassSummary)
tune the models:
model_list <- caretList(Class ~.,
data = Sonar,
trControl = ctrl,
tuneList = list(
xgbTree = caretModelSpec(method="xgbTree",
tuneGrid = grid.xgboost),
rf = caretModelSpec(method = "rf",
tuneGrid = grid.rf))
)
create the stacked ensamble:
models_stack <- caretStack(
model_list,
tuneLength = 10,
method ="glmnet",
metric = "ROC",
trControl = ctrl
)
2) Is there not supposed to have a "data" parameter? If i need to use a different dataset for my level 1 supervisor model what I can do?
caretStack needs only the predictions from the base models, in order to create an ensemble of models trained on different data you must create a new caretList with the appropriate data specified there.
3) Also I wanted to use AUC/ROC but got these errors
The easiest way to use AUC as metric is to set: summaryFunction = twoClassSummary in
trainControl