I'm having and error while trying to train a dataset with the caret package. The error is the following... Error in train.default(x, y, weights = w, ...) : Stopping. I also have warnings() which all of them are the same because I'm creating an object for the tuneGrid with the following code...grid <- expand.grid(cp = seq(0, 0.05, 0.005)). This code is creating a data.frame with 11 rows that correspond to the 11 warnings I'm having. Here is the warning... In eval(expr, envir, enclos) :
model fit failed for Fold01: cp=0 Error in[.data.frame(m, labs) : undefined columns selected. Looks like the cp doesn't have anything. I can go to my environment and see the grid object and all 11 rows. I have search stackoverflow and I found similar questions but since these functions have so many ways to tweak them, I haven't found a question that fix my problem.
Here is my code...
require(rpart)
require(rattle)
require(rpart.plot)
require(caret)
setwd('~/Documents/Lipscomb/predictive_analytics/class4/')
data <- read.csv(file = 'data.csv',
head = FALSE)
data <- subset(data, select = -V1)
colnames(data) <- c('diagnostic', 'm.radius', 'm.texture', 'm. perimeter', 'm.area', 'm.smoothness', 'm.compactness', 'm.concavity', 'm.concave.points', 'm.symmetry', 'm.fractal.dimension',
'se.radius', 'se.texture', 'se. perimeter', 'se.area', 'se.smoothness', 'se.copactness', 'se.concavity', 'se.concave.points', 'se.symmetry', 'se.fractal.dimension',
'w.radius', 'w.texture', 'w. perimeter', 'w.area', 'w.smoothness', 'w.copactness', 'w.concavity', 'w.concave.points', 'w.symmetry', 'w.fractal.dimension')
str(data)
set.seed(7)
sample.train <- sample(1:nrow(data), nrow(data) * .8)
sample.test <- setdiff(1:nrow(data), sample.train)
data.train <- data[sample.train, ]
data.test <- subset(data[sample.test, ], select = -diagnostic)
rpart.tree <- rpart(diagnostic ~ ., data = data.train)
out <- predict(rpart.tree, data.test, type = 'class')
table(out, data[sample.test, ]$diagnostic)
fancyRpartPlot(rpart.tree)
temp <- rpart.control(xval = 10, minbucket = 2, minsplit = 4, cp = 0)
dfit <- rpart(diagnostic ~ ., data = data.train, control = temp)
fancyRpartPlot(dfit)
fit.control <- trainControl(method = 'cv', number = 10)
grid <- expand.grid(cp = seq(0, 0.05, 0.005))
trained.tree <- train(diagnostic ~ ., method = 'rpart', data = data.train,
metric = 'Accuracy', maximize = TRUE,
trControl = fit.control, tuneGrid = grid)
I have found a solution to this problem. I changed the way I was naming my colnames. For some reason, the original code for naming colnames was causing error utilizing the train function. This code fixed the problem.
colnames(data) <- c('diagnostic', 'radius', 'texture', 'perimeter', 'area', 'smoothness', 'compactness', 'concavity', 'concavePoints', 'symmetry', 'fractalDimension',
'SeRadius', 'SeTexture', 'SePerimeter', 'SeArea', 'SeSmoothness', 'SeCopactness', 'SeConcavity', 'SeConcavePoints', 'SeSymmetry', 'SeFractalDimension',
'Wradius', 'Wtexture', 'Wperimeter', 'Warea', 'Wsmoothness', 'Wcopactness', 'Wconcavity', 'WconcavePoints', 'Wsymmetry', 'WfractalDimension')
Related
Good evening, I am currently running a an classification algorithm using the Caret package. I'm using the upsample and downsample function to take care of data imbalance. I've taken care of all the NA values, however I keep getting this message, "Error in model.frame.default(form = lost_client ~ SRC + Commission_Rate + :
variable lengths differ (found for 'SRC')"
The code for the dataset
clients4 <- clients[,-c(1:6,8,14,15,16,18,19,20,21,22,23,26,27,28,29,32,33,42,44,50,51,52,53,57, 60:62, 63:66,71, 73:75)]
clients4$lost_client <- as.factor(clients4$lost_client)
clients4$New_Client <- as.factor(clients4$New_Client)
clients4 <- clients4[complete.cases(clients4),]
set.seed(101)
Training <- createDataPartition(clients4$lost_client, p=.80)$Resample1
fitControl <- trainControl(method = "cv", number = 10, allowParallel = TRUE)
glmgrid <- expand.grid(lambda=seq(0,1,.05), alpha=seq(0,1,.1))
rpartgrid <- expand.grid(maxdepth=1:20)
rfgrid <- expand.grid(mtry=1:14)
gbmgrid <- expand.grid(interaction.depth=1:5, n.trees=c(50,100,150,200,250), shrinkage=.1, n.minobsinnode=10)
svmgrid <- expand.grid(cost=seq(0,10, 0.05))
Training <- clients4[Training,]
clients5 <- clients4
clients5$lost_client[which(clients4$lost_client == 0)] = -1
TrainUp <- upSample(x=Training[,-2],
y=Training$lost_client)
TrainDown <- downSample(x=Training[,-2],
y=Training$lost_client)
This is the code for the model itself.
set.seed(3)
m2 <- train(lost_client~SRC+Commission_Rate+Line_of_Business+Pro_Rate+Pro_Increase+Premium+PrevWrittenPremium+PrevWrittenAgencyComm+Office_State+Non_Parent+Policy_Count+Cross_Sell_Prdcr+Provider_Type+num_months+Revenue+SIC_Industry_Code, data = TrainUp, method="rpart2",trControl=fitControl, tuneGrid=rpartgrid, num.threads = 6)
pred3 <- predict(m2, newdata=clients4[-Training,])
confusionMatrix(pred3, clients4[-Training,]$lost_client)
m2$bestTune
rpart.plot(m2$finalModel)
Any idea of what is causing this error?
I've tried to look at similar questions but can't figure out my problem.
I was already able to complete my analysis with random forest (using caret), tuning parameters separately. Now I'm trying to create a function that will perform my analysis all at once.
I created a function with two inputs, the dataset, and variable to be classified.
For now I'm using the iris dataset for simplicity.
RF <- function(data, classvariable) {
# Best mtry
trControl <- trainControl(method = "cv", number = 10,
search = "grid")
set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 3))
RF_mtry <- train(classvariable ~.,
data = dataset,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
ntree = 100)
print(RF_mtry)
mtry = 0
for (i in 1:nrow(RF_mtry$results)) {
if (RF_mtry$results[i,2] > mtry) mtry <-
RF_mtry$results[i,2]
}
trial_mtry <- c(1:3)
best_mtry <- trial_mtry[i]
best_mtry
}
Once I run the function
RF(data = iris, classvariable = Species)
I get the error
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected
Tried running the code without putting it in a function, so i wrote directly iris instead of dataset and Species instead of classvariable, and it works.
previously I was getting the error
Error in model.frame.default(form = classvariable ~ ., data = trainingDataset, :
variable lengths differ (found for 'Sepal.Length')
Anybody have an idea why it does not work?
Thank you very much.
So I have this assignment where I have to create 3 different models (r). I can do them individually without a problem. However I want to take it a step further and to create a function that trains all of them with a for loop. (I know I could create a function that trained the 3 models each time. I am not looking for other solutions to the problem, I want to do it this way (or in a similar fashion) because now I have 3 models but imagine if I wanted to train 20!
I tried creating a list to store all three models, but i keep having some warnings.
library(caret)
library(readr)
library(rstudioapi)
library(e1071)
library(dplyr)
library(rpart)
TrainingFunction <- function(method,formula,data,tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
if(method == "rf") {Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)}
else if (method == "knn"){
preObj <- preProcess(data[, c(13,14,15)], method=c("center", "scale"))
data <- predict(preObj, data)
Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)
}
else if (method == "svm"){Model <- svm(formula, data = data,cost=1000 , gamma = 0.001)}
Model
}
So this is a training function I created, and it works, but now I want to train all three at once !
So I tried this:
methods <- c("rf","knn","svm")
Models <- vector(mode = "list" , length = length(methods))
for(i in 1:length(methods))
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
This are the warnings :
Warning messages:
1: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
2: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
3: In svm.default(x, y, scale = scale, ..., na.action = na.action) :
Variable(s) ‘ProductType.GameConsole’ constant. Cannot scale data.
4: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
When I do Models the output is this :
[[1]]
[1] "rf"
[[2]]
[1] "knn"
[[3]]
svm(formula = formula, data = data, cost = 1000, gamma = 0.001)
Consider switch to avoid the many if and else especially if extending to 20 models. Then use lapply to build a list without initialization or iterative assignment:
TrainingFunction <- function(method, formula, data, tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
Model <- switch(method,
"rf" = train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
"knn" = {
preObj <- preProcess(data[,c(13,14,15)],
method=c("center", "scale"))
data <- predict(preObj, data)
train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
}
"svm" = svm(formula, data = data, cost = 1000, gamma = 0.001)
)
}
methods <- c("rf","knn","svm")
Model_list <-lapply(methods, function(m)
TrainingFunction(m, Volume~., List$trainingSet, 5))
I think the problem comes from this line:
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
If you want to assign your model to the i-th place of the list, you should do it with a double bracket, like this:
{Models[[i]] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
Another alternative would be use lapply instead of an explicit loop, so you avoid that problem altogether:
train_from_method <- function(methods) {TrainingFunction(methods,Volume~.,List$trainingSet,5)}
Models <- lapply(species_vector, train_from_method)
Getting the error Error: nrow(x) == n is not TRUE when I am using Caret, can anyone help. See example below
Get Wisconsin Breast Cancer dataset and load required libraries
library(dplyr)
library(caret)
library(mlbench)
data(BreastCancer)
myControl = trainControl(
method = "cv", number = 5,
repeats = 5, verboseIter = TRUE
)
breast_cancer_y <- BreastCancer %>%
dplyr::select(Class)
breast_cancer_x <- BreastCancer %>%
dplyr::select(-Class)
Also apply median imputation to model for missing data
model <- train(
x = breast_cancer_x,
y = breast_cancer_y,
method = "glmnet",
trControl = myControl,
preProcess = "medianImpute"
)
Getting this error -
Error: nrow(x) == n is not TRUE
Here, breast_cancer_y is a data.frame. Consider using y = breast_cancer_y$Class in the train function.
You also have a character column in breast_cancer_x; the Id columns.
I am still getting some other error messages after fixing these. These appear related to pre-processing and control.
I'm getting an error when using the confusionMatrix but am lost what it is saying:
R:Error in sort.list(y) : 'x' must be atomic
Below are the steps i took
# Cleaning the environment
rm(list = ls())
# Reading the data
hr <- read.csv('C:/HR_comma_sep.csv')
View(hr)
# load library randomForest
library(randomForest)
# Attaching the data
attach(hr)
head(hr)
left <- as.data.frame(unlist(hr$left))
set.seed(27)
rf <- randomForest(left~., data = hr, mtry = 4, ntree = 500, importance = T)
plot(rf)
varImpPlot(rf, sort = T, main = "Variable Importance", n.var = 5)
hr$predicted.response <- predict(rf, left)
library(lattice)
library(ggplot2)
library(caret)
library(e1071)
# Confusion matrix for accuracy calculation
confusionMatrix(data = hr$predicted.response, reference = left,positive ='yes')