Getting the error Error: nrow(x) == n is not TRUE when I am using Caret, can anyone help. See example below
Get Wisconsin Breast Cancer dataset and load required libraries
library(dplyr)
library(caret)
library(mlbench)
data(BreastCancer)
myControl = trainControl(
method = "cv", number = 5,
repeats = 5, verboseIter = TRUE
)
breast_cancer_y <- BreastCancer %>%
dplyr::select(Class)
breast_cancer_x <- BreastCancer %>%
dplyr::select(-Class)
Also apply median imputation to model for missing data
model <- train(
x = breast_cancer_x,
y = breast_cancer_y,
method = "glmnet",
trControl = myControl,
preProcess = "medianImpute"
)
Getting this error -
Error: nrow(x) == n is not TRUE
Here, breast_cancer_y is a data.frame. Consider using y = breast_cancer_y$Class in the train function.
You also have a character column in breast_cancer_x; the Id columns.
I am still getting some other error messages after fixing these. These appear related to pre-processing and control.
Related
I have a database in R where I would like to perform a glmnet task. The y variable consists on an originally numeric variable which however takes on only 0 and 1 values. If I specify the latter to be a factor variable as follows
df_ML_1976[,names] <- lapply(df_ML_1976[,names] , factor)
and then apply glmnet after dividing into training and test set:
library("dplyr")
df_ML_1976 %>%
select(where(~ any(. != 0)))
#df_ML_1976 <- subset(df_ML_1976, select = -c(X))
library("caret")
default_idx = createDataPartition(df_ML_1976$y_tr4, p = 0.75, list = FALSE)
default_trn = df_ML_1976[default_idx, ]
default_tst = df_ML_1976[-default_idx, ]
## Fitting elasticnet:
cv_5 = trainControl(method = "cv", number = 5)
def_elnet = train(
y_tr4 ~ ., data = default_trn,
method = "glmnet",
trControl = cv_5
)
def_elnet
an error occurs:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'drop': non-conformable arguments
which does not appear if I do not specify
df_ML_1976[,names] <- lapply(df_ML_1976[,names] , factor)
why is it like so?
Thank you
I've tried to look at similar questions but can't figure out my problem.
I was already able to complete my analysis with random forest (using caret), tuning parameters separately. Now I'm trying to create a function that will perform my analysis all at once.
I created a function with two inputs, the dataset, and variable to be classified.
For now I'm using the iris dataset for simplicity.
RF <- function(data, classvariable) {
# Best mtry
trControl <- trainControl(method = "cv", number = 10,
search = "grid")
set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 3))
RF_mtry <- train(classvariable ~.,
data = dataset,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
ntree = 100)
print(RF_mtry)
mtry = 0
for (i in 1:nrow(RF_mtry$results)) {
if (RF_mtry$results[i,2] > mtry) mtry <-
RF_mtry$results[i,2]
}
trial_mtry <- c(1:3)
best_mtry <- trial_mtry[i]
best_mtry
}
Once I run the function
RF(data = iris, classvariable = Species)
I get the error
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected
Tried running the code without putting it in a function, so i wrote directly iris instead of dataset and Species instead of classvariable, and it works.
previously I was getting the error
Error in model.frame.default(form = classvariable ~ ., data = trainingDataset, :
variable lengths differ (found for 'Sepal.Length')
Anybody have an idea why it does not work?
Thank you very much.
So I have this assignment where I have to create 3 different models (r). I can do them individually without a problem. However I want to take it a step further and to create a function that trains all of them with a for loop. (I know I could create a function that trained the 3 models each time. I am not looking for other solutions to the problem, I want to do it this way (or in a similar fashion) because now I have 3 models but imagine if I wanted to train 20!
I tried creating a list to store all three models, but i keep having some warnings.
library(caret)
library(readr)
library(rstudioapi)
library(e1071)
library(dplyr)
library(rpart)
TrainingFunction <- function(method,formula,data,tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
if(method == "rf") {Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)}
else if (method == "knn"){
preObj <- preProcess(data[, c(13,14,15)], method=c("center", "scale"))
data <- predict(preObj, data)
Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)
}
else if (method == "svm"){Model <- svm(formula, data = data,cost=1000 , gamma = 0.001)}
Model
}
So this is a training function I created, and it works, but now I want to train all three at once !
So I tried this:
methods <- c("rf","knn","svm")
Models <- vector(mode = "list" , length = length(methods))
for(i in 1:length(methods))
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
This are the warnings :
Warning messages:
1: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
2: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
3: In svm.default(x, y, scale = scale, ..., na.action = na.action) :
Variable(s) ‘ProductType.GameConsole’ constant. Cannot scale data.
4: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
When I do Models the output is this :
[[1]]
[1] "rf"
[[2]]
[1] "knn"
[[3]]
svm(formula = formula, data = data, cost = 1000, gamma = 0.001)
Consider switch to avoid the many if and else especially if extending to 20 models. Then use lapply to build a list without initialization or iterative assignment:
TrainingFunction <- function(method, formula, data, tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
Model <- switch(method,
"rf" = train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
"knn" = {
preObj <- preProcess(data[,c(13,14,15)],
method=c("center", "scale"))
data <- predict(preObj, data)
train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
}
"svm" = svm(formula, data = data, cost = 1000, gamma = 0.001)
)
}
methods <- c("rf","knn","svm")
Model_list <-lapply(methods, function(m)
TrainingFunction(m, Volume~., List$trainingSet, 5))
I think the problem comes from this line:
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
If you want to assign your model to the i-th place of the list, you should do it with a double bracket, like this:
{Models[[i]] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
Another alternative would be use lapply instead of an explicit loop, so you avoid that problem altogether:
train_from_method <- function(methods) {TrainingFunction(methods,Volume~.,List$trainingSet,5)}
Models <- lapply(species_vector, train_from_method)
I'm running a SVM in R with caret package. My entire df (named total, which includes train and test) are scaled numbers from 0 to 1. My Y is binary (0-1). All the variables have the class "num". Here is the code:
model_SVM <- train(
Y ~ ., training,
method = "svmPoly",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE
)
)
summary(model_SVM)
model_SVM
SVMprediction <-predict(model_SVM, testing)
cmSVM <-confusionMatrix(SVMprediction,testing$Y) # ERROR
print(SVMprediction)
I got this error1 in the line # ERROR
> cmSVM <-confusionMatrix(SVMprediction,testing$Y)
Error: `data` and `reference` should be factors with the same levels.
It was solved by adding:
SVMprediction<-as.factor(SVMprediction)
testing$Y<-as.factor(testing$Y)
However, I got an error2 now in # ERROR2:
Error in confusionMatrix.default(SVMprediction, testing$Y) :
the data cannot have more levels than the reference
When I check the levels, SVMprediction has 361 levels, and testing$Y 2 levels. How SVMprediction got 361 levels if the Y had just two?
Thanks!
PS: The full code:
totalY <- total
total <- total%>%
select(-Y)
# Missing Values with MICE
mod_mice <- mice(data = total, m = 5,meth='cart')
total <- complete(mod_mice)
post_mv_var_top10 <- total
Y <- totalY$Y
total<-cbind(total,Y)
train_ <- total%>%
filter(is.na(Y)==FALSE)
test_ <- total%>%
filter(is.na(Y)==TRUE)
inTraining <- createDataPartition(train_$Y, p = .70, list = FALSE)
training <- train_[ inTraining,]
testing <- train_[-inTraining,]
# MODEL SVM
model_SVM <- train(
Y ~ ., training,
method = "svmPoly",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE
)
)
summary(model_SVM)
SVMprediction <-predict(model_SVM, testing)
SVMprediction<-as.factor(SVMprediction)
testing$Y<-as.factor(testing$Y)
cmSVM <-confusionMatrix(SVMprediction,testing$Y) # ERROR 2
print(cmSVM)
I'm having and error while trying to train a dataset with the caret package. The error is the following... Error in train.default(x, y, weights = w, ...) : Stopping. I also have warnings() which all of them are the same because I'm creating an object for the tuneGrid with the following code...grid <- expand.grid(cp = seq(0, 0.05, 0.005)). This code is creating a data.frame with 11 rows that correspond to the 11 warnings I'm having. Here is the warning... In eval(expr, envir, enclos) :
model fit failed for Fold01: cp=0 Error in[.data.frame(m, labs) : undefined columns selected. Looks like the cp doesn't have anything. I can go to my environment and see the grid object and all 11 rows. I have search stackoverflow and I found similar questions but since these functions have so many ways to tweak them, I haven't found a question that fix my problem.
Here is my code...
require(rpart)
require(rattle)
require(rpart.plot)
require(caret)
setwd('~/Documents/Lipscomb/predictive_analytics/class4/')
data <- read.csv(file = 'data.csv',
head = FALSE)
data <- subset(data, select = -V1)
colnames(data) <- c('diagnostic', 'm.radius', 'm.texture', 'm. perimeter', 'm.area', 'm.smoothness', 'm.compactness', 'm.concavity', 'm.concave.points', 'm.symmetry', 'm.fractal.dimension',
'se.radius', 'se.texture', 'se. perimeter', 'se.area', 'se.smoothness', 'se.copactness', 'se.concavity', 'se.concave.points', 'se.symmetry', 'se.fractal.dimension',
'w.radius', 'w.texture', 'w. perimeter', 'w.area', 'w.smoothness', 'w.copactness', 'w.concavity', 'w.concave.points', 'w.symmetry', 'w.fractal.dimension')
str(data)
set.seed(7)
sample.train <- sample(1:nrow(data), nrow(data) * .8)
sample.test <- setdiff(1:nrow(data), sample.train)
data.train <- data[sample.train, ]
data.test <- subset(data[sample.test, ], select = -diagnostic)
rpart.tree <- rpart(diagnostic ~ ., data = data.train)
out <- predict(rpart.tree, data.test, type = 'class')
table(out, data[sample.test, ]$diagnostic)
fancyRpartPlot(rpart.tree)
temp <- rpart.control(xval = 10, minbucket = 2, minsplit = 4, cp = 0)
dfit <- rpart(diagnostic ~ ., data = data.train, control = temp)
fancyRpartPlot(dfit)
fit.control <- trainControl(method = 'cv', number = 10)
grid <- expand.grid(cp = seq(0, 0.05, 0.005))
trained.tree <- train(diagnostic ~ ., method = 'rpart', data = data.train,
metric = 'Accuracy', maximize = TRUE,
trControl = fit.control, tuneGrid = grid)
I have found a solution to this problem. I changed the way I was naming my colnames. For some reason, the original code for naming colnames was causing error utilizing the train function. This code fixed the problem.
colnames(data) <- c('diagnostic', 'radius', 'texture', 'perimeter', 'area', 'smoothness', 'compactness', 'concavity', 'concavePoints', 'symmetry', 'fractalDimension',
'SeRadius', 'SeTexture', 'SePerimeter', 'SeArea', 'SeSmoothness', 'SeCopactness', 'SeConcavity', 'SeConcavePoints', 'SeSymmetry', 'SeFractalDimension',
'Wradius', 'Wtexture', 'Wperimeter', 'Warea', 'Wsmoothness', 'Wcopactness', 'Wconcavity', 'WconcavePoints', 'Wsymmetry', 'WfractalDimension')