I'm running a SVM in R with caret package. My entire df (named total, which includes train and test) are scaled numbers from 0 to 1. My Y is binary (0-1). All the variables have the class "num". Here is the code:
model_SVM <- train(
Y ~ ., training,
method = "svmPoly",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE
)
)
summary(model_SVM)
model_SVM
SVMprediction <-predict(model_SVM, testing)
cmSVM <-confusionMatrix(SVMprediction,testing$Y) # ERROR
print(SVMprediction)
I got this error1 in the line # ERROR
> cmSVM <-confusionMatrix(SVMprediction,testing$Y)
Error: `data` and `reference` should be factors with the same levels.
It was solved by adding:
SVMprediction<-as.factor(SVMprediction)
testing$Y<-as.factor(testing$Y)
However, I got an error2 now in # ERROR2:
Error in confusionMatrix.default(SVMprediction, testing$Y) :
the data cannot have more levels than the reference
When I check the levels, SVMprediction has 361 levels, and testing$Y 2 levels. How SVMprediction got 361 levels if the Y had just two?
Thanks!
PS: The full code:
totalY <- total
total <- total%>%
select(-Y)
# Missing Values with MICE
mod_mice <- mice(data = total, m = 5,meth='cart')
total <- complete(mod_mice)
post_mv_var_top10 <- total
Y <- totalY$Y
total<-cbind(total,Y)
train_ <- total%>%
filter(is.na(Y)==FALSE)
test_ <- total%>%
filter(is.na(Y)==TRUE)
inTraining <- createDataPartition(train_$Y, p = .70, list = FALSE)
training <- train_[ inTraining,]
testing <- train_[-inTraining,]
# MODEL SVM
model_SVM <- train(
Y ~ ., training,
method = "svmPoly",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE
)
)
summary(model_SVM)
SVMprediction <-predict(model_SVM, testing)
SVMprediction<-as.factor(SVMprediction)
testing$Y<-as.factor(testing$Y)
cmSVM <-confusionMatrix(SVMprediction,testing$Y) # ERROR 2
print(cmSVM)
Related
Good evening, I am currently running a an classification algorithm using the Caret package. I'm using the upsample and downsample function to take care of data imbalance. I've taken care of all the NA values, however I keep getting this message, "Error in model.frame.default(form = lost_client ~ SRC + Commission_Rate + :
variable lengths differ (found for 'SRC')"
The code for the dataset
clients4 <- clients[,-c(1:6,8,14,15,16,18,19,20,21,22,23,26,27,28,29,32,33,42,44,50,51,52,53,57, 60:62, 63:66,71, 73:75)]
clients4$lost_client <- as.factor(clients4$lost_client)
clients4$New_Client <- as.factor(clients4$New_Client)
clients4 <- clients4[complete.cases(clients4),]
set.seed(101)
Training <- createDataPartition(clients4$lost_client, p=.80)$Resample1
fitControl <- trainControl(method = "cv", number = 10, allowParallel = TRUE)
glmgrid <- expand.grid(lambda=seq(0,1,.05), alpha=seq(0,1,.1))
rpartgrid <- expand.grid(maxdepth=1:20)
rfgrid <- expand.grid(mtry=1:14)
gbmgrid <- expand.grid(interaction.depth=1:5, n.trees=c(50,100,150,200,250), shrinkage=.1, n.minobsinnode=10)
svmgrid <- expand.grid(cost=seq(0,10, 0.05))
Training <- clients4[Training,]
clients5 <- clients4
clients5$lost_client[which(clients4$lost_client == 0)] = -1
TrainUp <- upSample(x=Training[,-2],
y=Training$lost_client)
TrainDown <- downSample(x=Training[,-2],
y=Training$lost_client)
This is the code for the model itself.
set.seed(3)
m2 <- train(lost_client~SRC+Commission_Rate+Line_of_Business+Pro_Rate+Pro_Increase+Premium+PrevWrittenPremium+PrevWrittenAgencyComm+Office_State+Non_Parent+Policy_Count+Cross_Sell_Prdcr+Provider_Type+num_months+Revenue+SIC_Industry_Code, data = TrainUp, method="rpart2",trControl=fitControl, tuneGrid=rpartgrid, num.threads = 6)
pred3 <- predict(m2, newdata=clients4[-Training,])
confusionMatrix(pred3, clients4[-Training,]$lost_client)
m2$bestTune
rpart.plot(m2$finalModel)
Any idea of what is causing this error?
So I have this assignment where I have to create 3 different models (r). I can do them individually without a problem. However I want to take it a step further and to create a function that trains all of them with a for loop. (I know I could create a function that trained the 3 models each time. I am not looking for other solutions to the problem, I want to do it this way (or in a similar fashion) because now I have 3 models but imagine if I wanted to train 20!
I tried creating a list to store all three models, but i keep having some warnings.
library(caret)
library(readr)
library(rstudioapi)
library(e1071)
library(dplyr)
library(rpart)
TrainingFunction <- function(method,formula,data,tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
if(method == "rf") {Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)}
else if (method == "knn"){
preObj <- preProcess(data[, c(13,14,15)], method=c("center", "scale"))
data <- predict(preObj, data)
Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)
}
else if (method == "svm"){Model <- svm(formula, data = data,cost=1000 , gamma = 0.001)}
Model
}
So this is a training function I created, and it works, but now I want to train all three at once !
So I tried this:
methods <- c("rf","knn","svm")
Models <- vector(mode = "list" , length = length(methods))
for(i in 1:length(methods))
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
This are the warnings :
Warning messages:
1: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
2: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
3: In svm.default(x, y, scale = scale, ..., na.action = na.action) :
Variable(s) ‘ProductType.GameConsole’ constant. Cannot scale data.
4: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
When I do Models the output is this :
[[1]]
[1] "rf"
[[2]]
[1] "knn"
[[3]]
svm(formula = formula, data = data, cost = 1000, gamma = 0.001)
Consider switch to avoid the many if and else especially if extending to 20 models. Then use lapply to build a list without initialization or iterative assignment:
TrainingFunction <- function(method, formula, data, tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
Model <- switch(method,
"rf" = train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
"knn" = {
preObj <- preProcess(data[,c(13,14,15)],
method=c("center", "scale"))
data <- predict(preObj, data)
train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
}
"svm" = svm(formula, data = data, cost = 1000, gamma = 0.001)
)
}
methods <- c("rf","knn","svm")
Model_list <-lapply(methods, function(m)
TrainingFunction(m, Volume~., List$trainingSet, 5))
I think the problem comes from this line:
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
If you want to assign your model to the i-th place of the list, you should do it with a double bracket, like this:
{Models[[i]] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
Another alternative would be use lapply instead of an explicit loop, so you avoid that problem altogether:
train_from_method <- function(methods) {TrainingFunction(methods,Volume~.,List$trainingSet,5)}
Models <- lapply(species_vector, train_from_method)
I want to extend the RandomForest so that each leaf will contain naivebayes regression instead of average. In the following, I first tried to use mob() for adding linearModel. I got the following error:
Error in root.matrix(crossprod(process)) : matrix is not positive semidefinite
Here is my code:
require (data.table)
require (party)
set.seed(123)
data1 <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data',header = TRUE)
colnames(data1)<- c("BuyingPrice","Maintenance","NumDoors","NumPersons","BootSpace","Safety","Condition")
# Split into Train and Validation sets
# Training Set : Validation Set = 70 : 30 (random)
set.seed(100)
train <- sample(nrow(data1), 0.7*nrow(data1), replace = FALSE)
TrainSet <- data1[train,]
ValidSet <- data1[-train,]
summary(TrainSet)
summary(ValidSet)
# Create a Random Forest model with default parameters
model1 <- randomForest(Condition ~ ., data = TrainSet, importance = TRUE)
model1
# Fine tuning parameters of Random Forest model
model2 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
model2
fmBH <- mob(Condition ~ BuyingPrice + Maintenance | NumDoors+ NumPersons + BootSpace + Safety ,
data = TrainSet, model = linearModel)
In the code below I train a NN with crossvalidation on the first 20000 records in the dataset. The dataset contains 8 predictors.
First I have split my data in 2 parts:
the first 20.000 rows (trainset)
and the last 4003 rows (out of sample test set)
I have done 2 runs:
run 1) a run with 3 predictors
run 2) a run with all 8 predictors (see code below).
Based on crossvalidation within the 20.000 rows from the trainset, the RMSE (for the optimal parametersetting) improves from 2.30 (run 1) to 2.11 (run 2).
Although when I test both models on the 4003 rows from the out of sample test set, the RMSE improves only neglible from 2.64 (run 1) to 2.63 (run 2).
What can be concluded from these contradiction in the results?
Thanks!
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Chapter 7: Non-Linear Regression Models
### Required packages: AppliedPredictiveModeling, caret, doMC (optional),
### earth, kernlab, lattice, nnet
################################################################################
library(caret)
### Load the data
mydata <- read.csv(file="data.csv", header=TRUE, sep=",")
validatiex <- mydata[20001:24003,c(1:8)]
validatiey <- mydata[20001:24003,9]
mydata <- mydata[1:20000,]
x <- mydata[,c(1:8)]
y <- mydata[,9]
parti <- createDataPartition(y, times = 1, p=0.8, list = FALSE)
x_train <- x[parti,]
x_test <- x[-parti,]
y_train <- y[parti]
y_test <- y[-parti]
set.seed(100)
indx <- createFolds(y_train, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
## train neural net:
nnetGrid <- expand.grid(decay = c(.1),
size = c(5, 15, 30),
bag = FALSE)
set.seed(100)
nnetTune <- train(x = x_train, y = y_train,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 30 * (ncol(x_train) + 1) + 30 + 1,
maxit = 1000,
repeats = 25,
allowParallel = FALSE)
nnetTune
plot(nnetTune)
predictions <- predict(nnetTune, validatiex, type="raw")
mse <- mean((validatiey - predictions)^2)
mse <- sqrt(mse)
print (mse)
Getting the error Error: nrow(x) == n is not TRUE when I am using Caret, can anyone help. See example below
Get Wisconsin Breast Cancer dataset and load required libraries
library(dplyr)
library(caret)
library(mlbench)
data(BreastCancer)
myControl = trainControl(
method = "cv", number = 5,
repeats = 5, verboseIter = TRUE
)
breast_cancer_y <- BreastCancer %>%
dplyr::select(Class)
breast_cancer_x <- BreastCancer %>%
dplyr::select(-Class)
Also apply median imputation to model for missing data
model <- train(
x = breast_cancer_x,
y = breast_cancer_y,
method = "glmnet",
trControl = myControl,
preProcess = "medianImpute"
)
Getting this error -
Error: nrow(x) == n is not TRUE
Here, breast_cancer_y is a data.frame. Consider using y = breast_cancer_y$Class in the train function.
You also have a character column in breast_cancer_x; the Id columns.
I am still getting some other error messages after fixing these. These appear related to pre-processing and control.