I'm getting an error when using the confusionMatrix but am lost what it is saying:
R:Error in sort.list(y) : 'x' must be atomic
Below are the steps i took
# Cleaning the environment
rm(list = ls())
# Reading the data
hr <- read.csv('C:/HR_comma_sep.csv')
View(hr)
# load library randomForest
library(randomForest)
# Attaching the data
attach(hr)
head(hr)
left <- as.data.frame(unlist(hr$left))
set.seed(27)
rf <- randomForest(left~., data = hr, mtry = 4, ntree = 500, importance = T)
plot(rf)
varImpPlot(rf, sort = T, main = "Variable Importance", n.var = 5)
hr$predicted.response <- predict(rf, left)
library(lattice)
library(ggplot2)
library(caret)
library(e1071)
# Confusion matrix for accuracy calculation
confusionMatrix(data = hr$predicted.response, reference = left,positive ='yes')
Related
Good evening, I am currently running a an classification algorithm using the Caret package. I'm using the upsample and downsample function to take care of data imbalance. I've taken care of all the NA values, however I keep getting this message, "Error in model.frame.default(form = lost_client ~ SRC + Commission_Rate + :
variable lengths differ (found for 'SRC')"
The code for the dataset
clients4 <- clients[,-c(1:6,8,14,15,16,18,19,20,21,22,23,26,27,28,29,32,33,42,44,50,51,52,53,57, 60:62, 63:66,71, 73:75)]
clients4$lost_client <- as.factor(clients4$lost_client)
clients4$New_Client <- as.factor(clients4$New_Client)
clients4 <- clients4[complete.cases(clients4),]
set.seed(101)
Training <- createDataPartition(clients4$lost_client, p=.80)$Resample1
fitControl <- trainControl(method = "cv", number = 10, allowParallel = TRUE)
glmgrid <- expand.grid(lambda=seq(0,1,.05), alpha=seq(0,1,.1))
rpartgrid <- expand.grid(maxdepth=1:20)
rfgrid <- expand.grid(mtry=1:14)
gbmgrid <- expand.grid(interaction.depth=1:5, n.trees=c(50,100,150,200,250), shrinkage=.1, n.minobsinnode=10)
svmgrid <- expand.grid(cost=seq(0,10, 0.05))
Training <- clients4[Training,]
clients5 <- clients4
clients5$lost_client[which(clients4$lost_client == 0)] = -1
TrainUp <- upSample(x=Training[,-2],
y=Training$lost_client)
TrainDown <- downSample(x=Training[,-2],
y=Training$lost_client)
This is the code for the model itself.
set.seed(3)
m2 <- train(lost_client~SRC+Commission_Rate+Line_of_Business+Pro_Rate+Pro_Increase+Premium+PrevWrittenPremium+PrevWrittenAgencyComm+Office_State+Non_Parent+Policy_Count+Cross_Sell_Prdcr+Provider_Type+num_months+Revenue+SIC_Industry_Code, data = TrainUp, method="rpart2",trControl=fitControl, tuneGrid=rpartgrid, num.threads = 6)
pred3 <- predict(m2, newdata=clients4[-Training,])
confusionMatrix(pred3, clients4[-Training,]$lost_client)
m2$bestTune
rpart.plot(m2$finalModel)
Any idea of what is causing this error?
#data splicing
set.seed(12345)
train <- sample(1:nrow(student.mat.pass.or.fail),size =
ceiling(0.80*nrow(student.mat.pass.or.fail)),replace = FALSE)
# training set
students_train <- student.mat.pass.or.fail[train,]
# test set
students_test <- student.mat.pass.or.fail[-train,]
# penalty matrix
penalty.matrix <- matrix(c(0,1,10,0), byrow=TRUE, nrow=2)
# building the classification tree with part
tree <- rpart(class~.,
data = students_train, # as.matrix(students_train)
parms = list(loss = penalty.matrix),
method = "class")
object is not a matrix, can someone help me cause I'm new in R I also used the as. matrix(students_train) but it still showing the same problem
I use PCA on my divided train dataset and project the test dataset to the results after removing irrelevant columns.
data <- read.csv('bottom10.csv')
set.seed(1)
inTrain <- createDataPartition(data$cuisine, p = .8)[[1]]
dataTrain <- data[,-1][inTrain,][,-1]
dataTest <- data[,-1][-inTrain,][,-1]
cuisine.pca <- prcomp(dataTrain[,-1])
Then I extract the first 500 components and project the test dataset.
traincom <- cuisine.pca$x[,1:500]
testcom <- scale(dataTest[,-1], cuisine.pca$center) %*% cuisine.pca$rotation
Then I transfer the labels into integer, and combine components and labels into xgbDMatrix form.
label_train <- as.integer(dataTrain$cuisine) - 1
label_test <- as.integer(dataTest$cuisine) - 1
xgb_train <- xgb.DMatrix(data = traincom, label = label_train)
xgb_test <- xgb.DMatrix(data = testcom, label = label_test)
Then I build the xgboost model as
xgb.fit <- xgboost(cuisine~., data = xgb_train, nrounds = 40, num_class = 10, early_stopping_rounds = 5)
And after I run this, there is a warning but the training can still run.
xgboost: label will be ignored
I can predict the train dataset using the model but when I try to predict test dataset there will be an error.
xgb_pred <- predict(xgb.fit, newdata = xgb_train)
sum(label_train == xgb_pred)/length(label_train)
xgb_pred <- predict(xgb.fit, newdata = xgb_test, rescale = T)
Error in predict.xgb.Booster(xgb.fit, newdata = xgb_test, rescale = T) :
Feature names stored in `object` and `newdata` are different!
Please let me know what am I doing wrong?
Regards
Getting the error Error: nrow(x) == n is not TRUE when I am using Caret, can anyone help. See example below
Get Wisconsin Breast Cancer dataset and load required libraries
library(dplyr)
library(caret)
library(mlbench)
data(BreastCancer)
myControl = trainControl(
method = "cv", number = 5,
repeats = 5, verboseIter = TRUE
)
breast_cancer_y <- BreastCancer %>%
dplyr::select(Class)
breast_cancer_x <- BreastCancer %>%
dplyr::select(-Class)
Also apply median imputation to model for missing data
model <- train(
x = breast_cancer_x,
y = breast_cancer_y,
method = "glmnet",
trControl = myControl,
preProcess = "medianImpute"
)
Getting this error -
Error: nrow(x) == n is not TRUE
Here, breast_cancer_y is a data.frame. Consider using y = breast_cancer_y$Class in the train function.
You also have a character column in breast_cancer_x; the Id columns.
I am still getting some other error messages after fixing these. These appear related to pre-processing and control.
I'm having and error while trying to train a dataset with the caret package. The error is the following... Error in train.default(x, y, weights = w, ...) : Stopping. I also have warnings() which all of them are the same because I'm creating an object for the tuneGrid with the following code...grid <- expand.grid(cp = seq(0, 0.05, 0.005)). This code is creating a data.frame with 11 rows that correspond to the 11 warnings I'm having. Here is the warning... In eval(expr, envir, enclos) :
model fit failed for Fold01: cp=0 Error in[.data.frame(m, labs) : undefined columns selected. Looks like the cp doesn't have anything. I can go to my environment and see the grid object and all 11 rows. I have search stackoverflow and I found similar questions but since these functions have so many ways to tweak them, I haven't found a question that fix my problem.
Here is my code...
require(rpart)
require(rattle)
require(rpart.plot)
require(caret)
setwd('~/Documents/Lipscomb/predictive_analytics/class4/')
data <- read.csv(file = 'data.csv',
head = FALSE)
data <- subset(data, select = -V1)
colnames(data) <- c('diagnostic', 'm.radius', 'm.texture', 'm. perimeter', 'm.area', 'm.smoothness', 'm.compactness', 'm.concavity', 'm.concave.points', 'm.symmetry', 'm.fractal.dimension',
'se.radius', 'se.texture', 'se. perimeter', 'se.area', 'se.smoothness', 'se.copactness', 'se.concavity', 'se.concave.points', 'se.symmetry', 'se.fractal.dimension',
'w.radius', 'w.texture', 'w. perimeter', 'w.area', 'w.smoothness', 'w.copactness', 'w.concavity', 'w.concave.points', 'w.symmetry', 'w.fractal.dimension')
str(data)
set.seed(7)
sample.train <- sample(1:nrow(data), nrow(data) * .8)
sample.test <- setdiff(1:nrow(data), sample.train)
data.train <- data[sample.train, ]
data.test <- subset(data[sample.test, ], select = -diagnostic)
rpart.tree <- rpart(diagnostic ~ ., data = data.train)
out <- predict(rpart.tree, data.test, type = 'class')
table(out, data[sample.test, ]$diagnostic)
fancyRpartPlot(rpart.tree)
temp <- rpart.control(xval = 10, minbucket = 2, minsplit = 4, cp = 0)
dfit <- rpart(diagnostic ~ ., data = data.train, control = temp)
fancyRpartPlot(dfit)
fit.control <- trainControl(method = 'cv', number = 10)
grid <- expand.grid(cp = seq(0, 0.05, 0.005))
trained.tree <- train(diagnostic ~ ., method = 'rpart', data = data.train,
metric = 'Accuracy', maximize = TRUE,
trControl = fit.control, tuneGrid = grid)
I have found a solution to this problem. I changed the way I was naming my colnames. For some reason, the original code for naming colnames was causing error utilizing the train function. This code fixed the problem.
colnames(data) <- c('diagnostic', 'radius', 'texture', 'perimeter', 'area', 'smoothness', 'compactness', 'concavity', 'concavePoints', 'symmetry', 'fractalDimension',
'SeRadius', 'SeTexture', 'SePerimeter', 'SeArea', 'SeSmoothness', 'SeCopactness', 'SeConcavity', 'SeConcavePoints', 'SeSymmetry', 'SeFractalDimension',
'Wradius', 'Wtexture', 'Wperimeter', 'Warea', 'Wsmoothness', 'Wcopactness', 'Wconcavity', 'WconcavePoints', 'Wsymmetry', 'WfractalDimension')