How to create a confusion matrix for a decision tree model - r

I am having some difficulties creating a confusion matrix to compare my model prediction to the actual values. My data set has 159 explanatory variables and my target is called "classe".
#Load Data
df <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", na.strings=c("NA","#DIV/0!",""))
#Split into training and validation
index <- createDataPartition(df$classe, times=1, p=0.5)[[1]]
training <- df[index, ]
validation <- df[-index, ]
#Model
decisionTreeModel <- rpart(classe ~ ., data=training, method="class", cp =0.5)
#Predict
pred1 <- predict(decisionTreeModel, validation)
#Check model performance
confusionMatrix(validation$classe, pred1)
The following error message is generated from the code above:
Error in confusionMatrix.default(validation$classe, pred1) :
The data must contain some levels that overlap the reference.
I think it may have something to do with the pred1 variable that the predict function generates, it's a matrix with 5 columns while validation$classe is a factor with 5 levels. Any ideas on how to solve this?
Thanks in advance

Your prediction is giving you a matrix of probabilities for each class. If you want to be returned the "winner" (predicted class), replace your predict line with this:
pred1 <- predict(decisionTreeModel, validation, type="class")

Related

How can I handle a confusionMatrix error when it says my Data is null

I am trying to run a random forests analysis in R and it works well when I fit the model and predict it on the test group but when I run the confusionMatrix it gives me the following error:
Error in table(data, reference, dnn = dnn, ...) : all arguments must have the same length
load the test and training data
trainData <- read.csv("./pml-training.csv")
testData <- read.csv("./pml-testing.csv")
dim(trainData)
dim(testData)
Data cleaning - Here, variables with nearly zero variance or that are almost always NA,
and the columns containing summary statistics or irrelevant data will be removed.
trainClean <- trainData[,colMeans(is.na(trainData))< .9]
trainClean <- trainData[,-c(1:7)]
nvz <- nearZeroVar(trainClean)
trainClean <- trainClean[,-nvz]
dim(trainClean)
Split the data into training (70%) and validation (30%)
inTrain <- createDataPartition(y=trainClean$classe, p=0.7, list=FALSE)
train <- trainClean[inTrain,]
valid <- trainClean[-inTrain,]
# Create a control for 3 fold validation
control <- trainControl(method="cv", number=3, verboseIter = FALSE)
Building the models
Random Forests
# Fit the model on train using random forest
modFit <- train(classe~., data=train, method="rf", trControl=control, tuneLength=5, na.action=na.omit)
modFit
modPredict<- predict(modFit, valid, na.action=na.omit) # predict on the valid data set.
# Turn valid$classe into a factor and check it
valid$classe <- as.factor(valid$classe)
modCM <- confusionMatrix(modPredict, as.factor(valid$classe))
modCM
table(modPredict, valid$classe)
When I check the length of modPredict it = 122, and valid$classe = 5885. If I try dim on modPredict, I get NULL. I have tried using na.action=na.omit on the prediction chunk. I have also tried NOT using na.action=na.omit on the prediction or the fit chunks.
I checked the test and valid data sets where I split the data using:
```length(train); length(valid); length(valid$classe); nrow(valid); nrow(train)```
The output is:
[1] 94
[1] 94
[1] 5885
[1] 5885
[1] 13737
I have been struggling with this problem and similar problems on my decision tree chunk as well. I don't want people to do my homework for me, but I could use a hint.
Thanks in advance

Random Forest model yields incorrect predictions despite having accuracy of over 99 percent

For a ML course, I am supposed to build a model based on the training set to predict the variable "classe" on a validation set. I removed all unnecessary variables in the training set, used cross validation to prevent over-fitting, and made sure the validation set matched the training set in terms of which columns are removed. When I predict classe in the validation set, it yields all classe A, and I know this is incorrect.
I included the entire script below.
Where did I go wrong?
library(caret)
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "train.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "test.csv")
train <- read.csv("./train.csv")
val <- read.csv("./test.csv")
#getting rid of columns with NAs
nas <- sapply(train, function(x) sum(is.na(x)))
train <- train[, nas<1900]
#removing near zero variance columns
remove <- nearZeroVar(train)
train <- train[, -remove]
#create partition in our training set
set.seed(8675309)
inTrain <- createDataPartition(train$classe, p = .7, list = FALSE)
training <- train[inTrain,]
testing <- train[-inTrain,]
model <- train(classe ~ ., method = "rf", data = training)
confusionMatrix(predict(model, testing), testing$classe)
#make sure validation set has same features as training set
trainforvalid <- subset(training, select = -classe)
val <- val[, colnames(trainforvalid)]
predict(model, val)
#the above step yields all predictions as classe A
This might be happening because the data is unbalanced. If the data have a lot more data points for Class A then Class B, the model will simply learn to predict always Class A.
Try to use a better metric in this case like F1 score.
I also recommend using techniques like oversampling or undersampling to avoid the unbalanced data issue.

Get row number for prediction with caret

I use caret a lot for my machine learning tasks in R and I like it a lot.
But I face the following problem:
I train a model in caret, say a linear regression with lm()
When I want to score new data, I do: predict(model, new_data)
When new_datacontains missing values in my predictors, predict returns no prediction, instead of say NA
Is it possible to either:
return a prediction for all rows in new_data with a prediction of NA when it is not possible or
return predictions + the row number of the dataframe the prediction corresponds to?
E.g. like the mlr-package does with an id-column that shows which row the prediction corresponds to:
Here is the link to the mlr-predict page with more details:
mlr-package: predict with row-id
Any help greatly appreciated!
You can identify the cases with missing values prior to running caret::train() by creating a new column with the row names in your data set, since these default to the row numbers in the data frame.
Using the Sonar data set from the mlbench package as an illustration:
library(mlbench)
data(Sonar)
library(caret)
set.seed(95014)
# add row numbers
Sonar$rowId <- rownames(Sonar)
# create training & testing data sets
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
# set column 60 to NA for some values in test data
testing[48:51,60] <- NA
testing[!complete.cases(testing),"rowId"]
...and the output:
> testing[!complete.cases(testing),"rowId"]
[1] "193" "194" "200" "206"
You can then run predict() on the rows in the test data set that have complete cases. Again using the Sonar dataset with a random forest model and 3 fold cross validation to expedite processing:
fitControl <- trainControl(method = "cv",number = 3)
fit <- train(x,y, method="rf",data=Sonar,trControl = fitControl)
predicted <- predict(fit,testing[complete.cases(testing),])
Another way to handle this situation is to use an imputation strategy to eliminate the missing values for the independent variables in your model. My article on Github, Strategies for Handling Missing Values links to a number of research papers on this topic.

Different no of tuples for the prediction model and test set data in SVM

I have a dataset with two columns as shown below, where Column 1, timestamp is a particular value for time for which Column.10 gives the total power usage at that instance of time. There are totally 81502 instances for this data.
I'm doing support vector regression on this data in R using the e1071 package to predict the future usage of power. The code is given below. I first divided the dataset into training and test data. Then using the training data modeled the data using the svm function and then predict the power usage for the testset.
library(e1071)
attach(data.csv)
index <- 1:nrow(data.csv)
testindex <- sample(index,trunc(length(index)/3))
testset <- na.omit(data.csv[testindex, ])
trainingset <- na.omit(data.csv[-testindex, ])
model <- svm(Column.10 ~ timestamp, data=trainingset)
prediction <- predict(model, testset[,-2])
tab <- table(pred = prediction, true = testset[,2])
However, when I try to make a confusion matrix from the prediction, I'm getting the error:
Error in table(pred = prediction, true = testset[, 2]) : all arguments must have the same length
So I tried to find the length of the two arguments and found that
the length(prediction) to be 81502
and the length(testset[,2]) to be 27167
Since I had done the prediction only for the testset, I don't know how prediction is done for 81502 values. How are the total no of values different for the prediction and the testset? How is the power value for the entire dataset getting predicted eventhough I gave it only for the testset?
Change
prediction <- predict(model, testset[,-2])
in
prediction <- predict(model, testset)
However, you should not use table when doing regression, use the MSE instead.

Random Forest Predictions

I am looking for some guidance on a homework assignment I am working on for a class. We are given a dataset with 14K observations and we are asked to build a prediction model. I subset the dataset into training and testing (4909 observations), here I am using the caret package, which predicts the last variable "classe". I pulled out the near zero variables and built the model but when I tried to do predictions I only get 97 predictions back. I reviewed the help files but still can't figure out where I am going wrong. Any hints would be appreciated.
Here is the Code:
set.seed(1234)
pml.training <- read.csv("./data/pml-training.csv")
#
library(caret)
inTrain <- createDataPartition(y=pml.training$classe, p=0.75, list=FALSE)
training <- pml.training[inTrain,]
testing <- pml.training[-inTrain,]
# Pull out the Near Zero Value (NZV)
nzv <- nearZeroVar(training, saveMetrics=TRUE)
omit <- which(nzv$nzv==TRUE)
training <- training[,-omit]
testing <- testing[,-omit]
# Fit the model
modFit <- train(classe ~., method="rf", data=training)
modFit
print(modFit$finalModel)
plot(modFit)
# Try and predict on the testing model
pred <- predict(modFit, newdata=testing)
testing$predRight <- pred==testing$classe
print(table(pred, testing$classe))
Thanks, Pat C.
Have you checked
sum(complete.cases(subset(testing, select = -classe)))
?

Resources