'factors with the same levels' in Confusion Matrix - r

I'm trying to make a decision tree but this error comes up when I make a confusion matrix in the last line :
Error : `data` and `reference` should be factors with the same levels
Here's my code:
library(rpart)
library(caret)
library(dplyr)
library(rpart.plot)
library(xlsx)
library(caTools)
library(data.tree)
library(e1071)
#Loading the Excel File
library(readxl)
FINALDATA <- read_excel("Desktop/FINALDATA.xlsm")
View(FINALDATA)
df <- FINALDATA
View(df)
#Selecting the meaningful columns for prediction
#df <- select(df, City, df$`Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
df <- select(df, City, `Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
#making sure the data is in the right format
df <- mutate(df, City= as.character(City), `Customer type`= as.character(`Customer type`), Gender= as.character(Gender), Quantity= as.numeric(Quantity), Total= as.numeric(Total), Time= as.numeric(Time), Payment = as.character(Payment), Rating= as.numeric(Rating))
#Splitting into training and testing data
set.seed(123)
sample = sample.split('Customer type', SplitRatio = .70)
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)
#Training the Decision Tree Classifier
tree <- rpart(df$`Customer type` ~., data = train)
#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')
#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$`Customer type`)
So I've tried to do this as said in another topic:
confusionMatrix(table(tree.customertype.predicted, test$`Customer type`))
But I still have an error:
Error in !all.equal(nrow(data), ncol(data)) : argument type is invalid

I made a toy data set and examined your code. There were a couple issues:
R has a easier time with variable names that follow a certain style. Your 'Customer type' variable has a space in it. In general, coding is easier when you avoid spaces. So I renamed it 'Customer_type". For your data.frame you could simply go into the source file, or use names(df) <- gsub("Customer type", "Customer_type", names(df)).
I coded 'Customer_type' as a factor. For you this will look like df$Customer_type <- factor(df$Customer_type)
The documentation for sample.split() says the first argument 'Y' should be a vector of labels. But in your code you gave the variable name. The labels are the names of the levels of the factor. In my example these levels are High, Med and Low. To see the levels of your variable you could use levels(df$Customer_type). Input these to sample.split() as a character vector.
Adjust the rpart() call as shown below.
With these adjustments, your code might be OK.
# toy data
df <- data.frame(City = factor(sample(c("Paris", "Tokyo", "Miami"), 100, replace = T)),
Customer_type = factor(sample(c("High", "Med", "Low"), 100, replace = T)),
Gender = factor(sample(c("Female", "Male"), 100, replace = T)),
Quantity = sample(1:10, 100, replace = T),
Total = sample(1:10, 100, replace = T),
Date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100),
Rating = factor(sample(1:5, 100, replace = T)))
library(rpart)
library(caret)
library(dplyr)
library(caTools)
library(data.tree)
library(e1071)
#Splitting into training and testing data
set.seed(123)
sample = sample.split(levels(df$Customer_type), SplitRatio = .70) # ADJUST YOUR CODE TO MATCH YOUR FACTOR LABEL NAMES
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)
#Training the Decision Tree Classifier
tree <- rpart(Customer_type ~., data = train) # ADJUST YOUR CODE SO IT'S LIKE THIS
#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')
#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$Customer_type)

Try to keep factor levels of train and test same as df.
train$`Customer type` <- factor(train$`Customer type`, unique(df$`Customer type`))
test$`Customer type` <- factor(test$`Customer type`, unique(df$`Customer type`))

Related

How to avoid spread error in two-way repeated measures ANOVA in R?

I am trying to run a two-way repeated measures ANOVA using the rstatix package however, I keep getting the following error message that I don't know how to interpret although I suspect it has something to do with dat$id.
Error in `spread()`:
! Each row of output must be identified by a unique combination of keys.
Keys are shared for 72 rows:
The data I am using has two locations with three measurements per Location for each Date. Any idea how to avoid this error?
Example Data
library(dplyr)
set.seed(321)
dat <- data.frame(matrix(ncol = 3, nrow = 72))
colnames(dat)[1:3] <- c("Date","Location","Value")
dat$Value <- round(rnorm(72, 100,50),0)
dat$Location <- rep(c("Location 1","Location 2"), each = 36)
st <- as.Date("2020-01-01")
en <- as.Date("2020-12-31")
dat$Date <- rep(seq.Date(st,en,by = '1 month'),3)
dat <- dat %>% mutate(id = dense_rank(Date))
dat$Date <- as.factor(dat$Date)
View(dat)
Two-way repeated measures ANOVA that throws the error
library(rstatix)
resaov <- anova_test(
data = dat, dv = Value, wid = id,
within = c(Location, Date))

modelr add_predictions error: in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels)

I am facing the following error using modelr add_predictions function.
modelr add_predictions error: in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): fe.lead.surgeon has new levels ....
In my understanding, it is a common issue that arises when you are making the prediction model using a train dataset and applying the model to a test dataset since the factor levels that existed in a train dataset may not be present in a test dataset. However, I am using the same sample for creating the model and getting the predicted values, and still getting this error.
Specifically, here is the code I am using, and I would appreciate it for any insight on why this error occurs and how to solve this issue.
# indep is a vector of independent variable names
# dep is a vector of dependent variable names
# id.case is the id variable
# sample is my dataset.
eq <-
paste(indep, collapse = ' + ') %>%
paste(dep, ., sep = ' ~ ') %>%
as.formula
s <-
lm(eq, data = sample %>% select(-id.case))
pred <-
sample %>%
modelr::add_predictions(s) %>%
select(id.case, pred)
As per the request of #SimoneBianchi, I am providing the reproducible example here.
Reproducible example
library(tidyverse)
library(tibble)
library(data.table)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10002)
id <- sample(1:1000, 1000, replace=F)
set.seed(10003)
fe1 <- sample(c('A','B','C'), 1000, replace=T)
set.seed(10001)
fe2 <- sample(c('a','b','c'), 1000, replace=T)
set.seed(10001)
cont1 <- sample(1:300, 1000, replace=T)
set.seed(10004)
value <- sample(1:30, 1000, replace=T)
sample <-
data.frame(id, fe1, fe2, cont1, value)
dep <- 'value'
indep <-
c('fe1','fe2', 'cont1')
eq <-
paste(indep, collapse = ' + ') %>%
paste(dep, ., sep = ' ~ ') %>%
as.formula
s <-
lm(eq, data = sample %>% select(-id))
pred <-
sample %>%
modelr::add_predictions(s) %>%
select(id, pred)
Update and Workaround
One workaround I found is that you don't use modelr function but use fitted function. However, I would still want to learn why the regression automatically drops soma factor levels from a factor variable. If anyone knows, please leave a comment.
pred <-
sample %>%
cbind(pred = fitted(s))
Closing: Problem found with the dataset
I found that some observations were NA that had new levels in the corresponding factor variable -- the error. After I fixed the NA, the original code worked fine. So, it was a problem with the dataset rather than the code!
Thank you all for trying to help me out.

Random Forest Tree for classification

I am trying RF for the 1st time. I am trying to predict the genre of the game based on the factors
data <- read.csv("appstore_games.csv")
data <- data %>% drop_na()
data <- data %>% select(Average.User.Rating, User.Rating.Count, Price, Age.Rating, Genres)
data <- data %>% separate(Genres, c("Main Genre","Genre1","Genre2","Genre3"), extra = "drop" )
data1 <- data %>% select(Genre1 , Average.User.Rating, User.Rating.Count, Price )
str(data1)
data1$Genre1 <- as.factor(data1$Genre1)
set.seed(123)
sample <- sample(2 , nrow(data1),replace = TRUE, prob = c(0.7,0.3))
train_data <- data1[sample == 1,]
test_data <- data1[sample == 2,]
library(randomForest)
set.seed(1)
rf <- randomForest(train_data$Genre1 ~., data = train_data , proximity = TRUE, ntree = 200, importance = TRUE)
It shows error at this point
Error in randomForest.default(m, y, ...) : Can't have empty classes in y.
Can I know what is wrong here?
Thanks
The genre has names such as Strategy, Entertainment, etc
I am not completely sure, but I think that could happen if not all different levels of your Y is represented in the train data. Maybe you check this.
My other idea is that one of your classes in Y is "None".
train_data <- droplevels(train_data)
Try using this before you pass data to the model

Descriptive statistics of the data from nhanes 2003 -2004

I am trying to reproduce the original results (n = 9643) of missing data percentage from table 5 of the article "A robust imputation method for missing responses and covariates in sample selection models". I downloaded the nhanes data 2003-2004 and created a script to read them. I was able to faithfully reproduce the results of all variables except the income variable. I've read the article several times and researched a lot, but I can't see where I'm going wrong. Does anyone know how to find the 24.41% missing data value for the income variable? Below is my code!
rm(list = ls())
cat("\014")
library("tidyverse")
library(Hmisc)
mydata <- sasxport.get("https://raw.githack.com/maf335/stack/master/DEMO_C.XPT")
attach(mydata)
newdata <- mydata %>% select(seqn,ridageyr, riagendr, dmdeduc, ridreth1, indhhinc)
names(newdata) <- c("id","age","gender", "educ", "race", "income")
attach(newdata)
##################
mydata2 <- sasxport.get("https://raw.githack.com/maf335/stack/master/BMX_C.XPT")
attach(mydata2)
newdata2 <- mydata2 %>% select(seqn,bmxbmi)
names(newdata2) <- c("id","bmi")
attach(newdata2)
##############
mydata3 <- sasxport.get("https://raw.githack.com/maf335/stack/master/BPX_C.XPT")
attach(mydata3)
newdata3 <- mydata3 %>% select(seqn, bpxsy1)
names(newdata3) <- c("id", "sbp")
attach(newdata3)
#################
dt <- merge(newdata, newdata2, by="id")
data <- merge(dt, newdata3, by= "id")
attach(data)
####################
perc <- function(x,data){
nna <- ifelse(sum(is.na(x))!=0,summary(x)[[7]],"x has no missing data")
perc <- ifelse(sum(is.na(x))!=0,(nna/length(data$id))*100,"x has no missing data")
#perc <- (nna/length(data$id))*100
return(perc)
}
perc(sbp,data)
perc(age,data)
perc(gender,data)
perc(bmi,data)
perc(educ,data)
perc(race,data)
perc(income,data)
hist(data$income, prob= TRUE, breaks = seq(1, 99, 0.5), xlim = c(1,10), ylim = c(0,0.35), main = "Histogram of Income", xlab = "Category")
The article "Subsample ignorable likelihood for regression
analysis with missing data" also presents, in table 1, the income variable with high value of missing data. Even considering a smaller number of observations (n = 9041).

How to perform Mixed Design ANOVA on MICE imputed data in R?

I have a question about performing a Mixed Design ANOVA in R after multiple imputation using MICE. My data is as follows:
id <- c(1,2,3,4,5,6,7,8,9,10)
group <- c(0,1,1,0,0,1,0,0,0,1)
measure_1 <- c(60,80,90,54,60,61,77,67,88,90)
measure_2 <- c(55,88,88,55,70,62,78,66,65,92)
measure_3 <- c(58,88,85,56,68,62,89,62,70,99)
measure_4 <- c(64,80,78,92,65,64,87,65,67,96)
measure_5 <- c(64,85,80,65,74,69,90,65,70,99)
measure_6 <- c(70,83,80,55,73,64,91,65,91,89)
dat <- data.frame(id, group, measure_1, measure_2, measure_3, measure_4, measure_5, measure_6)
dat$group <- as.factor(dat$group)
So: we have 6 repeated measurements of diastolic blood pressure (measure 1 till 6). The grouping factor is gender, which is called group. This variable is coded 1 if male and 0 if female. Before multiple imputation, we have used the following code in R:
library(reshape)
library(reshape2)
datLong <- melt(dat, id = c("id", "group"), measured = c("measure_1", "measure_2", "measure_3", "measure_4", "measure_5", "measure_6"))
datLong
colnames(datLong) <- c("ID", "Gender", "Time", "Score")
datLong
table(datLong$Time)
datLong$ID <- as.factor(datLong$ID)
library(ez)
model_mixed <- ezANOVA(data = datLong,
dv = Value,
wid = ID,
within = Time,
between = Gender,
detailed = TRUE,
type = 3,
return_aov = TRUE)
model_mixed
This worked perfectly. However, our data is not complete. We have missing values, that we impute using MICE:
id <- c(1,2,3,4,5,6,7,8,9,10)
group <- c(0,1,1,0,0,1,0,0,0,1)
measure_1 <- c(60,80,90,54,60,61,77,67,88,90)
measure_2 <- c(55,NA,88,55,70,62,78,66,65,92)
measure_3 <- c(58,88,85,56,68,62,89,62,70,99)
measure_4 <- c(64,80,78,92,NA,NA,87,65,67,96)
measure_5 <- c(64,85,80,65,74,69,90,65,70,99)
measure_6 <- c(70,NA,80,55,73,64,91,65,91,89)
dat <- data.frame(id, group, measure_1, measure_2, measure_3, measure_4, measure_5, measure_6)
dat$group <- as.factor(dat$group)
imp_anova <- mice(dat, maxit = 0)
meth <- imp_anova$method
pred <- imp_anova$predictorMatrix
imp_anova <- mice(dat, method = meth, predictorMatrix = pred, seed = 2018, maxit = 10, m = 5)
(The imputation gives logged events, because of the made-up data and the simple imputation code e.g id used as a predictor. For my real data, the imputation was correct and valid)
Now I have the imputed dataset of class ‘mids’. I have searched the internet, but I cannot find how I can perform the mixed design ANOVA on this imputed set, as I did before with the complete set using ezANOVA. Is there anyone who can and wants to help me?

Resources