I am trying to run a two-way repeated measures ANOVA using the rstatix package however, I keep getting the following error message that I don't know how to interpret although I suspect it has something to do with dat$id.
Error in `spread()`:
! Each row of output must be identified by a unique combination of keys.
Keys are shared for 72 rows:
The data I am using has two locations with three measurements per Location for each Date. Any idea how to avoid this error?
Example Data
library(dplyr)
set.seed(321)
dat <- data.frame(matrix(ncol = 3, nrow = 72))
colnames(dat)[1:3] <- c("Date","Location","Value")
dat$Value <- round(rnorm(72, 100,50),0)
dat$Location <- rep(c("Location 1","Location 2"), each = 36)
st <- as.Date("2020-01-01")
en <- as.Date("2020-12-31")
dat$Date <- rep(seq.Date(st,en,by = '1 month'),3)
dat <- dat %>% mutate(id = dense_rank(Date))
dat$Date <- as.factor(dat$Date)
View(dat)
Two-way repeated measures ANOVA that throws the error
library(rstatix)
resaov <- anova_test(
data = dat, dv = Value, wid = id,
within = c(Location, Date))
I am facing the following error using modelr add_predictions function.
modelr add_predictions error: in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): fe.lead.surgeon has new levels ....
In my understanding, it is a common issue that arises when you are making the prediction model using a train dataset and applying the model to a test dataset since the factor levels that existed in a train dataset may not be present in a test dataset. However, I am using the same sample for creating the model and getting the predicted values, and still getting this error.
Specifically, here is the code I am using, and I would appreciate it for any insight on why this error occurs and how to solve this issue.
# indep is a vector of independent variable names
# dep is a vector of dependent variable names
# id.case is the id variable
# sample is my dataset.
eq <-
paste(indep, collapse = ' + ') %>%
paste(dep, ., sep = ' ~ ') %>%
as.formula
s <-
lm(eq, data = sample %>% select(-id.case))
pred <-
sample %>%
modelr::add_predictions(s) %>%
select(id.case, pred)
As per the request of #SimoneBianchi, I am providing the reproducible example here.
Reproducible example
library(tidyverse)
library(tibble)
library(data.table)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10002)
id <- sample(1:1000, 1000, replace=F)
set.seed(10003)
fe1 <- sample(c('A','B','C'), 1000, replace=T)
set.seed(10001)
fe2 <- sample(c('a','b','c'), 1000, replace=T)
set.seed(10001)
cont1 <- sample(1:300, 1000, replace=T)
set.seed(10004)
value <- sample(1:30, 1000, replace=T)
sample <-
data.frame(id, fe1, fe2, cont1, value)
dep <- 'value'
indep <-
c('fe1','fe2', 'cont1')
eq <-
paste(indep, collapse = ' + ') %>%
paste(dep, ., sep = ' ~ ') %>%
as.formula
s <-
lm(eq, data = sample %>% select(-id))
pred <-
sample %>%
modelr::add_predictions(s) %>%
select(id, pred)
Update and Workaround
One workaround I found is that you don't use modelr function but use fitted function. However, I would still want to learn why the regression automatically drops soma factor levels from a factor variable. If anyone knows, please leave a comment.
pred <-
sample %>%
cbind(pred = fitted(s))
Closing: Problem found with the dataset
I found that some observations were NA that had new levels in the corresponding factor variable -- the error. After I fixed the NA, the original code worked fine. So, it was a problem with the dataset rather than the code!
Thank you all for trying to help me out.
I am trying RF for the 1st time. I am trying to predict the genre of the game based on the factors
data <- read.csv("appstore_games.csv")
data <- data %>% drop_na()
data <- data %>% select(Average.User.Rating, User.Rating.Count, Price, Age.Rating, Genres)
data <- data %>% separate(Genres, c("Main Genre","Genre1","Genre2","Genre3"), extra = "drop" )
data1 <- data %>% select(Genre1 , Average.User.Rating, User.Rating.Count, Price )
str(data1)
data1$Genre1 <- as.factor(data1$Genre1)
set.seed(123)
sample <- sample(2 , nrow(data1),replace = TRUE, prob = c(0.7,0.3))
train_data <- data1[sample == 1,]
test_data <- data1[sample == 2,]
library(randomForest)
set.seed(1)
rf <- randomForest(train_data$Genre1 ~., data = train_data , proximity = TRUE, ntree = 200, importance = TRUE)
It shows error at this point
Error in randomForest.default(m, y, ...) : Can't have empty classes in y.
Can I know what is wrong here?
Thanks
The genre has names such as Strategy, Entertainment, etc
I am not completely sure, but I think that could happen if not all different levels of your Y is represented in the train data. Maybe you check this.
My other idea is that one of your classes in Y is "None".
train_data <- droplevels(train_data)
Try using this before you pass data to the model
I am trying to reproduce the original results (n = 9643) of missing data percentage from table 5 of the article "A robust imputation method for missing responses and covariates in sample selection models". I downloaded the nhanes data 2003-2004 and created a script to read them. I was able to faithfully reproduce the results of all variables except the income variable. I've read the article several times and researched a lot, but I can't see where I'm going wrong. Does anyone know how to find the 24.41% missing data value for the income variable? Below is my code!
rm(list = ls())
cat("\014")
library("tidyverse")
library(Hmisc)
mydata <- sasxport.get("https://raw.githack.com/maf335/stack/master/DEMO_C.XPT")
attach(mydata)
newdata <- mydata %>% select(seqn,ridageyr, riagendr, dmdeduc, ridreth1, indhhinc)
names(newdata) <- c("id","age","gender", "educ", "race", "income")
attach(newdata)
##################
mydata2 <- sasxport.get("https://raw.githack.com/maf335/stack/master/BMX_C.XPT")
attach(mydata2)
newdata2 <- mydata2 %>% select(seqn,bmxbmi)
names(newdata2) <- c("id","bmi")
attach(newdata2)
##############
mydata3 <- sasxport.get("https://raw.githack.com/maf335/stack/master/BPX_C.XPT")
attach(mydata3)
newdata3 <- mydata3 %>% select(seqn, bpxsy1)
names(newdata3) <- c("id", "sbp")
attach(newdata3)
#################
dt <- merge(newdata, newdata2, by="id")
data <- merge(dt, newdata3, by= "id")
attach(data)
####################
perc <- function(x,data){
nna <- ifelse(sum(is.na(x))!=0,summary(x)[[7]],"x has no missing data")
perc <- ifelse(sum(is.na(x))!=0,(nna/length(data$id))*100,"x has no missing data")
#perc <- (nna/length(data$id))*100
return(perc)
}
perc(sbp,data)
perc(age,data)
perc(gender,data)
perc(bmi,data)
perc(educ,data)
perc(race,data)
perc(income,data)
hist(data$income, prob= TRUE, breaks = seq(1, 99, 0.5), xlim = c(1,10), ylim = c(0,0.35), main = "Histogram of Income", xlab = "Category")
The article "Subsample ignorable likelihood for regression
analysis with missing data" also presents, in table 1, the income variable with high value of missing data. Even considering a smaller number of observations (n = 9041).
I have a question about performing a Mixed Design ANOVA in R after multiple imputation using MICE. My data is as follows:
id <- c(1,2,3,4,5,6,7,8,9,10)
group <- c(0,1,1,0,0,1,0,0,0,1)
measure_1 <- c(60,80,90,54,60,61,77,67,88,90)
measure_2 <- c(55,88,88,55,70,62,78,66,65,92)
measure_3 <- c(58,88,85,56,68,62,89,62,70,99)
measure_4 <- c(64,80,78,92,65,64,87,65,67,96)
measure_5 <- c(64,85,80,65,74,69,90,65,70,99)
measure_6 <- c(70,83,80,55,73,64,91,65,91,89)
dat <- data.frame(id, group, measure_1, measure_2, measure_3, measure_4, measure_5, measure_6)
dat$group <- as.factor(dat$group)
So: we have 6 repeated measurements of diastolic blood pressure (measure 1 till 6). The grouping factor is gender, which is called group. This variable is coded 1 if male and 0 if female. Before multiple imputation, we have used the following code in R:
library(reshape)
library(reshape2)
datLong <- melt(dat, id = c("id", "group"), measured = c("measure_1", "measure_2", "measure_3", "measure_4", "measure_5", "measure_6"))
datLong
colnames(datLong) <- c("ID", "Gender", "Time", "Score")
datLong
table(datLong$Time)
datLong$ID <- as.factor(datLong$ID)
library(ez)
model_mixed <- ezANOVA(data = datLong,
dv = Value,
wid = ID,
within = Time,
between = Gender,
detailed = TRUE,
type = 3,
return_aov = TRUE)
model_mixed
This worked perfectly. However, our data is not complete. We have missing values, that we impute using MICE:
id <- c(1,2,3,4,5,6,7,8,9,10)
group <- c(0,1,1,0,0,1,0,0,0,1)
measure_1 <- c(60,80,90,54,60,61,77,67,88,90)
measure_2 <- c(55,NA,88,55,70,62,78,66,65,92)
measure_3 <- c(58,88,85,56,68,62,89,62,70,99)
measure_4 <- c(64,80,78,92,NA,NA,87,65,67,96)
measure_5 <- c(64,85,80,65,74,69,90,65,70,99)
measure_6 <- c(70,NA,80,55,73,64,91,65,91,89)
dat <- data.frame(id, group, measure_1, measure_2, measure_3, measure_4, measure_5, measure_6)
dat$group <- as.factor(dat$group)
imp_anova <- mice(dat, maxit = 0)
meth <- imp_anova$method
pred <- imp_anova$predictorMatrix
imp_anova <- mice(dat, method = meth, predictorMatrix = pred, seed = 2018, maxit = 10, m = 5)
(The imputation gives logged events, because of the made-up data and the simple imputation code e.g id used as a predictor. For my real data, the imputation was correct and valid)
Now I have the imputed dataset of class ‘mids’. I have searched the internet, but I cannot find how I can perform the mixed design ANOVA on this imputed set, as I did before with the complete set using ezANOVA. Is there anyone who can and wants to help me?