Can't use glht post-hoc with repeated measures ANOVA in R? - r

I have a data frame with this structure:
'data.frame': 39 obs. of 3 variables:
$ topic : Factor w/ 13 levels "Acido Folico",..: 1 2 3 4 5 6 7 8 9 10 ...
$ variable: Factor w/ 3 levels "Both","Preconception",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : int 14 1 36 17 5 9 19 9 19 25 ...
and I want to test the effect value ~ variable, considering that observation are grouped in topics. So I thought to use a repeated measure ANOVA, where "variable" is considered as a repeatead measure on every topic.
the call is aov(value ~ variable + Error(topic/variable)).
Up to this everything's ok.
Then I wanted to perform a post-hoc test with glht(model, linfct= mcp(variable = 'Tukey')), but I receive two errors:
‘glht’ does not support objects of class ‘aovlist’
no ‘model.matrix’ method for ‘model’ found! Since, taking out the error term solve the error I suppose that is the problem.
So, how can I perform a post-hoc test over a repeated measure anova?
Thanks!

Related

Adding random term into glmer mixed-effect model; error message: failure to converge

I'm analyzing data from an experiment, replicated in time, where I measured plant emergence at the soil surface. I had 3 experimental runs, represented by the term trialnum, and would like to include trialnum as a random effect.
Here is a summary of variables involved:
data.frame: 768 obs. of 9 variables:
$ trialnum : Factor w/ 2 levels "2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ Flood : Factor w/ 4 levels "0","5","10","15": 2 2 2 2 2 2 1 1 1 1 ...
$ Burial : Factor w/ 4 levels "1.3","2.5","5",..: 3 3 3 3 3 3 4 4 4 4 ...
$ biotype : Factor w/ 6 levels "0","1","2","3",..: 1 2 3 4 5 6 1 2 3 4 ...
$ soil : int 0 0 0 0 0 0 0 0 0 0 ...
$ n : num 15 15 15 15 15 15 15 15 15 15 ...
Where trialnum is the experimental run, Flood, Burial, and biotype are input/independent variables, and soil is the response/dependent variable.
I previously created this model with all input variables:
glmfitALL <-glm(cbind(soil,n)~trialnum*Flood*Burial*biotype,family = binomial(logit),total)`
From this model I found that by running
anova(glmfitALL, test = "Chisq")
trialnum is significant. There were 3 experimental runs, I'm only including 2 of those in my analysis. I have been advised to incorporate trialnum as a random effect so that I do not have to report the experimental runs separately.
To do this, I created the following model:
glmerfitALL <-glmer(cbind(soil,n)~Flood*Burial*biotype + (1|trialnum),
data = total,
family = binomial(logit),
control = glmerControl(optimizer = "bobyqa"))
From this I get the following error message:
maxfun < 10 * length(par)^2 is not recommended. Unable to evaluate scaled gradientModel failed to converge: degenerate Hessian with 9 negative eigenvalues
I have tried running this model in a variety of ways including:
glmerfitALL <-glmer(cbind(soil,n)~Flood*Burial*biotype*(1|trialnum),
data = total,
family = binomial(logit),
control = glmerControl(optimizer = "bobyqa"))
as well as incorporating REML=FALSE and used optimx in place of bobyqa, but all reiterations resulted in a similar error message.
Because this is an "eigenvalue" error, does that mean there is a problem with my source file/original data?
I also found previous threads regarding the lmer4 error messages (sorry I did not save the link), and saw some comments raising issue with the lack of replicates of the random effect. Because I only have 2 replicates trialnum2 and trialnum3, am I able to even run trialnum as a random effect?
Regarding the eigenvalue, the chief recommendation for this is centring and/or scaling predictors.
Regarding the RE groups, around five are an approximate minimum.

Single input data.frame instead of testset.csv fails in randomForest code in R

I have written a R script which successfully runs and predicts output but only when csv with multiple entries is passed as input to classifier.
training_set = read.csv('finaldata.csv')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[-5],
y = training_set$Song,
ntree = 50)
test_set = read.csv('testSet.csv')
y_pred = predict(classifier, newdata = test_set)
Above code runs succesfully, but instead of giving 10+ inputs to classifier, I want to pass a data.frame as single input to this classifier. That works in other classifier except this, why?
So following code doesn't work and throws error -
y_pred = predict(classifier, data.frame(Emot="happy",Pact="Walking",Mact="nothing",Session="morning"))
Error in predict.randomForest(classifier, data.frame(Emot = "happy", :
Type of predictors in new data do not match that of the training data.
I even tried keeping single entry in testinput.csv, still throws same error! How to solve it? This code is back-end of my another code and I want only single entry to pass as test to predict results. Also all are 'factors' in training as well as testing set. Help appreciated.
PS: Previous solutions to same error, didn't help me.
str(test_set)
'data.frame': 1 obs. of 5 variables:
$ Emot : Factor w/ 1 level "fear": 1
$ Pact : Factor w/ 1 level "Bicycling": 1
$ Mact : Factor w/ 1 level "browsing": 1
$ Session: Factor w/ 1 level "morning": 1
$ Song : Factor w/ 1 level "Dusk Till Dawn.mp3": 1
str(training_set)
'data.frame': 1052 obs. of 5 variables:
$ Emot : Factor w/ 8 levels "anger","contempt",..: 4 7 6 6 4 3 4 6 4 6 ...
$ Pact : Factor w/ 5 levels "Bicycling","Driving",..: 1 2 2 2 4 3 1 1 3 4 ...
$ Mact : Factor w/ 6 levels "browsing","chatting",..: 1 6 1 4 5 1 5 6 6 6 ...
$ Session: Factor w/ 4 levels "afternoon","evening",..: 3 4 3 2 1 3 1 1 2 1 ...
$ Song : Factor w/ 101 levels "Aaj Ibaadat.mp3",..: 29 83 47 72 29 75 77 8 30 53 ...
Ohk this worked successfully, weird solution. Equalized classes of training and test set. Following code binds the first row of training set to the test set and than delete it.
test_set <- rbind(training_set[1, ] , test_set)
test_set <- test_set[-1,]
done! it works for single input as well as single entry .csv file, without bringing error in model.

High OOB error rate for random forest

I am trying to develop a model to predict the WaitingTime variable. I am running a random forest on the following dataset.
$ BookingId : Factor w/ 589855 levels "00002100-1E20-E411-BEB6-0050568C445E",..: 223781 471484 372126 141550 246376 512394 566217 38486 560536 485266 ...
$ PickupLocality : int 1 67 77 -1 33 69 67 67 67 67 ...
$ ExZone : int 0 0 0 0 1 1 0 0 0 0 ...
$ BookingSource : int 2 2 2 2 2 2 7 7 7 7 ...
$ StarCustomer : int 1 1 1 1 1 1 1 1 1 1 ...
$ PickupZone : int 24 0 0 0 6 11 0 0 0 0 ...
$ ScheduledStart_Day : int 14 20 22 24 24 24 31 31 31 31 ...
$ ScheduledStart_Month : int 6 6 6 6 6 6 7 7 7 7 ...
$ ScheduledStart_Hour : int 14 17 7 2 8 8 1 2 2 2 ...
$ ScheduledStart_Minute : int 6 0 58 55 53 54 54 0 12 19 ...
$ ScheduledStart_WeekDay: int 1 7 2 4 4 4 6 6 6 6 ...
$ Season : int 1 1 1 1 1 1 1 1 1 1 ...
$ Pax : int 1 3 2 4 2 2 2 4 1 4 ...
$ WaitingTime : int 45 10 25 5 15 25 40 15 40 30 ...
I am splitting the dataset into training/test subsets into 80%/20% using the sample method and then running a random forest excluding the BookingId factor. This is only used to validate the predictions.
set.seed(1)
index <- sample(1:nrow(data),round(0.8*nrow(data)))
train <- data[index,]
test <- data[-index,]
library(randomForest)
extractFeatures <- function(data) {
features <- c( "PickupLocality",
"BookingSource",
"StarCustomer",
"ScheduledStart_Month",
"ScheduledStart_Day",
"ScheduledStart_WeekDay",
"ScheduledStart_Hour",
"Season",
"Pax")
fea <- data[,features]
return(fea)
}
rf <- randomForest(extractFeatures(train), as.factor(train$WaitingTime), ntree=600, mtry=2, importance=TRUE)
The problem is that all attempts to try and decrease OOB error rate and increase the accuracy failed. The maximum accuracy that I managed to achieve was ~23%.
I tried to change the number of features used, different ntree and mtry values, different training/test ratios, and also taking into consideration only data with WaitingTime <= 40. My last attempt was to follow MrFlick's suggestion and get the same sample size for all classes of get the same sample size for all classes of my predicting variable (WaitingTime).1
tempdata <- subset(tempdata, WaitingTime <= 40)
rndid <- with(tempdata, ave(tempdata$Season, tempdata$WaitingTime, FUN=function(x) {sample.int(length(x))}))
data <- tempdata[rndid<=27780,]
Do you know of any other ways how I can achieve at least accuracy over 50%?
Records by WaitingTime class:
Thanks in advance!
Messing with the randomForest hyperparameters will almost assuredly not significantly increase your performance.
I would suggest using a regression approach for you data. Since waiting time isn't categorical, a classification approach may not work very well. Your classification model loses the ordering information that 5 < 10 < 15, etc.
One thing to first try is to use a simple linear regression. Bin the predicted values from the test set and recalculate the accuracy. Better? Worse? If it's better, than go ahead and try a randomForest regression model (or as I would prefer, gradient boosted machines).
Secondly, it's possible that your data is just not predictive of the variable that you're interested in. Maybe the data got messed up somehow upstream. It might be a good diagnostic to first calculate correlation and/or mutual information of the predictors with the outcome.
Also, with so many categorical labels, 23% might actually not be that bad. The probability of a particular datapoint to be correctly labeled based on random guess is N_class/N. So the accuracy of a random guess model is not 50%. You can calculate the adjusted rand index to show that it is better than random guess.

Grouping error with lmer

I have a data frame with the following structure:
> t <- read.csv("combinedData.csv")[,1:7]
> str(t)
'data.frame': 699 obs. of 7 variables:
$ Awns : int 0 0 0 0 0 0 0 0 1 0 ...
$ Funnel : Factor w/ 213 levels "MEL001","MEL002",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Plant : int 1 2 1 3 8 1 1 2 3 5 ...
$ Line : Factor w/ 8 levels "a","b","c","cA",..: 2 2 1 1 1 3 1 1 1 1 ...
$ X : int 1 2 3 4 7 8 9 10 11 12 ...
$ ID : Factor w/ 699 levels "MEL_001-1b","MEL_001-2b",..: 1 2 3 4 5 6 7 8 9 10 ...
$ BobWhite_c10082_241: int 2 2 2 2 2 2 0 2 2 0 ...
I want to construct a mixed effect model. I know in my data frame that the random effect I want to include (Funnel) is a factor, but it does not work:
> lmer(t$Awns ~ (1|t$Funnel) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Funnel within model frame: try adding grouping factor to data frame explicitly if possible
In fact this happens whatever I want to include as a random effect e.g. Plant:
> lmer(t$Awns ~ (1|t$Plant) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Plant within model frame: try adding grouping factor to data frame explicitly if possible
Why is R giving me this error? The only other answer I could google fu is that the random effect fed in wasn't a factor in the DF. But as str shows, df$Funnel certainly is.
It is actually not so easy to provide a convenient syntax for modeling functions and at the same time have a robust implementation. Most package authors assume that you use the data parameter and even then scoping issues can occur. Thus, strange things can happen if you specify variables with DF$col syntax since package authors rarely spend a lot of effort to make this work correctly and don't include a lot of unit tests for this.
It is therefore strongly recommended to use the data parameter if the model function offers a formula method. Strange things can happen if you don't follow that praxis (also with other model functions like lm).
In your example:
lmer(Awns ~ (1|Funnel) + BobWhite_c10082_241, data = t)
This not only works, but is also more convenient to write.

Getting an error "(subscript) logical subscript too long" while training SVM from e1071 package in R

I am training svm using my traindata. (e1071 package in R). Following is the information about my data.
> str(train)
'data.frame': 891 obs. of 10 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 15 levels "capt","col","countess",..: 12 13 9 13 12 12 12 8 13 13
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 ...
$ ticket : Factor w/ 533 levels "110152","110413",..: 516 522 531 50 473 276 86 396
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 9 levels "a","b","c","d",..: 9 3 9 3 9 9 5 9 9 9 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ family : int 1 1 0 1 0 0 0 4 2 1 ...
I train it as the following.
library(e1071)
model1 <- svm(survived~.,data=train, type="C-classification")
No problem here. But when I predict as:
pred <- predict(model1,test)
I get the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
I also tried removing "ticket" predictor from both train and test data. But still same error. What is the problem?
There might a difference in the number of levels in one of the factors in 'test' dataset.
run str(test) and check that the factor variables have the same levels as corresponding variables in the 'train' dataset.
ie the example below shows my.test$foo only has 4 levels.....
str(my.train)
'data.frame': 554 obs. of 7 variables:
....
$ foo: Factor w/ 5 levels "C","Q","S","X","Z": 2 2 4 3 4 4 4 4 4 4 ...
str(my.test)
'data.frame': 200 obs. of 7 variables:
...
$ foo: Factor w/ 4 levels "C","Q","S","X": 3 3 3 3 1 3 3 3 3 3 ...
Thats correct train data contains 2 blanks for embarked because of this there is one extra categorical value for blanks and you are getting this error
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The first is blank
I encountered the same problem today. It turned out that the svm model in e1071 package can only use rows as the objects, which means one row is one sample, rather than column. If you use column as the sample and row as the variable, this error will occur.
Probably your data is good (no new levels in test data), and you just need a small trick, then you are fine with prediction.
test.df = rbind(train.df[1,],test.df)
test.df = test.df[-1,]
This trick was from R Random Forest - type of predictors in new data do not match.
Today I encountered this problem, used above trick and then solved the problem.
I have been playing with that data set as well. I know this was a long time ago, but one of the things you can do is explicitly include only the columns you feel will add to the model, like such:
fit <- svm(Survived~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train)
This eliminated the problem for me by eliminating columns that contribute nothing (like ticket number) which have no relevant data.
Another possible issue that resolved my code was the fact I hard forgotten to make some of my independent variables factors.

Resources