Grouping error with lmer - r

I have a data frame with the following structure:
> t <- read.csv("combinedData.csv")[,1:7]
> str(t)
'data.frame': 699 obs. of 7 variables:
$ Awns : int 0 0 0 0 0 0 0 0 1 0 ...
$ Funnel : Factor w/ 213 levels "MEL001","MEL002",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Plant : int 1 2 1 3 8 1 1 2 3 5 ...
$ Line : Factor w/ 8 levels "a","b","c","cA",..: 2 2 1 1 1 3 1 1 1 1 ...
$ X : int 1 2 3 4 7 8 9 10 11 12 ...
$ ID : Factor w/ 699 levels "MEL_001-1b","MEL_001-2b",..: 1 2 3 4 5 6 7 8 9 10 ...
$ BobWhite_c10082_241: int 2 2 2 2 2 2 0 2 2 0 ...
I want to construct a mixed effect model. I know in my data frame that the random effect I want to include (Funnel) is a factor, but it does not work:
> lmer(t$Awns ~ (1|t$Funnel) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Funnel within model frame: try adding grouping factor to data frame explicitly if possible
In fact this happens whatever I want to include as a random effect e.g. Plant:
> lmer(t$Awns ~ (1|t$Plant) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Plant within model frame: try adding grouping factor to data frame explicitly if possible
Why is R giving me this error? The only other answer I could google fu is that the random effect fed in wasn't a factor in the DF. But as str shows, df$Funnel certainly is.

It is actually not so easy to provide a convenient syntax for modeling functions and at the same time have a robust implementation. Most package authors assume that you use the data parameter and even then scoping issues can occur. Thus, strange things can happen if you specify variables with DF$col syntax since package authors rarely spend a lot of effort to make this work correctly and don't include a lot of unit tests for this.
It is therefore strongly recommended to use the data parameter if the model function offers a formula method. Strange things can happen if you don't follow that praxis (also with other model functions like lm).
In your example:
lmer(Awns ~ (1|Funnel) + BobWhite_c10082_241, data = t)
This not only works, but is also more convenient to write.

Related

Single input data.frame instead of testset.csv fails in randomForest code in R

I have written a R script which successfully runs and predicts output but only when csv with multiple entries is passed as input to classifier.
training_set = read.csv('finaldata.csv')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[-5],
y = training_set$Song,
ntree = 50)
test_set = read.csv('testSet.csv')
y_pred = predict(classifier, newdata = test_set)
Above code runs succesfully, but instead of giving 10+ inputs to classifier, I want to pass a data.frame as single input to this classifier. That works in other classifier except this, why?
So following code doesn't work and throws error -
y_pred = predict(classifier, data.frame(Emot="happy",Pact="Walking",Mact="nothing",Session="morning"))
Error in predict.randomForest(classifier, data.frame(Emot = "happy", :
Type of predictors in new data do not match that of the training data.
I even tried keeping single entry in testinput.csv, still throws same error! How to solve it? This code is back-end of my another code and I want only single entry to pass as test to predict results. Also all are 'factors' in training as well as testing set. Help appreciated.
PS: Previous solutions to same error, didn't help me.
str(test_set)
'data.frame': 1 obs. of 5 variables:
$ Emot : Factor w/ 1 level "fear": 1
$ Pact : Factor w/ 1 level "Bicycling": 1
$ Mact : Factor w/ 1 level "browsing": 1
$ Session: Factor w/ 1 level "morning": 1
$ Song : Factor w/ 1 level "Dusk Till Dawn.mp3": 1
str(training_set)
'data.frame': 1052 obs. of 5 variables:
$ Emot : Factor w/ 8 levels "anger","contempt",..: 4 7 6 6 4 3 4 6 4 6 ...
$ Pact : Factor w/ 5 levels "Bicycling","Driving",..: 1 2 2 2 4 3 1 1 3 4 ...
$ Mact : Factor w/ 6 levels "browsing","chatting",..: 1 6 1 4 5 1 5 6 6 6 ...
$ Session: Factor w/ 4 levels "afternoon","evening",..: 3 4 3 2 1 3 1 1 2 1 ...
$ Song : Factor w/ 101 levels "Aaj Ibaadat.mp3",..: 29 83 47 72 29 75 77 8 30 53 ...
Ohk this worked successfully, weird solution. Equalized classes of training and test set. Following code binds the first row of training set to the test set and than delete it.
test_set <- rbind(training_set[1, ] , test_set)
test_set <- test_set[-1,]
done! it works for single input as well as single entry .csv file, without bringing error in model.

R multiway split trees using ctree {partykit}

I want to analyze my data with a conditional inference trees using the ctree function from partykit. I specifically went for this function because - if I understood correctly - it's one of the only ones allowing multiway splits. I need this option because all of my variables are multilevel (unordered) categorical variables.
However, trying to enable multiway split using ctree_control gives the following error:
aufprallentree <- ctree(case ~., data = aufprallen,
control = ctree_control(minsplit = 10, minbucket = 5, multiway = TRUE))
## Error in 1:levels(x) : NA/NaN argument
## In addition: Warning messages:
## 1: In 1:levels(x) :
## numerical expression has 4 elements: only the first used
## 2: In partysplit(as.integer(isel), index = 1:levels(x)) :
## NAs introduced by coercion
Anyone knows how to solve this? Or if I'm mistaken and ctree does not allow multiway splits?
For clarity, an overview of my data: (no NAs)
str(aufprallen)
## 'data.frame': 299 obs. of 10 variables:
## $ prep : Factor w/ 6 levels "an","auf","hinter",..: 2 2 2 2 2 2 1 2 2 2 ...
## $ prep_main : Factor w/ 2 levels "auf","other": 1 1 1 1 1 1 2 1 1 1 ...
## $ case : Factor w/ 2 levels "acc","dat": 1 1 2 1 1 1 2 1 1 1 ...
## $ sense : Factor w/ 3 levels "crashdown","crashinto",..: 2 2 1 3 2 2 1 2 1 2 ...
## $ PO_type : Factor w/ 4 levels "object","region",..: 4 4 3 1 4 4 3 4 3 4 ...
## $ PO_type2 : Factor w/ 3 levels "object","region",..: 1 1 3 1 1 1 3 1 3 1 ...
## $ perfectivity : Factor w/ 2 levels "imperfective",..: 1 1 2 2 1 1 1 1 1 1 ...
## $ mit_Körperteil: Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
## $ PP_place : Factor w/ 4 levels "back","front",..: 4 1 1 1 1 1 1 1 1 1 ...
## $ PP_place_main : Factor w/ 3 levels "marked","rel",..: 2 3 3 3 3 3 3 3 3 3 ...
Thanks in advance!
A couple of remarks:
The error with 1:levels(x) was a bug in ctree. The code should have been 1:nlevels(x). I just fixed this on R-Forge - so you can check out the SVN from there and manually install the package if you want to use the option now. (Contact me off-list if you need more details on this.) Torsten will probably also make a new CRAN release in the next weeks.
Another function that can learn binary classification trees with multiway splits is glmtree in the partykit package. The code would be glmtree(case ~ ., data = aufprallen, family = binomial, catsplit = "multiway", minsize = 5). It uses parameter instability tests instead of conditional inference for association to determine the splitting variables and adopts the formal likelihood. But in many cases the results are fairly similar to ctree.
In both algorithms, the multiway splits are very basic: If a categorical variable is selected for splitting, then no split selection is done at all. Instead all categories get their own daughter node. There are algorithms that try to determine optimal groupings of categories with a data-driven number of daughter nodes (between 2 and the number of categories).
Even though you have categorical predictor variables with more than two levels you don't need multiway splits. Many algorithms just use binary splits because any multiway split can be represented by a sequence of binary splits. In many datasets, however, it turns out that it is beneficial to not separate all but just a few of the categories in a splitting factor.
Overall my recommendation would be to start out with standard conditional inference trees with binary splits only. And only if it turns out that this leads to many binary splits in the same factor, then I would go on to explore multiway splits.

Can't use glht post-hoc with repeated measures ANOVA in R?

I have a data frame with this structure:
'data.frame': 39 obs. of 3 variables:
$ topic : Factor w/ 13 levels "Acido Folico",..: 1 2 3 4 5 6 7 8 9 10 ...
$ variable: Factor w/ 3 levels "Both","Preconception",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : int 14 1 36 17 5 9 19 9 19 25 ...
and I want to test the effect value ~ variable, considering that observation are grouped in topics. So I thought to use a repeated measure ANOVA, where "variable" is considered as a repeatead measure on every topic.
the call is aov(value ~ variable + Error(topic/variable)).
Up to this everything's ok.
Then I wanted to perform a post-hoc test with glht(model, linfct= mcp(variable = 'Tukey')), but I receive two errors:
‘glht’ does not support objects of class ‘aovlist’
no ‘model.matrix’ method for ‘model’ found! Since, taking out the error term solve the error I suppose that is the problem.
So, how can I perform a post-hoc test over a repeated measure anova?
Thanks!

Assign list of attributes() to sublist in R

I have a dataframe called 'situations' containing list of attributes.
> str(situations)
'data.frame': 24 obs. of 8 variables:
$ ID.SITUATION : Factor w/ 24 levels "cnf_01_be","cnf_02_ch",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ELICITATION.D : Factor w/ 2 levels "NATUREL","SEMI.DIRIGE": 1 1 1 1 1 1 1 1 2 2 ...
$ INTERLOCUTEUR.C : Factor w/ 3 levels "DIALOGUE","MONOLOGUE",..: 2 2 2 2 3 3 3 3 1 1 ...
$ PREPARATION.D : Factor w/ 3 levels "PREPARE","SEMI.PREPARE",..: 2 2 2 2 3 3 3 3 3 3 ...
$ INTERACTIVITE.D : Factor w/ 3 levels "INTERACTIF","NON. INTERACTIF",..: 2 2 2 2 1 1 1 1 3 3 ...
$ MEDIATISATION.D : Factor w/ 3 levels "MEDIATIQUE","NON.MEDIATIQUE",..: 2 2 2 2 2 2 2 2 2 2 ...
$ PROFESSIONNALISATION.C: Factor w/ 1 level "PRO": 1 1 1 1 1 1 1 1 1 1 ...
$ ID.TASK : Factor w/ 5 levels "conference scientifique",..: 1 1 1 1 2 2 2 2 3 3 ...
I have as many observation in this dataframes (24) than i have sublist in a given corpus.
ID situation names (cnf_01_be) correspond to the name of the sublist (cnf_01_be).
I know how to assign individual attributes :
attributes(corpus$cnf_01_be) = situations[1,]
attributes(corpus$cnf_02_ch) = situations[2,]
And retrieve them for a specific purpose :
attr(corpus$cnf_01_be, "ELICITATION.D")
attr(corpus$cnf_02_ch, "ELICITATION.D")
attr(corpus$cnf_02_ch, "PREPARATION.D")
But how can I use for example lapply to assign automatically attributes to all the sublist in my corpus ?
I feel like all my trial are going in the wrong direction :
setattr <- function(x,y) {
attributes(x) <- situations[y,]
return(attributes)
}
...or...
lapply(corpus,setattr)
lapply(corpus, attributes(corpus) <- situations[c(1:length(situations[,1])),])
Thanks in advance!
The main problem with using lapply (and similar approaches) is that they cannot normally change the original object of interest, but rather return a new structure. so if you already have a list "corpus" and just want to change its members' attributes you can't usually do that inside a function.
One way to overcome this limitation is to use eval.parent() call instead of the usual assignment. This function evaluates the assignment expression in the parent environment (the environment that called the function), rather than to the local instances (copies) of the objects you assign. if you use this you don't have to return any value.
Another option would be to create a local copy of your corpus list within the function, add to it all the attributes, then return the whole structure from the function and use it to substitute the old list. if your list is big/complex this is probably not a wise choice
Here is a code that does it. note - this is an ugly code. I'm still looking to see if I can make it simpler, but because of the issues above, i'm not sure there is a much simpler option. Anyway, I hope the following will do the trick for you:
f = function(lname,data) {
snames = eval.parent(parse(text=paste("names(",lname,")")))
for (xn in snames) {
rd = data[match(xn,as.character(data$id)),]
if (nrow(rd)>0) {
tmp___ <<-rd[1,]
cmm = paste("attributes(",lname,"[[",xn,"]]) = tmp___")
eval.parent(parse(text=cmm))
}
}
}
Note that in order to use it you need to supply your list name (as a character string, and not as a variable), and your data frame. In your case the call would be:
f("corpus",situations)
I hope this helps.

Getting an error "(subscript) logical subscript too long" while training SVM from e1071 package in R

I am training svm using my traindata. (e1071 package in R). Following is the information about my data.
> str(train)
'data.frame': 891 obs. of 10 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 15 levels "capt","col","countess",..: 12 13 9 13 12 12 12 8 13 13
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 ...
$ ticket : Factor w/ 533 levels "110152","110413",..: 516 522 531 50 473 276 86 396
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 9 levels "a","b","c","d",..: 9 3 9 3 9 9 5 9 9 9 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ family : int 1 1 0 1 0 0 0 4 2 1 ...
I train it as the following.
library(e1071)
model1 <- svm(survived~.,data=train, type="C-classification")
No problem here. But when I predict as:
pred <- predict(model1,test)
I get the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
I also tried removing "ticket" predictor from both train and test data. But still same error. What is the problem?
There might a difference in the number of levels in one of the factors in 'test' dataset.
run str(test) and check that the factor variables have the same levels as corresponding variables in the 'train' dataset.
ie the example below shows my.test$foo only has 4 levels.....
str(my.train)
'data.frame': 554 obs. of 7 variables:
....
$ foo: Factor w/ 5 levels "C","Q","S","X","Z": 2 2 4 3 4 4 4 4 4 4 ...
str(my.test)
'data.frame': 200 obs. of 7 variables:
...
$ foo: Factor w/ 4 levels "C","Q","S","X": 3 3 3 3 1 3 3 3 3 3 ...
Thats correct train data contains 2 blanks for embarked because of this there is one extra categorical value for blanks and you are getting this error
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The first is blank
I encountered the same problem today. It turned out that the svm model in e1071 package can only use rows as the objects, which means one row is one sample, rather than column. If you use column as the sample and row as the variable, this error will occur.
Probably your data is good (no new levels in test data), and you just need a small trick, then you are fine with prediction.
test.df = rbind(train.df[1,],test.df)
test.df = test.df[-1,]
This trick was from R Random Forest - type of predictors in new data do not match.
Today I encountered this problem, used above trick and then solved the problem.
I have been playing with that data set as well. I know this was a long time ago, but one of the things you can do is explicitly include only the columns you feel will add to the model, like such:
fit <- svm(Survived~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train)
This eliminated the problem for me by eliminating columns that contribute nothing (like ticket number) which have no relevant data.
Another possible issue that resolved my code was the fact I hard forgotten to make some of my independent variables factors.

Resources