R multiway split trees using ctree {partykit} - r

I want to analyze my data with a conditional inference trees using the ctree function from partykit. I specifically went for this function because - if I understood correctly - it's one of the only ones allowing multiway splits. I need this option because all of my variables are multilevel (unordered) categorical variables.
However, trying to enable multiway split using ctree_control gives the following error:
aufprallentree <- ctree(case ~., data = aufprallen,
control = ctree_control(minsplit = 10, minbucket = 5, multiway = TRUE))
## Error in 1:levels(x) : NA/NaN argument
## In addition: Warning messages:
## 1: In 1:levels(x) :
## numerical expression has 4 elements: only the first used
## 2: In partysplit(as.integer(isel), index = 1:levels(x)) :
## NAs introduced by coercion
Anyone knows how to solve this? Or if I'm mistaken and ctree does not allow multiway splits?
For clarity, an overview of my data: (no NAs)
str(aufprallen)
## 'data.frame': 299 obs. of 10 variables:
## $ prep : Factor w/ 6 levels "an","auf","hinter",..: 2 2 2 2 2 2 1 2 2 2 ...
## $ prep_main : Factor w/ 2 levels "auf","other": 1 1 1 1 1 1 2 1 1 1 ...
## $ case : Factor w/ 2 levels "acc","dat": 1 1 2 1 1 1 2 1 1 1 ...
## $ sense : Factor w/ 3 levels "crashdown","crashinto",..: 2 2 1 3 2 2 1 2 1 2 ...
## $ PO_type : Factor w/ 4 levels "object","region",..: 4 4 3 1 4 4 3 4 3 4 ...
## $ PO_type2 : Factor w/ 3 levels "object","region",..: 1 1 3 1 1 1 3 1 3 1 ...
## $ perfectivity : Factor w/ 2 levels "imperfective",..: 1 1 2 2 1 1 1 1 1 1 ...
## $ mit_Körperteil: Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
## $ PP_place : Factor w/ 4 levels "back","front",..: 4 1 1 1 1 1 1 1 1 1 ...
## $ PP_place_main : Factor w/ 3 levels "marked","rel",..: 2 3 3 3 3 3 3 3 3 3 ...
Thanks in advance!

A couple of remarks:
The error with 1:levels(x) was a bug in ctree. The code should have been 1:nlevels(x). I just fixed this on R-Forge - so you can check out the SVN from there and manually install the package if you want to use the option now. (Contact me off-list if you need more details on this.) Torsten will probably also make a new CRAN release in the next weeks.
Another function that can learn binary classification trees with multiway splits is glmtree in the partykit package. The code would be glmtree(case ~ ., data = aufprallen, family = binomial, catsplit = "multiway", minsize = 5). It uses parameter instability tests instead of conditional inference for association to determine the splitting variables and adopts the formal likelihood. But in many cases the results are fairly similar to ctree.
In both algorithms, the multiway splits are very basic: If a categorical variable is selected for splitting, then no split selection is done at all. Instead all categories get their own daughter node. There are algorithms that try to determine optimal groupings of categories with a data-driven number of daughter nodes (between 2 and the number of categories).
Even though you have categorical predictor variables with more than two levels you don't need multiway splits. Many algorithms just use binary splits because any multiway split can be represented by a sequence of binary splits. In many datasets, however, it turns out that it is beneficial to not separate all but just a few of the categories in a splitting factor.
Overall my recommendation would be to start out with standard conditional inference trees with binary splits only. And only if it turns out that this leads to many binary splits in the same factor, then I would go on to explore multiway splits.

Related

Convert dummy variable from numeric to factor for chi-square test in R

I want to perform chi-square test in R using the following datasets. After perform dummy variable creation. The p-value i get from chi-square test is 1, which is incorrect. I suspect it is because of after dummy variable creation, the data structure change from factor to numeric. This is a hypothesis testing question that wants to check whether the defective % varies by 4 countries center at 5% confidence interval. Please advice what is the possible error and what is the solution.
Subset of datasets used
Phillippines Indonesia Malta India
Error Free Error Free Defective Error Free
Error Free Error Free Error Free Defective
Error Free Defective Defective Error Free
Error Free Error Free Error Free Error Free
Error Free Error Free Defective Error Free
Error Free Error Free Error Free Error Free
The structure of the initial data is factor:
> str(data)
'data.frame': 300 obs. of 4 variables:
$ Phillippines: Factor w/ 2 levels "Defective","Error Free": 2 2 2 2 2 2 2 2 2 2 ...
$ Indonesia : Factor w/ 2 levels "Defective","Error Free": 2 2 1 2 2 2 1 2 2 2 ...
$ Malta : Factor w/ 2 levels "Defective","Error Free": 1 2 1 2 1 2 2 2 2 2 ...
$ India : Factor w/ 2 levels "Defective","Error Free": 2 1 2 2 2 2 2 2 2 2 …
I convert dummy variable for the following categorical data (error free and defective) by following code:
library(caret)
dmy <- dummyVars("~ .", data = data, fullRank = T)
trsf <- data.frame(predict(dmy, newdata = data))
After dummy variable creation, the data structure of dummy variable turn to numeric:
> str(trsf)
'data.frame': 300 obs. of 4 variables:
$ Phillippines.Error.Free: num 1 1 1 1 1 1 1 1 1 1 ...
$ Indonesia.Error.Free : num 1 1 0 1 1 1 0 1 1 1 ...
$ Malta.Error.Free : num 0 1 0 1 0 1 1 1 1 1 ...
$ India.Error.Free : num 1 0 1 1 1 1 1 1 1 1 ...
P-value of chi-square is 1
> chisq.test(trsf)
Pearson's Chi-squared test
data: trsf
X-squared = 112.75, df = 897, p-value = 1
Warning message:
In chisq.test(trsf) : Chi-squared approximation may be incorrect
I try apply as.factor and perform chi-square but get the following error:
trsf_2 <- as.factor(trsf)
str(trsf_2)
Factor w/ 4 levels "c(1, 1, 1, 1, 1, 0, 0, 0, 0, 1)",..: NA NA NA NA
- attr(*, "names")= chr [1:4] "Phillippines.Error.Free" "Indonesia.Error.Free" "Malta.Error.Free" "India.Error.Free"
> chisq.test(trsf_2)
Error in chisq.test(trsf_2) :
all entries of 'x' must be nonnegative and finite
In addition: Warning message:
In Ops.factor(x, 0) : ‘<’ not meaningful for factors
You could try
dataset <- as.data.frame(lapply(data, as.numeric))
chisq.test(dataset).
However, I am not sure that chi-square is the most appropriate method for binary variables. May I suggest Phi coefficient? You can find information below:
https://en.wikipedia.org/wiki/Phi_coefficient.
However, you will need to create a loop if you do not want to do it manually for each set of two variables (i.e. countries).

Adding random term into glmer mixed-effect model; error message: failure to converge

I'm analyzing data from an experiment, replicated in time, where I measured plant emergence at the soil surface. I had 3 experimental runs, represented by the term trialnum, and would like to include trialnum as a random effect.
Here is a summary of variables involved:
data.frame: 768 obs. of 9 variables:
$ trialnum : Factor w/ 2 levels "2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ Flood : Factor w/ 4 levels "0","5","10","15": 2 2 2 2 2 2 1 1 1 1 ...
$ Burial : Factor w/ 4 levels "1.3","2.5","5",..: 3 3 3 3 3 3 4 4 4 4 ...
$ biotype : Factor w/ 6 levels "0","1","2","3",..: 1 2 3 4 5 6 1 2 3 4 ...
$ soil : int 0 0 0 0 0 0 0 0 0 0 ...
$ n : num 15 15 15 15 15 15 15 15 15 15 ...
Where trialnum is the experimental run, Flood, Burial, and biotype are input/independent variables, and soil is the response/dependent variable.
I previously created this model with all input variables:
glmfitALL <-glm(cbind(soil,n)~trialnum*Flood*Burial*biotype,family = binomial(logit),total)`
From this model I found that by running
anova(glmfitALL, test = "Chisq")
trialnum is significant. There were 3 experimental runs, I'm only including 2 of those in my analysis. I have been advised to incorporate trialnum as a random effect so that I do not have to report the experimental runs separately.
To do this, I created the following model:
glmerfitALL <-glmer(cbind(soil,n)~Flood*Burial*biotype + (1|trialnum),
data = total,
family = binomial(logit),
control = glmerControl(optimizer = "bobyqa"))
From this I get the following error message:
maxfun < 10 * length(par)^2 is not recommended. Unable to evaluate scaled gradientModel failed to converge: degenerate Hessian with 9 negative eigenvalues
I have tried running this model in a variety of ways including:
glmerfitALL <-glmer(cbind(soil,n)~Flood*Burial*biotype*(1|trialnum),
data = total,
family = binomial(logit),
control = glmerControl(optimizer = "bobyqa"))
as well as incorporating REML=FALSE and used optimx in place of bobyqa, but all reiterations resulted in a similar error message.
Because this is an "eigenvalue" error, does that mean there is a problem with my source file/original data?
I also found previous threads regarding the lmer4 error messages (sorry I did not save the link), and saw some comments raising issue with the lack of replicates of the random effect. Because I only have 2 replicates trialnum2 and trialnum3, am I able to even run trialnum as a random effect?
Regarding the eigenvalue, the chief recommendation for this is centring and/or scaling predictors.
Regarding the RE groups, around five are an approximate minimum.

Single input data.frame instead of testset.csv fails in randomForest code in R

I have written a R script which successfully runs and predicts output but only when csv with multiple entries is passed as input to classifier.
training_set = read.csv('finaldata.csv')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[-5],
y = training_set$Song,
ntree = 50)
test_set = read.csv('testSet.csv')
y_pred = predict(classifier, newdata = test_set)
Above code runs succesfully, but instead of giving 10+ inputs to classifier, I want to pass a data.frame as single input to this classifier. That works in other classifier except this, why?
So following code doesn't work and throws error -
y_pred = predict(classifier, data.frame(Emot="happy",Pact="Walking",Mact="nothing",Session="morning"))
Error in predict.randomForest(classifier, data.frame(Emot = "happy", :
Type of predictors in new data do not match that of the training data.
I even tried keeping single entry in testinput.csv, still throws same error! How to solve it? This code is back-end of my another code and I want only single entry to pass as test to predict results. Also all are 'factors' in training as well as testing set. Help appreciated.
PS: Previous solutions to same error, didn't help me.
str(test_set)
'data.frame': 1 obs. of 5 variables:
$ Emot : Factor w/ 1 level "fear": 1
$ Pact : Factor w/ 1 level "Bicycling": 1
$ Mact : Factor w/ 1 level "browsing": 1
$ Session: Factor w/ 1 level "morning": 1
$ Song : Factor w/ 1 level "Dusk Till Dawn.mp3": 1
str(training_set)
'data.frame': 1052 obs. of 5 variables:
$ Emot : Factor w/ 8 levels "anger","contempt",..: 4 7 6 6 4 3 4 6 4 6 ...
$ Pact : Factor w/ 5 levels "Bicycling","Driving",..: 1 2 2 2 4 3 1 1 3 4 ...
$ Mact : Factor w/ 6 levels "browsing","chatting",..: 1 6 1 4 5 1 5 6 6 6 ...
$ Session: Factor w/ 4 levels "afternoon","evening",..: 3 4 3 2 1 3 1 1 2 1 ...
$ Song : Factor w/ 101 levels "Aaj Ibaadat.mp3",..: 29 83 47 72 29 75 77 8 30 53 ...
Ohk this worked successfully, weird solution. Equalized classes of training and test set. Following code binds the first row of training set to the test set and than delete it.
test_set <- rbind(training_set[1, ] , test_set)
test_set <- test_set[-1,]
done! it works for single input as well as single entry .csv file, without bringing error in model.

Grouping error with lmer

I have a data frame with the following structure:
> t <- read.csv("combinedData.csv")[,1:7]
> str(t)
'data.frame': 699 obs. of 7 variables:
$ Awns : int 0 0 0 0 0 0 0 0 1 0 ...
$ Funnel : Factor w/ 213 levels "MEL001","MEL002",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Plant : int 1 2 1 3 8 1 1 2 3 5 ...
$ Line : Factor w/ 8 levels "a","b","c","cA",..: 2 2 1 1 1 3 1 1 1 1 ...
$ X : int 1 2 3 4 7 8 9 10 11 12 ...
$ ID : Factor w/ 699 levels "MEL_001-1b","MEL_001-2b",..: 1 2 3 4 5 6 7 8 9 10 ...
$ BobWhite_c10082_241: int 2 2 2 2 2 2 0 2 2 0 ...
I want to construct a mixed effect model. I know in my data frame that the random effect I want to include (Funnel) is a factor, but it does not work:
> lmer(t$Awns ~ (1|t$Funnel) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Funnel within model frame: try adding grouping factor to data frame explicitly if possible
In fact this happens whatever I want to include as a random effect e.g. Plant:
> lmer(t$Awns ~ (1|t$Plant) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Plant within model frame: try adding grouping factor to data frame explicitly if possible
Why is R giving me this error? The only other answer I could google fu is that the random effect fed in wasn't a factor in the DF. But as str shows, df$Funnel certainly is.
It is actually not so easy to provide a convenient syntax for modeling functions and at the same time have a robust implementation. Most package authors assume that you use the data parameter and even then scoping issues can occur. Thus, strange things can happen if you specify variables with DF$col syntax since package authors rarely spend a lot of effort to make this work correctly and don't include a lot of unit tests for this.
It is therefore strongly recommended to use the data parameter if the model function offers a formula method. Strange things can happen if you don't follow that praxis (also with other model functions like lm).
In your example:
lmer(Awns ~ (1|Funnel) + BobWhite_c10082_241, data = t)
This not only works, but is also more convenient to write.

Can't use glht post-hoc with repeated measures ANOVA in R?

I have a data frame with this structure:
'data.frame': 39 obs. of 3 variables:
$ topic : Factor w/ 13 levels "Acido Folico",..: 1 2 3 4 5 6 7 8 9 10 ...
$ variable: Factor w/ 3 levels "Both","Preconception",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : int 14 1 36 17 5 9 19 9 19 25 ...
and I want to test the effect value ~ variable, considering that observation are grouped in topics. So I thought to use a repeated measure ANOVA, where "variable" is considered as a repeatead measure on every topic.
the call is aov(value ~ variable + Error(topic/variable)).
Up to this everything's ok.
Then I wanted to perform a post-hoc test with glht(model, linfct= mcp(variable = 'Tukey')), but I receive two errors:
‘glht’ does not support objects of class ‘aovlist’
no ‘model.matrix’ method for ‘model’ found! Since, taking out the error term solve the error I suppose that is the problem.
So, how can I perform a post-hoc test over a repeated measure anova?
Thanks!

Resources