Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
I am running a mixed model in R. However I am having some difficulty understanding the type of model I should be running for the data that I have.
Let's call the dependant variable the number of early button presses in a computerised experiment. An experiment is made up of multiple trials. In each trial a participant has to press a button to react to a target appearing on a screen. However they may press the button too early and this is what is being measured as the outcome variable. So for example, participant A may have in total 3 early button presses in an experiment across trials whereas participant B may have 15.
In a straightforward linear regression model using the lm command in R, I would think this outcome is a continuous numerical variable. As well... its a number that participants score on in the experiment. However I am not trying to run a linear regression, I am trying to run a mixed model with random effects. My understanding of a mixed model in R is that the data format that the model takes from should be structured to show every participant by every trial. When the data is structured like this at trial level suddenly I have a lot of 1s and 0s in my outcome column. As of course at a trial level participants may accidently press the button too early scoring a 1, or not and score a 0.
Does this sound like something that needs to be considered as categorical. If so would it then be looked at through the glmer function with family set to binomial?
Thanks
As started by Martin, this question seems to be more of a cross-validation question. But I'll throw in my 2 cents here.
The question often becomes what you're interested in with the experiment, and whether you have cause to believe that there is a random effect in your model. In your example you have 2 possible effects that could be random: The individuals and the trials. In classical random-effect models the random effects are often chosen based on a series of rule-of-thumbs such as
If the parameter can be thought of as random. This often refers to the levels changing within a factor. In this situation both individuals and the trials are likely to change between experiments.
If you're interested in the systematic effect (eg. how much did A affect B) then the effect is not random and should be considered for the fixed effects. In your case, it is really only relevant if there are enough trials to see a systematic effects across individuals, but one could then question how relevant this effect would be for generalized results.
Several other rule-of-thumbs exist out there, but this at least gives us a place to start. The next question becomes which effect we're actually interested in. In your case it is not quite clear, but it sounds like you're interested in one of the following.
How many early button presses can we expect for any given trial
How many early button presses can we expect for any given individual
How big is the chance that an early button press happen during any given trial
For the first 2, you can benefit from averaging over either individual or trial and using a linear mixed effect model with the counter part as random effect. Although I would argue that a poisson generalized linear model is likely a better fit, as you are modelling counts that can only be positive. Eg. in a rather general sense use:
#df is assumed contain raw data
#1)
df_agg <- aggregate(. ~ individual, data = df)
lmer(early_clicks ~ . - individual + (1 | individual)) #or better: glmer(early_clicks ~ . - individual + (1 | individual), family = poisson, data = df_agg)
#2)
df_agg <- aggregate(. ~ trial, data = df)
lmer(early_clicks ~ . - trial+ (1 | trial)) #or better: glmer(early_clicks ~ . - trial+ (1 | trial), family = poisson, data = df_agg)
#3)
glmer(early_clicks ~ . + (1 | trial) + (1 | individual), family = binomial, data = df)
Note that we could use 3) to get answers for 1) and 2) by using 3) to predict probabilities and use these to find the expected early_clicks. However one can show theoretically that the estimation methods used in linear mixed models are exact, while this is not possible for generalized linear models. As such the results may differ slightly (or quite substantially) between all models. Especially in 3) the number of random effects may be quite substantial compared to the number of observations, and in practice may be impossible to estimate.
Disclaimer
I have only very briefly gone over some principals, and while they may be a very brief introduction they are by no means exhaustive. In the last 15 - 20 years the theory and practical side of mixed effect models has been extended substantially. If you'd like more information about mixed effect models I'd suggest starting at the glmm faq side by ben bolker (and others) and the references listed within there. For estimation and implementations I suggest reading the vignettes of the lme4, glmmTMB and possibly merTools packages. glmmTMB being a more recent and interesting project.
Related
This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 9 days ago.
I have dataframe of relative proportions of cell types, measured at 3 different timepoints in several individuals.
I want to write a model that primarily evaluates the effect of time, but also potential influences on the longitudinal development of a few other factors.
My approach so far would be to loop over every dependent variable and every possible metadata variable and evaluate the model:
model<-glmmTMB(formula = (cell_type1~ Age + metadata1 + (1|patient_id) , data = data, family = beta_family(), na.action = na.omit, REML = FALSE))
car::Anova(model, test.statistic=c("Chisq"), type=c(2))
I have a hard time interpreting the results from this. For example, if i do get a significant p-value for both my cell type and my metadata variable, can i say that the effect of Age is significantly influenced by my metadata variable? If so , what is the best way to quantify this influence? The metadata variables are usually fixed for one individual and don't change over the observation period.
Unfortuantely, potential influencing variables on my time dependent effect include binary, catgeroical as well as continouus data.
What would be the best way to approach this problem?
Edit: Another approach which seems a little more sensible to me right now would be to calculate a baseline model (time~celltype) and compare it to another model time~celltype+metadata via a likelihood ratio test. However, I’d still be stuck at quantifying the effect of the metadata variable…
I am looking at average home range size on two sites (one that has undergone habitat restoration and the other is an experimental control) during three phases of the restoration process (before, during, and two years after). I am wanting to see if differences in mean home range size differ across sites and periods. Based on having two categorical variables (site and period), I assume this would be done using a repeated measures ANOVA? I was needing to see what code would be used since I have never done an ANOVA in R before.
rm (list = ls())
hrdata=read.csv(xxx)
hrdata
I think you could do this with a linear model, but see (https://stats.stackexchange.com/questions/20002/regression-vs-anova-discrepancy-aov-vs-lm-in-r) for a discussion of anova vs regression.
the code would look something like this:
lm1 <- lm(HRS ~ Site * Period, data=hrdata)
The first bit of this code is simply storing this linear model (lm) as an R object, which we have named lm1.
then you can do:
summary(lm1)
This would look at the effects of Site (habitat restoration vs control), Period (before, during, and after), and the interaction between the two.
There are lots of posts about interpreting these summary results. I have posted some below. This first one may be useful if you aren't sure how to interpret the interaction terms:
https://stats.stackexchange.com/questions/56784/how-to-interpret-the-interaction-term-in-lm-formula-in-r
https://stats.stackexchange.com/questions/59250/how-to-interpret-the-output-of-the-summary-method-for-an-lm-object-in-r
https://stats.stackexchange.com/questions/115304/interpreting-output-from-anova-when-using-lm-as-input
I have a dataset in which individuals, each belonging to a particular group, repeatedly chose between multiple discrete outcomes.
subID group choice
1 Big A
1 Big B
2 Small B
2 Small B
2 Small C
3 Big A
3 Big B
. . .
. . .
I want to test how group membership influences choice, and want to account for non-independence of observations due to repeated choices being made by the same individuals. In turn, I planned to implement a mixed multinomial regression treating group as a fixed effect and subID as a random effect. It seems that there are a few options for multinomial logits in R, and I'm hoping for some guidance on which may be most easily implemented for this mixed model:
1) multinom - GLM, via nnet, allows the usage of the multinom function. This appears to be a nice, clear, straightforward option... for fixed effect models. However is there a manner to implement random effects with multinom? A previous CV post suggests that multinom is able to handle mixed-effects GLM with poisson distribution and a log link. However, I don't understand (a) why this is the case or (b) the required syntax. Can anyone clarify?
2) mlogit - A fantastic package, with incredibly helpful vignettes. However, the "mixed logit" documentation refers to models that have random effects related to alternative specific covariates (implemented via the rpar argument). My model has no alternative specific variables; I simply want to account for the random intercepts of the participants. Is this possible with mlogit? Is that variance automatically accounted for by setting subID as the id.var when shaping the data to long form with mlogit.data? EDIT: I just found an example of "tricking" mlogit to provide random coefficients for variables that vary across individuals (very bottom here), but I don't quite understand the syntax involved.
3) MCMCglmm is evidently another option. However, as a relative novice with R and someone completely unfamiliar with Bayesian stats, I'm not personally comfortable parsing example syntax of mixed logits with this package, or, even following the syntax, making guesses at priors or other needed arguments.
Any guidance toward the most straightforward approach and its syntax implementation would be thoroughly appreciated. I'm also wondering if the random effect of subID needs to be nested within group (as individuals are members of groups), but that may be a question for CV instead. In any case, many thanks for any insights.
I would recommend the Apollo package by Hess & Palma. It comes with a great documentation and a quite helpful user group.
I'm trying to run a GLM in R for biomass data (reductive biomass and ratio of reproductive biomass to vegetative biomass) as a function of habitat type ("hab"), year data was collected ("year"), and site of data collection ("site"). My data looks like it would fit a Gamma distribution well, but I have 8 observations with zero biomass (out of ~800 observations), so the model won't run. What's the best way to deal with this? What would be another error distribution to use? Or would adding a very small value (such as .0000001) to my zero observations be viable?
My model is:
reproductive_biomass<-glm(repro.biomass~hab*year + site, data=biom, family = Gamma(link = "log"))
Ah, zeroes - gotta love them.
Depending on the system you're studying, I'd be tempted to check out zero-inflated or hurdle models - the basic idea is that there are two components to the model: some binomial process deciding whether the response is zero or nonzero, and then a gamma that works on the nonzeroes. Slick part is you can then do inferences on the coefficients of both models and even use different coefficients for both.
http://seananderson.ca/2014/05/18/gamma-hurdle.html ... but a search for "zero-inflated gamma" or "tweedie models" might also yield something informative and/or scholarly.
In an ideal world, your analytic tool should fit your system and your intended inferences. The zero-inflated world is pretty sweet, but is conditional on the assumption of separate processes. Thus an important question to answer, of course, is what zeroes "mean" in the context of your study, and only you can answer that - whether they're numbers that just happened to be really really small, or true zeroes that are the result of some confounding process like your coworker spilling the bleach (or something otherwise uninteresting to your study), or else true zeroes that ARE interesting.
Another thought: ask the same question over on crossvalidated, and you'll probably get an even more statistically informed answer. Good luck!
I'm new to R and lme4 and am having trouble understanding how I should code the error term for my analysis
My data looks like this:
Elevation
Cover
Treatment
Date
Flux
The study is to see how elevation (either upper or lower) and community (meadow or aspen) affect soil flux (the [continuous] response variable).
We initially included a treatment, but other analyses we've run indicate that it doesn't have much effect, so we are excluding it for now.
We repeatedly took the flux measurements from the same plots over the course of a couple years. We want to run a mixed linear model instead of a repeated measures ANOVA
My model currently looks like this:
fluxmodel <- lmer(Flux ~ Cover*Elevation + (1 | Date), data = mydata)
summary(fluxmodel)
anova(fluxmodel)
my question is what I should do with the error term, which is currently (1|Date)? I haven't found any helpful explanation for the differences between | and / in the error term or which syntax would be best to use in a repeated measures scenario like this. Given my experimental design, how should I code the error?