I currently have a data set of about 300 people on behaviors (answers: yes/no/NA) + variables on age, place of residence (city/country), income, etc.
In principle, I would like to find out the item difficulties for the overall sample (with which R-package is the best?-How does that work? Don't fully understand some codes :/)
and in the next step examine different groups (young, old, city/country, income (median split) with regard to their possibly significantly different item difficulties.
How do I do that? (is this possible with Wald tests, Rasch trees, or raschmix?) (do I need latent groups - which are grouped data-driven)?
Related
This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 9 days ago.
I have dataframe of relative proportions of cell types, measured at 3 different timepoints in several individuals.
I want to write a model that primarily evaluates the effect of time, but also potential influences on the longitudinal development of a few other factors.
My approach so far would be to loop over every dependent variable and every possible metadata variable and evaluate the model:
model<-glmmTMB(formula = (cell_type1~ Age + metadata1 + (1|patient_id) , data = data, family = beta_family(), na.action = na.omit, REML = FALSE))
car::Anova(model, test.statistic=c("Chisq"), type=c(2))
I have a hard time interpreting the results from this. For example, if i do get a significant p-value for both my cell type and my metadata variable, can i say that the effect of Age is significantly influenced by my metadata variable? If so , what is the best way to quantify this influence? The metadata variables are usually fixed for one individual and don't change over the observation period.
Unfortuantely, potential influencing variables on my time dependent effect include binary, catgeroical as well as continouus data.
What would be the best way to approach this problem?
Edit: Another approach which seems a little more sensible to me right now would be to calculate a baseline model (time~celltype) and compare it to another model time~celltype+metadata via a likelihood ratio test. However, I’d still be stuck at quantifying the effect of the metadata variable…
Is it possible to use both cluster standard errors and multilevel models together and how does one implement this in R?
In my set up I am running a conjoint experiment in 26 countries with 2000 participants per country. Like any conjoint experiment each participant is shown two vignettes and asked to choose/rate each vignette. The same participants is then shown two fresh vignettes for comparison and asked to repeat the task. In this case each participant performs two comparisons. The hierarchy is thus comparisons nested within individuals nested within countries. I am currently running a multilevel model with each comparison at level 1 and country is the level 2 unit. Obviously comparisons within individuals are likely to be correlated so I'd like to cluster standard errors at the individual level as well. It seems overkill to add another level in the MLM for this since the size of my clusters are extremely small (n=2) and it makes more sense to do my analysis at the individual level (not to mention unnecessarily complicating the model since with 2000 individuals*26 countries the parameter space becomes crazy huge). Is this possible? If so how does one do this in R together with a multilevel model set up?
The cluster size of 2 is not an issue, and I don't see any issue with the parameter space either. If you fit random intercepts for participants, and countries, these are estimated as latent normally distributed variables. A model such as:
lmer(outomce ~ fixed effects + (1|country/participant)
This will handle the dependencies within clusters (at the participant level and the country level) so there will be no need to use cluster standard errors.
I am fitting several mixed models using the lmer function in the package lme4, each with the same fixed effects and random effects, but different response variables. The purpose of these models is to determine how environmental conditions influence different fruit and seed traits in a particular tree species. I want to know which traits respond most strongly to which environmental variables, and how well the variation in each trait is captured by each model overall.
The data have been collected from several sites, and several trees within each site.
Response variables: measures of fruits and seeds, e.g. fresh mass, dry mass, volume
Fixed effects: temperature, rainfall, soil nitrogen, soil phosphorus
Random effects: Sites, trees
Example of model notation I have been using:
lmer(fruit.mass ~ temperature + rainfall + soil N + soil P +
(1|site/tree), data = fruit)
My problem: some of the models run fine with no detectable issues, however, some produce a singular fit where the estimated variance for 'site' = 0.
I know there is considerable debate around dealing with singular fit models, although one approach is to drop site and keep the tree level random effect. The models run fine after this.
My question: Should I then drop the site random effect from the models which weren't singular if I want to compare these models in anyway? If so, are there certain methods for comparing model performance more suited to this situation?
If this is covered in publications or discussion threads then any links would be much appreciated.
When the model converges to a singular fit, it indicates that the random structure is overfitted. Therefore I would argue that it does not make sense to compare these. I would also be concerned about multiple testing in this situation.
I would drop models that give "boundary is singular" warnings. The rest are fine and you can compare among them. Of note: It is best to specify REML=FALSE when comparing models (can give references if need be). Once the "best model" is selected you can run it normally (i.e. with REML). I would recommend using conditional AIC (cAIC4 package), for example anocAIC(model1, model2 ...). Another good one is the performance package. It has many options, such as: model_performance, check_model....
Dataset Description: I use a dataset with neuropsychological (np) tests from several subjects. Every subject has more than one tests in his/her follow up i.e one test per year. I study the cognitive decline in these subjects. The information that I have are: Individual number(identity number), Education(years), Gender(M/F as factor), Age(years), Time from Baseline (= years after the first np test).
AIM: My aim is to measure the rate of change in their np tests i.e the cognitive decline per year for each of them. To do that I use Linear Mixture Effects Models (LMEM), taking into account the above parameters and I compute the slope for each subject.
Question: When I run the possible models (combining different parameters every time), I also check their singularity and the result in almost all cases is TRUE. So my models present singularity! In the case that I would like to use these models to do predictions this is not good as it means that the model overfits the data. But now that I just want to find the slope for each individual I think that this is not a problem, or even better I think that this is an advantage, as in that case singularity offers a more precise calculation for the subjects' slopes. Do you think that this thought is correct?
In my study I was sampling the same sites in different regions for many years. Each site has different properties in each year, which is important for my research question. I want to know, if the properties of the site affect biodiversity on the sites. And I am interested in the interaction of the propterties and the regions.
Overview:
Biodiversity = response
Site property = fixed factor, changes every year
Region = fixed factor , same regions every year
Site = random effect, is repeatedly sampled in the different sampling years
Year = random effect, is the factor in which "site" is repeated
At the moment my model looks like this:
mod1 <- lmer(biodiversity~region*siteProperty+(1|Year)+(1|site))
I'm not sure if this accounts for the repeated measures.
Alternatively I was thinking about this one, as it includes also the nestedness of the sites in the years, but maybe that is not necessary:
mod2 <- lmer(biodiversity~region*siteProperty+(1|Year)+(1|Year:site))
The problem with this approach is, that it only works if my site properties are not zero. But I have zeros in different properties and I need to analyse their effects as well.
If you need more information, just ask me for.
Thanks for your help!
Your first example,
mod1 <- lmer(biodiversity~region*siteProperty+(1|Year)+(1|site))
should be fine (although I'd encourage you to use the data argument explicitly ...). If you have samples for "many years" for each site, you might want to consider
including a trend model (i.e. include Year, as a numeric variable, in the fixed effects part of the model as well, either as a simple linear term or as part of an additive model, e.g. using splines::ns
checking for/allowing for autocorrelation (although this is tricky in lme4; you can use lme, but then crossed random effects of Year and site become harder).
If you have one sample per site/year combination, you don't want (1|Site:year), because that will be the same as the residual variability term.
Your statement "only works if my site properties are not zero" doesn't make sense to me: in general, having zeros in predictor variables shouldn't cause any problems ... ?