I have a data frame with post and follow-up measurements for approximately 200 people. In the study, we try to find out if there is a correlation between sports participation and distress symptoms. We have two measurement periods (post and follow-up) that are conducted after a workshop about health and sports. Post was conducted 6 months after the Workshop and followup one year after the workshop. We formed the following hypothesis: „Participation in sport for obese people within one year after a workshop correlates significantly positively with psychological distress symptoms at follow up.“ I assume, the dependent variable is psychological distress and the independent is the participation in sports activities. The data structure looks like:
Df
$ measurement_period : Factor w/ 2 levels "0","1": 1 1 1 1
$ psychological_distress ; int 12 45 32 85
$ participation : Factor w/ 2 levels "0","1": 1 1 1 1
$ id : num 1 2 3 4
After reading some posts here, we believe that there are 2 levels in the model: 1 ) measurement period (post and follow up) 2) id
At first we conductet a unconditiional Model (intercept only Model for confirming if a multilevel Model fits, hope that this is right) with following code:
test <-lmer(psychological_distress ~1+(1|id),data=Df
But we are not sure if the model is appropriate given the data structure and, whether the level 1 and level 2 classification is correct.
Thank you very much in advance!
Your model:
lmer(psychological_distress ~ 1 + (1|id) , data = Df)
is a variance components model. It will tell you how much of the variation in psychological_distress is attributable to the id level, and how much is attributable to the unit/residual level. That isn't going to answer your research question:
we try to find out if there is a correlation between sports participation and distress symptoms
To look into this, you need to include the participation variable as a fixed effect, and also the time variable, and their interaction. So in the first instance I would consider this:
lmer(psychological_distress ~ measurement_period*participation + (1|id) , data = Df)
A good website on how to fit longitudinal and growth models using lme4 is https://rpsychologist.com/r-guide-longitudinal-lme-lmer
As Robert pointed out, and as demonstrated on the website, it is often useful to fit an interaction between "time" and "group" (e.g., treatment vs. control), to see how the outcome changes for each group over time. You can see this change by looking at the coefficients, but it's usually easier to plot (adjusted) predictions.
Here's a toy example:
library(parameters)
library(datawizard)
library(lme4)
library(ggeffects)
data("qol_cancer")
# filter two time points
qol_cancer <- data_filter(qol_cancer, time %in% c(1, 2))
# create fake treatment/control variable
set.seed(123)
treatment <- sample(unique(qol_cancer$ID), size = length(unique(qol_cancer$ID)) / 2, replace = FALSE)
qol_cancer$treatment <- 0
qol_cancer$treatment[qol_cancer$ID %in% treatment] <- 1
qol_cancer$time <- as.factor(qol_cancer$time)
qol_cancer$treatment <- factor(qol_cancer$treatment, labels = c("control", "treatment"))
m <- lmer(QoL ~ time * treatment + (1 + time | ID),
data = qol_cancer,
control = lmerControl(check.nobs.vs.nRE = "ignore"))
model_parameters(m)
#> # Fixed Effects
#>
#> Parameter | Coefficient | SE | 95% CI | t(368) | p
#> ----------------------------------------------------------------------------------------
#> (Intercept) | 70.74 | 2.15 | [66.52, 74.97] | 32.90 | < .001
#> time [2] | 0.27 | 2.22 | [-4.10, 4.64] | 0.12 | 0.905
#> treatment [treatment] | 4.88 | 3.04 | [-1.10, 10.86] | 1.60 | 0.110
#> time [2] * treatment [treatment] | 1.95 | 3.14 | [-4.23, 8.13] | 0.62 | 0.535
#>
#> # Random Effects
#>
#> Parameter | Coefficient
#> ---------------------------------------
#> SD (Intercept: ID) | 15.14
#> SD (time2: ID) | 7.33
#> Cor (Intercept~time2: ID) | -0.62
#> SD (Residual) | 14.33
#>
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#> using a Wald t-distribution approximation.
ggpredict(m, c("time", "treatment")) |> plot()
Regarding the statistical significance of the interaction term: the p-values from the summary might be misleading. If you're really interested in statistically significant differences either between time points, or between groups (treatment vs. control), it is recommended to calculate pairwise contrasts including p-values. You can do this, e.g., with the emmeans-package.
library(emmeans)
emmeans(m, c("time", "treatment")) |> contrast(method = "pairwise", adjust = "none")
#> contrast estimate SE df t.ratio p.value
#> time1 control - time2 control -0.266 2.22 186 -0.120 0.9049
#> time1 control - time1 treatment -4.876 3.04 186 -1.604 0.1105
#> time1 control - time2 treatment -7.092 2.89 316 -2.453 0.0147
#> time2 control - time1 treatment -4.610 2.89 316 -1.594 0.1118
#> time2 control - time2 treatment -6.826 2.73 186 -2.497 0.0134
#> time1 treatment - time2 treatment -2.216 2.22 186 -0.997 0.3199
#>
#> Degrees-of-freedom method: kenward-roger
Created on 2022-05-22 by the reprex package (v2.0.1)
Here you can see, e.g., that treatment and control do not differ regarding their QoL at time point 1, but they do at time point 2.
Related
I am trying to test the effect of a treatment on the proportions of juveniles in a population of migrating birds. The birds were counted and identified as juveniles or adults daily, but the treatment was only on every second day. Days without treatment were used as a control. The problem is that the proportion of juveniles in the population is expected to be affected not only by the treatment, but also by migration phenology. For example, it is possible that on a given day more juveniles migrated to the study area, and therefor this, and not only the treatment, affected the proportion of juveniles in the population. To account for this problem, I also checked the proportion of juveniles every day at a close by site which was not affected by the treatment (i.e., control site). Hence, I have two types of controls.
To analyze the data, I thought of using a binomial GLMM, with the proportion of juveniles as the variable of interest, the treatment as a categorical (with or without treatment) explanatory variable and day as a random-intercepts factor, and I use weights to account for the different number of birds in each day, but I am not sure how to input the data from the control site. From what I read, it should be used as an offset, but I am not sure exactly how.
Is the link function affected by the fact it (juveniles prop. at the ctrl. site) is a proportion?
Is it better to use a the juveniles prop. at the ctrl. site in an interaction instead of offset (i.e., ~ Treatment* Juv.prop.cntrl.site)?
This is the model I have so far, but I am not sure if it makes sense, especially if the offset is set correctly:
glm(Juv.prop.exp.site ~ Treatment + Day, offset = Juv.prop.cntrl.site, weights = Tot.birds.exp.site, data = df, family = Binomial)
Where Juv.prop.exp.site is the number of juveniles divided by the total at this site (juveniles + adults)
See the data here: DATA (day starts at 11, because during the first 10 days no birds of that species were observed)
Normally, I would suggest that questions regarding statistical analysis are migrated to CrossValidated, where you will get better answers to purely statistical questions. However, in your case, it will help a lot to reshape your data into a tidy format before analysis, which is more of a programming problem.
Essentially, you need one column each for day, site, treatment, number of juveniles, and number of adults. I am assuming that in your data, "V" is the treatment and "X" is the control.
library(tidyverse)
df <- data %>%
select(1, 2, 4, 5, 8, 9) %>%
rename_all(~gsub("\\.site", "_site", .x)) %>%
pivot_longer(1:4, names_sep = "\\.", names_to = c(".value", "Site")) %>%
mutate(Treatment = ifelse(Site == "Exp_site", Treatment, "X")) %>%
mutate(Treatment = ifelse(Treatment == "V", "Treatment", "Control")) %>%
mutate(Site = ifelse(Site == "Exp_site", "Experimental", "Control")) %>%
rename(Juveniles = Juv, Adults = Ad) %>%
select(2, 1, 3:5)
This makes your data look like this, and to my mind this is easier to analyse (and to reason about):
df
#> # A tibble: 100 x 5
#> Day Treatment Site Juveniles Adults
#> <int> <chr> <chr> <int> <int>
#> 1 11 Control Experimental 1 0
#> 2 11 Control Control 0 0
#> 3 12 Treatment Experimental 2 1
#> 4 12 Control Control 1 0
#> 5 13 Control Experimental 2 0
#> 6 13 Control Control 1 1
#> 7 14 Treatment Experimental 6 3
#> 8 14 Control Control 4 2
#> 9 15 Control Experimental 6 4
#> 10 15 Control Control 1 2
#> # ... with 90 more rows
#> # i Use `print(n = ...)` to see more rows
You can then perform a binomial glm like this, with Treatment and Site as independent variables.
model <- glm(cbind(Juveniles, Adults) ~ Treatment + Site,
data = df, family = binomial)
summary(model)
#> Call:
#> glm(formula = cbind(Juveniles, Adults) ~ Treatment + Site, family = binomial,
#> data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.4652 -0.6971 0.0000 0.7895 2.9541
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.0059 0.1461 6.886 5.74e-12 ***
#> TreatmentTreatment 0.3012 0.2877 1.047 0.295
#> SiteExperimental -0.1632 0.2598 -0.628 0.530
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 118.16 on 88 degrees of freedom
#> Residual deviance: 117.07 on 86 degrees of freedom
#> AIC: 244.13
#>
#> Number of Fisher Scoring iterations: 4
I am running a multinomial analysis with vglm(). It all works, but then I try to follow the instructions from the following website (https://rcompanion.org/handbook/H_08.html) to do a pairwise test, because emmeans cannot handle pairwise for vglm models. The lrtest() part gives me the following error:
Error in lrtest.default(model) :
'list' object cannot be coerced to type 'double'
I cannot figure out what is wrong, I even copy and pasted the exact code that the website used (see below) and get the same error with their own code and dataset. Any ideas?
Their code and suggestion for doing pairwise testing with vglm() is the only pairwise testing option I found for vglm() anywhere on the web.
Here is the code along with all the expected output and extra details from their website (it is simpler than mine but gets same error anyways).
Input = ("
County Sex Result Count
Bloom Female Pass 9
Bloom Female Fail 5
Bloom Male Pass 7
Bloom Male Fail 17
Cobblestone Female Pass 11
Cobblestone Female Fail 4
Cobblestone Male Pass 9
Cobblestone Male Fail 21
Dougal Female Pass 9
Dougal Female Fail 7
Dougal Male Pass 19
Dougal Male Fail 9
Heimlich Female Pass 15
Heimlich Female Fail 8
Heimlich Male Pass 14
Heimlich Male Fail 17
")
Data = read.table(textConnection(Input),header=TRUE)
### Order factors otherwise R will alphabetize them
Data$County = factor(Data$County,
levels=unique(Data$County))
Data$Sex = factor(Data$Sex,
levels=unique(Data$Sex))
Data$Result = factor(Data$Result,
levels=unique(Data$Result))
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
### Remove unnecessary objects
rm(Input)
Multinomial regression
library(VGAM)
model = vglm(Result ~ Sex + County + Sex:County,
family=multinomial(refLevel=1),
weights = Count,
data = Data)
summary(model)
library(car)
Anova(model,
type="II",
test="Chisq")```
Analysis of Deviance Table (Type II tests)
Response: Result
Df Chisq Pr(>Chisq)
Sex 1 6.7132 0.00957 **
County 3 4.1947 0.24120
Sex:County 3 7.1376 0.06764 .
library(rcompanion)
nagelkerke(model)
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.0797857
Cox and Snell (ML) 0.7136520
Nagelkerke (Cragg and Uhler) 0.7136520
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
7 -10.004 20.009 0.0055508
library(lmtest)
lrtest(model)
Likelihood ratio test
Model 1: Result ~ Sex + County + Sex:County
Model 2: Result ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 8 -115.39
2 15 -125.39 7 20.009 0.005551 **
Post-hoc analysis
At the time of writing, the lsmeans package cannot be used with vglm models.
One option for post-hoc analysis would be to conduct analyses on reduced models, including only two levels of a factor. For example, if the variable County x Sex term had been significant, the following code could be used to create a reduced dataset with only Bloom–Female and Bloom–Male, and analyze this data with vglm.
Data.b = Data[Data$County=="Bloom" &
(Data$Sex=="Female"| Data$Sex=="Male") , ]
Data.b$County = factor(Data.b$County)
Data.b$Sex = factor(Data.b$Sex)
summary(Data.b)
County Sex Result Count
Bloom:4 Female:2 Pass:2 Min. : 5.0
Male :2 Fail:2 1st Qu.: 6.5
Median : 8.0
Mean : 9.5
3rd Qu.:11.0
Max. :17.0
library(VGAM)
model.b = vglm(Result ~ Sex,
family=multinomial(refLevel=1),
weights = Count,
data = Data.b)
lrtest(model.b)
Likelihood ratio test
#Df LogLik Df Chisq Pr(>Chisq)
1 2 -23.612
2 3 -25.864 1 4.5041 0.03381 *
Summary table of results
Comparison p-value
Bloom–Female - Bloom–Male 0.034
Cobblestone–Female - Cobblestone–Male 0.0052
Dougal–Female - Dougal–Male 0.44
Heimlich–Female - Heimlich–Male 0.14
p.value = c(0.034, 0.0052, 0.44, 0.14)
p.adj = p.adjust(p.value,
method = "fdr")
p.adj = signif(p.adj,
2)
p.adj
[1] 0.068 0.021 0.440 0.190
Comparison p-value p.adj
Bloom–Female - Bloom–Male 0.034 0.068
Cobblestone–Female - Cobblestone–Male 0.0052 0.021
Dougal–Female - Dougal–Male 0.44 0.44
Heimlich–Female - Heimlich–Male 0.14 0.19
It looks to me like qdrq() can be used. As I commented, you can't use the lazy interface, you have to give all the specific needed parameters:
> library(emmeans)
> RG = qdrg(formula(model), Data, coef(model), vcov(model), link = "log")
> RG
'emmGrid' object with variables:
Sex = Female, Male
County = Bloom, Cobblestone, Dougal, Heimlich
Transformation: “log”
> emmeans(RG, consec ~ Sex | County)
$emmeans
County = Bloom:
Sex emmean SE df asymp.LCL asymp.UCL
Female -0.588 0.558 Inf -1.68100 0.5054
Male 0.887 0.449 Inf 0.00711 1.7675
County = Cobblestone:
Sex emmean SE df asymp.LCL asymp.UCL
Female -1.012 0.584 Inf -2.15597 0.1328
Male 0.847 0.398 Inf 0.06643 1.6282
County = Dougal:
Sex emmean SE df asymp.LCL asymp.UCL
Female -0.251 0.504 Inf -1.23904 0.7364
Male -0.747 0.405 Inf -1.54032 0.0459
County = Heimlich:
Sex emmean SE df asymp.LCL asymp.UCL
Female -0.629 0.438 Inf -1.48668 0.2295
Male 0.194 0.361 Inf -0.51320 0.9015
Results are given on the log (not the response) scale.
Confidence level used: 0.95
$contrasts
County = Bloom:
contrast estimate SE df z.ratio p.value
Male - Female 1.475 0.716 Inf 2.060 0.0394
County = Cobblestone:
contrast estimate SE df z.ratio p.value
Male - Female 1.859 0.707 Inf 2.630 0.0085
County = Dougal:
contrast estimate SE df z.ratio p.value
Male - Female -0.496 0.646 Inf -0.767 0.4429
County = Heimlich:
contrast estimate SE df z.ratio p.value
Male - Female 0.823 0.567 Inf 1.450 0.1470
Results are given on the log (not the response) scale.
If I understand this model correctly, the response is the log of the ratio of the 2nd multinomial response to the 1st. So what we see above is estimated differences of logs and setimated differences of those differences. If run with type = "response" you would get estimated ratios, and ratios of those ratios.
Probably something changed in either the VGAM package or the lmtest package since that was written.
But the following will work for a likelihood ratio test for vglm models:
VGAM::lrtest(model)
VGAM::lrtest(model, model2)
I am trying to replicate Stata's marginal effects from multinomial logit models in R but with no success. For the multinomial logit model, I used the multinom() function from the nnet package and for the marginal effects I used the margins package but the marginal_effects function seems to only display effects of a single variable. What if I want to have the marginal effects of the variable conditioned on another variable? Here is the output from Stata:
. margins, dydx(male) at(site=(1 2 3)) #male conditioned on site
Average marginal effects Number of obs = 615
Model VCE : OIM
dy/dx w.r.t. : 1.male
1._predict : Pr(insure==Indemnity), predict(pr outcome(1))
2._predict : Pr(insure==Prepaid), predict(pr outcome(2))
3._predict : Pr(insure==Uninsure), predict(pr outcome(3))
1._at : site = 1
2._at : site = 2
3._at : site = 3
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.male |
_predict#_at |
1 1 | -.1492951 .0728108 -2.05 0.040 -.2920016 -.0065885
1 2 | -.159346 .0723512 -2.20 0.028 -.3011517 -.0175403
1 3 | -.055138 .0875712 -0.63 0.529 -.2267745 .1164984
2 1 | .0763095 .0765406 1.00 0.319 -.0737074 .2263264
2 2 | .1747759 .0730055 2.39 0.017 .0316877 .3178641
2 3 | .0861997 .0843816 1.02 0.307 -.0791852 .2515846
3 1 | .0729855 .0516839 1.41 0.158 -.0283131 .1742842
3 2 | -.0154299 .0104982 -1.47 0.142 -.036006 .0051462
3 3 | -.0310617 .0495625 -0.63 0.531 -.1282025 .0660791
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
My attempt to calculate the marginal effects of male using the marginal_effects function:
library(nnet)
sysdsn1$insure <- as.factor(sysdsn1$insure)
sysdsn1$male <- as.factor(sysdsn1$male)
sysdsn1$site <- as.factor(sysdsn1$site)
sysdsn1$nonwhite <- as.factor(sysdsn1$nonwhite)
sysdsn1$insure <- relevel(sysdsn1$insure, ref = "3") #set the reference level
mn0 <- multinom(insure ~ age + male*site + nonwhite, data = sysdsn1) #multinomial logit model
head(marginal_effects(mn0, variables = "male")) #this only calculate marginal effects of male, how to condition on site?
dydx_male1
1 -0.01310874
2 -0.01744213
3 0.07911846
4 -0.03386199
5 -0.01728126
6 -0.01638176
Data
Data can be downloaded from http://www.stata-press.com/data/r13/sysdsn1.dta and imported into R
I normally work with lme4 package, but the glmmTMB package is increasingly becoming better suited to work with highly complicated data (think overdispersion and/or zero-inflation).
Is there a way to extract posterior modes and credible intervals from glmmTMB models, similar to how it is done for lme4 models (example here).
Details:
I am working with count data (available here) that are zero-inflated and overdispersed and have random effects. The package best suited to work with this sort of data is the glmmTMB (details here). (Note two outliers: euc0==78 and np_other_grass==20).
The data looks like this:
euc0 ea_grass ep_grass np_grass np_other_grass month year precip season prop_id quad
3 5.7 0.0 16.7 4.0 7 2006 526 Winter Barlow 1
0 6.7 0.0 28.3 0.0 7 2006 525 Winter Barlow 2
0 2.3 0.0 3.3 0.0 7 2006 524 Winter Barlow 3
0 1.7 0.0 13.3 0.0 7 2006 845 Winter Blaber 4
0 5.7 0.0 45.0 0.0 7 2006 817 Winter Blaber 5
0 11.7 1.7 46.7 0.0 7 2006 607 Winter DClark 3
The glmmTMB model:
model<-glmmTMB(euc0 ~ ea_grass + ep_grass + np_grass + np_other_grass + (1|prop_id), data = euc, family= nbinom2) #nbimom2 lets var increases quadratically
summary(model)
confint(model) #this gives the confidence intervals
How I would normally extract the posterior mode and credible intervals for a lmer/glmer model:
#extracting model estimates and credible intervals
sm.model <-arm::sim(model, n.sim=1000)
smfixef.model = sm.model#fixef
smfixef.model =coda::as.mcmc(smfixef.model)
MCMCglmm::posterior.mode(smfixef.model) #mode of the distribution
coda::HPDinterval(smfixef.model) #credible intervals
#among-brood variance
bid<-sm.model#ranef$prop_id[,,1]
bvar<-as.vector(apply(bid, 1, var)) #between brood variance posterior distribution
bvar<-coda::as.mcmc(bvar)
MCMCglmm::posterior.mode(bvar) #mode of the distribution
coda::HPDinterval(bvar) #credible intervals
Most of an answer:
Getting a multivariate Normal sample of the parameters of the conditional model is pretty easy (I think this is what arm::sim() is doing.
library(MASS)
pp <- fixef(model)$cond
vv <- vcov(model)$cond
samp <- MASS::mvrnorm(1000, mu=pp, Sigma=vv)
(then use the rest of your method above).
I'm a little skeptical that your second example is doing what you want it to do. The variance of the conditional modes is not necessarily a good estimate of the between-group variance (e.g. see here). Furthermore, I'm nervous about the half-assed-Bayesian approach (e.g., why no priors? Why look at the posterior mode, which is rarely a meaningful value in a Bayesian context?) although I do sometimes use similar approaches myself!) However, it's not too hard to use glmmTMB results to do a proper Markov chain Monte Carlo analysis:
library(tmbstan)
library(rstan)
library(coda)
library(emdbook) ## for lump.mcmc.list(), or use runjags::combine.mcmc()
t2 <- system.time(m2 <- tmbstan(model$obj))
m3 <- rstan::As.mcmc.list(m2)
lattice::xyplot(m3,layout=c(5,6))
m4 <- emdbook::lump.mcmc.list(m3)
coda::HPDinterval(m4)
It may be helpful to know that the theta column of m4 is the log of the among-group standard standard deviation ...
(See vignette("mcmc", package="glmmTMB") for a little bit more information ...)
I think Ben has already answered your question, so my answer does not add much to the discussion... Maybe just one thing, as you wrote in your comments that you're interested in the within- and between-group variances. You can get these information via parameters::random_parameters() (if I did not misunderstand what you were looking for). See example below that first generates simulated samples from a multivariate normal (just like in Ben's example), and later gives you a summary of the random effect variances...
library(readr)
library(glmmTMB)
library(parameters)
library(bayestestR)
library(insight)
euc_data <- read_csv("D:/Downloads/euc_data.csv")
model <-
glmmTMB(
euc0 ~ ea_grass + ep_grass + np_grass + np_other_grass + (1 | prop_id),
data = euc_data,
family = nbinom2
) #nbimom2 lets var increases quadratically
# generate samples
samples <- parameters::simulate_model(model)
#> Model has no zero-inflation component. Simulating from conditional parameters.
# describe samples
bayestestR::describe_posterior(samples)
#> # Description of Posterior Distributions
#>
#> Parameter | Median | 89% CI | pd | 89% ROPE | % in ROPE
#> --------------------------------------------------------------------------------
#> (Intercept) | -1.072 | [-2.183, -0.057] | 0.944 | [-0.100, 0.100] | 1.122
#> ea_grass | -0.001 | [-0.033, 0.029] | 0.525 | [-0.100, 0.100] | 100.000
#> ep_grass | -0.050 | [-0.130, 0.038] | 0.839 | [-0.100, 0.100] | 85.297
#> np_grass | -0.020 | [-0.054, 0.012] | 0.836 | [-0.100, 0.100] | 100.000
#> np_other_grass | -0.002 | [-0.362, 0.320] | 0.501 | [-0.100, 0.100] | 38.945
# or directly get summary of sample description
sp <- parameters::simulate_parameters(model, ci = .95, ci_method = "hdi", test = c("pd", "p_map"))
sp
#> Model has no zero-inflation component. Simulating from conditional parameters.
#> # Description of Posterior Distributions
#>
#> Parameter | Coefficient | p_MAP | pd | CI
#> --------------------------------------------------------------
#> (Intercept) | -1.037 | 0.281 | 0.933 | [-2.305, 0.282]
#> ea_grass | -0.001 | 0.973 | 0.511 | [-0.042, 0.037]
#> ep_grass | -0.054 | 0.553 | 0.842 | [-0.160, 0.047]
#> np_grass | -0.019 | 0.621 | 0.802 | [-0.057, 0.023]
#> np_other_grass | 0.019 | 0.999 | 0.540 | [-0.386, 0.450]
plot(sp) + see::theme_modern()
#> Model has no zero-inflation component. Simulating from conditional parameters.
# random effect variances
parameters::random_parameters(model)
#> # Random Effects
#>
#> Within-Group Variance 2.92 (1.71)
#> Between-Group Variance
#> Random Intercept (prop_id) 2.1 (1.45)
#> N (groups per factor)
#> prop_id 18
#> Observations 346
insight::get_variance(model)
#> Warning: mu of 0.2 is too close to zero, estimate of random effect variances may be unreliable.
#> $var.fixed
#> [1] 0.3056285
#>
#> $var.random
#> [1] 2.104233
#>
#> $var.residual
#> [1] 2.91602
#>
#> $var.distribution
#> [1] 2.91602
#>
#> $var.dispersion
#> [1] 0
#>
#> $var.intercept
#> prop_id
#> 2.104233
Created on 2020-05-26 by the reprex package (v0.3.0)
This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I'm trying to run anova() in R and running into some difficulty. This is what I've done up to now to help shed some light on my question.
Here is the str() of my data to this point.
str(mhw)
'data.frame': 500 obs. of 5 variables:
$ r : int 1 2 3 4 5 6 7 8 9 10 ...
$ c : int 1 1 1 1 1 1 1 1 1 1 ...
$ grain: num 3.63 4.07 4.51 3.9 3.63 3.16 3.18 3.42 3.97 3.4 ...
$ straw: num 6.37 6.24 7.05 6.91 5.93 5.59 5.32 5.52 6.03 5.66 ...
$ Quad : Factor w/ 4 levels "NE","NW","SE",..: 2 2 2 2 2 2 2 2 2 2 ...
Column r is a numerical value indicating which row in the field an individual plot resides
Column c is a numerical value indicating which column an individual plot resides
Column Quad corresponds to the geographical location in the field to which each plot resides
Quad <- ifelse(mhw$c > 13 & mhw$r < 11, "NE",ifelse(mhw$c < 13 & mhw$r < 11,"NW", ifelse(mhw$c < 13 & mhw$r >= 11, "SW","SE")))
mhw <- cbind(mhw, Quad)
I have fit a lm() as follows
nov.model <-lm(mhw$grain ~ mhw$straw)
anova(nov.model)
This is an anova() for the entire field, which is testing grain yield against straw yield for each plot in the dataset.
My trouble is that I want to run an individual anova() for the Quad column of my data to test grain yield and straw yield in each quadrant.
perhaps a with() might fix that. I have never used it before and I am in the process of learning R currently. Any help would be greatly appreciated.
I think you are looking for by facility in R.
fit <- with(mhw, by(mhw, Quad, function (dat) lm(grain ~ straw, data = dat)))
Since you have 4 levels in Quad, you end up with 4 linear models in fit, i.e., fit is a "by" class object (a type of "list") of length 4.
To get coefficient for each model, you can use
sapply(fit, coef)
To produce model summary, use
lapply(fit, summary)
To export ANOVA table, use
lapply(fit, anova)
As a reproducible example, I am taking the example from ?by:
tmp <- with(warpbreaks,
by(warpbreaks, tension,
function(x) lm(breaks ~ wool, data = x)))
class(tmp)
# [1] "by"
mode(tmp)
# [1] "list"
sapply(tmp, coef)
# L M H
#(Intercept) 44.55556 24.000000 24.555556
#woolB -16.33333 4.777778 -5.777778
lapply(tmp, anova)
#$L
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 1200.5 1200.50 5.6531 0.03023 *
#Residuals 16 3397.8 212.36
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#$M
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 102.72 102.722 1.2531 0.2795
#Residuals 16 1311.56 81.972
#
#$H
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 150.22 150.222 2.3205 0.1472
#Residuals 16 1035.78 64.736
I was aware of this option, but not familiar with it. Thanks to #Roland for providing code for the above reproducible example:
library(nlme)
lapply(lmList(breaks ~ wool | tension, data = warpbreaks), anova)
For your data I think it would be
fit <- lmList(grain ~ straw | Quad, data = mhw)
lapply(fit, anova)
You don't need to install nlme; it comes with R as one of recommended packages.