how to calculate heritability from half-sib design - r

I'm trying to measure heritability of a trait, flowering time (FT), for a set of data collected from a half-sib design. The data includes FT for each mother plant and 2 half siblings from that mother plant for ~150 different maternal lines (ML). Paternity is unknown.
I've tried:
Estimating heritability with a regression of the maternal FT and the mean sibling FT, and doubling the slope. This worked fine, and produced an estimate of 0.14.
Running an ANOVA and using the between ML variation to estimate additive genetic variance. Got the idea from slide 25 of this powerpoint and from this thread on within and between variance calculation
fit = lm(FT ~ ML, her)
anova(fit)
her is the dataset, which, in this case, only includes the half sib FT values (I excluded the mother FT values for this attempt at heritability)
From the ANOVA output I have have used the "ML" term mean square as the between ML variation, which is also equal to 1/4 of the additive genetic variance because coefficient of relatedness between half-sibs is 0.25. This value turned out to be 0.098. Also, by multiplying this by 4 I could get the additive genetic variance.
I have used the "residuals" mean square as all variability save for that accounted for by the "ML" term. So, all of variance minus 1/4 of additive genetic variance. This turned out to be 1.342.
And then attempted to calculate heritabilty as Va/Vp = (4*0.098)/(1.342 + 0.098) = 0.39
This is quite different from my slope estimate, and I'm not sure if my reasoning is correct.
I've tried things with the sommer and heritability packages of R but haven't had success using either for a half-sib design and haven't found an example of a half-sib design with either package.
Any suggestions?

Related

Output from Linear Mixed Models differs from Estimated Marginal Means

I have a query about the output statistics gained from linear mixed models (using the lmer function) relative to the output statistics taken from the estimated marginal means gained from this model
Essentially, I am running an LMM comparing the within-subjects effect of different contexts (with "Negative" coded as the baseline) on enjoyment ratings. The LMM output suggests that the difference between negative and polite contexts is not significant, with a p-value of .35. See the screenshot below with the relevant line highlighted:
LMM output
However, when I then run the lsmeans function on the same model (with the Holm correction), the p-value for the comparison between Negative and Polite context categories is now .05, and all of the other statistics have changed too. Again, see the screenshot below with the relevant line highlighted:
LSMeans output
I'm probably being dense because my understanding of LMMs isn't hugely advanced, but I've tried to Google the reason for this and yet I can't seem to find out why? I don't think it has anything to do with the corrections because the smaller p-value is observed when the Holm correction is used. Therefore, I was wondering why this is the case, and which value I should report/stick with and why?
Thank you for your help!
Regression coefficients and marginal means are not one and the same. Once you learn these concepts it'll be easier to figure out which one is more informative and therefore which one you should report.
After we fit a regression by estimating its coefficients, we can predict the outcome yi given the m input variables Xi = (Xi1, ..., Xim). If the inputs are informative about the outcome, the predicted yi is different for different Xi. If we average the predictions yi for examples with Xij = xj, we get the marginal effect of the jth feature at the value xj. It's crucial to keep track of which inputs are kept fixed (and at what values) and which inputs are averaged over (aka marginalized out).
In your case, contextCatPolite in the coefficients summary is the difference between Polite and Negative when smileType is set to its reference level (no reward, I'd guess). In the emmeans contrasts, Polite - Negative is the average difference over all smileTypes.
Interactions have a way of making interpretation more challenging and your model includes an interaction between smileType and contextCat. See Interaction analysis in emmeans.
To add to #dipetkov's answer, the coefficients in your LMM are based on treatment coding (sometimes called 'dummy' coding). With the interactions in the model, these coefficients are no longer "main-effects" in the traditional sense of factorial ANOVA. For instance, if you have:
y = b_0 + b_1(X_1) + b_2(X_2) + b_3 (X_1 * X_2)
...b_1 is "the effect of X_1" only when X_2 = 0:
y = b_0 + b_1(X_1) + b_2(0) + b_3 (X_1 * 0)
y = b_0 + b_1(X_1)
Thus, as #dipetkov points out, 1.625 is not the difference between Negative and Polite on average across all other factors (which you get from emmeans). Instead, this coefficient is the difference between Negative and Polite specifically when smileType = 0.
If you use contrast coding instead of treatment coding, then the coefficients from the regression output would match the estimated marginal means, because smileType = 0 would now be on average across smile types. The coding scheme thus has a huge effect on the estimated values and statistical significance of regression coefficients, but it should not effect F-tests based on the reduction in deviance/variance (because no matter how you code it, a given variable explains the same amount of variance).
https://stats.oarc.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/

GLS / GLM nested design with autocorrelation over time

Still fairly new to GLM and a bit confused about how to establish my model.
About my project:
I sampled the microbiome (and measured a diversity index value = Shannon) from the root system of a sample of 9 trees (=tree1_cat).
In each tree I sampled fine and thick roots (=rootpart) and each tree was sampled four times (=days) over the course of one season. Thus I have a nested design but have to keep time in mind for autocorrelation. Also not all values are present, thus I have a few missing values). So far I have tried and tested the following:
Model <- gls(Shannon ~ tree1_cat/rootpart + tree1_cat + days,
na.action = na.omit, data = psL.meta,
correlation = corAR1(form =~ 1|days),
weights = varIdent(form= ~ 1|days))
Furthermore I've tried to get more insight and used anova(Model) to get the p-values of those factors. Am I allowed to use those p-values? Also I've used emmeans(Model, specs = pairwise ~ rootpart)for pairwise comparisons but since rootpart was entered as nested factor it only gives me the paired interactions.
It all works, but I am not sure, whether this is the right model! Any help would be highly appreciated!
It would be helpful to know your scientific question, but let's suppose you're interested in differences in Shannon diversity between fine and thick roots and in time trends. A model you could use would be:
library(lmerTest)
lmer(Shannon ~ rootpart*days + (rootpart*days|tree1_cat), data = ...)
The fixed-effect component rootpart*days can be expanded into 1 + rootpart + days + rootpart:days (where 1 signifies the intercept)
intercept: SD in fine roots on day 0 (hopefully that's the beginning of the season)
rootpart: difference between fine and thick roots on day 0
days: change per day in SD in fine roots (slope)
rootpart:days difference in slope between thick roots and fine roots
The random-effect component (rootpart*days|tree1_cat) measures how all four of these effects vary across trees, and their correlations (e.g. do trees with a larger-than-average difference between fine and thick roots on day 0 also have a larger-than-average change over time in fine root SD?)
This 'maximal' random effects model is almost certainly too complex for your data; a rough rule of thumb says you should have 10-20 data points per parameter estimated, the fixed-effect model takes 4 parameters. A full model with 4 random effects requires the estimate of a 4×4 covariance matrix, which has (4*5)/2 = 10 parameters all by itself. I might just try (1+days|tree1_cat) (random slopes) or (rootpart|tree_cat) (among-tree difference in fine vs. thick differences), with a bias towards allowing for the variation in the effect that is your primary interest (e.g. if your primary question is about fine vs. thick then go with (rootpart|tree_cat).
I probably wouldn't worry about autocorrelation at all, nor heteroscedasticity by day (your varIdent(~1|days) term) unless those patterns are very strongly evident in the data.
If you want to allow for autocorrelation you'll need to fit the model with nlme::lme or glmmTMB (lmer still doesn't have machinery for autocorrelation models); something like
library(nlme)
lme(Shannon ~ rootpart*days,
random = ~days|tree1_cat,
data = ...,
correlation = corCAR1(form = ~days|tree1_cat)
)
You need to use corCAR1 (continuous-time autoregressive order-1) rather than the more common corAR1 for unevenly sampled data. Be aware that lme is more finicky/worse at dealing with singular models, so you may discover you need to simplify your model before you can actually get this model to run.

How to specify icc_pre_subject and var_ratio in study_parameters function (powerlmm package)?

I am trying to conduct a power analysis for studies where I use Linear Mixed Model for the analysis. I conducted a pilot study in order to see the effect sizes of the fixed effects and to see the results of random effects, which are required to fill in in a R function - study_parametor().
First, I build a lmer model using the data from the pilot study. In the model, the reaction time for the stimuli is set as the dependent variable, and the experimental condition (with 2levels), the number of trials (from 0 to 159, code as numeric values) as well as the interaction between the condition and the number of trials are included as fixed factors. The experimental condition is a between-subject condition, but the number of trials is within-subject factor - all participants go through the trials from 0 to 159. For random effects, I set the random intercept and slope for participants, and random intercept for beauty rating for each item (as a control factor). Together, the model looks like:
lmer(Reaction time ~ Condition*Number of trial + (1 + Number of trial|Subject) + (1|Beautyrating))
For power analysis I want to use a function study_parametor() in powerlmm package. In this function, we have to specify icc_pre_subject and var_ratio as the parametors for random effect variance information. What I want to do here is, to set the parametors based on the results of the pilot study.
From the tutorial, the two variables are defined as follows:
icc_pre_subject: the amount of the total baseline variance the is between-subjects. (there is a typo in the sentence in the tutorial). icc_pre_subject would be the 2-level ICC if there was no random slopes.
icc_pre_subject = var(subject_intercepts)/(var(subject_intercepts) + var(within-subject_error))
var_ratio: the ratio of total random slope variance over the level-1 residual variance.
var_ratio = var(subject_slopes)/var(within-subject_error))
Here, I am not sure what var(within-subject_error)) means, and how to specify it.
This is the results of random effects in the model which used the pilot study data
My question
which number should I use for specify the icc_pre_subject and var_ratio in the function of study_parametor()

Estimation of random versus fixed effect size in mixed models in r

I am analyzing an effect of food deprivation on bird chicks' calls. An idea is to show that food deprivations (experiment) contributes much more to acoustic parameter changes than species or individual identity. We have 3 species, 8 individuals of each species. A reviewer of our ms said that we should use mixed models with experiment and species as fixed and individual as random variables. I would like to rank a contribution of all 3 variables. How do I do it? I get a variance which random intercept adds to a population intercept, but it's not quite what I need. Also, anova(model_name) gives me F-values for both fixed effects but not for a random one. A paper by Nakagawa & Schielzeth (2013) is talking about R-sq. for random vs fixed effects, but I have two fixed variables and would like to rank all three. Any thoughts?
Thank you!
Here is the code (a model with a random intercept works better than intercept and slope).
m_Fmax <- lme(F_max ~ Experiment + Species + Experiment*Species, random=~1|chicks, method = 'REML')

Variable sample size per cluster/group in mixed effects logistic regression

I am attempting to run mixed effects logistic regression models, yet am concerned about the variable samples sizes in each cluster/group, and also the very low number of "successes" in some models.
I have ~ 700 trees distributed across 163 field plots (i.e., the cluster/group), visited annually from 2004-11. I am fitting separate mixed effects logistic regression models (hereafter GLMMs) for each year of the study to compare this output to inference from a shared frailty model (i.e., survival analysis with random effect).
The number of trees per plot varies from 1-22. Also, some years have a very low number of "successes" (i.e., diseased trees). For example, in 2011 there were only 4 successes out of 694 "failures" (i.e., healthy trees).
My questions are: (1) is there a general rule for the ideal number of samples|group when the inference focus is only on estimating the fixed effects in the GLMM, and (2) are GLMMs stable when there is such an extreme difference in the ratio of successes:failures.
Thank you for any advice or suggestions of sources.
-Sarah
(Hi, Sarah, sorry I didn't answer previously via e-mail ...)
It's hard to answer these questions in general -- you're stuck
with your data, right? So it's not a question of power analysis.
If you want to make sure that your results will be reasonably
reliable, probably the best thing to do is to run some simulations.
I'm going to show off a fairly recent feature of lme4 (in the
development version 1.1-1, on Github), which is to simulate
data from a GLMM given a formula and a set of parameters.
First I have to simulate the predictor variables (you wouldn't
have to do this, since you already have the data -- although
you might want to try varying the range of number of plots,
trees per plot, etc.).
set.seed(101)
## simulate number of trees per plot
## want mean of 700/163=4.3 trees, range=1-22
## by trial and error this is about right
r1 <- rnbinom(163,mu=3.3,size=2)+1
## generate plots and trees within plots
d <- data.frame(plot=factor(rep(1:163,r1)),
tree=factor(unlist(lapply(r1,seq))))
## expand by year
library(plyr)
d2 <- ddply(d,c("plot","tree"),
transform,year=factor(2004:2011))
Now set up the parameters: I'm going to assume year is a fixed
effect and that overall disease incidence is plogis(-2)=0.12 except
in 2011 when it is plogis(-2-3)=0.0067. The among-plot standard deviation
is 1 (on the logit scale), as is the among-tree-within-plot standard
deviation:
beta <- c(-2,0,0,0,0,0,0,-3)
theta <- c(1,1) ## sd by plot and plot:tree
Now simulate: year as fixed effect, plot and tree-within-plot as
random effects
library(lme4)
s1 <- simulate(~year+(1|plot/tree),family=binomial,
newdata=d2,newparams=list(beta=beta,theta=theta))
d2$diseased <- s1[[1]]
Summarize/check:
d2sum <- ddply(d2,c("year","plot"),
summarise,
n=length(tree),
nDis=sum(diseased),
propDis=nDis/n)
library(ggplot2)
library(Hmisc) ## for mean_cl_boot
theme_set(theme_bw())
ggplot(d2sum,aes(x=year,y=propDis))+geom_point(aes(size=n),alpha=0.3)+
stat_summary(fun.data=mean_cl_boot,colour="red")
Now fit the model:
g1 <- glmer(diseased~year+(1|plot/tree),family=binomial,
data=d2)
fixef(g1)
You can try this many times and see how often the results are reliable ...
As Josh said, this is a better questions for CrossValidated.
There are no hard and fast rules for logistic regression, but one rule of thumb is 10 successes and 10 failures are needed per cell in the design (cluster in this case) times the number continuous variables in the model.
In your case, I would think the model, if it converges, would be unstable. You can examine that by bootstrapping the errors of the estimates of the fixed effects.

Resources