wilcoxon test using data stratification - r

I have a really basic problem. I have the concentrations of one chemical stored in one column and the gender of the study participant in a second column.
What is the code to do the wilcoxon test to see if there is a difference between the concentrations found in boys and the concentrations found in girls? Some explanation of the code would also be useful for me to understand how it works. Thanks!
I got this code for the ANOVA test to work which is also fine. Can anyone tell me if it does the thing that I need?
av <- aov(UC_MEHP ~ BQF05C1, data=data)
av
summary(av)
the output looks like this
> av <- aov(UC_MEHP ~ BQF05C1, data=data)
> av
Call:
aov(formula = UC_MEHP ~ BQF05C1, data = data)
Terms:
BQF05C1 Residuals
Sum of Squares 0.3445 2917.4564
Deg. of Freedom 1 151
Residual standard error: 4.395555
Estimated effects may be unbalanced
21 observations deleted due to missingness
> summary(av)
Df Sum Sq Mean Sq F value Pr(>F)
BQF05C1 1 0.3 0.344 0.018 0.894
Residuals 151 2917.5 19.321
21 observations deleted due to missingness
I'm sorry, I know it's not a very advanced question...

From ?wilcox.test:
## S3 method for class 'formula'
wilcox.test(formula, data, subset, na.action, ...)
...
formula: a formula of the form ‘lhs ~ rhs’ where ‘lhs’ is a numeric
variable giving the data values and ‘rhs’ a factor with two
levels giving the corresponding groups.
So wilcox.test(UC_MEHP ~ BQF05C1, data=data) should work (assuming that BQF05C1 is the column specifying gender and UC_MEHP is the concentration).

Related

is it possible compute error terms for each sample using stan_lmer?

(this is my first question here please be gentle :)
I ran a two-level model with intercept and a single predictor using this code:
TE_stanlmer1 <- stan_lmer(formula = LastCost ~ 1 + C.CostBefore + (1 | Therapist),
data = d3,
seed = 349)
These are the error terms and the auxiliary parameters I received:
Auxiliary parameter(s):
Median MAD_SD
sigma 2789.48 53.82
Error terms:
Groups Name Std.Dev.
Therapist (Intercept) 264
Residual 2790
Num. levels: Therapist 26
and these are the estimates:
mean sd 2.5% 97.5%
(Intercept) 1412.79 99.13 1217.75 1602.36
sigma 2789.55 54.61 2684.24 2898.25
Sigma[Therapist:(Intercept),(Intercept)] 69780.39 63411.61 1475.82 231756.53
I created a data frame with the following command:
posterior<- as.data.frame(TE_stanlmer)
For each iteration, I got a value for intercept, Sigma, Sigma[Therapist:(Intercept),(Intercept) and therapist 1...therapist 26.
Is it possible to compute error terms for each iteration? When comparing the Sigma[Therapist:(Intercept),(Intercept)] to the sigma it seems like 99% of the variance comes from the Therapist level. The original data doesn't look this way but the other way around (as the error terms seem to imply) so I wanted to get deeper into that.
Thank you very much!

Two-level modelling with lme in R

I am interested in estimating a mixed effect model with two random components (I am sorry for the somewhat unprecise notation. I am somewhat new to these kind of models). Finally, I also want also the standard errors of the variances of the random components. That is why I am somewhat boudn to using the package lme. The reason is that I found this description on how to calculate those standard errors and also interesting, the standard error for function of these variances link.
I believe I know how to use the package lmer. I am finally interested in model2. For the model1, both command yield the same estimates. But model2 with lme yields different results than model2 with lmer from the lme4 package. Could you help me to get around how to set up the random components for lme? This would be much appreciated. Thanks. Please find attached my MWE.
Best
Daniel
#### load all packages #####
loadpackage <- function(x){
for( i in x ){
# require returns TRUE invisibly if it was able to load package
if( ! require( i , character.only = TRUE ) ){
# If package was not able to be loaded then re-install
install.packages( i , dependencies = TRUE )
}
# Load package (after installing)
library( i , character.only = TRUE )
}
}
# Then try/install packages...
loadpackage( c("nlme", "msm", "lmeInfo", "lme4"))
alcohol1 <- read.table("https://stats.idre.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", header=T, sep=",")
attach(alcohol1)
id <- as.factor(id)
age <- as.factor(age)
model1.lmer <-lmer(alcuse ~ 1 + peer + (1|id))
summary(model1.lmer)
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age))
summary(model2.lmer)
model1.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id, method ="REML")
summary(model1.lme)
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id + 1|age, method ="REML")
Edit (15/09/2021):
Estimating the model as follows end then returning the estimates via nlme::VarCorr gives me different results. While the estimates seem to be in the ball park, it is as they are switched across components.
model2a.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2a.lme)
nlme::VarCorr(model2a.lme)
Variance StdDev
id = pdLogChol(1)
(Intercept) 0.38390274 0.6195989
age = pdLogChol(1)
(Intercept) 0.47892113 0.6920413
Residual 0.08282585 0.2877948
EDIT (16/09/2021):
Since Bob pushed me to think more about my model, I want to give some additional information. Please know that the data I use in the MWE do not match my true data. I just used it for illustrative purposes since I can not upload myy true data. I have a household panel with income, demographic informations and parent indicators.
I am interested in intergenerational mobility. Sibling correlations of permanent income are one industry standard. At the very least, contemporanous observations are very bad proxies of permanent income. Due to transitory shocks, i.e., classical measurement error, those estimates are most certainly attenuated. For this reason, we exploit the longitudinal dimension of our data.
For sibling correlations, this amounts to hypothesising that the income process is as follows:
$$Y_{ijt} = \beta X_{ijt} + \epsilon_{ijt}.$$
With Y being income from individual i from family j in year t. X comprises age and survey year indicators to account for life-cycle effects and macroeconmic conditions in survey years. Epsilon is a compund term comprising a random individual and family component as well as a transitory component (measurement error and short lived shocks). It looks as follows:
$$\epsilon_{ijt} = \alpha_i + \gamma_j + \eta_{ijt}.$$
The variance of income is then:
$$\sigma^2_\epsilon = \sigma^2_\alpha + \sigma^2\gamma + \sigma^2\eta.$$
The quantitiy we are interested in is
$$\rho = \frac(\sigma^2\gamma}{\sigma^2_\alpha + \sigma^2\gamma},$$
which reflects the share of shared family (and other characteristics) among siblings of the variation in permanent income.
B.t.w.: The struggle is simply because I want to have a standard errors for all estimates and for \rho.
This is an example of crossed vs nested random effects. (Note that the example you refer to is fitting a different kind of model, a random-slopes model rather than a model with two different grouping variables ...)
If you try with(alcohol1, table(age,id)) you can see that every id is associated with every possible age (14, 15, 16). Or subset(alcohol1, id==1) for example:
id age coa male age_14 alcuse peer cpeer ccoa
1 1 14 1 0 0 1.732051 1.264911 0.2469111 0.549
2 1 15 1 0 1 2.000000 1.264911 0.2469111 0.549
3 1 16 1 0 2 2.000000 1.264911 0.2469111 0.549
There are three possible models you could fit for a model with random effects of age(indexed by i) and id (indexed by j)
crossed ((1|age) + (1|id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_j +epsr_{ij}; alcohol use varies among individuals and, independently, across ages (this model won't work very well because there are only three distinct ages in the data set, more levels are usually needed)
id nested within age ((1|age/id) = (1|age) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_{ij} + epsr_{ij}; alcohol use varies across ages, and varies across individuals within ages (see note above about number of levels).
age nested within id ((1|id/age) = (1|id) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_j + eps2_{ij} + epsr_{ij}; alcohol use varies across individuals, and varies across ages within individuals
Here eps1_i, eps2_{ij}, and epsr_{ij} are normal deviates; epsr is the residual error term.
The latter two models actually don't make sense in this case; because there is only one observation per age/id combination, the nested variance (eps2) is completely confounded with the residual variance (epsr). lme doesn't notice this; if you try to fit one of the nested models in lmer it will give an error that
number of levels of each grouping factor must be < number of observations (problems: id:age)
(Although if you try to compute confidence intervals based on model1.lme you'll get an error "cannot get confidence intervals on var-cov components: Non-positive definite approximate variance-covariance", which is a hint that something is wrong.)
You could restate this problem as saying that the residual variation, and the variation among ages within individuals, are jointly unidentifiable (can't be separated from each other, statistically).
The updated answer here shows how to get the standard errors of the variance components from an lmer model, so you shouldn't be stuck with lme (but you should think carefully about which model you're really trying to fit ...)
The GLMM FAQ might also be useful.
More generally, the standard error of
rho = (V_gamma)/(V_alpha + V_gamma)
will be hard to compute accurately, because this is a nonlinear function of the model parameters. You can apply the delta method, but the most reliable approach would be to use parametric bootstrapping: if you have a fitted model m, then something like this should work:
var_ratio <- function(m) {
v <- as.data.frame(sapply(VarCorr(m), as.numeric))
return(v$family/(v$family + v$id))
}
confint(m, method="boot", FUN =var_ratio)
You should specify random effects in lme by using / not +
By lmer
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age), data = alcohol1)
summary(model2.lmer)
Linear mixed model fit by REML ['lmerMod']
Formula: alcuse ~ 1 + peer + (1 | id) + (1 | age)
Data: alcohol1
REML criterion at convergence: 651.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.0228 -0.5310 -0.1329 0.5854 3.1545
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.08078 0.2842
age (Intercept) 0.30313 0.5506
Residual 0.56175 0.7495
Number of obs: 246, groups: id, 82; age, 82
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.3039 0.1438 2.113
peer 0.6074 0.1151 5.276
Correlation of Fixed Effects:
(Intr)
peer -0.814
By lme
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2.lme)
Linear mixed-effects model fit by REML
Data: alcohol1
AIC BIC logLik
661.3109 678.7967 -325.6554
Random effects:
Formula: ~1 | id
(Intercept)
StdDev: 0.4381206
Formula: ~1 | age %in% id
(Intercept) Residual
StdDev: 0.4381203 0.7494988
Fixed effects: alcuse ~ 1 + peer
Value Std.Error DF t-value p-value
(Intercept) 0.3038946 0.1438333 164 2.112825 0.0361
peer 0.6073948 0.1151228 80 5.276060 0.0000
Correlation:
(Intr)
peer -0.814
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.0227793 -0.5309669 -0.1329302 0.5853768 3.1544873
Number of Observations: 246
Number of Groups:
id age %in% id
82 82
Okay, finally. Just to sketch my confidential data: I have a panel of individuals. The data includes siblings, identified via mnr. income is earnings, wavey survey year, age age factors. female a factor for gender, pid is the factor identifying the individual.
m1 <- lmer(income ~ age + wavey + female + (1|pid) + (1 | mnr),
data = panel)
vv <- vcov(m1, full = TRUE)
covvar <- vv[58:60, 58:60]
covvar
3 x 3 Matrix of class "dgeMatrix"
cov_pid.(Intercept) cov_mnr.(Intercept) residual
[1,] 2.6528679 -1.4624588 -0.4077576
[2,] -1.4624588 3.1015001 -0.0597926
[3,] -0.4077576 -0.0597926 1.1634680
mean <- as.data.frame(VarCorr(m1))$vcov
mean
[1] 17.92341 16.86084 56.77185
deltamethod(~ x2/(x1+x2), mean, covvar, ses =TRUE)
[1] 0.04242089
The last scalar should be what I interprete as the shared background of the siblings of permanent income.
Thanks to #Ben Bolker who pointed me into this direction.

How to get value of group = 0 in linear mixed model

I have a very simple stat question probably.
So, I am fitting linear mixed models like this:
lme(dependent ~ Group + Sex + Age + npgs, data=boookclub, random = ~ 1| subject)
Group is a factor variable with levels = 0, 1 , 2 , 3
The dependent are continuous variables standardized (mean 0) and the others are covariates with sex being factor, with Male/Female levels, Age being numerical, and npgs being numerical continuous standardized as well.
When I get the table with beta, standard error, t and p values, I get this:
Value Std.Error DF t-value p-value
(Intercept) -0.04550502 0.02933385 187 -1.551280 0.0025
Group1 0.04219801 0.03536929 181 1.193069 0.2344
Group2 0.03350827 0.03705896 181 0.904188 0.3671
Group3 0.00192119 0.03012654 181 0.063771 0.9492
SexMale 0.03866387 0.05012901 181 0.771287 0.4415
Age -0.00011675 0.00148684 181 -0.078520 0.9375
npgs 0.15308844 0.01637163 181 9.350835 0.0000
SexMale:Age 0.00492966 0.00276117 181 1.785352 0.0759
My problem is: how do I get the beta of Group0? In this case the intercept is Group0 but also the average of npgs, being npgs standardized. How do I get the Beta of Group0? And how can I check if Group0 is significantly associated to the dependent? I'd like to see the effect of all Group levels.
Thanks
The easiest way to do what you want may be with the emmeans package, but you may also have some conceptual issues. Technical details first, then conceptual:
Technical
Fitting an example (this isn't necessarily statistically sensible, but I wanted an example with a categorical fixed effect)
library(nlme)
m1 <- lme(Yield~Variety, random = ~1|Block, data=Alfalfa)
As with your example, the effects are "intercept" (= mean of the baseline group, which is the "Cossack" variety in this case [by default, the alphabetically-first group]), "Ladak" (difference between Ladak and Cossack means) and "Ranger" (similarly). (As #Ben hints in the comments above, R automatically generates dummies for [most of] the levels of the categorical variables [factors] in your model.)
coef(summary(m1))
## Value Std.Error DF t-value p-value
## (Intercept) 1.57166667 0.11665326 64 13.4729767 2.373343e-20
## VarietyLadak 0.09458333 0.07900687 64 1.1971532 2.356624e-01
## VarietyRanger -0.01916667 0.07900687 64 -0.2425949 8.090950e-01
The emmeans package is a convenient way to see predicted values for each group without recoding.
library(emmeans)
emmeans(m1, spec = ~Variety)
## Variety emmean SE df lower.CL upper.CL
## Cossack 1.57 0.117 5 1.27 1.87
## Ladak 1.67 0.117 5 1.37 1.97
## Ranger 1.55 0.117 5 1.25 1.85
Conceptual
You can't "check if Group0 is significantly associated with the dependent [response] variable". You can only check whether the response variables differs significantly between two groups, or whether it differs significantly among all groups (e.g. the results of anova()). You have to pick a baseline. (If you insist, you can test all pairwise comparisons among groups; emmeans can help with this too.) If you "remove the intercept" (by fitting Variety ~Yield-1, or by looking at the results that emmeans produces) then the difference you are quantifying is the difference between the mean of a particular group and zero. This is usually not a meaningful question; in the example here, for instance, this would be testing whether a wheat variety gave a yield that was significantly greater than zero — probably not very interesting.
On the other hand, if you are just interested in estimating the expected value in each group (conditioning on the baseline values of the other variables in the model), along with the standard errors/CIs, then the answers you get from emmeans are perfectly sensible.
There's a related question here that explains why you get an NA value if you manually create dummies for every level of your factor ...

manova () in R with between and within subject factors

I am using stats() 3.5.2 to run a manova with:
participant 1:20
gender as between subject factor
group as within subject factor
anxiety as dependent measure
BAC as dependent measures
The dataset follow:
treat4 = data.frame (
participant = rep(1:20,3),
gender = factor (rep(c(rep("male", 10), rep ("female", 10)),3)),
group = factor (c(rep("control",20), rep("run",20), rep("party",20))),
anxiety = round(c(rnorm(20, mean=55, sd=5),rnorm(20, mean=20, sd=5),rnorm(20, mean=75, sd=5))),
BAC = round(c(rep(0.01,20), rep(0.01,20), rnorm(20, mean= 0.09, sd=0.01)),2))
I apply the manova () function and summarize as follows:
mod = manova(cbind(anxiety,BAC) ~ gender + Error(group),data=treat4)
summary (mod)
This is what I get:
Error: group
Df Pillai approx F num Df den Df Pr(>F)
Residuals 2
Error: Within
Df Pillai approx F num Df den Df Pr(>F)
gender 1 0.013447 0.37482 2 55 0.6892
Residuals 56
There are a couple of issues:
1) Gender seems to be accounted as within-subjects factor
2) I don't get any statistics for the group factor
Any help?
if anxiety and BAC are your dependent variables, you place them on the left side of tilda (~) with cbind, to indicate multivariate response, and use Error() to specify the within group effect (or random effect). The rest on the right side of tilda (~) are your between group effect (or fixed effect):
manova(cbind(anxiety,BAC) ~ gender + Error(group),data=treat4)
Call:
manova(cbind(anxiety, BAC) ~ gender + Error(group), data = treat4)
Grand Means:
anxiety BAC
49.96666667 0.03766667
Stratum 1: group
Terms:
Residuals
anxiety 33156.63
BAC 0.09185333
Deg. of Freedom 2
Residual standard errors: 128.7568 0.2143051
Stratum 2: Within
Terms:
gender Residuals
anxiety 8.0667 1527.2333
BAC 0.0000 0.0034
Deg. of Freedom 1 56
Residual standard errors: 5.222262 0.007807201
Estimated effects are balanced
Thanks #StupidWolf for your answer.
However, when I apply summary () to the model:
summary(manova(cbind(anxiety,BAC) ~ gender + Error(group),data=treat4))
I get the following:
Error: group
Df Pillai approx F num Df den Df Pr(>F)
Residuals 2
Error: Within
Df Pillai approx F num Df den Df Pr(>F)
gender 1 0.039097 1.1189 2 55 0.334
Residuals 56
There are a couple of issues:
1) Gender seems to be accounted as within-subjects factor
2) I don't get any statistics for the group factor
I know this comes a bit late but I faced the same issue and I think you can simply solve it this way:
summary(manova(cbind(anxiety,BAC) ~ gender + group + Error(factor(participant)),data=treat4))
Basically, you need to add group as an IV (by doing + group).
Then you use the Error() to indicate how it needs to identify unique subjects, it needs to do this by the participant number, rather than by the group.
Don't forget to make the participant into a factor, otherwise, it causes problems!

ANOVA LMER Eta squared

I used the lmer package to run mixed models, when I use the anova function to retrieve the anova results, everything works. However, when I try and calculate the eta squared, I consistently get the error below. Any ideas?
Dyestuff is a dataset available with the lmerTest package. I
use package ‘lme4’ version 1.1-21. package ‘lmerTest’ version 3.1-0 and package ‘sjstats’ version 0.17.7
fm1 <- lmer(Yield ~ 1 + (1|Batch), Dyestuff)
am <- anova(fm1, test="F")
eta_sq(am, partial = FALSE, ci.lvl = NULL, n = 1000, method = c("dist", "quantile"))
Error: Result 2 is not a length 1 atomic vector
In addition: Warning message:
In tidy.anova(model) :
The following column names in ANOVA output were not recognized or transformed: NumDF, DenDF
tl;dr
it may be theoretically difficult to compute eta-squared for mixed models, see e.g. this CV question (it does suggest some ways of computing R^2 values for mixed models, which might satisfy your need for an effect size)
practically speaking, the proximal problem seems to be that internally the eta-squared computation in sjstats expects that the anova() method will return a table containing a row corresponding to the residual variance. ?anova.lmerModLmerTest returns a table with only rows corresponding to the fixed effect terms (not the residual variance).
in any case you might expect to have trouble computing an eta-squared for a model with no non-trivial fixed effects (i.e. a fixed-effect intercept only) ...
This might be more appropriate for the sjstats issues list but I'll use this space to share what I've figured out so far.
Fitting an intercept-only model gives a similar error even if it's just an lm() fit (which ought to work if anything does):
fm0 <- lm(Yield ~ 1 , Dyestuff)
am0 <- anova(fm0, test="F")
eta_sq(am0)
Error: Result 2 must be a single double, not a double vector of length 0
Run rlang::last_error() to see where the error occurred.
However: fitting a non-trivial (more fixed effects than just the intercept) lmer(Test) model also fails:
fm2 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
am2 <- anova(fm2, test="F")
eta_sq(am2)
Error: Result 2 must be a single double, not a double vector of length 0
Run rlang::last_error() to see where the error occurred.
In addition: Warning message:
In tidy.anova(model) :
The following column names in ANOVA output were not recognized or transformed: NumDF, DenDF
(From what I can tell the warning message is actually harmless.)
The proximal cause of this problem seems to be that the internal sjstats:::aov_stat_summary() function returns a table with only a single row, for the SSQ/MSQ/etc. due to Days; it should also have a row for the residual SSQ/MSQ/etc.
sjstats:::aov_stat_summary(am3)
## term sumsq meansq NumDF DenDF statistic p.value
## 1 Days 30030.94 30030.94 1 16.99998 45.85296 3.263825e-06
The problem is that the number of terms is internally computed as (nrow(aov.sum)-1), which doesn't make sense here.
Compare this with what we get with a 1+Days model using lm():
fm3 <- lm(Reaction ~ Days , sleepstudy)
am3 <- anova(fm3, test="F")
sjstats:::aov_stat_summary(am3)
## term df sumsq meansq statistic p.value
## 1 Days 1 162702.7 162702.652 71.46442 9.894096e-15
## 2 Residuals 178 405251.6 2276.694 NA NA
Digging a little deeper, we can see that this is a direct consequence of the way the anova() results are reported for mixed models:
anova(fm2)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Days 30031 30031 1 17 45.853 3.264e-06 ***
Note there is no "residuals" row. In contrast:
anova(fm3)
## Analysis of Variance Table
## Response: Reaction
## Df Sum Sq Mean Sq F value Pr(>F)
## Days 1 162703 162703 71.464 9.894e-15 ***
## Residuals 178 405252 2277
I think that if you use the function anova_stats from sjstats package it works.
fm2 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
am2 <- anova_stats(fm2, test="F")

Resources