Producing anova from already summarized data - r

I have a table that looks like this:
I'm trying to run aov() on the above table, but I'm only able to create a partial output. I'm not sure how to include the standard deviation in the calculation.
Right now I'm concatenating and repeating each group like so:
groups <- c(rep('LHS', 121), rep('HS', 546), rep('Jr', 97), rep('Bachelors', 253), rep('Graduate', 155))
And then doing the same for the means (since I don't have access to the original data sheet):
means <- c(rep(38.67, 121), rep(39.6, 546), rep(41.39, 97), rep(42.55, 253), rep(40.85, 155))
At this point I can create a data fame and then run aov on it:
df <- data.frame(groups, means)
groups.aov <- aov(means ~ groups, data = df)
Unfortunately summary(groups.aov) only gives me a partial result.
Df Sum Sq Mean Sq F value Pr(>F)
groups 4 2004 501 4.247e+27 <2e-16 ***
Residuals 1167 0 0
Any other way I can go, where I can factor in the SD?

We simulate some data so that we know the calculations are correct:
set.seed(100)
df = data.frame(
groups=rep(letters[1:4],times=seq(20,35,by=5)),
value=rnorm(110,rep(1:4,times=seq(20,35,by=5)),1))
We get back something like the table you see above:
library(dplyr)
res <- df %>% group_by(groups) %>% summarize_all(c(mean=mean,sd=sd,n=length))
total <- data.frame(groups="total",mean=mean(df$value),sd=sd(df$value),n=nrow(df))
rbind(res,total)
# A tibble: 5 x 4
groups mean sd n
<fct> <dbl> <dbl> <int>
1 a 0.937 1.14 20
2 b 1.91 0.851 25
3 c 3.01 0.780 30
4 d 4.01 0.741 35
5 total 2.70 1.42 110
We always work with the sum of squares in anova. So from sd back to sum of squares, you usually multiply by n-1, and from there you can derive the F value. The detailed calculations:
# number of groups
ngroups=nrow(res)# number of groups
# total sum of squares
SST = (total$sd^2)*(total$n-1)
#error within groups
SSE = sum((res$sd^2)*(res$n-1))
aovtable = data.frame(
Df = c(ngroups-1,total$n-ngroups-1),
SumSq = c(SST-SSE,SSE)
)
aovtable$MeanSq = aovtable$SumSq / aovtable$Df
aovtable$F = c(aovtable$MeanSq[1]/aovtable$MeanSq[2],NA)
aovtable$p = c(pf(aovtable$F[1],aovtable$Df[1],aovtable$Df[2],lower.tail=FALSE),NA)
And we can compare the two results:
aovtable
Df SumSq MeanSq F p
1 3 140.55970 46.8532330 62.62887 2.705082e-23
2 105 78.55147 0.7481092 NA NA
summary(aov(value~groups,data=df))
Df Sum Sq Mean Sq F value Pr(>F)
groups 3 140.56 46.85 63.23 <2e-16 ***
Residuals 106 78.55 0.74

Related

R: Compute Cohen's d based on t-statistic of a coefficient in multiple linear regression

I'm looking at age- and sex-adjusted group differences in a continuous variable of interest. As done in other studies in my field, I want to calculate Cohen's d based on contrasts extracted from a multiple linear regression model.
The original formula (Nakagawa & Cuthill, 2007) is as follows:
n1 = sample size in Group 1
n2 = sample size in Group 2
df' = degrees of freedom used for a corresponding t value in a linear model
t = t-statistic corresponding to the contrast of interest
So far I've attempted to apply this in R, but the results are looking strange (much larger effect sizes than expected).
Here's some simulated data:
library(broom)
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010),
Group = as.numeric(c('0','1','0','0','1','1','0','1','0','1')),
age = as.numeric(c('23','28','30','15','7','18','29','27','14','22')),
sex = as.numeric(c('1','0','1','0','0','1','1','0','0','1')),
test_score = as.numeric(c('18','20','19','15','20','23','19','25','10','14')))
# run lm and extract regression coefficients
model <- lm(test_score ~ Group + age + sex, data = df)
tidy_model <- tidy(model)
tidy_model
# A tibble: 4 x 5
#term estimate std.error statistic p.value
#<chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 11.1 4.41 2.52 0.0451
# 2 Group 4.63 2.65 1.75 0.131
# 3 age 0.225 0.198 1.13 0.300
# 4 sex 0.131 2.91 0.0452 0.965
t_statistic <- tidy_model[2,4] # = 1.76
n <- 5 #(equal n of participants in Group1 as in Group2)
cohens_d <- t_statistic*(n + n)/(sqrt(n * n) * sqrt(1)) # 1 dof for 1 estimated parameter (group contrast)
cohens_d # = 3.518096
Could you please flag up where I'm going wrong?
You have set the degrees of freedom to 1. However, you actually have 6 degrees of freedom which you can see if you type: summary(model).
If you set your degrees of freedom to 6 your Cohen's d will be ~1.7 which should be more inline with what you expect.

What is wrong with my syntax in lme4::lmer() for a split-split plot design with unbalanced repeated measures?

I am trying to use the lme4 package in R and function lmer() to fit a model for my split-split plot design. I would have used a repeated measures ANOVA if I did not have a small number of observations missing, but the missing data should be no problem with a linear mixed effects model.
My data frame (data) has a simple structure with four factors and a numeric outcome variable called all_vai. Note that in this example data frame, not all levels of all factors are crossed even though they would be in my real data (except for the missing observations). It shouldn't matter for my question, which is an attempt to fix problematic syntax.
collected_vai <- rnorm(125, mean = 6, sd = 1)
missing <- rep(NA, times = 3)
all_vai <- c(collected_vai, missing)
year1 <- rep(2018, times = 32)
year2 <- rep(2019, times = 32)
year3 <- rep(2020, times = 32)
year4 <- rep(2021, times = 32)
year <- c(year1, year2, year3, year4)
disturbance_severity <- rep(c(0,45,65,85), each = 32)
treatment <- rep(c("B" , "T"), each = 64)
replicate <- rep(c("A", "B", "C", "D"), each = 32)
data = data.frame(all_vai, year, disturbance_severity, treatment, replicate)
data$year <- as.factor(data$year)
data$disturbance_severity <- as.factor(data$disturbance_severity)
data$treatment <- as.factor(data$treatment)
data$replicate <- as.factor(data$replicate)
Here is the model I ran for an identical data set with a different (normally distributed) numeric outcome and no missing observations -- i.e., this is the model I would be running if I didn't have unbalanced repeated measures now due to missing data:
VAImodel1 <- aov(all_vai ~ disturbance_severity*treatment*year + Error(replicate/disturbance_severity/treatment/year), data = data)
summary(VAImodel1)
When I run this, I get the error message: "Warning message:
In aov(mean_vai ~ disturbance_severity * treatment * Year + Error(Replicate/disturbance_severity/treatment/Year), :
Error() model is singular"
I have observations from different years nested within different treatments, which are nested within different disturbance severities, and all of this nested within replicates (which are experimental blocks). So I tried using this structure in lme4:
library(lme4)
library(lmerTest)
VAImodel2 <- lmer(all_vai ~ (year|replicate:disturbance_severity:treatment) + disturbance_severity*treatment*year, data = data)
summary(VAImodel2)
And this is the error message I get: "Error: number of observations (=125) <= number of random effects (=128) for term (Year | Replicate:disturbance_severity:treatment); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable"
Next I tried simplifying my model so that I was not running out of degrees of freedom, by removing the treatment variable and interaction term, like so:
VAImodel3 <- lmer(all_vai ~ (year|replicate:disturbance_severity) + disturbance_severity*year, data = data)
summary(VAImodel3)
This time I get a different error: "boundary (singular) fit: see ?isSingular
Warning message:
Model failed to converge with 1 negative eigenvalue: -1.2e-01 "
Thank you in advance for any help.
Your problem is wrong data preparation!!
Let's start by defining values for your variables year, disturbance_severity, treatment, replicate.
library(tidyverse)
set.seed(123)
yars = 2018:2021
disturbances = c(0,45,65,85)
treatments = c("B" , "T")
replicates = c("A", "B", "C", "D")
n = length(yars)*length(disturbances)*length(treatments)*length(replicates)*1
nNA=3
Please note that I first created the variables yars, disturbances, treatments and replicates with all the allowed values.
Then I calculated the amount of data in n (you can increase the last value in the multiplication from 1 e.g. to 10) and determined how many values will be missing in the variable nNA.
The key aspect is the use of the function expand.grid(yars, disturbances, treatments, replicates) which will return the appropriate table with the correct distribution of values.
Look at the first few lines of what expand.grid returns.
Var1 Var2 Var3 Var4
1 2018 0 B A
2 2019 0 B A
3 2020 0 B A
4 2021 0 B A
5 2018 45 B A
6 2019 45 B A
7 2020 45 B A
8 2021 45 B A
9 2018 65 B A
10 2019 65 B A
11 2020 65 B A
12 2021 65 B A
13 2018 85 B A
14 2019 85 B A
15 2020 85 B A
16 2021 85 B A
17 2018 0 T A
18 2019 0 T A
This is crucial here.
The next step is straight ahead. We create a tibble sequence and put it in the aov function.
data = tibble(sample(c(rnorm(n-nNA, mean = 6, sd = 1), rep(NA, nNA)), n)) %>%
mutate(expand.grid(yars, disturbances, treatments, replicates)) %>%
rename_with(~c("all_vai", "year", "disturbance_severity", "treatment", "replicate"))
VAImodel1 <- aov(all_vai ~ disturbance_severity*treatment*year +
Error(replicate/disturbance_severity/treatment/year), data = data)
summary(VAImodel1)
output
Error: replicate
Df Sum Sq Mean Sq F value Pr(>F)
disturbance_severity 1 0.1341 0.1341 0.093 0.811
treatment 1 0.0384 0.0384 0.027 0.897
Residuals 1 1.4410 1.4410
Error: replicate:disturbance_severity
Df Sum Sq Mean Sq F value Pr(>F)
disturbance_severity 1 0.1391 0.1391 0.152 0.763
treatment 1 0.1819 0.1819 0.199 0.733
year 1 1.4106 1.4106 1.545 0.431
Residuals 1 0.9129 0.9129
Error: replicate:disturbance_severity:treatment
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 0.4647 0.4647 0.698 0.491
year 1 0.8127 0.8127 1.221 0.384
Residuals 2 1.3311 0.6655
Error: replicate:disturbance_severity:treatment:year
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 2.885 2.8846 3.001 0.144
year 1 0.373 0.3734 0.388 0.560
treatment:year 1 0.002 0.0015 0.002 0.970
Residuals 5 4.806 0.9612
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 0.03 0.031 0.039 0.8430
year 1 1.29 1.292 1.662 0.2002
treatment:year 1 4.30 4.299 5.532 0.0206 *
Residuals 102 79.26 0.777
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now there are no model is singular errors!!

How to calculate the Intraclass correlation (ICC) in R?

I have a dataset that is in a long format with 200 variables, 94 subjects, and each subject has anywhere from 1 to 3 measurements for each variable.
Eg:
ID measurement var1 var2 . . .
1 1 2 6
1 2 3 8
1 3 6 12
2 1 3 9
2 2 4 4
2 3 5 3
3 1 1 11
3 2 1 4
. . . .
. . . .
. . . .
However, some variables have missing values for one of three measurements. It was suggested to me that before imputing missing values with the mean for the subject, I should use a repeated measures ANOVA or mixed model in order to confirm the repeatability of measurements.
The first thing I found to calculate the ICC was the ICC() function from the psych package. However, from what I can tell this requires that the data have one row per subject and one column per measurement, which would be further complicated by the fact that I have 200 variables I need to calculate the ICC for individually. I did go ahead and calculate the ICC for a single variable, and obtained this output:
Intraclass correlation coefficients
type ICC F df1 df2 p lower bound upper bound
Single_raters_absolute ICC1 0.38 2.8 93 188 0.00000000067 0.27 0.49
Single_random_raters ICC2 0.38 2.8 93 186 0.00000000068 0.27 0.49
Single_fixed_raters ICC3 0.38 2.8 93 186 0.00000000068 0.27 0.49
Average_raters_absolute ICC1k 0.65 2.8 93 188 0.00000000067 0.53 0.74
Average_random_raters ICC2k 0.65 2.8 93 186 0.00000000068 0.53 0.74
Average_fixed_raters ICC3k 0.65 2.8 93 186 0.00000000068 0.53 0.74
Number of subjects = 94 Number of Judges = 3
Next, I tried to calculate the ICC using a mixed model. Using this code:
m1 <- lme(var1 ~ measurement, random=~1|ID, data=mydata, na.action=na.omit)
summary(m1)
The output looks like this:
Linear mixed-effects model fit by REML
Data: mydata
AIC BIC logLik
-1917.113 -1902.948 962.5564
Random effects:
Formula: ~1 | ORIGINAL_ID
(Intercept) Residual
StdDev: 0.003568426 0.004550419
Fixed effects: var1 ~ measurement
Value Std.Error DF t-value p-value
(Intercept) 0.003998953 0.0008388997 162 4.766902 0.0000
measurement 0.000473053 0.0003593452 162 1.316429 0.1899
Correlation:
(Intr)
measurement -0.83
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.35050264 -0.30417725 -0.03383329 0.25106803 12.15267443
Number of Observations: 257
Number of Groups: 94
Is this the correct model to use to assess ICC? It is not clear to me what the correlation (Intr) is measuring, and it is different from the ICC obtained using ICC().
This is my first time calculating and using intraclass correlation, so any help is appreciated!
Using a mock dataset...
set.seed(42)
n <- 6
dat <- data.frame(id=rep(1:n, 2),
group= as.factor(rep(LETTERS[1:2], n/2)),
V1 = rnorm(n),
V2 = runif(n*2, min=0, max=100),
V3 = runif(n*2, min=0, max=100),
V4 = runif(n*2, min=0, max=100),
V5 = runif(n*2, min=0, max=100))
Loading some libraries...
library(lme4)
library(purrr)
library(tidyr)
# Add list of variable names to the vector below...
var_list <- c("V1","V2","V3","V4","V5")
map_dfr() is from the purrr library. I use lme4::VarCorr() to get the variances at each level.
map_dfr(var_list,
function(x){
formula_mlm = as.formula(paste0(x,"~ group + (1|id)"));
model_fit = lmer(formula_mlm,data=dat);
re_variances = VarCorr(model_fit,comp="Variance") %>%
data.frame() %>%
dplyr::mutate(variable = x);
return(re_variances)
}) %>%
dplyr::select(variable,grp,vcov) %>%
pivot_wider(names_from="grp",values_from="vcov") %>%
dplyr::mutate(icc = id/(id+Residual))

Wrong degrees of freedom in lsmeans and SE calculation in R

I have this sample data:
Sample Replication Days
1 1 10
1 1 14
1 1 13
1 1 14
2 1 NA
2 1 5
2 1 18
2 1 20
1 2 16
1 2 NA
1 2 18
1 2 21
2 2 15
2 2 7
2 2 12
2 2 14
I have four observations for each sample with a total of 64 samples in each of the two replications. In total, I have 512 values for both the replications. I also have some missing values designated as 'NA'. I prformed ANOVA for Mean values for each Sample for each Rep that I generated using
library(tidyverse)
df <- Data %>% group_by(Sample, Rep) %>% summarise(Mean = mean(Days, na.rm = TRUE))
curve.anova <- aov(Mean~Rep+Sample, data=df)
Result of anova is:
> summary(curve.anova)
Df Sum Sq Mean Sq F value Pr(>F)
Rep 1 6.1 6.071 2.951 0.0915 .
Sample 63 1760.5 27.945 13.585 <2e-16 ***
Residuals 54 111.1 2.057
I created a table for mean and SE values,
ANOVA<-lsmeans(curve.anova, ~Sample)
ANOVA<-summary(ANOVA)
write.csv(ANOVA, file="Desktop/ANOVA.csv")
A few lines from file are:
Sample lsmean SE df lower.CL upper.CL
1 24.875 1.014145417 54 22.84176086 26.90823914
2 25.5 1.014145417 54 23.46676086 27.53323914
3 31.32575758 1.440722628 54 28.43728262 34.21423253
4 26.375 1.014145417 54 24.34176086 28.40823914
5 26.42424242 1.440722628 54 23.53576747 29.31271738
6 25.5 1.014145417 54 23.46676086 27.53323914
7 28.375 1.014145417 54 26.34176086 30.40823914
8 24.875 1.014145417 54 22.84176086 26.90823914
9 21.16666667 1.014145417 54 19.13342752 23.19990581
10 23.875 1.014145417 54 21.84176086 25.90823914
df for all 64 samples is 54 and the error bars in the ggplot are mostly equal for all the Samples. SE values are larger than the manually calculated values. Based on anova results, df=54 is for residuals.
I want to double check the ANOVA results so that they are correct and I am correctly generating lsmeans and SE to plot a bargraph using ggplot with confirdence interval error bars.
I will appreciate any help. Thank you!
After reading your comments, I think your workflow as an issue. Basically, when you are applying your anova test, you are doing it on means of the different samples.
So, in your example, when you are doing :
curve.anova <- aov(Mean~Rep+Sample, data=df)
You are comparing these values:
> df
# A tibble: 4 x 3
# Groups: Sample [2]
Sample Replication Mean
<dbl> <dbl> <dbl>
1 1 1 12.8
2 1 2 18.3
3 2 1 14.3
4 2 2 12
So, basically, you are comparing two groups with two values per group.
So, when you tried to remove the Replication group, you get an error because the output of:
df = Data %>% group_by(Sample %>% summarise(Mean = mean(Days, na.rm = TRUE))
is now:
# A tibble: 2 x 2
Sample Mean
<dbl> <dbl>
1 1 15.1
2 2 13
So, applying anova test on that dataset means that you are comparing two groups with one value each. So, you can't compute residuals and SE.
Instead, you should do it on the full dataset without trying to calculate the mean first:
anova_data <- aov(Days~Sample+Replication, data=Data)
anova_data2 <- aov(Days~Sample, data=Data)
And their output are:
> summary(anova_data)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 16.07 16.071 0.713 0.416
Replication 1 9.05 9.054 0.402 0.539
Residuals 11 247.80 22.528
2 observations deleted due to missingness
> summary(anova_data2)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 16.07 16.07 0.751 0.403
Residuals 12 256.86 21.41
2 observations deleted due to missingness
Now, you can apply lsmeans:
A_d = summary(lsmeans(anova_data, ~Sample))
A_d2 = summary(lsmeans(anova_data2, ~Sample))
> A_d
Sample lsmean SE df lower.CL upper.CL
1 15.3 1.8 11 11.29 19.2
2 12.9 1.8 11 8.91 16.9
Results are averaged over the levels of: Replication
Confidence level used: 0.95
> A_d2
Sample lsmean SE df lower.CL upper.CL
1 15.1 1.75 12 11.33 19.0
2 13.0 1.75 12 9.19 16.8
Confidence level used: 0.95
It does not change a lot the mean and the SE (which is good because it means that your replicate are consistent and you don't have too much variabilities between those) but it reduces the confidence interval.
So, to plot it, you can:
library(ggplot2)
ggplot(A_d, aes(x=as.factor(Sample), y=lsmean)) +
geom_bar(stat="identity", colour="black") +
geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)
Based on your initial question, if you want to check that the output of ANOVA is correct, you can mimick fake data like this:
d2 <- data.frame(Sample = c(rep(1,10), rep(2,10)),
Days = c(rnorm(10, mean =3), rnorm(10, mean = 8)))
Then,
curve.d2 <- aov(Days ~ Sample, data = d2)
ANOVA2 <- lsmeans(curve.d2, ~Sample)
ANOVA2 <- summary(ANOVA2)
And you get the following output:
> summary(curve.d2)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 139.32 139.32 167.7 1.47e-10 ***
Residuals 18 14.96 0.83
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> ANOVA2
Sample lsmean SE df lower.CL upper.CL
1 2.62 0.288 18 2.02 3.23
2 7.90 0.288 18 7.29 8.51
Confidence level used: 0.95
And for the plot
ggplot(ANOVA2, aes(x=as.factor(Sample), y=lsmean)) +
geom_bar(stat="identity", colour="black") +
geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)
As you can see, we get lsmeans for d2 close to 3 and 8 what we set at the first place. So, I think your output are correct. Maybe your data do not present any significant differences and the computation of SE are the same because the distribution of your data are the same. It is what it is.
I hope this answer helps you.
Data
df = data.frame(Sample = c(rep(1,4), rep(2,4),rep(1,4), rep(2,4)),
Replication = c(rep(1,8), rep(2,8)),
Days = c(10,14,13,14,NA,5,18,20,16,NA,18,21,15,7,12,14))

Output t.test results to a data frame in R

I have a data frame of values from individuals linked to groups. I want to identify those groups who have mean values greater than the mean value plus one standard deviation for the whole data set. To do this, I'm calculating the mean value and standard deviation for the entire data frame and then running pairwise t-tests to compare to each group mean. I'm running into trouble outputting the results.
> head(df)
individual group value
1 11559638 75 0.371
2 11559641 75 0.367
3 11559648 75 0.410
4 11559650 75 0.417
5 11559652 75 0.440
6 11559654 75 0.395
> allvalues <- data.frame(mean=rep(mean(df$value), length(df$individual)), sd=rep(sd(df$value), length(df$individual)))
> valueplus <- with(df, by(df, df$individual, function(x) t.test(allvalues$mean + allvalues$sd, df$value, data=x)))
> tmpplus
--------------------------------------------------------------------------
df$individuals: 10
NULL
--------------------------------------------------------------------------
df$individuals: 20
NULL
--------------------------------------------------------------------------
df$individuals: 21
Welch Two Sample t-test
data: allvalues$mean + allvalues$sd and df$value
t = 84.5217, df = 4999, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.04676957 0.04899068
sample estimates:
mean of x mean of y
0.4719964 0.4241162
How do I get the results into a data frame? I'd expect the output to look something like this:
groups t df p-value mean.x mean.y
1 10 NULL NULL NULL NULL NULL
2 20 NULL NULL NULL NULL NULL
3 21 84.5217 4999 2.2e-16 0.4719964 0.4241162
From a purely programming perspective, you are asking how to get the output of t.test into a data.frame. Try the following, using mtcars:
library(broom)
tidy(t.test(mtcars$mpg))
estimate statistic p.value parameter conf.low conf.high
1 20.09062 18.85693 1.526151e-18 31 17.91768 22.26357
Or for multiple groups:
library(dplyr)
mtcars %>% group_by(vs) %>% do(tidy(t.test(.$mpg)))
# A tibble: 2 x 9
# Groups: vs [2]
vs estimate statistic p.value parameter conf.low conf.high method alternative
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0 16.6 18.3 1.32e-12 17 14.7 18.5 One Sample t-test two.sided
2 1 24.6 17.1 2.75e-10 13 21.5 27.7 One Sample t-test two.sided
Needless to say, you'll need to adjust the code to fit your specific setting.

Resources