for a research project I need to run an ANOVA test to see the statistical significance of the differences between some treatments.
The experiment consisted in inoculating some bacteria in different tubes containing different treatments with different concentrations.
My dependent variable is the value of Optical Density 660 measured on the spectrophotometer, I measured the OD 13 times over time at different times.
Here is the dataset, i'll gave you all dataset, it is not so big:
od34_stat1 <- data.frame(
OD = c(0.032667,0.09,0.157,0.184,0.345667,
0.4445,0.47725,0.53925,0.74,0.750667,0.859167,0.880333,
0.8275,0.034667,0.0935,0.146,0.1725,0.522167,0.5865,0.71075,
0.69875,0.927,0.929667,1.063167,1.037333,0.973,0.031167,
0.1045,0.139,0.1665,0.425667,0.523,0.69875,0.80575,
1.0435,0.994667,1.085667,1.215333,1.1145,0.034667,0.1085,
0.1285,0.1645,0.349667,0.474,0.74075,0.78125,1.0815,
0.937167,1.045667,1.104333,0.9555,0.028167,0.065,0.13,0.1715,
0.331667,0.4015,0.45775,0.54425,0.811,0.739167,0.797167,
0.773333,0.6905,0.021167,0.0835,0.131,0.1585,0.279167,
0.384,0.40225,0.46975,0.646,0.625667,0.684667,0.701333,
0.5885,0.015667,0.0655,0.086,0.12,0.191667,0.261,0.29875,
0.35825,0.446,0.411167,0.364667,0.369333,0.31),
Treatment = as.factor(c("0_CNTRL","0_CNTRL",
"0_CNTRL","0_CNTRL","0_CNTRL","0_CNTRL","0_CNTRL",
"0_CNTRL","0_CNTRL","0_CNTRL","0_CNTRL",
"0_CNTRL","0_CNTRL","10_TOX","10_TOX","10_TOX","10_TOX",
"10_TOX","10_TOX","10_TOX","10_TOX","10_TOX",
"10_TOX","10_TOX","10_TOX","10_TOX","25_TOX",
"25_TOX","25_TOX","25_TOX","25_TOX","25_TOX","25_TOX",
"25_TOX","25_TOX","25_TOX","25_TOX","25_TOX",
"25_TOX","50_TOX","50_TOX","50_TOX","50_TOX",
"50_TOX","50_TOX","50_TOX","50_TOX","50_TOX",
"50_TOX","50_TOX","50_TOX","50_TOX","10_CNTRL",
"10_CNTRL","10_CNTRL","10_CNTRL","10_CNTRL","10_CNTRL",
"10_CNTRL","10_CNTRL","10_CNTRL","10_CNTRL",
"10_CNTRL","10_CNTRL","10_CNTRL","25_CNTRL","25_CNTRL",
"25_CNTRL","25_CNTRL","25_CNTRL","25_CNTRL",
"25_CNTRL","25_CNTRL","25_CNTRL","25_CNTRL",
"25_CNTRL","25_CNTRL","25_CNTRL","50_CNTRL","50_CNTRL",
"50_CNTRL","50_CNTRL","50_CNTRL","50_CNTRL",
"50_CNTRL","50_CNTRL","50_CNTRL","50_CNTRL","50_CNTRL",
"50_CNTRL","50_CNTRL")),
Time = as.factor(c("0","2","4","6",
"70","94","478","496","568","616","736","784",
"808","0","2","4","6","70","94","478","496",
"568","616","736","784","808","0","2","4","6",
"70","94","478","496","568","616","736","784",
"808","0","2","4","6","70","94","478","496",
"568","616","736","784","808","0","2","4",
"6","70","94","478","496","568","616","736",
"784","808","0","2","4","6","70","94","478",
"496","568","616","736","784","808","0","2","4",
"6","70","94","478","496","568","616","736",
"784","808"))
)
So, what I tried to do is a repeated measures anova, taking into account that I measured the OD over time, time is my repeated measures factor (?).
I would need to see if there are statistically significant differences between the treatment groups (e.g. Is there a significant difference between 0_CNTRL and 25_TOX?).
Initially I found a code where it correctly performs the ANOVA in repeated measures but it shows me the differences between the time points: then it tells me if there is a difference between Time 4 and Time 6 etc. but it is not the question that I need and above all the result is too dispersive.
This is the original code (i followed this guide: https://www.datanovia.com/en/lessons/repeated-measures-anova-in-r/#one-way-repeated-measures-anova):
library(tidyverse)
library(ggpubr)
library(rstatix)
library(ggplot2)
##Factors
od34_stat1$Treatment <- as.factor(od34_stat1$Treatment)
od34_stat1$Time <- as.factor(od34_stat1$Time)
#Interactionplot - Boxplot
bxp34 <- ggboxplot(od34_stat1, x = "Time", y = "OD", add = "point")
bxp34
##Check assumptions: Outliers
od34_stat1 %>%
group_by(Time) %>%
identify_outliers(OD)
##Check assumptions: Normality
od34_stat1 %>%
group_by(Time) %>%
shapiro_test(OD)
#OR
ggqqplot(od34_stat1, "OD", facet.by = "Time")
#Computing One-Way repeated measure ANOVA
od34.aov <- anova_test(data = od34_stat1, dv = OD, wid = Treatment, within = Time)
get_anova_table(od34.aov)
# Pairwise comparisons
od34.pwc <- od34_stat1 %>%
pairwise_t_test(
OD ~ Time, paired = TRUE,
p.adjust.method = "bonferroni"
)
od34.pwc
##Creating Report
od34.pwc <- od34.pwc %>% add_xy_position(x = "Time")
bxp34 +
stat_pvalue_manual(od34.pwc) +
labs(
subtitle = get_test_label(od34.aov, detailed = TRUE),
caption = get_pwc_label(od34.pwc)
)
Okay. Here is my problem, now the output is the "Time" factor. However, the guide uses a dataset where there are only 3 times of measurement of the dependent variable, while I measured 13 times. Moreover, I think that the intent of the guide is precisely to see the differences over time, while mine is to see the difference between the Treatments whose OD measure has been measured over Time..
So what I thought, as an rstudio noob, is change the code: "Time" to "Treatment". This way my output is just what I would need.
My concern is that by changing these factors the result is clear but doesn't make logical sense.
Reviewed code:
#Interactionplot - Boxplot
bxp34_1 <- ggboxplot(od34_stat1, x = "Treatment", y = "OD", add = "point")
bxp34_1
##Check assumptions: Outliers
od34_stat1 %>%
group_by(Time) %>%
identify_outliers(OD)
##Check assumptions: Normality
od34_stat1 %>%
group_by(Treatment) %>%
shapiro_test(OD)
#OR
ggqqplot(od34_stat1, "OD", facet.by = "Treatment")
#Computing One-Way repeated measure ANOVA
od34.aov_1 <- anova_test(data = od34_stat1, dv = OD, wid = Time, within = Treatment)
get_anova_table(od34.aov_1)
# Pairwise comparisons
od34.pwc_1 <- od34_stat1 %>%
pairwise_t_test(
OD ~ Treatment, paired = TRUE,
p.adjust.method = "bonferroni"
)
od34.pwc_1
##Creating Report
od34.pwc_1 <- od34.pwc_1 %>% add_xy_position(x = "Treatment")
bxp34_1 +
stat_pvalue_manual(od34.pwc_1) +
labs(
subtitle = get_test_label(od34.aov_1, detailed = TRUE),
caption = get_pwc_label(od34.pwc_1)
)
This way my graphical output (od34.pwc_1) allows me to explain the statistical significance of the difference between treatments.
I hope I have summarized all the doubt correctly. What do you think? Is it right to do this?
And if it is not correct, What would you recommend to analyze and visualize the difference between these treatments?
I'm trying to perform a pretty straightforward clustering analysis but can't get the results right. My question for a large dataset is "Which diseases are frequently reported together?". The simplified data sample below should result in 2 clusters: 1) headache / dizziness 2) nausea / abd pain. However, I can't get the code right. I'm using the pam and daisy functions. For this example I manually assign 2 clusters (k=2) because I know the desired result, but in reality I explore several values for k.
Does anyone know what I'm doing wrong here?
library(cluster)
library(dplyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))
gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k) # performs cluster analysis
pam_results <- dat %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
The format in which you give the dataset to the clustering algorithm is not precise for your objective. In fact, if you want to group diseases that are reported together but you also include IDs in your dissimilarity matrix, they will have a part in the matrix construction and you do not want that, since your objective regards only the diseases.
Hence, we need to build up a dataset in which each row is a patient with all the diseases he/she reported, and then construct the dissimilarity matrix only on the numeric features. For this task, I'm going to add a column presence with value 1 if the disease is reported by the patient, 0 otherwise; zeros will be filled automatically by the function pivot_wider (link).
Here is the code I used and I think I reached what you wanted to, please tell me if it is so.
library(cluster)
library(dplyr)
library(tidyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"),
presence = 1)
# build the wider dataset: each row is a patient
dat_wider <- pivot_wider(
dat,
id_cols = ID,
names_from = PTName,
values_from = presence,
values_fill = list(presence = 0)
)
# in the dissimalirity matrix construction, we leave out the column ID
gower_dist <- daisy(dat_wider %>% select(-ID), metric = "gower")
k <- 2
set.seed(123)
pam_fit <- pam(gower_dist, diss = TRUE, k)
pam_results <- dat_wider %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
Furthermore, since you are working only with binary data, instead of Gower's distance you can consider using the Simple Matching or Jaccard distance if they suit your data better. In R you can employ them using
sm_dist <- dist(dat_wider %>% select(-ID), method = "manhattan")/p
j_dist <- dist(dat_wider %>% select(-ID), method = "binary")
respectively, where p is the number of binary variables you want to consider.
I frequently want to create summary tables for studies where I compare several variables between two groups, listing values for each variable along with the difference between that variable for the two groups.
For example, say I want to compare age groups (young and old) and proportion of males between two groups, A and B. I’d like to end up with a table with rows for each variable (age, proportion of males) and columns for the following variables repeated for each group (numerator, denominators, rate, difference between the two rates, 95%CI, p-value from a chi-square).
I’m looking for a general approach to this type of table.
Let’s say I have the following table:
library(dplyr)
AgeGroup <- sample(c("Young", "Old"), 10, replace = TRUE)
Gender <- sample(c("Male", "Female"), 10, replace = TRUE)
df <- data.frame(AgeGroup, Gender)
df
I can create a summary table without the comparison easily:
df1 <- df %>%
group_by(AgeGroup) %>%
summarise(num_M = sum(Gender == "Male"),
den_M = n(),
prop_M = num_M/den_M)
df1
But I can’t figure out how to create additional columns of comparisons between the different rows of grouped data. Let’s say I want to do a chi.sq test on the proportion of Males in each AgeGroup and add the p-value to the summary table above.
It would look like this (numbers, obviously, are examples), Y = Young, O = Old:
Any gentle nudges in the right direction would be greatly appreciated.
Thanks!
I like the finalfit package for summary tables. If you need to add custom summary functions, it might not be flexible enough, but its default stats cover everything you've asked for in your example, e.g. numbers in each group, proportions, and a chi-squared test. If you have continuous variables it will calculate means and SDs in each group.
library(finalfit)
finalfit::summary_factorlist(
df,
dependent = "Gender",
explanatory = "AgeGroup",
total_col = TRUE,
p = TRUE
)
Output:
label levels Female Male Total p
1 AgeGroup Old 0 (0.0) 6 (100.0) 6 0.197
2 Young 1 (25.0) 3 (75.0) 4