I am attempting to run a friedman test on ordinal data in R and am getting errors. The data can be found here on dropbox https://www.dropbox.com/s/gh8crh18y1ueriy/seltoutput.xlsx?dl=0.
As a description of the data:
group1: group assignments, 2 levels
time1: time points, 2 levels
loameasure: ordinal data, 5 levels
distmeasure: continuous data
vectemp: participant IDs
After importing the data I run the following to correctly format:
selt$loameasure<-factor(selt$loameasure)
selt$distmeasure<-as.numeric(selt$distmeasure)
selt$time1<-factor(selt$time1)
Then I run:
friedman_test(formula = loameasure ~ time1 | vectemp, data = selt)
Then I get the error:
Error in friedman.test.default(c(3L, 2L, 3L, 2L, 2L, 5L, 2L, 1L, 3L, 4L, :
not an unreplicated complete block design
I thought that loameasure and time1 had to be factors but I did try them as numeric and I get a similar error:
Error in friedman.test.default(c(3, 2, 3, 2, 2, 5, 2, 1, 3, 4, 2, 2, 4, :
not an unreplicated complete block design
I've been playing around with this for days and haven't been able to figure out what my problem is. I would love some assistance! Thank you in advance!
As far as I can anticipate a Friedman test is not appropriate in your situation. I would suggest to perform a two-way ANOVA test for unbalanced designs with Type-III sums of square method.
The assumptions of Normality of residuals and homogenity are given.
I have tried to guide you how to perform the test and the meaning of some steps. It is not complete (lacking of interpretation etc..) But this should be a begin and direction for you.
We want to know if loameasure depends on group1 and time1
We will perform a two-way anova with two factors
Dependent variable: loameasure
Independent variable: group1 and time1
library(readxl)
# load your data
df <- read_excel("C:/Users/coding/Downloads/seltoutput.xlsx",
col_types = c("numeric", "numeric", "numeric"))
# Prepare data
# group1 to factor
df$group1 <- factor(df$group1,
levels = c(0, 1),
labels = c("Group_0", "Group_1"))
# time1 to factor
df$time1 <- factor(df$time1,
levels = c(1, 2),
labels = c("Time_1", "Time_2"))
----------------------------------------------------------------------------
# Visualize
library("ggpubr")
ggboxplot(df, x = "time1", y = "loameasure", color = "group1",
palette = c("#00AFBB", "#E7B800"))
ggline(df, x = "time1", y = "loameasure", color = "group1",
add = c("mean_se", "dotplot"),
palette = c("#00AFBB", "#E7B800"))
-----------------------------------------------------------------------------
# first decide if balanced or unbalnced design
table(df$group1, df$time1)
# Output
# Time_1 Time_2
# Group_0 20 20
# Group_1 29 29
# Here it is a unbalance design
# An unbalanced design has unequal numbers of subjects in each group!
## We will perform two-way ANOVA test in R for unbalanced designs !!!!!!!!!!!
# The recommended method are the Type-III sums of squares.
# you need `car` package
library(car)
# Our 2 way anova of unbalanced design (SS Type-III)
df_anova <- aov(loameasure ~ group1 * time1, data = df)
Anova(df_anova, type = "III")
## Output
# Anova Table (Type III tests)
# Response: loameasure
# Sum Sq Df F value Pr(>F)
# (Intercept) 120.050 1 83.9312 1.116e-14 ***
# group1 0.700 1 0.4891 0.48607
# time1 62.500 1 43.6960 2.301e-09 ***
# group1:time1 5.716 1 3.9963 0.04849 *
# Residuals 134.452 94
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Nomrality check ---------------------------------------------------------
# call residuals (difference between each indivdual and their group1/time1 combination mean)
res <- df_anova$residuals
# Histogram of residuals: Residuals should be normally distributed
hist(res,main="Histogram of residuals", xlab = "Residuals")
# # Extract the residuals
# Run Shapiro-Wilk test
shapiro.test(x = res )
# Output
# data: res
# W = 0.97708, p-value = 0.08434
# P is > 0.05 therefore normality can be assumed.
# Homogenity test ---------------------------------------------------------
# Levene's test for equality of variances (in `car` package)
library(car)
leveneTest(loameasure~ time1 * group1,data=df)
# Output:
# Levene's Test for Homogeneity of Variance (center = median)
# Df F value Pr(>F)
# group 3 0.3196 0.8112
# 94
# P is > 0.05 therefore equal variances can be assumed.
Related
this may be a trivial question.
In my data, I have two groups grp1 and grp2. In each group, I have some observations assigned to the treatment group and some observations assigned to the control group.
My question is whether there is a statistically significant difference on dv of the treatment in grp1 and grp2. In some way, this is a difference in differences.
I want to estimate if the following difference is significant:
dd = mean(dv_grp1_treat-dv_grp1_control)-mean(dv_grp2_treat-dv_grp2_control)
# create data
install.packages("librarian")
librarian::shelf(librarian,tidyverse,truncnorm)
aud_tr<- as.data.frame(list(avglist=rtruncnorm(625, a=0,b=4, mean=2.1, sd=1))) %>% mutate(group="grp1_tr")
aud_notr <- as.data.frame(list(avglist=rtruncnorm(625, a=0,b=4, mean=2, sd=1))) %>% mutate(group="grp1_notr")
noaud_tr<- as.data.frame(list(avglist=rtruncnorm(625, a=0,b=4, mean=2.4, sd=1))) %>% mutate(group="grp2_tr")
noaud_notr<- as.data.frame(list(avglist=rtruncnorm(625, a=0,b=4, mean=2.1, sd=1))) %>% mutate(group="grp2_notr")
df<- bind_rows(aud_tr,aud_notr,noaud_tr,noaud_notr)
unique(df$group)
[1] "grp1_treat" "grp1_control" "grp2_treat" "grp2_control"
I know how to run t.test for difference in means between in each group, but how do I do it if I want to examine the difference across groups?
t.test(df$dv[df$group=="grp1_treat"],df$dv[df$group=="grp1_control"])
t.test(df$dv[df$group=="grp2_treat"],df$dv[df$group=="grp2_control"])
It sounds like you need a two-way analysis of variance (ANOVA). Firstly, you should ensure that you separate out "group membership" and "treatment versus control" into two columns, since these are really two distinct variables:
df$treatment <- ifelse(grepl('treat', df$group), 'treat', 'control')
df$group <- ifelse(grepl('1', df$group), 'grp1', 'grp2')
Then you can carry out a two way ANOVA using aov
summary(aov(dv ~ group + treatment, data = df))
#> Df Sum Sq Mean Sq F value Pr(>F)
#> group 1 1.18 1.175 1.362 0.245
#> treatment 1 26.14 26.145 30.307 1.14e-07 ***
#> Residuals 197 169.95 0.863
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This tells you that, in this sample, the effect of treatment was significant, but the effect of group membership was not
Data
Obviously, we don't have your data since it wasn't supplied in the question, but the following sample data frame has the same names and structure as your own:
set.seed(1)
df <- data.frame(dv = c(rnorm(50, 3.2), rnorm(50, 3.8),
rnorm(50, 3.5), rnorm(50, 4.1)),
group = rep(c('grp1_control', 'grp1_treat',
'grp2_control', 'grp2_treat'), each = 50))
df <- data.frame (rating1 = c(1,5,2,4,5),
rating2 = c(2,1,2,4,2),
rating3 = c(0,2,1,2,0),
race = c("black", "asian", "white","black","white"),
gender = c("male","female","female","male","female")
)
I'd like to conduct t-test of group mean (e.g. mean of asians in rating1) and the overall mean of each rating (e.g. rating1). Below is my code for Asians in rating1.
asian_df <- df %>%
filter(race == "asian")
t.test(asian_df$rating1, mean(df$rating1))
Then for Blacks in rating 2, I'd run
black_df <- df %>%
filter(race == "black")
t.test(black_df$rating2, mean(df$rating2))
How can I write a function that automates the t-test for each group? So far I have to manually change the variable name to essentially run for each race, each gender and on each rating (rating 1 to rating 3). Thanks!
Performing multiple t-tests increases your risk of Type I error and you will need to adjust for multiple comparisons in order for your results to be valid/meaningful. You can run the t-tests by looping through your variables, e.g.
library(tidyverse)
df <- data.frame (rating1 = c(5,8,7,8,9,6,9,7,8,5,8,5),
rating2 = c(2,7,8,4,9,3,6,1,7,3,9,1),
rating3 = c(0,6,1,2,7,2,9,1,6,2,3,1),
race = c("asian", "asian", "asian","black","asian","black","white","black","white","black","white","black"),
gender = c("male","female","female","male","female","male","female","male","female","male","female","male")
)
for (rac in unique(df$race)){
tmp_df <- df %>%
filter(race == rac)
print(rac)
print(t.test(tmp_df$rating1,
rep(mean(df$rating1),
length(tmp_df$rating1))))
}
[1] "asian"
Welch Two Sample t-test
data: tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = 0.19518, df = 3, p-value = 0.8577
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.550864 2.884198
sample estimates:
mean of x mean of y
7.250000 7.083333
[1] "black"
Welch Two Sample t-test
data: tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = -1.5149, df = 4, p-value = 0.2044
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.5022651 0.7355985
sample estimates:
mean of x mean of y
6.200000 7.083333
[1] "white"
Welch Two Sample t-test
data: tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = 3.75, df = 2, p-value = 0.06433
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1842176 2.6842176
sample estimates:
mean of x mean of y
8.333333 7.083333
for (gend in unique(df$gender)){
tmp_df <- df %>%
filter(gender == gend)
print(gend)
print(t.test(tmp_df$rating1,
rep(mean(df$rating1),
length(tmp_df$rating1))))
}
[1] "male"
Welch Two Sample t-test
data: tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = -2.0979, df = 5, p-value = 0.09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.4107761 0.2441094
sample estimates:
mean of x mean of y
6.000000 7.083333
[1] "female"
Welch Two Sample t-test
data: tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = 3.5251, df = 5, p-value = 0.01683
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2933469 1.8733198
sample estimates:
mean of x mean of y
8.166667 7.083333
Due to multiple testing (in this example, 5 t-tests) your chance of a false positive is 1 - (1 - 0.05)^5 = 22.62% <- very high. To account for this, you can apply the Bonferroni correction, which basically takes your required p-value (in this case, p < 0.05) and divides it by the number of tests (i.e. the new p-value required to reject the null is p < 0.01). When you apply this correction, even the 'best' t-test result (gender; p-value = 0.01683) is not statistically significant.
An alternative approach would be to compare means in all conditions using ANOVA, then use Tukey's HSD to determine which groups are different. Tukey's HSD is a single post-hoc test, so you don't need to account for multiple testing, and your results are valid. Adapting this approach to your problem might be a better way to go e.g.
anova_one_way <- aov(rating1 + rating2 + rating3 ~ race + gender, data = df)
summary(anova_one_way)
Df Sum Sq Mean Sq F value Pr(>F)
race 2 266.70 133.35 14.01 0.00243 **
gender 1 140.08 140.08 14.72 0.00497 **
Residuals 8 76.13 9.52
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
TukeyHSD(anova_one_way)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = rating1 + rating2 + rating3 ~ race + gender, data = df)
$race
diff lwr upr p adj
black-asian -7.050000 -12.963253 -1.136747 0.0224905
white-asian 4.416667 -2.315868 11.149201 0.2076254
white-black 11.466667 5.029132 17.904201 0.0023910
$gender
diff lwr upr p adj
male-female -3.416667 -7.523829 0.6904958 0.0913521
I have a data frame with count numbers and I want to perform a chisq.test for each value of the variable Cluster. So basically, I need 4 contingency tables (for "A","B","C","D") where rows = Category, columns = Drug, value = Total. And subsequently a chisq.test should be run for all 4 tabels.
Example data frame
df <- data.frame(Cluster = c(rep("A",8),rep("B",8),rep("C",8),rep("D",8)),
Category = rep(c(rep("0-1",2),rep("2-4",2),rep("5-12",2),rep(">12",2)),2),
Drug = rep(c("drug X","drug Y"),16),
Total = as.numeric(sample(20:200,32,replace=TRUE)))
Firstly, use xtabs() to produce stratified contingency tables.
tab <- xtabs(Total ~ Category + Drug + Cluster, df)
tab
# , , Cluster = A
#
# Drug
# Category drug X drug Y
# >12 92 75
# 0-1 33 146
# 2-4 193 95
# 5-12 76 195
#
# etc.
Then use apply() to conduct a Pearson's Chi-squared test over each stratum.
apply(tab, 3, chisq.test)
# $A
#
# Pearson's Chi-squared test
#
# data: array(newX[, i], d.call, dn.call)
# X-squared = 145.98, df = 3, p-value < 2.2e-16
#
# etc.
Furthermore, you can perform a Cochran-Mantel-Haenszel chi-squared test for conditional independence.
mantelhaen.test(tab)
# Cochran-Mantel-Haenszel test
#
# data: tab
# Cochran-Mantel-Haenszel M^2 = 59.587, df = 3, p-value = 7.204e-13
I am trying to run a correlation test on different data frames representing the number of unique stores an employee is assigned and columns repenting different regions simultaneously. My data frame is split by the number of unique stores each employee has by:
unique_store_breakdown <- split(Data, as.factor(Data$unique_stores))
Ideally I would like the output:
Region -- unique_store -- correlation
Midwest ------- 1 -------------- .05
Midwest ------- 2 -------------- .04
.
.
Southeast ----- 1 ------------- 0.75
.
.
cor_tests <-list()
counter = 0
for (i in unique(j$region)){
for (j in 1: length(unique_store_breakdown)){
counter = counter + 1
#Create new variables for correlation test
x = as.numeric(j[j$region == i,]$quality)
y = as.numeric(j[j$region == i,]$rsv)
cor_tests[[counter]] <- cor.test(x,y)
}}
cor_tests
I am able to run this for one dataframe at a time, but when I try to add the nested loop (j term) I receive the error "Error: $ operator is invalid for atomic vectors. Additionally I would also like to output the results as a dataframe rather than a list if possible.
If all you want to do is perform cor.test() for each store, it should be fairly simple using by(). The output from by() is a regular list, it's just the printing that is a little special.
# example data
set.seed(1)
dtf <- data.frame(store=rep(1:3, each=30), rsv=rnorm(90))
dtf$quality <- dtf$rsv + rnorm(90, 0, dtf$store)
# perform cor.test for every store
by(dtf, dtf$store, function(x) cor.test(x$quality, x$rsv))
# dtf$store: 1
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 5.5485, df = 28, p-value = 6.208e-06
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.4915547 0.8597796
# sample estimates:
# cor
# 0.7236681
#
# ------------------------------------------------------------------------------
# dtf$store: 2
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 0.68014, df = 28, p-value = 0.502
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.2439893 0.4663368
# sample estimates:
# cor
# 0.1274862
#
# ------------------------------------------------------------------------------
# dtf$store: 3
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 2.2899, df = 28, p-value = 0.02977
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.04304952 0.66261810
# sample estimates:
# cor
# 0.397159
#
I'm doing some exploring with the same data and I'm trying to highlight the in-group variance versus the between group variance. Now I have been able to successfully show the between group variance is very strong, however, the nature of the data should show weak within group variance. (I.e. My Shapiro-Wilk normality test shows this) I believe if I do some re-sampling with a welch correction, this might be the case.
I was wondering if someone knew if there was a re-sampling based anova with a Welch correction in R. I see there is an R implementation of the permutation test but with no correction. If not, how would I code the test directly while using this implementation.
http://finzi.psych.upenn.edu/library/lmPerm/html/aovp.html
Here is the outline for my basic between group ANOVA:
fit <- lm(formula = data$Boys ~ data$GroupofBoys)
anova(fit)
I believe you're correct in that there isn't an easy way to do welch corrected anova with resampling, but it should be possible to hobble a few things together to make it work.
require('Ecdat')
I'll use the “Star” dataset from the “Ecdat" package which looks at the effects of small class sizes on standardized test scores.
star<-Star
attach(star)
head(star)
tmathssk treadssk classk totexpk sex freelunk race schidkn
2 473 447 small.class 7 girl no white 63
3 536 450 small.class 21 girl no black 20
5 463 439 regular.with.aide 0 boy yes black 19
11 559 448 regular 16 boy no white 69
12 489 447 small.class 5 boy yes white 79
13 454 431 regular 8 boy yes white 5
Some exploratory analysis:
#bloxplots
boxplot(treadssk ~ classk, ylab="Total Reading Scaled Score")
title("Reading Scores by Class Size")
#histograms
hist(treadssk, xlab="Total Reading Scaled Score")
Run regular anova
model1 = aov(treadssk ~ classk, data = star)
summary(model1)
Df Sum Sq Mean Sq F value Pr(>F)
classk 2 37201 18601 18.54 9.44e-09 ***
Residuals 5745 5764478 1003
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A look at the anova residuals
#qqplot
qqnorm(residuals(model1),ylab="Reading Scaled Score")
qqline(residuals(model1),ylab="Reading Scaled Score")
qqplot shows that ANOVA residuals deviate from the normal qqline
#Fitted Y vs. Residuals
plot(fitted(model1), residuals(model1))
Fitted Y vs. Residuals shows converging trend in the residuals, can test with a Shapiro-Wilk test just to be sure
shapiro.test(treadssk[1:5000]) #shapiro.test contrained to sample sizes between 3 and 5000
Shapiro-Wilk normality test
data: treadssk[1:5000]
W = 0.92256, p-value < 2.2e-16
Just confirms that we aren't going to be able to assume a normal distribution.
We can use bootstrap to estimate the true F-dist.
#Bootstrap version (with 10,000 iterations)
mean_read = mean(treadssk)
grpA = treadssk[classk=="regular"] - mean_read[1]
grpB = treadssk[classk=="small.class"] - mean_read[2]
grpC = treadssk[classk=="regular.with.aide"] - mean_read[3]
sim_classk <- classk
R = 10000
sim_Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=2000, replace=T)
groupB = sample(grpB, size=1733, replace=T)
groupC = sample(grpC, size=2015, replace=T)
sim_score = c(groupA,groupB,groupC)
sim_data = data.frame(sim_score,sim_classk)
}
Now we need to get the set of unique pairs of the Group factor
allPairs <- expand.grid(levels(sim_data$sim_classk), levels(sim_data$sim_classk))
## http://stackoverflow.com/questions/28574006/unique-combination-of-two-columns-in-r/28574136#28574136
allPairs <- unique(t(apply(allPairs, 1, sort)))
allPairs <- allPairs[ allPairs[,1] != allPairs[,2], ]
allPairs
[,1] [,2]
[1,] "regular" "small.class"
[2,] "regular" "regular.with.aide"
[3,] "regular.with.aide" "small.class"
Since oneway.test() applies a Welch correction by default, we can use that on our simulated data.
allResults <- apply(allPairs, 1, function(p) {
#http://stackoverflow.com/questions/28587498/post-hoc-tests-for-one-way-anova-with-welchs-correction-in-r
dat <- sim_data[sim_data$sim_classk %in% p, ]
ret <- oneway.test(sim_score ~ sim_classk, data = sim_data, na.action = na.omit)
ret$sim_classk <- p
ret
})
length(allResults)
[1] 3
allResults[[1]]
One-way analysis of means (not assuming equal variances)
data: sim_score and sim_classk
F = 1.7741, num df = 2.0, denom df = 1305.9, p-value = 0.170