All,
I would like to perform the equivalent of TukeyHSD on the rank ordering median shift test that such as kruskal wallis
X=matrix(c(1,1,1,1,2,2,2,4,4,4,4,4,1,3,6,9,4,6,8,10,1,2,1,3),ncol=2)
anova=aov(X[,2]~factor(X[,1]))
TukeyHSD(anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = X[, 2] ~ factor(X[, 1]))
##
## $`factor(X[, 1])`
## diff lwr upr p adj
## 2-1 1.25 -5.927068 8.427068 0.8794664
## 4-1 -1.35 -7.653691 4.953691 0.8246844
## 4-2 -2.60 -9.462589 4.262589 0.5617125
kruskal.test(X[,2]~factor(X[,1]))
##
## Kruskal-Wallis rank sum test
##
## data: X[, 2] by factor(X[, 1])
## Kruskal-Wallis chi-squared = 1.7325, df = 2, p-value = 0.4205
I would like now to analyze the contrasts. Please help. Thanks.
Rik
If you want to do multiple comparisons after a Kruskal-Wallis test, you need the kruskalmc function from the pgirmess package. Before you can implement this function, you will need to transform your matrix to a dataframe. In your example:
# convert matrix to dataframe
dfx <- as.data.frame(X)
# the Kruskal-Wallis test & output
kruskal.test(dfx$V2~factor(dfx$V1))
Kruskal-Wallis rank sum test
data: dfx$V2 by factor(dfx$V1)
Kruskal-Wallis chi-squared = 1.7325, df = 2, p-value = 0.4205
# the post-hoc tests & output
kruskalmc(V2~factor(V1), data = dfx)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
1-2 1.75 6.592506 FALSE
1-4 1.65 5.790265 FALSE
2-4 3.40 6.303642 FALSE
If you want the compact letter display similar to what is outputed from TukeyHSD, for the Kruskal test, the library agricolae allows it with the function kruskal. Using your own data:
library(agricolae)
print( kruskal(X[, 2], factor(X[, 1]), group=TRUE, p.adj="bonferroni") )
#### ...
#### $groups
#### trt means M
#### 1 2 8.50 a
#### 2 1 6.75 a
#### 3 4 5.10 a
(well, in this example the groups are not considered different, same result than the other answer..)
Related
I've just run an ANOVA (aov) in r between 3 groups. Group 1,2,3.
After running TukeyHSD for my model, my comparisons are compared in the order of groups:
2-1,
3-1,
3-2
can this be changed so is as follows:
1-2,
1-3,
2-3
Thanks
Using relevel will not work since you don't want to change the level order, just the labels. First we need some reproducible data:
data(iris)
SL <- iris$Sepal.Length
Sp <- as.factor(as.numeric(iris$Species))
iris.aov <- aov(SL~Sp)
iris.mc <- TukeyHSD(iris.aov)
iris.mc
# Tukey multiple comparisons of means
# 95% family-wise confidence level
#
# Fit: aov(formula = SL ~ Sp)
#
# $Sp
# diff lwr upr p adj
# 2-1 0.930 0.6862273 1.1737727 0
# 3-1 1.582 1.3382273 1.8257727 0
# 3-2 0.652 0.4082273 0.8957727 0
Now to switch the labels we use expand.grid to create ones with the first and second group id switched:
ngroups <- 3
Grps <- expand.grid(seq(ngroups), seq(ngroups))
Grps <- Grps[Grps$Var1 < Grps$Var2,] # Unique groups
newlbls <- unname(apply(Grps, 1, paste0, collapse="-"))
dimnames(iris.mc$Sp)[[1]] <- newlbls
iris.mc
# Tukey multiple comparisons of means
# 95% family-wise confidence level
#
# Fit: aov(formula = SL ~ Sp)
#
# $Sp
# diff lwr upr p adj
# 1-2 0.930 0.6862273 1.1737727 0
# 1-3 1.582 1.3382273 1.8257727 0
# 2-3 0.652 0.4082273 0.8957727 0
I am attempting to run a friedman test on ordinal data in R and am getting errors. The data can be found here on dropbox https://www.dropbox.com/s/gh8crh18y1ueriy/seltoutput.xlsx?dl=0.
As a description of the data:
group1: group assignments, 2 levels
time1: time points, 2 levels
loameasure: ordinal data, 5 levels
distmeasure: continuous data
vectemp: participant IDs
After importing the data I run the following to correctly format:
selt$loameasure<-factor(selt$loameasure)
selt$distmeasure<-as.numeric(selt$distmeasure)
selt$time1<-factor(selt$time1)
Then I run:
friedman_test(formula = loameasure ~ time1 | vectemp, data = selt)
Then I get the error:
Error in friedman.test.default(c(3L, 2L, 3L, 2L, 2L, 5L, 2L, 1L, 3L, 4L, :
not an unreplicated complete block design
I thought that loameasure and time1 had to be factors but I did try them as numeric and I get a similar error:
Error in friedman.test.default(c(3, 2, 3, 2, 2, 5, 2, 1, 3, 4, 2, 2, 4, :
not an unreplicated complete block design
I've been playing around with this for days and haven't been able to figure out what my problem is. I would love some assistance! Thank you in advance!
As far as I can anticipate a Friedman test is not appropriate in your situation. I would suggest to perform a two-way ANOVA test for unbalanced designs with Type-III sums of square method.
The assumptions of Normality of residuals and homogenity are given.
I have tried to guide you how to perform the test and the meaning of some steps. It is not complete (lacking of interpretation etc..) But this should be a begin and direction for you.
We want to know if loameasure depends on group1 and time1
We will perform a two-way anova with two factors
Dependent variable: loameasure
Independent variable: group1 and time1
library(readxl)
# load your data
df <- read_excel("C:/Users/coding/Downloads/seltoutput.xlsx",
col_types = c("numeric", "numeric", "numeric"))
# Prepare data
# group1 to factor
df$group1 <- factor(df$group1,
levels = c(0, 1),
labels = c("Group_0", "Group_1"))
# time1 to factor
df$time1 <- factor(df$time1,
levels = c(1, 2),
labels = c("Time_1", "Time_2"))
----------------------------------------------------------------------------
# Visualize
library("ggpubr")
ggboxplot(df, x = "time1", y = "loameasure", color = "group1",
palette = c("#00AFBB", "#E7B800"))
ggline(df, x = "time1", y = "loameasure", color = "group1",
add = c("mean_se", "dotplot"),
palette = c("#00AFBB", "#E7B800"))
-----------------------------------------------------------------------------
# first decide if balanced or unbalnced design
table(df$group1, df$time1)
# Output
# Time_1 Time_2
# Group_0 20 20
# Group_1 29 29
# Here it is a unbalance design
# An unbalanced design has unequal numbers of subjects in each group!
## We will perform two-way ANOVA test in R for unbalanced designs !!!!!!!!!!!
# The recommended method are the Type-III sums of squares.
# you need `car` package
library(car)
# Our 2 way anova of unbalanced design (SS Type-III)
df_anova <- aov(loameasure ~ group1 * time1, data = df)
Anova(df_anova, type = "III")
## Output
# Anova Table (Type III tests)
# Response: loameasure
# Sum Sq Df F value Pr(>F)
# (Intercept) 120.050 1 83.9312 1.116e-14 ***
# group1 0.700 1 0.4891 0.48607
# time1 62.500 1 43.6960 2.301e-09 ***
# group1:time1 5.716 1 3.9963 0.04849 *
# Residuals 134.452 94
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Nomrality check ---------------------------------------------------------
# call residuals (difference between each indivdual and their group1/time1 combination mean)
res <- df_anova$residuals
# Histogram of residuals: Residuals should be normally distributed
hist(res,main="Histogram of residuals", xlab = "Residuals")
# # Extract the residuals
# Run Shapiro-Wilk test
shapiro.test(x = res )
# Output
# data: res
# W = 0.97708, p-value = 0.08434
# P is > 0.05 therefore normality can be assumed.
# Homogenity test ---------------------------------------------------------
# Levene's test for equality of variances (in `car` package)
library(car)
leveneTest(loameasure~ time1 * group1,data=df)
# Output:
# Levene's Test for Homogeneity of Variance (center = median)
# Df F value Pr(>F)
# group 3 0.3196 0.8112
# 94
# P is > 0.05 therefore equal variances can be assumed.
I have a data frame with count numbers and I want to perform a chisq.test for each value of the variable Cluster. So basically, I need 4 contingency tables (for "A","B","C","D") where rows = Category, columns = Drug, value = Total. And subsequently a chisq.test should be run for all 4 tabels.
Example data frame
df <- data.frame(Cluster = c(rep("A",8),rep("B",8),rep("C",8),rep("D",8)),
Category = rep(c(rep("0-1",2),rep("2-4",2),rep("5-12",2),rep(">12",2)),2),
Drug = rep(c("drug X","drug Y"),16),
Total = as.numeric(sample(20:200,32,replace=TRUE)))
Firstly, use xtabs() to produce stratified contingency tables.
tab <- xtabs(Total ~ Category + Drug + Cluster, df)
tab
# , , Cluster = A
#
# Drug
# Category drug X drug Y
# >12 92 75
# 0-1 33 146
# 2-4 193 95
# 5-12 76 195
#
# etc.
Then use apply() to conduct a Pearson's Chi-squared test over each stratum.
apply(tab, 3, chisq.test)
# $A
#
# Pearson's Chi-squared test
#
# data: array(newX[, i], d.call, dn.call)
# X-squared = 145.98, df = 3, p-value < 2.2e-16
#
# etc.
Furthermore, you can perform a Cochran-Mantel-Haenszel chi-squared test for conditional independence.
mantelhaen.test(tab)
# Cochran-Mantel-Haenszel test
#
# data: tab
# Cochran-Mantel-Haenszel M^2 = 59.587, df = 3, p-value = 7.204e-13
I am trying to run a correlation test on different data frames representing the number of unique stores an employee is assigned and columns repenting different regions simultaneously. My data frame is split by the number of unique stores each employee has by:
unique_store_breakdown <- split(Data, as.factor(Data$unique_stores))
Ideally I would like the output:
Region -- unique_store -- correlation
Midwest ------- 1 -------------- .05
Midwest ------- 2 -------------- .04
.
.
Southeast ----- 1 ------------- 0.75
.
.
cor_tests <-list()
counter = 0
for (i in unique(j$region)){
for (j in 1: length(unique_store_breakdown)){
counter = counter + 1
#Create new variables for correlation test
x = as.numeric(j[j$region == i,]$quality)
y = as.numeric(j[j$region == i,]$rsv)
cor_tests[[counter]] <- cor.test(x,y)
}}
cor_tests
I am able to run this for one dataframe at a time, but when I try to add the nested loop (j term) I receive the error "Error: $ operator is invalid for atomic vectors. Additionally I would also like to output the results as a dataframe rather than a list if possible.
If all you want to do is perform cor.test() for each store, it should be fairly simple using by(). The output from by() is a regular list, it's just the printing that is a little special.
# example data
set.seed(1)
dtf <- data.frame(store=rep(1:3, each=30), rsv=rnorm(90))
dtf$quality <- dtf$rsv + rnorm(90, 0, dtf$store)
# perform cor.test for every store
by(dtf, dtf$store, function(x) cor.test(x$quality, x$rsv))
# dtf$store: 1
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 5.5485, df = 28, p-value = 6.208e-06
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.4915547 0.8597796
# sample estimates:
# cor
# 0.7236681
#
# ------------------------------------------------------------------------------
# dtf$store: 2
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 0.68014, df = 28, p-value = 0.502
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.2439893 0.4663368
# sample estimates:
# cor
# 0.1274862
#
# ------------------------------------------------------------------------------
# dtf$store: 3
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 2.2899, df = 28, p-value = 0.02977
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.04304952 0.66261810
# sample estimates:
# cor
# 0.397159
#
I'm doing some exploring with the same data and I'm trying to highlight the in-group variance versus the between group variance. Now I have been able to successfully show the between group variance is very strong, however, the nature of the data should show weak within group variance. (I.e. My Shapiro-Wilk normality test shows this) I believe if I do some re-sampling with a welch correction, this might be the case.
I was wondering if someone knew if there was a re-sampling based anova with a Welch correction in R. I see there is an R implementation of the permutation test but with no correction. If not, how would I code the test directly while using this implementation.
http://finzi.psych.upenn.edu/library/lmPerm/html/aovp.html
Here is the outline for my basic between group ANOVA:
fit <- lm(formula = data$Boys ~ data$GroupofBoys)
anova(fit)
I believe you're correct in that there isn't an easy way to do welch corrected anova with resampling, but it should be possible to hobble a few things together to make it work.
require('Ecdat')
I'll use the “Star” dataset from the “Ecdat" package which looks at the effects of small class sizes on standardized test scores.
star<-Star
attach(star)
head(star)
tmathssk treadssk classk totexpk sex freelunk race schidkn
2 473 447 small.class 7 girl no white 63
3 536 450 small.class 21 girl no black 20
5 463 439 regular.with.aide 0 boy yes black 19
11 559 448 regular 16 boy no white 69
12 489 447 small.class 5 boy yes white 79
13 454 431 regular 8 boy yes white 5
Some exploratory analysis:
#bloxplots
boxplot(treadssk ~ classk, ylab="Total Reading Scaled Score")
title("Reading Scores by Class Size")
#histograms
hist(treadssk, xlab="Total Reading Scaled Score")
Run regular anova
model1 = aov(treadssk ~ classk, data = star)
summary(model1)
Df Sum Sq Mean Sq F value Pr(>F)
classk 2 37201 18601 18.54 9.44e-09 ***
Residuals 5745 5764478 1003
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A look at the anova residuals
#qqplot
qqnorm(residuals(model1),ylab="Reading Scaled Score")
qqline(residuals(model1),ylab="Reading Scaled Score")
qqplot shows that ANOVA residuals deviate from the normal qqline
#Fitted Y vs. Residuals
plot(fitted(model1), residuals(model1))
Fitted Y vs. Residuals shows converging trend in the residuals, can test with a Shapiro-Wilk test just to be sure
shapiro.test(treadssk[1:5000]) #shapiro.test contrained to sample sizes between 3 and 5000
Shapiro-Wilk normality test
data: treadssk[1:5000]
W = 0.92256, p-value < 2.2e-16
Just confirms that we aren't going to be able to assume a normal distribution.
We can use bootstrap to estimate the true F-dist.
#Bootstrap version (with 10,000 iterations)
mean_read = mean(treadssk)
grpA = treadssk[classk=="regular"] - mean_read[1]
grpB = treadssk[classk=="small.class"] - mean_read[2]
grpC = treadssk[classk=="regular.with.aide"] - mean_read[3]
sim_classk <- classk
R = 10000
sim_Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=2000, replace=T)
groupB = sample(grpB, size=1733, replace=T)
groupC = sample(grpC, size=2015, replace=T)
sim_score = c(groupA,groupB,groupC)
sim_data = data.frame(sim_score,sim_classk)
}
Now we need to get the set of unique pairs of the Group factor
allPairs <- expand.grid(levels(sim_data$sim_classk), levels(sim_data$sim_classk))
## http://stackoverflow.com/questions/28574006/unique-combination-of-two-columns-in-r/28574136#28574136
allPairs <- unique(t(apply(allPairs, 1, sort)))
allPairs <- allPairs[ allPairs[,1] != allPairs[,2], ]
allPairs
[,1] [,2]
[1,] "regular" "small.class"
[2,] "regular" "regular.with.aide"
[3,] "regular.with.aide" "small.class"
Since oneway.test() applies a Welch correction by default, we can use that on our simulated data.
allResults <- apply(allPairs, 1, function(p) {
#http://stackoverflow.com/questions/28587498/post-hoc-tests-for-one-way-anova-with-welchs-correction-in-r
dat <- sim_data[sim_data$sim_classk %in% p, ]
ret <- oneway.test(sim_score ~ sim_classk, data = sim_data, na.action = na.omit)
ret$sim_classk <- p
ret
})
length(allResults)
[1] 3
allResults[[1]]
One-way analysis of means (not assuming equal variances)
data: sim_score and sim_classk
F = 1.7741, num df = 2.0, denom df = 1305.9, p-value = 0.170