I've just run an ANOVA (aov) in r between 3 groups. Group 1,2,3.
After running TukeyHSD for my model, my comparisons are compared in the order of groups:
can this be changed so is as follows:
Using relevel will not work since you don't want to change the level order, just the labels. First we need some reproducible data:
SL <- iris$Sepal.Length
Sp <- as.factor(as.numeric(iris$Species))
iris.aov <- aov(SL~Sp)
iris.mc <- TukeyHSD(iris.aov)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
# Fit: aov(formula = SL ~ Sp)
# $Sp
# diff lwr upr p adj
# 2-1 0.930 0.6862273 1.1737727 0
# 3-1 1.582 1.3382273 1.8257727 0
# 3-2 0.652 0.4082273 0.8957727 0
Now to switch the labels we use expand.grid to create ones with the first and second group id switched:
ngroups <- 3
Grps <- expand.grid(seq(ngroups), seq(ngroups))
Grps <- Grps[Grps$Var1 < Grps$Var2,] # Unique groups
newlbls <- unname(apply(Grps, 1, paste0, collapse="-"))
dimnames(iris.mc$Sp)[[1]] <- newlbls
# Tukey multiple comparisons of means
# 95% family-wise confidence level
# Fit: aov(formula = SL ~ Sp)
# $Sp
# diff lwr upr p adj
# 1-2 0.930 0.6862273 1.1737727 0
# 1-3 1.582 1.3382273 1.8257727 0
# 2-3 0.652 0.4082273 0.8957727 0
I am trying to run a correlation test on different data frames representing the number of unique stores an employee is assigned and columns repenting different regions simultaneously. My data frame is split by the number of unique stores each employee has by:
unique_store_breakdown <- split(Data, as.factor(Data$unique_stores))
Ideally I would like the output:
Region -- unique_store -- correlation
Midwest ------- 1 -------------- .05
Midwest ------- 2 -------------- .04
Southeast ----- 1 ------------- 0.75
cor_tests <-list()
counter = 0
for (i in unique(j$region)){
for (j in 1: length(unique_store_breakdown)){
counter = counter + 1
#Create new variables for correlation test
x = as.numeric(j[j$region == i,]$quality)
y = as.numeric(j[j$region == i,]$rsv)
cor_tests[[counter]] <- cor.test(x,y)
I am able to run this for one dataframe at a time, but when I try to add the nested loop (j term) I receive the error "Error: $ operator is invalid for atomic vectors. Additionally I would also like to output the results as a dataframe rather than a list if possible.
If all you want to do is perform cor.test() for each store, it should be fairly simple using by(). The output from by() is a regular list, it's just the printing that is a little special.
# example data
dtf <- data.frame(store=rep(1:3, each=30), rsv=rnorm(90))
dtf$quality <- dtf$rsv + rnorm(90, 0, dtf$store)
# perform cor.test for every store
by(dtf, dtf$store, function(x) cor.test(x$quality, x$rsv))
# dtf$store: 1
# Pearson's product-moment correlation
# data: x$quality and x$rsv
# t = 5.5485, df = 28, p-value = 6.208e-06
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.4915547 0.8597796
# sample estimates:
# cor
# 0.7236681
# ------------------------------------------------------------------------------
# dtf$store: 2
# Pearson's product-moment correlation
# data: x$quality and x$rsv
# t = 0.68014, df = 28, p-value = 0.502
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.2439893 0.4663368
# sample estimates:
# cor
# 0.1274862
# ------------------------------------------------------------------------------
# dtf$store: 3
# Pearson's product-moment correlation
# data: x$quality and x$rsv
# t = 2.2899, df = 28, p-value = 0.02977
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.04304952 0.66261810
# sample estimates:
# cor
# 0.397159
I'm using predict.lm(fit, newdata=newdata, interval="prediction") to get predictions and their prediction intervals (PI) for new observations. Now I would like to aggregate (sum and mean) these predictions and their PI's based on an additional variable (i.e. a spatial aggregation on the zip code level of predictions for single households).
I learned from StackExchange, that you cannot aggregate the prediction intervals of single predictions just by aggregating the limits of the prediction intervals. The post is very helpful to understand why this can't be done, but I have a hard time translating this bit into actual code. The answer reads:
Here's a reproducible example:
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit regression model
fit1 <- lm(Petal.Width ~ Petal.Length, data=train)
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
#Predict Pedal.Width for new data incl prediction intervals for each prediction
predictions1<-predict(fit1, newdata=pred, interval="prediction")
predictions2<-predict(fit2, newdata=pred, interval="prediction")
# Aggregate data by summing predictions for species
#NOT correct for prediction intervals
predictions_agg1<-data.frame(predictions1,Species=pred$Species) %>%
group_by(Species) %>%
predictions_agg2<-data.frame(predictions2,Species=pred$Species) %>%
group_by(Species) %>%
I couldn't find a good tutorial or package which describes how to properly aggregate predictions and their PI's in R when using predict.lm(). Is there something out there? Would highly appreciate if you could point me in the right direction on how to do this in R.
Your question is closely related to a thread I answered 2 years ago: linear model with `lm`: how to get prediction variance of sum of predicted values. It provides an R implementation of Glen_b's answer on Cross Validated. Thanks for quoting that Cross Validated thread; I didn't know it; perhaps I can leave a comment there linking the Stack Overflow thread.
I have polished my original answer, wrapping up line-by-line code cleanly into easy-to-use functions lm_predict and agg_pred. Solving your question is then simplified to applying those functions by group.
Consider the iris example in your question, and the second model fit2 for demonstration.
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
We split pred by group Species, then apply lm_predict (with diag = FALSE) on all sub data frames.
oo <- lapply(split(pred, pred$Species), lm_predict, lmObject = fit2, diag = FALSE)
To use agg_pred we need to specify a weight vector, whose length equals to the number of data. We can determine this by consulting the length of fit in each oo[[i]]:
n <- lengths(lapply(oo, "[[", 1))
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
#List of 3
# $ setosa : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
# $ versicolor: num [1:13] 1 1 1 1 1 1 1 1 1 1 ...
# $ virginica : num [1:14] 1 1 1 1 1 1 1 1 1 1 ...
SUM <- Map(agg_pred, w, oo)
SUM[[1]] ## result for the first group, for example
#[1] 2.499728
#[1] 0.1271554
# lower upper
#1.792908 3.206549
# lower upper
#0.999764 3.999693
sapply(SUM, "[[", "CI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 1.792908 16.41526 26.55839
#upper 3.206549 17.63953 28.10812
If aggregation operation is average, we rescale w by n and call agg_pred.
w <- mapply("/", w, n)
#List of 3
# $ setosa : num [1:11] 0.0909 0.0909 0.0909 0.0909 0.0909 ...
# $ versicolor: num [1:13] 0.0769 0.0769 0.0769 0.0769 0.0769 ...
# $ virginica : num [1:14] 0.0714 0.0714 0.0714 0.0714 0.0714 ...
AVE <- Map(agg_pred, w, oo)
AVE[[2]] ## result for the second group, for example
#[1] 1.3098
#[1] 0.0005643196
# lower upper
#1.262712 1.356887
# lower upper
#1.189562 1.430037
sapply(AVE, "[[", "PI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 0.09088764 1.189562 1.832255
#upper 0.36360845 1.430037 2.072496
This is great! Thank you so much! There is one thing I forgot to mention: in my actual application I need to sum ~300,000 predictions which would create a full variance-covariance matrix which is about ~700GB in size. Do you have any idea if there is a computationally more efficient way to directly get to the sum of the variance-covariance matrix?
Use the fast_agg_pred function provided in the revision of the original Q & A. Let's start it all over.
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
## list of new data
newdatlist <- split(pred, pred$Species)
n <- sapply(newdatlist, nrow)
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
SUM <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
If aggregation operation is average, we do
w <- mapply("/", w, n)
AVE <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
Note that we can't use Map in this case as we need to provide more arguments to fast_agg_pred. Use mapply in this situation, with MoreArgs and SIMPLIFY.
I have the following data (dat)
I have the following data(dat)
1 8 89 3 900
1 8 100 2 800
0 9 333 4 980
0 9 560 1 999
I wish to perform TukeysHSD pairwise test to the above data set.
dat1 <- gather(dat) #convert to long form
pairwise.t.test(dat1$key, dat1$value, p.adj = "holm")
However, every time I try to run it, it keeps running and does not yield an output. Any suggestions on how to correct this?
I would also like to perform the same test using the function TukeyHSD(). However, when I try to use the wide/long format, I run into a error that says
" Error in UseMethod("TukeyHSD") :
no applicable method for 'TukeyHSD' applied to an object of class "data.frame"
We need 'x' to be dat1$value as it is not specified the first argument is taken as 'x' and second as 'g'
pairwise.t.test( dat1$value, dat1$key, p.adj = "holm")
#data: dat1$value and dat1$key
# V W X Y
#W 1.000 - - -
#X 0.018 0.018 - -
#Y 1.000 1.000 0.018 -
#Z 4.1e-08 4.1e-08 2.8e-06 4.1e-08
#P value adjustment method: holm
Or we specify the argument and use in any order we wanted
pairwise.t.test(g = dat1$key, x= dat1$value, p.adj = "holm")
Regarding the TukeyHSD
TukeyHSD(aov(value~key, data = dat1), ordered = TRUE)
#Tukey multiple comparisons of means
# 95% family-wise confidence level
# factor levels have been ordered
#Fit: aov(formula = value ~ key, data = dat1)
# diff lwr upr p adj
#Y-V 2.00 -233.42378 237.4238 0.9999999
#W-V 8.00 -227.42378 243.4238 0.9999691
#X-V 270.00 34.57622 505.4238 0.0211466
#Z-V 919.25 683.82622 1154.6738 0.0000000
#W-Y 6.00 -229.42378 241.4238 0.9999902
#X-Y 268.00 32.57622 503.4238 0.0222406
#Z-Y 917.25 681.82622 1152.6738 0.0000000
#X-W 262.00 26.57622 497.4238 0.0258644
#Z-W 911.25 675.82622 1146.6738 0.0000000
#Z-X 649.25 413.82622 884.6738 0.0000034
I'm doing some exploring with the same data and I'm trying to highlight the in-group variance versus the between group variance. Now I have been able to successfully show the between group variance is very strong, however, the nature of the data should show weak within group variance. (I.e. My Shapiro-Wilk normality test shows this) I believe if I do some re-sampling with a welch correction, this might be the case.
I was wondering if someone knew if there was a re-sampling based anova with a Welch correction in R. I see there is an R implementation of the permutation test but with no correction. If not, how would I code the test directly while using this implementation.
Here is the outline for my basic between group ANOVA:
fit <- lm(formula = data$Boys ~ data$GroupofBoys)
I believe you're correct in that there isn't an easy way to do welch corrected anova with resampling, but it should be possible to hobble a few things together to make it work.
I'll use the “Star” dataset from the “Ecdat" package which looks at the effects of small class sizes on standardized test scores.
tmathssk treadssk classk totexpk sex freelunk race schidkn
2 473 447 small.class 7 girl no white 63
3 536 450 small.class 21 girl no black 20
5 463 439 regular.with.aide 0 boy yes black 19
11 559 448 regular 16 boy no white 69
12 489 447 small.class 5 boy yes white 79
13 454 431 regular 8 boy yes white 5
Some exploratory analysis:
boxplot(treadssk ~ classk, ylab="Total Reading Scaled Score")
title("Reading Scores by Class Size")
hist(treadssk, xlab="Total Reading Scaled Score")
Run regular anova
model1 = aov(treadssk ~ classk, data = star)
Df Sum Sq Mean Sq F value Pr(>F)
classk 2 37201 18601 18.54 9.44e-09 ***
Residuals 5745 5764478 1003
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A look at the anova residuals
qqnorm(residuals(model1),ylab="Reading Scaled Score")
qqline(residuals(model1),ylab="Reading Scaled Score")
qqplot shows that ANOVA residuals deviate from the normal qqline
#Fitted Y vs. Residuals
plot(fitted(model1), residuals(model1))
Fitted Y vs. Residuals shows converging trend in the residuals, can test with a Shapiro-Wilk test just to be sure
shapiro.test(treadssk[1:5000]) #shapiro.test contrained to sample sizes between 3 and 5000
Shapiro-Wilk normality test
data: treadssk[1:5000]
W = 0.92256, p-value < 2.2e-16
Just confirms that we aren't going to be able to assume a normal distribution.
We can use bootstrap to estimate the true F-dist.
#Bootstrap version (with 10,000 iterations)
mean_read = mean(treadssk)
grpA = treadssk[classk=="regular"] - mean_read[1]
grpB = treadssk[classk=="small.class"] - mean_read[2]
grpC = treadssk[classk=="regular.with.aide"] - mean_read[3]
sim_classk <- classk
R = 10000
sim_Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=2000, replace=T)
groupB = sample(grpB, size=1733, replace=T)
groupC = sample(grpC, size=2015, replace=T)
sim_score = c(groupA,groupB,groupC)
sim_data = data.frame(sim_score,sim_classk)
Now we need to get the set of unique pairs of the Group factor
allPairs <- expand.grid(levels(sim_data$sim_classk), levels(sim_data$sim_classk))
## http://stackoverflow.com/questions/28574006/unique-combination-of-two-columns-in-r/28574136#28574136
allPairs <- unique(t(apply(allPairs, 1, sort)))
allPairs <- allPairs[ allPairs[,1] != allPairs[,2], ]
[,1] [,2]
[1,] "regular" "small.class"
[2,] "regular" "regular.with.aide"
[3,] "regular.with.aide" "small.class"
Since oneway.test() applies a Welch correction by default, we can use that on our simulated data.
allResults <- apply(allPairs, 1, function(p) {
dat <- sim_data[sim_data$sim_classk %in% p, ]
ret <- oneway.test(sim_score ~ sim_classk, data = sim_data, na.action = na.omit)
ret$sim_classk <- p
[1] 3
One-way analysis of means (not assuming equal variances)
data: sim_score and sim_classk
F = 1.7741, num df = 2.0, denom df = 1305.9, p-value = 0.170
I would like to perform the equivalent of TukeyHSD on the rank ordering median shift test that such as kruskal wallis
## Tukey multiple comparisons of means
## 95% family-wise confidence level
## Fit: aov(formula = X[, 2] ~ factor(X[, 1]))
## $`factor(X[, 1])`
## diff lwr upr p adj
## 2-1 1.25 -5.927068 8.427068 0.8794664
## 4-1 -1.35 -7.653691 4.953691 0.8246844
## 4-2 -2.60 -9.462589 4.262589 0.5617125
## Kruskal-Wallis rank sum test
## data: X[, 2] by factor(X[, 1])
## Kruskal-Wallis chi-squared = 1.7325, df = 2, p-value = 0.4205
I would like now to analyze the contrasts. Please help. Thanks.
If you want to do multiple comparisons after a Kruskal-Wallis test, you need the kruskalmc function from the pgirmess package. Before you can implement this function, you will need to transform your matrix to a dataframe. In your example:
# convert matrix to dataframe
dfx <- as.data.frame(X)
# the Kruskal-Wallis test & output
Kruskal-Wallis rank sum test
data: dfx$V2 by factor(dfx$V1)
Kruskal-Wallis chi-squared = 1.7325, df = 2, p-value = 0.4205
# the post-hoc tests & output
kruskalmc(V2~factor(V1), data = dfx)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
obs.dif critical.dif difference
1-2 1.75 6.592506 FALSE
1-4 1.65 5.790265 FALSE
2-4 3.40 6.303642 FALSE
If you want the compact letter display similar to what is outputed from TukeyHSD, for the Kruskal test, the library agricolae allows it with the function kruskal. Using your own data:
print( kruskal(X[, 2], factor(X[, 1]), group=TRUE, p.adj="bonferroni") )
#### ...
#### $groups
#### trt means M
#### 1 2 8.50 a
#### 2 1 6.75 a
#### 3 4 5.10 a
(well, in this example the groups are not considered different, same result than the other answer..)