I am using the oranges data provided with lsmeans.
library(lsmeans)
oranges.rg1<-lm(sales1 ~ price1 + price2 + day + store, data = oranges)
days.lsm <- lsmeans(oranges.rg1, "day")
days_contr.lsm <- contrast(days.lsm, "trt.vs.ctrl", ref = c(5,6))
The confidence intervals can be visualized by ploting plot(contrast(days.lsm, "trt.vs.ctrl", ref = c(5,6))), but they are not showed at days_contr.lsm
> days_contr.lsm
contrast estimate SE df t.ratio p.value
1 - avg(5,6) -7.8538769 2.194243 23 -3.579 0.0058
2 - avg(5,6) -6.9234858 2.127341 23 -3.255 0.0125
3 - avg(5,6) 0.2462789 2.155529 23 0.114 0.9979
4 - avg(5,6) -4.6760034 2.110761 23 -2.215 0.1184
How can I extract the confidence intervals to a data.frame?
> days_contr.lsm
contrast estimate SE df t.ratio p.value lower.CL upper.CL
1 - avg(5,6) -7.8538769 2.194243 23 -3.579 0.0058 ? ?
2 - avg(5,6) -6.9234858 2.127341 23 -3.255 0.0125 ? ?
3 - avg(5,6) 0.2462789 2.155529 23 0.114 0.9979 ? ?
4 - avg(5,6) -4.6760034 2.110761 23 -2.215 0.1184 ? ?
confint(contrast(days.lsm, "trt.vs.ctrl", ref = c(5,6))) worked fine
At risk of beating a dead horse, I feel that the main point of the question is getting the confidence intervals, given that what is seen in days_contr.lsm is only the t ratios and P values.
This happened because the default method for summarizing contrast() results is to show tests and not CIs, whereas the default method for summarizing emmeans() results is to show CIs and not tests. The infer argument of summary.emmGrid() controls what you see. Thus, you can get both CIs and tests using
summary(days_contr.lsm, infer = c(TRUE, TRUE))
and this would fill-in the question marks in the OP. The summary() result, by the way, is of class c("summary_emm", "data.frame"); it is a data.frame with a special print method that often shows some additional annotations.
There are additional emmGrid methods confint() and test() that run summary() with infer = c(TRUE, FALSE) and infer = c(FALSE, TRUE) respectively (though both have additional capabilities). The as.data.frame() method is just as.data.frame(summary(...)). For details, see tge help page for emmeans::summary.emmGrid.
Related
I originally posted this on cross--validated but I think it might be more appropriate for SO since it's purely about software syntax.
This is a follow-up question to this post. I ran a multinomial logistic regression examining the difference in log-odds of respondents indicating they treated a range of different medical conditions (pain, sleep, mental-health/substance use (mhsu) and all other conditions (allOther)) with either licit or illicit medical cannabis.
Here is the toy data
df <- tibble(mcType = factor(rep(c("licit", "illicit"),
times = c(534,1207))),
cond = factor(c(rep(c("pain","mhsu","allOther","sleep"),
times = c(280,141,82,31)),
rep(c("pain","mhsu","allOther","sleep"),
times = c(491,360,208,148))),
levels = c("pain","sleep","mhsu","allOther")))
And the proportions of each type of condition reported for each type of cannabis
mcType cond n tot perc
<fct> <fct> <int> <int> <dbl>
1 illicit pain 491 1207 40.7
2 illicit sleep 148 1207 12.3
3 illicit mhsu 360 1207 29.8
4 illicit allOther 208 1207 17.2
5 licit pain 280 534 52.4
6 licit sleep 31 534 5.81
7 licit mhsu 141 534 26.4
8 licit allOther 82 534 15.4
To see whether there were differences in the relative proportion of respondents indicating each type of condition based on the type of cannabis they report using I ran a multinomial logistic regression using multinom() in the nnet package. Output below,
library(nnet)
summary(mm <- multinom(cond ~ mcType,
data = df))
# output
Coefficients:
(Intercept) mcTypelicit
sleep -1.1992431 -1.0014884
mhsu -0.3103369 -0.3756443
allOther -0.8589398 -0.3691759
Std. Errors:
(Intercept) mcTypelicit
sleep 0.09377333 0.2112368
mhsu 0.06938587 0.1244098
allOther 0.08273132 0.1503720
Residual Deviance: 4327.814
AIC: 4339.814
The I ran tests of simple effects, using the emmeans package. In this blog post the author suggests that the emmeans package applies error correction by default, but that you can control this via the adjust = argument.
# testing effect of mc type at each level of condition. first create emmeans object
library(emmeans)
(em_mcTypeByCond <- emmeans(object = mm,
specs = ~mcType|cond,
adjust = "bonferroni"))
# output
cond = pain:
mcType prob SE df lower.CL upper.CL
illicit 0.4068 0.01414 6 0.3648 0.4488
licit 0.5243 0.02161 6 0.4602 0.5885
cond = sleep:
mcType prob SE df lower.CL upper.CL
illicit 0.1226 0.00944 6 0.0946 0.1506
licit 0.0581 0.01012 6 0.0280 0.0881
cond = mhsu:
mcType prob SE df lower.CL upper.CL
illicit 0.2983 0.01317 6 0.2592 0.3374
licit 0.2641 0.01908 6 0.2074 0.3207
cond = allOther:
mcType prob SE df lower.CL upper.CL
illicit 0.1723 0.01087 6 0.1401 0.2046
licit 0.1535 0.01560 6 0.1072 0.1999
Confidence level used: 0.95
Conf-level adjustment: bonferroni method for 2 estimates
The problem is that I don't seem to be able to choose any other method of error correction (e.g. "BH", "fdr", "westfall", "holm"). I am not sure if it is because I am applying the correction at the wrong step, i.e. before I apply any tests.
So I tried applying the adjust argument within the pairs() function (testing the difference in probability of each condition between the two types of cannabis)
(mcTypeByCond_test <- pairs(em_mcTypeByCond,
adjust = "bonferroni"))
cond = pain:
contrast estimate SE df t.ratio p.value
illicit - licit -0.1175 0.0258 6 -4.551 0.0039
cond = sleep:
contrast estimate SE df t.ratio p.value
illicit - licit 0.0646 0.0138 6 4.665 0.0034
cond = mhsu:
contrast estimate SE df t.ratio p.value
illicit - licit 0.0342 0.0232 6 1.476 0.1905
cond = allOther:
contrast estimate SE df t.ratio p.value
illicit - licit 0.0188 0.0190 6 0.987 0.3616
But as you can see this does not provide any message telling me what type of error correction was applied (I assume none, and tried several different methods). Also I want to control error across all four pairwise comparisons.
So I need to know how and at what stage I need to make the arguments specifying adjustment of p-values.
Any help much appreciated
P-value adjustments are applied to each by group, and there is only one comparison - hence no multiplicity - in each group. And no annotation about adjustments is shown when no adjustments are made.
To apply an adjustment to all the results, you need to remove the by variable from consideration when displaying the results:
summary(pairs(...), by = NULL, adjust = "bonf")
I am using emmeans to conduct a contrast of a contrast (i.e., testing for an interaction effect through 1st/2nd differences).
It involves 3 steps:
estimate means using “emmeans”
estimate if there is a difference in means (1st difference) using “pairs”
estimate if there is a difference in the difference (2nd difference) using ????
While I can execute steps 1 and 2 (see reprex below with fictions data), i’m stuck on step 3. Tips?
(the contrast of a contrast shown in the vignette here is for alternative functional forms, which is somewhat different than what I want to test)
suppressPackageStartupMessages({
library(emmeans)})
# create ex. data set. 1 row per respondent (dataset shows 2 resp).
cedata.1 <- data.frame( id = c(1,1,1,1,1,1,2,2,2,2,2,2),
QES = c(1,1,2,2,3,3,1,1,2,2,3,3), # Choice set
Alt = c(1,2,1,2,1,2,1,2,1,2,1,2), # Alt 1 or Alt 2 in choice set
Choice = c(0,1,1,0,1,0,0,1,0,1,0,1), # Dep variable. if Chosen (1) or not (0)
LOC = c(0,0,1,1,0,1,0,1,1,0,0,1), # Indep variable per Choice set, binary categorical
SIZE = c(1,1,1,0,0,1,0,0,1,1,0,1), # Indep variable per Choice set, binary categorical
gender = c(1,1,1,1,1,1,0,0,0,0,0,0) # Indep variable per indvidual, binary categorical
)
# estimate model
glm.model <- glm(Choice ~ LOC*SIZE, data=cedata.1, family = binomial(link = "logit"))
# estimate means (i.e., values used to calc 1st diff).
comp1.loc.size <- emmeans(glm.model, ~ LOC * SIZE)
# calculate 1st diff (and p value)
pairs(comp1.loc.size, simple = "SIZE") # gives result I want
#> LOC = 0:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 -1.39 1.73 Inf -0.800 0.4235
#>
#> LOC = 1:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 0.00 1.73 Inf 0.000 1.0000
#>
#> Results are given on the log odds ratio (not the response) scale.
# calculate 2nd diff (and p value)
# ** the following gives the relevant values for doing the 2nd diff comparison (i.e., -1.39 and 0.00)...but how to make the statistical comparison?
pairs(comp1.loc.size, simple = "SIZE")
#> LOC = 0:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 -1.39 1.73 Inf -0.800 0.4235
#>
#> LOC = 1:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 0.00 1.73 Inf 0.000 1.0000
#>
#> Results are given on the log odds ratio (not the response) scale.
pairs(pairs(comp1.loc.size, simple = "SIZE"), by = NULL)
Another solution:
# estimate means (i.e., values used to calc 1st diff).
comp1.loc.size <- emmeans(glm.model, ~ LOC | SIZE)
# second difference:
pairs(pairs(emmeans::regrid(comp1.loc.size)), by = NULL)
PS: This solution is almost a copy of the solution here: Testing contrast of contrast (first/second difference) in outcome
I'm using emmeans to perform custom comparisons to a control group. The trt.vs.ctrl approach works perfectly for me if I'm only interested in comparing one factor, but then fails (or I fail) when I set the comparison to be more complicated (i.e., the control group is described by a specific combination of 2+ variables).
Example code below. Say that using the pigs data, I want to compare all diets to the low percent fish diet. Note how in the nd data frame, "fish" only has 9% associated with it. However, when I run emmeans, the function does not pick up on the nesting, and while the control is correct, the treatment groups also include various values of fish and percents. Which means that the p-value adjustment is wrong.
So the two approaches I can think of:
How do I make emmeans pick up on the nesting in this case, or
How do I do the dunnettx adjustment manually (=I can use adjustment "none", then pull out the tests I actually want, and adjust the p-value myself?).
library(emmeans)
library(dplyr)
pigs.lm <- lm(log(conc) ~ source + factor(percent), data = pigs)
nd <- expand.grid(source = levels(pigs$source), percent = unique(pigs$percent)) %>%
filter(percent == 9 | source != "fish")
emmeans(pigs.lm, trt.vs.ctrl ~ source + percent,
data = nd, covnest = TRUE, cov.reduce = FALSE)
Appreciate your help.
The suggestion to use include worked perfectly. Posting my code here in case anyone else has the same issue in the future.
library(emmeans)
library(dplyr)
library(tidyr)
pigs.lm <- lm(log(conc) ~ source + factor(percent), data = pigs)
nd <- expand.grid(source = levels(pigs$source), percent = unique(pigs$percent)) %>%
filter(percent == 9 | source != "fish")
ems <- emmeans(pigs.lm, trt.vs.ctrl ~ source + percent,
data = nd, covnest = TRUE, cov.reduce = FALSE)
# to identify which levels to exclude - in this case,
# I only want the low-percent fish to remain as the ref level
aux <- as.data.frame(ems[[1]]) %>%
mutate(ID = 1:n()) %>%
filter(!grepl("fish", source) | ID == 1)
emmeans(pigs.lm, trt.vs.ctrl ~ source + percent,
data = nd, covnest = TRUE, cov.reduce = FALSE, include = aux$ID)
I'm not totally clear on what you are trying to accomplish, but I don't think filtering the data is the solution.
If your goal is to compare the marginal means for source with the (fish, 9 percent) combination, you can do it by constructing two sets of emmeans, then subsetting and combining:
emm1 = emmeans(pigs.lm, "source")
emm2 = emmeans(pigs.lm, ~source*percent)
emm3 = emm2[1] + emm1 # or rbind(emm2[1], emm1)
Then you get
> confint(emm3, adjust ="none")
source percent emmean SE df lower.CL upper.CL
fish 9 3.22 0.0536 23 3.11 3.33
fish . 3.39 0.0367 23 3.32 3.47
soy . 3.67 0.0374 23 3.59 3.74
skim . 3.80 0.0394 23 3.72 3.88
Results are averaged over some or all of the levels of: percent
Results are given on the log (not the response) scale.
Confidence level used: 0.95
> contrast(emm3, "trt.vs.ctrl1")
contrast estimate SE df t.ratio p.value
fish,. - fish,9 0.174 0.0366 23 4.761 0.0002
soy,. - fish,9 0.447 0.0678 23 6.595 <.0001
skim,. - fish,9 0.576 0.0696 23 8.286 <.0001
Results are averaged over some or all of the levels of: percent
Results are given on the log (not the response) scale.
P value adjustment: dunnettx method for 3 tests
Another (much more tedious, more error-prone) way to do the same thing is to get the EMMs for the factor combinations, and then use custom contrasts:
> contrast(emm2, list(con1 = c(-3,0,0, 1,0,0, 1,0,0, 1,0,0)/4,
+ con2 = c(-4,1,0, 0,1,0, 0,1,0, 0,1,0)/4,
+ con3 = c(-4,0,1, 0,0,1, 0,0,1, 0,0,1)/4),
+ adjust = "mvt")
contrast estimate SE df t.ratio p.value
con1 0.174 0.0366 23 4.761 0.0002
con2 0.447 0.0678 23 6.595 <.0001
con3 0.576 0.0696 23 8.286 <.0001
Results are given on the log (not the response) scale.
P value adjustment: mvt method for 3 tests
(The mvt adjustment is the exact correction for which dunnettx is only an approximation. It doesn't default to mvt because it is computationally heavy for a large number of tests.)
In answer to the last part of the question, you may use exclude (or include) to focus on a subset of the levels; see ? pairwise.emmc.
I have the following data (dat)
I have the following data(dat)
V W X Y Z
1 8 89 3 900
1 8 100 2 800
0 9 333 4 980
0 9 560 1 999
I wish to perform TukeysHSD pairwise test to the above data set.
library(reshape2)
dat1 <- gather(dat) #convert to long form
pairwise.t.test(dat1$key, dat1$value, p.adj = "holm")
However, every time I try to run it, it keeps running and does not yield an output. Any suggestions on how to correct this?
I would also like to perform the same test using the function TukeyHSD(). However, when I try to use the wide/long format, I run into a error that says
" Error in UseMethod("TukeyHSD") :
no applicable method for 'TukeyHSD' applied to an object of class "data.frame"
We need 'x' to be dat1$value as it is not specified the first argument is taken as 'x' and second as 'g'
pairwise.t.test( dat1$value, dat1$key, p.adj = "holm")
#data: dat1$value and dat1$key
# V W X Y
#W 1.000 - - -
#X 0.018 0.018 - -
#Y 1.000 1.000 0.018 -
#Z 4.1e-08 4.1e-08 2.8e-06 4.1e-08
#P value adjustment method: holm
Or we specify the argument and use in any order we wanted
pairwise.t.test(g = dat1$key, x= dat1$value, p.adj = "holm")
Regarding the TukeyHSD
TukeyHSD(aov(value~key, data = dat1), ordered = TRUE)
#Tukey multiple comparisons of means
# 95% family-wise confidence level
# factor levels have been ordered
#Fit: aov(formula = value ~ key, data = dat1)
#$key
# diff lwr upr p adj
#Y-V 2.00 -233.42378 237.4238 0.9999999
#W-V 8.00 -227.42378 243.4238 0.9999691
#X-V 270.00 34.57622 505.4238 0.0211466
#Z-V 919.25 683.82622 1154.6738 0.0000000
#W-Y 6.00 -229.42378 241.4238 0.9999902
#X-Y 268.00 32.57622 503.4238 0.0222406
#Z-Y 917.25 681.82622 1152.6738 0.0000000
#X-W 262.00 26.57622 497.4238 0.0258644
#Z-W 911.25 675.82622 1146.6738 0.0000000
#Z-X 649.25 413.82622 884.6738 0.0000034
I'm doing some exploring with the same data and I'm trying to highlight the in-group variance versus the between group variance. Now I have been able to successfully show the between group variance is very strong, however, the nature of the data should show weak within group variance. (I.e. My Shapiro-Wilk normality test shows this) I believe if I do some re-sampling with a welch correction, this might be the case.
I was wondering if someone knew if there was a re-sampling based anova with a Welch correction in R. I see there is an R implementation of the permutation test but with no correction. If not, how would I code the test directly while using this implementation.
http://finzi.psych.upenn.edu/library/lmPerm/html/aovp.html
Here is the outline for my basic between group ANOVA:
fit <- lm(formula = data$Boys ~ data$GroupofBoys)
anova(fit)
I believe you're correct in that there isn't an easy way to do welch corrected anova with resampling, but it should be possible to hobble a few things together to make it work.
require('Ecdat')
I'll use the “Star” dataset from the “Ecdat" package which looks at the effects of small class sizes on standardized test scores.
star<-Star
attach(star)
head(star)
tmathssk treadssk classk totexpk sex freelunk race schidkn
2 473 447 small.class 7 girl no white 63
3 536 450 small.class 21 girl no black 20
5 463 439 regular.with.aide 0 boy yes black 19
11 559 448 regular 16 boy no white 69
12 489 447 small.class 5 boy yes white 79
13 454 431 regular 8 boy yes white 5
Some exploratory analysis:
#bloxplots
boxplot(treadssk ~ classk, ylab="Total Reading Scaled Score")
title("Reading Scores by Class Size")
#histograms
hist(treadssk, xlab="Total Reading Scaled Score")
Run regular anova
model1 = aov(treadssk ~ classk, data = star)
summary(model1)
Df Sum Sq Mean Sq F value Pr(>F)
classk 2 37201 18601 18.54 9.44e-09 ***
Residuals 5745 5764478 1003
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A look at the anova residuals
#qqplot
qqnorm(residuals(model1),ylab="Reading Scaled Score")
qqline(residuals(model1),ylab="Reading Scaled Score")
qqplot shows that ANOVA residuals deviate from the normal qqline
#Fitted Y vs. Residuals
plot(fitted(model1), residuals(model1))
Fitted Y vs. Residuals shows converging trend in the residuals, can test with a Shapiro-Wilk test just to be sure
shapiro.test(treadssk[1:5000]) #shapiro.test contrained to sample sizes between 3 and 5000
Shapiro-Wilk normality test
data: treadssk[1:5000]
W = 0.92256, p-value < 2.2e-16
Just confirms that we aren't going to be able to assume a normal distribution.
We can use bootstrap to estimate the true F-dist.
#Bootstrap version (with 10,000 iterations)
mean_read = mean(treadssk)
grpA = treadssk[classk=="regular"] - mean_read[1]
grpB = treadssk[classk=="small.class"] - mean_read[2]
grpC = treadssk[classk=="regular.with.aide"] - mean_read[3]
sim_classk <- classk
R = 10000
sim_Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=2000, replace=T)
groupB = sample(grpB, size=1733, replace=T)
groupC = sample(grpC, size=2015, replace=T)
sim_score = c(groupA,groupB,groupC)
sim_data = data.frame(sim_score,sim_classk)
}
Now we need to get the set of unique pairs of the Group factor
allPairs <- expand.grid(levels(sim_data$sim_classk), levels(sim_data$sim_classk))
## http://stackoverflow.com/questions/28574006/unique-combination-of-two-columns-in-r/28574136#28574136
allPairs <- unique(t(apply(allPairs, 1, sort)))
allPairs <- allPairs[ allPairs[,1] != allPairs[,2], ]
allPairs
[,1] [,2]
[1,] "regular" "small.class"
[2,] "regular" "regular.with.aide"
[3,] "regular.with.aide" "small.class"
Since oneway.test() applies a Welch correction by default, we can use that on our simulated data.
allResults <- apply(allPairs, 1, function(p) {
#http://stackoverflow.com/questions/28587498/post-hoc-tests-for-one-way-anova-with-welchs-correction-in-r
dat <- sim_data[sim_data$sim_classk %in% p, ]
ret <- oneway.test(sim_score ~ sim_classk, data = sim_data, na.action = na.omit)
ret$sim_classk <- p
ret
})
length(allResults)
[1] 3
allResults[[1]]
One-way analysis of means (not assuming equal variances)
data: sim_score and sim_classk
F = 1.7741, num df = 2.0, denom df = 1305.9, p-value = 0.170