I have the following model:
prior1 <- c(
prior(normal(0, 50), class = b),
prior(exponential(0.1), class = sd),
prior(exponential(0.1), class = sigma))
BMvpa <- brm (
RT ~ 1 + GroupC*PrimeC*CongC*EmoC*SexC
+ (1 + CongC *PrimeC*EmoC || ID)
+ (1 + GroupC*PrimeC*CongC*EmoC*SexC || Target),
data = df1,
family = exgaussian(),
prior = prior1,
warmup = 2000, iter = 5000,
chains = 3,
cores= parallel::detectCores(),
sample_prior = TRUE
)
Here is part of my output:
Predictors Estimate SE Lower Upper Rhat BF01
1 Intercept 700.175 13.470 674.330 726.617 1.000 <NA>
2 Group 49.027 33.542 -17.261 115.122 1.000 0.51
3 Prime -12.197 2.816 -17.799 -6.655 1.000 0.017
4 Congruency -15.879 2.798 -21.507 -10.435 1.001 <0.001
5 Emotion 17.092 6.860 3.740 30.373 1.000 0.32
6 Sex 5.362 24.381 -42.347 52.871 1.001 2.031
...
21 Group x Prime x Emotion 22.339 8.509 5.464 39.136 1.001 0.24
...
Each variable (Group, Sex, Prime, Congruency , Emotion) is dichotomous and coded -0.5 (TD, M, LSF, ICG, Joy), +0.5 (ASD, F, HSF, CG, Anger).
I would like to go more in details in the interaction Group x Prime x Emotion (represented on the figure) and would like to know the posterior distribution regarding the effect of Prime for each group and each emotion.
I thought about 2 strategies.
1/First using emmeans:
BMGPE_emm <- emmeans(BMvpa, ~ GroupC:PrimeC:EmoC)
BMGPE_fac <- update(BMGPE_emm, levels =list(GroupC= c("TD","ASD"), PrimeC= c("LSF","HSF"), EmoC=c("joy","anger")))
contRT1 <- as.data.frame(contrast(BMGPE_fac, method = "pairwise", by = c("GroupC","EmoC")))
Output:
1 LSF - HSF TD joy 8.706978 -0.3446188 18.33661
2 LSF - HSF ASD joy 20.487280 10.2622944 30.46115
3 LSF - HSF TD anger 15.029082 6.2702713 24.67623
4 LSF - HSF ASD anger 4.412052 -5.6749680 14.60393
I am not sure about this because I would have expected only negative estimates (HSF primes reducing response time).
Additionally, is there a possibility to compute a Bayes factor here?
2/ Second, using the hypothesisfunction (I read the blogpost of Matti Vuorre). I think it would be the best but the result I got are even more strange and I think I probably made a mistake (I was expected only negative estimates) :
> hypothesis(BMvpa,c(qAJ = "PrimeC + 0.5*GroupC - 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qAJ 3.77 17.35 -30.51 37.79 3.56 0.78
---
> hypothesis(BMvpa,c(qAA = "PrimeC + 0.5*GroupC + 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qAA 20.86 17.46 -13.54 55.14 1.63 0.62
---
> hypothesis(BMvpa,c(qTJ = "PrimeC - 0.5*GroupC - 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qTJ -45.26 17.34 -78.93 -10.78 0.15 0.13 *
---
> hypothesis(BMvpa,c(qTA = "PrimeC - 0.5*GroupC - 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qTA -45.26 17.34 -78.93 -10.78 0.15 0.13 *
---
So my question is: how could I have the posterior distribution for the efefct of prime in each subgroup (and BF).
Related
These are three different ways to run an individual fixed effect method which gives more or less the same results (see below). My main question is how to get predictive probabilities or average marginal effects using the second model (model_plm) or the third model(model_felm). I know how to do it using the first model (model_lm) and show an example below using ggeffects, but that only works when i have a small sample.
As i have over a million individual, my model only works using model_plm and model_felm. If i use model_lm, it takes a lot of time to run with one million individuals since they are controlled for in the model. I also get the following error: Error: vector memory exhausted (limit reached?). I checked many threads on StackOverflow to work around that error but nothing seems to solve it.
I was wondering whether there is an efficient way to work around this issue. My main interest is to extract the predicted probabilities of the interaction residence*union. I usually extract predictive probabilities or average marginal effects using one of these packages: ggeffects,emmeans or margins.
library(lfe)
library(plm)
library(ggeffects)
data("Males")
model_lm = lm(wage ~ exper + residence+health + residence*union +factor(nr)-1, data=Males)
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr", "year"), data=Males)
model_felm = felm(wage ~ exper + residence + health + residence*union | nr, data= Males)
pred_ggeffects <- ggpredict(model_lm, c("residence","union"),
vcov.fun = "vcovCL",
vcov.type = "HC1",
vcov.args = list(cluster = Males$nr))
I tried adjusting formula/datasets to get emmeans and plm to play nice. Let me know if there's something here. I realized the biglm answer wasn't going to cut it for a million individuals after some testing.
library(emmeans)
library(plm)
data("Males")
## this runs but we need to get an equivalent result with expanded formula
## and expanded dataset
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr"), data=Males)
## expanded dataset
Males2 <- data.frame(wage=Males[complete.cases(Males),"wage"],
model.matrix(wage ~ exper + residence + health + residence*union, Males),
nr=Males[complete.cases(Males),"nr"])
(fmla2 <- as.formula(paste("wage ~ ", paste(names(coef(model_plm)), collapse= "+"))))
## expanded formula
model_plm2 <- plm(fmla2,
model = "within",
index=c("nr"),
data=Males2)
(fmla2_rg <- as.formula(paste("wage ~ -1 +", paste(names(coef(model_plm)), collapse= "+"))))
plm2_rg <- qdrg(fmla2_rg,
data = Males2,
coef = coef(model_plm2),
vcov = vcov(model_plm2),
df = model_plm2$df.residual)
plm2_rg
### when all 3 residences are 0, that's `rural area`
### then just pick the rows when one of the residences are 1
emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
Which gives, after some row-deletion:
> ### when all 3 residences are 0, that's `rural area`
> ### then just pick the rows when one of the residences are 1
> emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
residencenorth_east residencenothern_central residencesouth unionyes emmean SE df lower.CL upper.CL
0 0 0 0 0.3777 0.0335 2677 0.31201 0.443
1 0 0 0 0.3301 0.1636 2677 0.00929 0.651
0 1 0 0 0.1924 0.1483 2677 -0.09834 0.483
0 0 1 0 0.2596 0.1514 2677 -0.03732 0.557
0 0 0 1 0.2875 0.1473 2677 -0.00144 0.576
1 0 0 1 0.3845 0.1647 2677 0.06155 0.708
0 1 0 1 0.3326 0.1539 2677 0.03091 0.634
0 0 1 1 0.3411 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: healthyes
Confidence level used: 0.95
The problem seems to be that when we add -1 to the formula, that creates an extra column in the model matrix that is not included in the regression coefficients. (This is a byproduct of the way that R creates factor codings.)
So I can work around this by adding a strategically placed coefficient of zero. We also have to fix up the covariance matrix the same way:
library(emmeans)
library(plm)
data("Males")
mod <- plm(wage ~ exper + residence + health + residence*union,
model = "within",
index = "nr",
data = Males)
BB <- c(coef(mod)[1], 0, coef(mod)[-1])
k <- length(BB)
VV <- matrix(0, nrow = k, ncol = k)
VV[c(1, 3:k), c(1, 3:k)] <- vcov(mod)
RG <- qdrg(~ -1 + exper + residence + health + residence*union,
data = Males, coef = BB, vcov = VV, df = df.residual(mod))
Verify that things line up:
> names(RG#bhat)
[1] "exper" ""
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
> colnames(RG#linfct)
[1] "exper" "residencerural_area"
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
They do line up, so we can get the results we need:
(EMM <- emmeans(RG, ~ residence * union))
residence union emmean SE df lower.CL upper.CL
rural_area no 0.378 0.0335 2677 0.31201 0.443
north_east no 0.330 0.1636 2677 0.00929 0.651
nothern_central no 0.192 0.1483 2677 -0.09834 0.483
south no 0.260 0.1514 2677 -0.03732 0.557
rural_area yes 0.287 0.1473 2677 -0.00144 0.576
north_east yes 0.385 0.1647 2677 0.06155 0.708
nothern_central yes 0.333 0.1539 2677 0.03091 0.634
south yes 0.341 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: health
Confidence level used: 0.95
In general, the key is to identify where the added column occurs. It's going to be the position of the first level of the first factor in the model formula. You can check it by looking at names(coef(mod)) and colnames(model.matrix(formula), data = data) where formula is the model formula with intercept removed.
Update: a general function
Here's a function that may be used to create a reference grid for any plm object. It turns out that sometimes these objects do have an intercept (e.g., random-effects models) so we have to check. For models lacking an intercept, you really should use this only for contrasts.
plmrg = function(object, ...) {
form = formula(formula(object))
if (!("(Intercept)" %in% names(coef(object))))
form = update(form, ~ . - 1)
data = eval(object$call$data, environment(form))
mmat = model.matrix(form, data)
sel = which(colnames(mmat) %in% names(coef(object)))
k = ncol(mmat)
b = rep(0, k)
b[sel] = coef(object)
v = matrix(0, nrow = k, ncol = k)
v[sel, sel] = vcov(object)
emmeans::qdrg(formula = form, data = data,
coef = b, vcov = v, df = df.residual(object), ...)
}
Test run:
> (rg = plmrg(mod, at = list(exper = c(3,6,9))))
'emmGrid' object with variables:
exper = 3, 6, 9
residence = rural_area, north_east, nothern_central, south
health = no, yes
union = no, yes
> emmeans(rg, "residence")
NOTE: Results may be misleading due to involvement in interactions
residence emmean SE df lower.CL upper.CL
rural_area 0.313 0.0791 2677 0.1579 0.468
north_east 0.338 0.1625 2677 0.0190 0.656
nothern_central 0.243 0.1494 2677 -0.0501 0.536
south 0.281 0.1514 2677 -0.0161 0.578
Results are averaged over the levels of: exper, health, union
Confidence level used: 0.95
This potential solution uses biglm::biglm() to fit the lm model and then uses emmeans::qdrg() with a nuisance specified. Does this approach help in your situation?
library(biglm)
library(emmeans)
## the biglm coefficients using factor() with all the `nr` levels has NAs.
## so restrict data to complete cases in the `biglm()` call
model_biglm <- biglm(wage ~ -1 +exper + residence+health + residence*union + factor(nr), data=Males[!is.na(Males$residence),])
summary(model_biglm)
## double check that biglm and lm give same/similar model
## summary(model_biglm)
## summary(model_lm)
summary(model_biglm)$rsq
summary(model_lm)$r.squared
identical(coef(model_biglm), coef(model_lm)) ## not identical! but plot the coefficients...
head(cbind(coef(model_biglm), coef(model_lm)))
tail(cbind(coef(model_biglm), coef(model_lm)))
plot(cbind(coef(model_biglm), coef(model_lm))); abline(0,1,col="blue")
## do a "[q]uick and [d]irty [r]eference [g]rid and follow examples
### from ?qdrg and https://cran.r-project.org/web/packages/emmeans/vignettes/FAQs.html
rg1 <- qdrg(wage ~ -1 + exper + residence+health + residence*union + factor(nr),
data = Males,
coef = coef(model_biglm),
vcov = vcov(model_biglm),
df = model_biglm$df.resid,
nuisance="nr")
## Since we already specified nuisance in qdrg() we don't in emmeans():
emmeans(rg1, c("residence","union"))
Which gives:
> emmeans(rg1, c("residence","union"))
residence union emmean SE df lower.CL upper.CL
rural_area no 1.72 0.1417 2677 1.44 2.00
north_east no 1.67 0.0616 2677 1.55 1.79
nothern_central no 1.53 0.0397 2677 1.45 1.61
south no 1.60 0.0386 2677 1.52 1.68
rural_area yes 1.63 0.2011 2677 1.23 2.02
north_east yes 1.72 0.0651 2677 1.60 1.85
nothern_central yes 1.67 0.0503 2677 1.57 1.77
south yes 1.68 0.0460 2677 1.59 1.77
Results are averaged over the levels of: 1 nuisance factors, health
Confidence level used: 0.95
I would like to perform a Cox regression for the following questions: A group of patients receives a treatment "drug" or not (0 / 1). My time variable "time" tells me, how many days the patient is observed and "status" if the patient survived or died (died = 1, survived = 0).
library(survival)
set.seed(123)
df <- data.frame(time = round(runif(100, min = 1, max = 70)),
status = round(runif(100, min = 0, max = 1)),
drug = round(runif(100, min = 0, max = 1)),
age40 = round(runif(100, min = 0, max = 1)),
stringsAsFactors = FALSE)
object <- Surv(df$time, df$status)
model <- coxph(object ~ drug, data = df)
summary(model)
This works fine for me and tells me, that the HR is 0.89, so the drug prevents patients from dying.
Now I want to do some subgroup analysis, f.e. how does the HR change, if the patient is <= 40 years or > 40 years old (age40: 0 vs 1).
Is all I have to do to include the variable "age40" into the coxph?
object2 <- Surv(df$time, df$status)
model2 <- coxph(object2 ~ drug + age40, data = df)
summary(model2)
If I do that my HR in the summary for drug1 slightly changes to 0.86 and I get another one for age40 (1.12).
Now my question is: How are the Hazard Ratios for dying under treatment (drug = 1) if the patient is <= 40 or > 40 years old.
EDIT: Another question would be to graphically show the different HRs of the effect of drug on status in a forest plot, f.e. like this: https://rpkgs.datanovia.com/survminer/reference/ggforest-2.png.
Instead of "sex", "rx", "adhere" etc. I would like to show the HRs for Age40 = 0 vs. 1 and other variables as well, like hypertension = 0 vs. 1, smoker = 0 vs. 1.
Thank you!
The function you need to use is predict on your model2, and it needs to be supplied with a newdata argument that includes all the cases that you want to consider:
exp( predict(model2, newdata=expand.grid(drug=c(0,1), age40=c(0,1))) )
# 1 2 3 4
#1.0000000 0.8564951 1.1268713 0.9651598
You now have all 4 cases of possible combinations of drug and age40. The base case has a value of unity because you are estimating risk ratios form a baseline case of {drug=0, age40=0} You can see what the other risk ratios are associated with
expand.grid(drug=c(0,1), age40=c(0,1))
drug age40
1 0 0
2 1 0
3 0 1
4 1 1
Notice that the ration of drug=0 to drug=1 is the same for each age category considered separately. If you had wanted to see if the effects of drug was different in the two age categories you would have used an interaction model:
model3 <- coxph(object2 ~ drug * age40, data = df)
summary(model3)
#----------------
Call:
coxph(formula = object2 ~ drug * age40, data = df)
n= 100, number of events= 50
coef exp(coef) se(coef) z Pr(>|z|)
drug -0.18524 0.83091 0.45415 -0.408 0.683
age40 0.09611 1.10089 0.39560 0.243 0.808
drug:age40 0.05679 1.05843 0.63094 0.090 0.928
exp(coef) exp(-coef) lower .95 upper .95
drug 0.8309 1.2035 0.3412 2.024
age40 1.1009 0.9084 0.5070 2.390
drug:age40 1.0584 0.9448 0.3073 3.645
Concordance= 0.528 (se = 0.042 )
Likelihood ratio test= 0.34 on 3 df, p=1
Wald test = 0.33 on 3 df, p=1
Score (logrank) test = 0.33 on 3 df, p=1
And the effect estimates are now a bit different:
exp( predict(model3, newdata=expand.grid(drug=c(0,1), age40=c(0,1))) )
# 1 2 3 4
#1.0000000 0.8309089 1.1008850 0.9681861
Use argument strata.
coxph(object ~ drug + strata(age40), data = df)
I'm hoping any of you could shed some light on the following. I have been attempting to replicate a Cox PH model from Stata in R. As you can see below, I get the same results for Cox PH models without tvcs in both programs:
Stata Cox PH model
stset date_endpoint, failure(cause_endpoint2==4) exit(failure) origin(time capture_date) id(wolf_ID)
id: wolf_ID
failure event: cause_endpoint2 == 4
obs. time interval: (date_endpoint[_n-1], date_endpoint]
exit on or before: failure
t for analysis: (time-origin)
origin: time capture_date
--------------------------------------------------------------------------
5,664 total observations
0 exclusions
--------------------------------------------------------------------------
5,664 observations remaining, representing
513 subjects
231 failures in single-failure-per-subject data
279,430.5 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 3,051
stcox deer_hunt bear_hunt, strata(winter lib_kill) efron robust cluster(wolf_ID)
failure _d: cause_endpoint2 == 4
analysis time _t: (date_endpoint-origin)
origin: time capture_date
id: wolf_ID
Iteration 0: log pseudolikelihood = -993.65252
Iteration 1: log pseudolikelihood = -992.55768
Iteration 2: log pseudolikelihood = -992.55733
Refining estimates:
Iteration 0: log pseudolikelihood = -992.55733
Stratified Cox regr. -- Efron method for ties
No. of subjects = 513 Number of obs = 5,664
No. of failures = 231
Time at risk = 279430.5
Wald chi2(2) = 2.21
Log pseudolikelihood = -992.55733 Prob > chi2 = 0.3317
(Std. Err. adjusted for 513 clusters in wolf_ID)
--------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf.Interval]
-------------+------------------------------------------------------------
deer_hunt | .7860433 .1508714 -1.25 0.210 .539596 1.145049
bear_hunt | 1.21915 .2687211 0.90 0.369 .7914762 1.877916
--------------------------------------------------------------------------
Stratified by winter lib_kill
R Cox PH model
> LTF.coxph <- coxph(Surv(`_t0`,`_t`, endpoint_r_enc=="ltf") ~ deer_hunt
+ + bear_hunt + strata(winter, lib_kill), data=statadta, ties = "efron", id = wolf_ID)
> summary(LTF.coxph)
Call:
coxph(formula = Surv(`_t0`, `_t`, endpoint_r_enc == "ltf") ~
deer_hunt + bear_hunt + strata(winter, lib_kill), data = statadta,
ties = "efron", id = wolf_ID)
n= 5664, number of events= 231
coef exp(coef) se(coef) z Pr(>|z|)
deer_hunt -0.2407 0.7860 0.1849 -1.302 0.193
bear_hunt 0.1982 1.2191 0.2174 0.911 0.362
exp(coef) exp(-coef) lower .95 upper .95
deer_hunt 0.786 1.2722 0.5471 1.129
bear_hunt 1.219 0.8202 0.7962 1.867
Concordance= 0.515 (se = 0.022 )
Likelihood ratio test= 2.19 on 2 df, p=0.3
Wald test = 2.21 on 2 df, p=0.3
Score (logrank) test = 2.22 on 2 df, p=0.3
> cox.zph(LTF.coxph)
chisq df p
deer_hunt 5.5773 1 0.018
bear_hunt 0.0762 1 0.783
GLOBAL 5.5773 2 0.062
The problem I have is that results look very different when adding a time-varying coefficient (tvc() in Stata and tt() in R) for one of the variables in my model. Nothing is the same between models (coefficients for all variables or their significance).
Stata Cox PH model with tvc()
stcox deer_hunt bear_hunt, tvc(deer_hunt) strata(winter lib_kill) efron robust cluster(wolf_ID)
failure _d: cause_endpoint2 == 4
analysis time _t: (date_endpoint-origin)
origin: time capture_date
id: wolf_ID
Iteration 0: log pseudolikelihood = -993.65252
Iteration 1: log pseudolikelihood = -990.70475
Iteration 2: log pseudolikelihood = -990.69386
Iteration 3: log pseudolikelihood = -990.69386
Refining estimates:
Iteration 0: log pseudolikelihood = -990.69386
Stratified Cox regr. -- Efron method for ties
No. of subjects = 513 Number of obs = 5,664
No. of failures = 231
Time at risk = 279430.5
Wald chi2(3) = 4.72
Log pseudolikelihood = -990.69386 Prob > chi2 = 0.1932
(Std. Err. adjusted for 513 clusters in wolf_ID)
--------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf Interval]
-------------+------------------------------------------------------------
main |
deer_hunt | 1.043941 .2643779 0.17 0.865 .6354908 1.714915
bear_hunt | 1.204522 .2647525 0.85 0.397 .7829279 1.853138
-------------+------------------------------------------------------------
tvc |
deer_hunt | .9992683 .0004286 -1.71 0.088 .9984287 1.000109
------------------------------------------------------------------------------
Stratified by winter lib_kill
Note: Variables in tvc equation interacted with _t.
R Cox PH model with tt()
> LTF.tvc1.coxph <- coxph(Surv(`_t0`,`_t`, endpoint_r_enc=="ltf") ~ deer_hunt + bear_hunt + tt(deer_hunt) + strata(winter, lib_kill),
+ data=statadta, ties = "efron", id = wolf_ID, cluster = wolf_ID,
+ tt=function(x,t,...){x*t})
> summary(LTF.tvc1.coxph)
Call:
coxph(formula = Surv(`_t0`, `_t`, endpoint_r_enc == "ltf") ~
deer_hunt + bear_hunt + tt(deer_hunt) + strata(winter, lib_kill),
data = statadta, ties = "efron", tt = function(x, t, ...) {
x * t
}, id = wolf_ID, cluster = wolf_ID)
n= 5664, number of events= 231
coef exp(coef) se(coef) robust se z Pr(>|z|)
deer_hunt 0.4741848 1.6067039 0.2082257 0.2079728 2.280 0.02261 *
bear_hunt -0.7923208 0.4527927 0.1894531 0.1890497 -4.191 2.78e-05 ***
tt(deer_hunt)-0.0009312 0.9990692 0.0003442 0.0003612 -2.578 0.00994 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
deer_hunt 1.6067 0.6224 1.0688 2.4153
bear_hunt 0.4528 2.2085 0.3126 0.6559
tt(deer_hunt) 0.9991 1.0009 0.9984 0.9998
Concordance= 0.634 (se = 0.02 )
Likelihood ratio test= 28.29 on 3 df, p=3e-06
Wald test = 25.6 on 3 df, p=1e-05
Score (logrank) test = 26.19 on 3 df, p=9e-06, Robust = 32.6 p=4e-07
Moreover, I checked this post before posting this because I did not find it very helpful. The 'noadjust' Stata command was useful for SEs, but it does not answer my main issue of also getting different covariate coefficients between programs for the main and time-varying effects when I add those time-varying effects to the Cox model in each program (and the same formula for calculating the time-varying effects). That is really my main concern: the difference in covariate estimates seems substantial and would result in different prescriptions, I believe
I have been unable to figure out what is happening there, and am hoping the community can help.
I have a question about the how to calculate differences between the coefficients (categorical variables) of glm in Rstan.
As example, I used iris dataset in R to judge whether I can calculate the posterior distribution of differences of coefficients.
At first, I conducted a basic glm procedure like below and calculate the significant differences of coefficients.
library(tidyverse)
library(magrittr)
library(multcomp)
iris_glm <-
glm(Sepal.Length ~ Species, data = iris)
multcomp::glht(iris_glm, linfct = mcp(Species = "Tukey")) %>%
summary(.) %>%
broom::tidy()
lhs rhs estimate std.error statistic p.value
1 versicolor - setosa 0 0.930 0.1029579 9.032819 0.000000e+00
2 virginica - setosa 0 1.582 0.1029579 15.365506 0.000000e+00
3 virginica - versicolor 0 0.652 0.1029579 6.332686 4.294805e-10
Next, I conducted bayesian glm procedure using stan like below code, and calculate the posterior distribution of the differences between coefficients in generated quantities section.
# Make the model matrix for Rstan
iris_mod <-
model.matrix(Sepal.Length ~ Species, data = iris) %>%
as.data.frame(.)
# Input data
stan_data <-
list(N = nrow(iris_mod),
SL = iris$Sepal.Length,
Intercept = iris_mod$`(Intercept)`,
versicolor = iris_mod$Speciesversicolor,
virginica = iris_mod$Speciesvirginica)
# Stan code
data{
int N;
real <lower = 0> SL[N];
int <lower = 1> Intercept[N];
int <lower = 0, upper = 1> versicolor[N];
int <lower = 0, upper = 1> virginica[N];
}
parameters{
real beta0;
real beta1;
real beta2;
real <lower = 0> sigma;
}
transformed parameters{
real mu[N];
for(n in 1:N) mu[n] = beta0*Intercept[n] + beta1*versicolor[n] +
beta2*virginica[n];
}
model{
for(n in 1:N) SL[n] ~ normal(mu[n], sigma);
}
generated quantities{
real diff_beta0_beta1;
real diff_beta1_beta2;
real diff_beta0_beta2;
diff_beta0_beta1 = (beta0 + beta1) - beta0;
diff_beta1_beta2 = (beta0 + beta1) - (beta0 + beta2);
diff_beta0_beta2 = (beta0 + beta2) - beta0;
}
library(rstan)
fit_stan <-
stan(file = "iris.stan", data = stan_data, chains = 4,
seed = 1234)
# confirmation of posterior distribution
print(fit_stan, pars = c("diff_beta0_beta1", "diff_beta1_beta2",
"diff_beta0_beta2"))
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
diff_beta0_beta1 0.92 0 0.1 0.73 0.86 0.92 0.99 1.13 2041 1
diff_beta1_beta2 0.65 0 0.1 0.45 0.58 0.65 0.72 0.86 4000 1
diff_beta0_beta2 1.58 0 0.1 1.38 1.51 1.58 1.64 1.78 1851 1
Finally, I could get same results between the frequentist method and bayesian method.
I think this is correct way, but I'm not sure this because there are no information nor examples.
Also I also confirm this way could be extended another error distributions (including, poisson, gamma, binomial, negative- binomial etc.).
If there are another good ways or advices, please teach me.
You can calculate any function (including the difference in coefficients) of draws (such as those produced by Stan) from any proper posterior distribution.
Does anyone know how to do post hoc tests in an ANCOVA model with a factorial design?
I have two vectors consisting of 23 baseline values (covariate) and 23 values after treatment (dependent variable) and I have two factors with both two levels. I created an ANCOVA model and calculated the adjusted means, standard errors and confidence intervals. Example:
library(effects)
baseline = c(0.7672,1.846,0.6487,0.4517,0.5599,0.2255,0.5946,1.435,0.5374,0.4901,1.258,0.5445,1.078,1.142,0.5,1.044,0.7824,1.059,0.6802,0.8003,0.5547,1.003,0.9213)
after_treatment = c(0.4222,1.442,0.8436,0.5544,0.8818,0.08789,0.6291,1.23,0.4093,0.7828,-0.04061,0.8686,0.8525,0.8036,0.3758,0.8531,0.2897,0.8127,1.213,0.05276,0.7364,1.001,0.8974)
age = factor(c(rep(c("Young","Old"),11),"Young"))
treatment = factor(c(rep("Drug",12),rep("Placebo",11)))
ANC = aov(after_treatment ~ baseline + treatment*age)
effect_treatage = effect("treatment*age",ANC)
data.frame(effect_treatage)
treatment age fit se lower upper
1 Drug Old 0.8232137 0.1455190 0.5174897 1.1289377
2 Placebo Old 0.6168641 0.1643178 0.2716452 0.9620831
3 Drug Young 0.5689036 0.1469175 0.2602413 0.8775659
4 Placebo Young 0.7603360 0.1462715 0.4530309 1.0676410
Now I want test if there is a difference between the adjusted means of
Young-Placebo:Young-Drug
Old-Placebo:Old-Drug
Young-Placebo:Old-Drug
Old-Placebo:Young-Drug
So I tried:
library(multcomp)
pH = glht(ANC, linfct = mcp(treatment*age="Tukey"))
# Error: unexpected '=' in "ph = glht(ANC_nback, linfct = mcp(treat*age="
And:
pH = TukeyHSD(ANC)
# Error in rep.int(n, length(means)) : unimplemented type 'NULL' in 'rep3'
# In addition: Warning message:
# In replications(paste("~", xx), data = mf) : non-factors ignored: baseline
Does anyone know how to resolve this?
Many thanks!
PS for more info see
R: How to graphically plot adjusted means, SE, CI ANCOVA
If you wish to use multcomp, then you can take advantage of the wonderful and seamless interface between lsmeans and multcomp packages (see ?lsm), whereas lsmeans provides support for glht().
baseline = c(0.7672,1.846,0.6487,0.4517,0.5599,0.2255,0.5946,1.435,0.5374,0.4901,1.258,0.5445,1.078,1.142,0.5,1.044,0.7824,1.059,0.6802,0.8003,0.5547,1.003,0.9213)
after_treatment = c(0.4222,1.442,0.8436,0.5544,0.8818,0.08789,0.6291,1.23,0.4093,0.7828,-0.04061,0.8686,0.8525,0.8036,0.3758,0.8531,0.2897,0.8127,1.213,0.05276,0.7364,1.001,0.8974)
age = factor(c(rep(c("Young","Old"),11),"Young"))
treatment = factor(c(rep("Drug",12),rep("Placebo",11)))
Treat <- data.frame(baseline, after_treatment, age, treatment)
ANC <- aov(after_treatment ~ baseline + treatment*age, data=Treat)
library(multcomp)
library(lsmeans)
summary(glht(ANC, linfct = lsm(pairwise ~ treatment * age)))
## Note: df set to 18
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: aov(formula = after_treatment ~ baseline + treatment * age, data = Treat)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Drug,Old - Placebo,Old == 0 0.20635 0.21913 0.942 0.783
## Drug,Old - Drug,Young == 0 0.25431 0.20698 1.229 0.617
## Drug,Old - Placebo,Young == 0 0.06288 0.20647 0.305 0.990
## Placebo,Old - Drug,Young == 0 0.04796 0.22407 0.214 0.996
## Placebo,Old - Placebo,Young == 0 -0.14347 0.22269 -0.644 0.916
## Drug,Young - Placebo,Young == 0 -0.19143 0.20585 -0.930 0.789
## (Adjusted p values reported -- single-step method)
This eliminates the need for reparametrization. You can achieve the same results by using lsmeans alone:
lsmeans(ANC, list(pairwise ~ treatment * age))
## $`lsmeans of treatment, age`
## treatment age lsmean SE df lower.CL upper.CL
## Drug Old 0.8232137 0.1455190 18 0.5174897 1.1289377
## Placebo Old 0.6168641 0.1643178 18 0.2716452 0.9620831
## Drug Young 0.5689036 0.1469175 18 0.2602413 0.8775659
## Placebo Young 0.7603360 0.1462715 18 0.4530309 1.0676410
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## Drug,Old - Placebo,Old 0.20634956 0.2191261 18 0.942 0.7831
## Drug,Old - Drug,Young 0.25431011 0.2069829 18 1.229 0.6175
## Drug,Old - Placebo,Young 0.06287773 0.2064728 18 0.305 0.9899
## Placebo,Old - Drug,Young 0.04796056 0.2240713 18 0.214 0.9964
## Placebo,Old - Placebo,Young -0.14347183 0.2226876 18 -0.644 0.9162
## Drug,Young - Placebo,Young -0.19143238 0.2058455 18 -0.930 0.7893
##
## P value adjustment: tukey method for comparing a family of 4 estimates
You need to use the which argument in TukeyHSD; "listing terms in the fitted model for which the intervals should be calculated". This is needed because you have a non-factor variable in the model ('baseline'). The variable causes trouble when included, which is default when which is not specified.
ANC = aov(after_treatment ~ baseline + treatment*age)
TukeyHSD(ANC, which = c("treatment:age"))
If you wish to use the more flexible glht, see section 3, page 8- here
Reparametrization is a possibility here:
treatAge <- interaction(treatment, age)
ANC1 <- aov(after_treatment ~ baseline + treatAge)
#fits are equivalent:
all.equal(logLik(ANC), logLik(ANC1))
#[1] TRUE
library(multcomp)
summary(glht(ANC1, linfct = mcp(treatAge="Tukey")))
# Simultaneous Tests for General Linear Hypotheses
#
#Multiple Comparisons of Means: Tukey Contrasts
#
#
#Fit: aov(formula = after_treatment ~ baseline + treatAge)
#
#Linear Hypotheses:
# Estimate Std. Error t value Pr(>|t|)
#Placebo.Old - Drug.Old == 0 -0.20635 0.21913 -0.942 0.783
#Drug.Young - Drug.Old == 0 -0.25431 0.20698 -1.229 0.617
#Placebo.Young - Drug.Old == 0 -0.06288 0.20647 -0.305 0.990
#Drug.Young - Placebo.Old == 0 -0.04796 0.22407 -0.214 0.996
#Placebo.Young - Placebo.Old == 0 0.14347 0.22269 0.644 0.916
#Placebo.Young - Drug.Young == 0 0.19143 0.20585 0.930 0.789
#(Adjusted p values reported -- single-step method)