Error with post hoc tests in ANCOVA with factorial design - r

Does anyone know how to do post hoc tests in an ANCOVA model with a factorial design?
I have two vectors consisting of 23 baseline values (covariate) and 23 values after treatment (dependent variable) and I have two factors with both two levels. I created an ANCOVA model and calculated the adjusted means, standard errors and confidence intervals. Example:
library(effects)
baseline = c(0.7672,1.846,0.6487,0.4517,0.5599,0.2255,0.5946,1.435,0.5374,0.4901,1.258,0.5445,1.078,1.142,0.5,1.044,0.7824,1.059,0.6802,0.8003,0.5547,1.003,0.9213)
after_treatment = c(0.4222,1.442,0.8436,0.5544,0.8818,0.08789,0.6291,1.23,0.4093,0.7828,-0.04061,0.8686,0.8525,0.8036,0.3758,0.8531,0.2897,0.8127,1.213,0.05276,0.7364,1.001,0.8974)
age = factor(c(rep(c("Young","Old"),11),"Young"))
treatment = factor(c(rep("Drug",12),rep("Placebo",11)))
ANC = aov(after_treatment ~ baseline + treatment*age)
effect_treatage = effect("treatment*age",ANC)
data.frame(effect_treatage)
treatment age fit se lower upper
1 Drug Old 0.8232137 0.1455190 0.5174897 1.1289377
2 Placebo Old 0.6168641 0.1643178 0.2716452 0.9620831
3 Drug Young 0.5689036 0.1469175 0.2602413 0.8775659
4 Placebo Young 0.7603360 0.1462715 0.4530309 1.0676410
Now I want test if there is a difference between the adjusted means of
Young-Placebo:Young-Drug
Old-Placebo:Old-Drug
Young-Placebo:Old-Drug
Old-Placebo:Young-Drug
So I tried:
library(multcomp)
pH = glht(ANC, linfct = mcp(treatment*age="Tukey"))
# Error: unexpected '=' in "ph = glht(ANC_nback, linfct = mcp(treat*age="
And:
pH = TukeyHSD(ANC)
# Error in rep.int(n, length(means)) : unimplemented type 'NULL' in 'rep3'
# In addition: Warning message:
# In replications(paste("~", xx), data = mf) : non-factors ignored: baseline
Does anyone know how to resolve this?
Many thanks!
PS for more info see
R: How to graphically plot adjusted means, SE, CI ANCOVA

If you wish to use multcomp, then you can take advantage of the wonderful and seamless interface between lsmeans and multcomp packages (see ?lsm), whereas lsmeans provides support for glht().
baseline = c(0.7672,1.846,0.6487,0.4517,0.5599,0.2255,0.5946,1.435,0.5374,0.4901,1.258,0.5445,1.078,1.142,0.5,1.044,0.7824,1.059,0.6802,0.8003,0.5547,1.003,0.9213)
after_treatment = c(0.4222,1.442,0.8436,0.5544,0.8818,0.08789,0.6291,1.23,0.4093,0.7828,-0.04061,0.8686,0.8525,0.8036,0.3758,0.8531,0.2897,0.8127,1.213,0.05276,0.7364,1.001,0.8974)
age = factor(c(rep(c("Young","Old"),11),"Young"))
treatment = factor(c(rep("Drug",12),rep("Placebo",11)))
Treat <- data.frame(baseline, after_treatment, age, treatment)
ANC <- aov(after_treatment ~ baseline + treatment*age, data=Treat)
library(multcomp)
library(lsmeans)
summary(glht(ANC, linfct = lsm(pairwise ~ treatment * age)))
## Note: df set to 18
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: aov(formula = after_treatment ~ baseline + treatment * age, data = Treat)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Drug,Old - Placebo,Old == 0 0.20635 0.21913 0.942 0.783
## Drug,Old - Drug,Young == 0 0.25431 0.20698 1.229 0.617
## Drug,Old - Placebo,Young == 0 0.06288 0.20647 0.305 0.990
## Placebo,Old - Drug,Young == 0 0.04796 0.22407 0.214 0.996
## Placebo,Old - Placebo,Young == 0 -0.14347 0.22269 -0.644 0.916
## Drug,Young - Placebo,Young == 0 -0.19143 0.20585 -0.930 0.789
## (Adjusted p values reported -- single-step method)
This eliminates the need for reparametrization. You can achieve the same results by using lsmeans alone:
lsmeans(ANC, list(pairwise ~ treatment * age))
## $`lsmeans of treatment, age`
## treatment age lsmean SE df lower.CL upper.CL
## Drug Old 0.8232137 0.1455190 18 0.5174897 1.1289377
## Placebo Old 0.6168641 0.1643178 18 0.2716452 0.9620831
## Drug Young 0.5689036 0.1469175 18 0.2602413 0.8775659
## Placebo Young 0.7603360 0.1462715 18 0.4530309 1.0676410
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## Drug,Old - Placebo,Old 0.20634956 0.2191261 18 0.942 0.7831
## Drug,Old - Drug,Young 0.25431011 0.2069829 18 1.229 0.6175
## Drug,Old - Placebo,Young 0.06287773 0.2064728 18 0.305 0.9899
## Placebo,Old - Drug,Young 0.04796056 0.2240713 18 0.214 0.9964
## Placebo,Old - Placebo,Young -0.14347183 0.2226876 18 -0.644 0.9162
## Drug,Young - Placebo,Young -0.19143238 0.2058455 18 -0.930 0.7893
##
## P value adjustment: tukey method for comparing a family of 4 estimates

You need to use the which argument in TukeyHSD; "listing terms in the fitted model for which the intervals should be calculated". This is needed because you have a non-factor variable in the model ('baseline'). The variable causes trouble when included, which is default when which is not specified.
ANC = aov(after_treatment ~ baseline + treatment*age)
TukeyHSD(ANC, which = c("treatment:age"))
If you wish to use the more flexible glht, see section 3, page 8- here

Reparametrization is a possibility here:
treatAge <- interaction(treatment, age)
ANC1 <- aov(after_treatment ~ baseline + treatAge)
#fits are equivalent:
all.equal(logLik(ANC), logLik(ANC1))
#[1] TRUE
library(multcomp)
summary(glht(ANC1, linfct = mcp(treatAge="Tukey")))
# Simultaneous Tests for General Linear Hypotheses
#
#Multiple Comparisons of Means: Tukey Contrasts
#
#
#Fit: aov(formula = after_treatment ~ baseline + treatAge)
#
#Linear Hypotheses:
# Estimate Std. Error t value Pr(>|t|)
#Placebo.Old - Drug.Old == 0 -0.20635 0.21913 -0.942 0.783
#Drug.Young - Drug.Old == 0 -0.25431 0.20698 -1.229 0.617
#Placebo.Young - Drug.Old == 0 -0.06288 0.20647 -0.305 0.990
#Drug.Young - Placebo.Old == 0 -0.04796 0.22407 -0.214 0.996
#Placebo.Young - Placebo.Old == 0 0.14347 0.22269 0.644 0.916
#Placebo.Young - Drug.Young == 0 0.19143 0.20585 0.930 0.789
#(Adjusted p values reported -- single-step method)

Related

How to run the predicted probabilities (or average marginal effects) for individuals fixed effects in panel data using R?

These are three different ways to run an individual fixed effect method which gives more or less the same results (see below). My main question is how to get predictive probabilities or average marginal effects using the second model (model_plm) or the third model(model_felm). I know how to do it using the first model (model_lm) and show an example below using ggeffects, but that only works when i have a small sample.
As i have over a million individual, my model only works using model_plm and model_felm. If i use model_lm, it takes a lot of time to run with one million individuals since they are controlled for in the model. I also get the following error: Error: vector memory exhausted (limit reached?). I checked many threads on StackOverflow to work around that error but nothing seems to solve it.
I was wondering whether there is an efficient way to work around this issue. My main interest is to extract the predicted probabilities of the interaction residence*union. I usually extract predictive probabilities or average marginal effects using one of these packages: ggeffects,emmeans or margins.
library(lfe)
library(plm)
library(ggeffects)
data("Males")
model_lm = lm(wage ~ exper + residence+health + residence*union +factor(nr)-1, data=Males)
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr", "year"), data=Males)
model_felm = felm(wage ~ exper + residence + health + residence*union | nr, data= Males)
pred_ggeffects <- ggpredict(model_lm, c("residence","union"),
vcov.fun = "vcovCL",
vcov.type = "HC1",
vcov.args = list(cluster = Males$nr))
I tried adjusting formula/datasets to get emmeans and plm to play nice. Let me know if there's something here. I realized the biglm answer wasn't going to cut it for a million individuals after some testing.
library(emmeans)
library(plm)
data("Males")
## this runs but we need to get an equivalent result with expanded formula
## and expanded dataset
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr"), data=Males)
## expanded dataset
Males2 <- data.frame(wage=Males[complete.cases(Males),"wage"],
model.matrix(wage ~ exper + residence + health + residence*union, Males),
nr=Males[complete.cases(Males),"nr"])
(fmla2 <- as.formula(paste("wage ~ ", paste(names(coef(model_plm)), collapse= "+"))))
## expanded formula
model_plm2 <- plm(fmla2,
model = "within",
index=c("nr"),
data=Males2)
(fmla2_rg <- as.formula(paste("wage ~ -1 +", paste(names(coef(model_plm)), collapse= "+"))))
plm2_rg <- qdrg(fmla2_rg,
data = Males2,
coef = coef(model_plm2),
vcov = vcov(model_plm2),
df = model_plm2$df.residual)
plm2_rg
### when all 3 residences are 0, that's `rural area`
### then just pick the rows when one of the residences are 1
emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
Which gives, after some row-deletion:
> ### when all 3 residences are 0, that's `rural area`
> ### then just pick the rows when one of the residences are 1
> emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
residencenorth_east residencenothern_central residencesouth unionyes emmean SE df lower.CL upper.CL
0 0 0 0 0.3777 0.0335 2677 0.31201 0.443
1 0 0 0 0.3301 0.1636 2677 0.00929 0.651
0 1 0 0 0.1924 0.1483 2677 -0.09834 0.483
0 0 1 0 0.2596 0.1514 2677 -0.03732 0.557
0 0 0 1 0.2875 0.1473 2677 -0.00144 0.576
1 0 0 1 0.3845 0.1647 2677 0.06155 0.708
0 1 0 1 0.3326 0.1539 2677 0.03091 0.634
0 0 1 1 0.3411 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: healthyes
Confidence level used: 0.95
The problem seems to be that when we add -1 to the formula, that creates an extra column in the model matrix that is not included in the regression coefficients. (This is a byproduct of the way that R creates factor codings.)
So I can work around this by adding a strategically placed coefficient of zero. We also have to fix up the covariance matrix the same way:
library(emmeans)
library(plm)
data("Males")
mod <- plm(wage ~ exper + residence + health + residence*union,
model = "within",
index = "nr",
data = Males)
BB <- c(coef(mod)[1], 0, coef(mod)[-1])
k <- length(BB)
VV <- matrix(0, nrow = k, ncol = k)
VV[c(1, 3:k), c(1, 3:k)] <- vcov(mod)
RG <- qdrg(~ -1 + exper + residence + health + residence*union,
data = Males, coef = BB, vcov = VV, df = df.residual(mod))
Verify that things line up:
> names(RG#bhat)
[1] "exper" ""
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
> colnames(RG#linfct)
[1] "exper" "residencerural_area"
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
They do line up, so we can get the results we need:
(EMM <- emmeans(RG, ~ residence * union))
residence union emmean SE df lower.CL upper.CL
rural_area no 0.378 0.0335 2677 0.31201 0.443
north_east no 0.330 0.1636 2677 0.00929 0.651
nothern_central no 0.192 0.1483 2677 -0.09834 0.483
south no 0.260 0.1514 2677 -0.03732 0.557
rural_area yes 0.287 0.1473 2677 -0.00144 0.576
north_east yes 0.385 0.1647 2677 0.06155 0.708
nothern_central yes 0.333 0.1539 2677 0.03091 0.634
south yes 0.341 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: health
Confidence level used: 0.95
In general, the key is to identify where the added column occurs. It's going to be the position of the first level of the first factor in the model formula. You can check it by looking at names(coef(mod)) and colnames(model.matrix(formula), data = data) where formula is the model formula with intercept removed.
Update: a general function
Here's a function that may be used to create a reference grid for any plm object. It turns out that sometimes these objects do have an intercept (e.g., random-effects models) so we have to check. For models lacking an intercept, you really should use this only for contrasts.
plmrg = function(object, ...) {
form = formula(formula(object))
if (!("(Intercept)" %in% names(coef(object))))
form = update(form, ~ . - 1)
data = eval(object$call$data, environment(form))
mmat = model.matrix(form, data)
sel = which(colnames(mmat) %in% names(coef(object)))
k = ncol(mmat)
b = rep(0, k)
b[sel] = coef(object)
v = matrix(0, nrow = k, ncol = k)
v[sel, sel] = vcov(object)
emmeans::qdrg(formula = form, data = data,
coef = b, vcov = v, df = df.residual(object), ...)
}
Test run:
> (rg = plmrg(mod, at = list(exper = c(3,6,9))))
'emmGrid' object with variables:
exper = 3, 6, 9
residence = rural_area, north_east, nothern_central, south
health = no, yes
union = no, yes
> emmeans(rg, "residence")
NOTE: Results may be misleading due to involvement in interactions
residence emmean SE df lower.CL upper.CL
rural_area 0.313 0.0791 2677 0.1579 0.468
north_east 0.338 0.1625 2677 0.0190 0.656
nothern_central 0.243 0.1494 2677 -0.0501 0.536
south 0.281 0.1514 2677 -0.0161 0.578
Results are averaged over the levels of: exper, health, union
Confidence level used: 0.95
This potential solution uses biglm::biglm() to fit the lm model and then uses emmeans::qdrg() with a nuisance specified. Does this approach help in your situation?
library(biglm)
library(emmeans)
## the biglm coefficients using factor() with all the `nr` levels has NAs.
## so restrict data to complete cases in the `biglm()` call
model_biglm <- biglm(wage ~ -1 +exper + residence+health + residence*union + factor(nr), data=Males[!is.na(Males$residence),])
summary(model_biglm)
## double check that biglm and lm give same/similar model
## summary(model_biglm)
## summary(model_lm)
summary(model_biglm)$rsq
summary(model_lm)$r.squared
identical(coef(model_biglm), coef(model_lm)) ## not identical! but plot the coefficients...
head(cbind(coef(model_biglm), coef(model_lm)))
tail(cbind(coef(model_biglm), coef(model_lm)))
plot(cbind(coef(model_biglm), coef(model_lm))); abline(0,1,col="blue")
## do a "[q]uick and [d]irty [r]eference [g]rid and follow examples
### from ?qdrg and https://cran.r-project.org/web/packages/emmeans/vignettes/FAQs.html
rg1 <- qdrg(wage ~ -1 + exper + residence+health + residence*union + factor(nr),
data = Males,
coef = coef(model_biglm),
vcov = vcov(model_biglm),
df = model_biglm$df.resid,
nuisance="nr")
## Since we already specified nuisance in qdrg() we don't in emmeans():
emmeans(rg1, c("residence","union"))
Which gives:
> emmeans(rg1, c("residence","union"))
residence union emmean SE df lower.CL upper.CL
rural_area no 1.72 0.1417 2677 1.44 2.00
north_east no 1.67 0.0616 2677 1.55 1.79
nothern_central no 1.53 0.0397 2677 1.45 1.61
south no 1.60 0.0386 2677 1.52 1.68
rural_area yes 1.63 0.2011 2677 1.23 2.02
north_east yes 1.72 0.0651 2677 1.60 1.85
nothern_central yes 1.67 0.0503 2677 1.57 1.77
south yes 1.68 0.0460 2677 1.59 1.77
Results are averaged over the levels of: 1 nuisance factors, health
Confidence level used: 0.95

R post-hoc test brms

I have the following model:
prior1 <- c(
prior(normal(0, 50), class = b),
prior(exponential(0.1), class = sd),
prior(exponential(0.1), class = sigma))
BMvpa <- brm (
RT ~ 1 + GroupC*PrimeC*CongC*EmoC*SexC
+ (1 + CongC *PrimeC*EmoC || ID)
+ (1 + GroupC*PrimeC*CongC*EmoC*SexC || Target),
data = df1,
family = exgaussian(),
prior = prior1,
warmup = 2000, iter = 5000,
chains = 3,
cores= parallel::detectCores(),
sample_prior = TRUE
)
Here is part of my output:
Predictors Estimate SE Lower Upper Rhat BF01
1 Intercept 700.175 13.470 674.330 726.617 1.000 <NA>
2 Group 49.027 33.542 -17.261 115.122 1.000 0.51
3 Prime -12.197 2.816 -17.799 -6.655 1.000 0.017
4 Congruency -15.879 2.798 -21.507 -10.435 1.001 <0.001
5 Emotion 17.092 6.860 3.740 30.373 1.000 0.32
6 Sex 5.362 24.381 -42.347 52.871 1.001 2.031
...
21 Group x Prime x Emotion 22.339 8.509 5.464 39.136 1.001 0.24
...
Each variable (Group, Sex, Prime, Congruency , Emotion) is dichotomous and coded -0.5 (TD, M, LSF, ICG, Joy), +0.5 (ASD, F, HSF, CG, Anger).
I would like to go more in details in the interaction Group x Prime x Emotion (represented on the figure) and would like to know the posterior distribution regarding the effect of Prime for each group and each emotion.
I thought about 2 strategies.
1/First using emmeans:
BMGPE_emm <- emmeans(BMvpa, ~ GroupC:PrimeC:EmoC)
BMGPE_fac <- update(BMGPE_emm, levels =list(GroupC= c("TD","ASD"), PrimeC= c("LSF","HSF"), EmoC=c("joy","anger")))
contRT1 <- as.data.frame(contrast(BMGPE_fac, method = "pairwise", by = c("GroupC","EmoC")))
Output:
1 LSF - HSF TD joy 8.706978 -0.3446188 18.33661
2 LSF - HSF ASD joy 20.487280 10.2622944 30.46115
3 LSF - HSF TD anger 15.029082 6.2702713 24.67623
4 LSF - HSF ASD anger 4.412052 -5.6749680 14.60393
I am not sure about this because I would have expected only negative estimates (HSF primes reducing response time).
Additionally, is there a possibility to compute a Bayes factor here?
2/ Second, using the hypothesisfunction (I read the blogpost of Matti Vuorre). I think it would be the best but the result I got are even more strange and I think I probably made a mistake (I was expected only negative estimates) :
> hypothesis(BMvpa,c(qAJ = "PrimeC + 0.5*GroupC - 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qAJ 3.77 17.35 -30.51 37.79 3.56 0.78
---
> hypothesis(BMvpa,c(qAA = "PrimeC + 0.5*GroupC + 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qAA 20.86 17.46 -13.54 55.14 1.63 0.62
---
> hypothesis(BMvpa,c(qTJ = "PrimeC - 0.5*GroupC - 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qTJ -45.26 17.34 -78.93 -10.78 0.15 0.13 *
---
> hypothesis(BMvpa,c(qTA = "PrimeC - 0.5*GroupC - 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qTA -45.26 17.34 -78.93 -10.78 0.15 0.13 *
---
So my question is: how could I have the posterior distribution for the efefct of prime in each subgroup (and BF).

How to change unit increment in hazard ratio from coxph and frailty model in R?

I ran a coxph model and a frailty model, but now I would like to change the hazard ratio for continuous variable (age) to show in terms of 5-unit increment instead of 1-unit. Is there a function in R that can perform such task? If so, does the function also work for frailty mode? I used the package frailtypack.
library('survival')
data(veteran)
cox <- coxph(Surv(time, status) ~ age, data = veteran)
summary(cox)
# Call:
# coxph(formula = Surv(time, status) ~ age, data = veteran)
#
# n= 137, number of events= 128
#
# coef exp(coef) se(coef) z Pr(>|z|)
# age 0.007500 1.007528 0.009565 0.784 0.433
#
# exp(coef) exp(-coef) lower .95 upper .95
# age 1.008 0.9925 0.9888 1.027
#
# Concordance= 0.515 (se = 0.029 )
# Likelihood ratio test= 0.63 on 1 df, p=0.4
# Wald test = 0.61 on 1 df, p=0.4
# Score (logrank) test = 0.62 on 1 df, p=0.4
Just add a new variable that represents the age group each subject belongs to; for example 1: 0-4, 2: 5-9, 3: 10-15, etc.
This is an example using the veteran dataset in the survival package. The data has a continuous variable age. Adding this as a predictor to the model will give you the relative risk (hazard ratio) for a one-year increase or increment in age. If you are interested in the x-year increment, you should generate a new variable which groups subjects accordingly. For these data, I applied the following grouping; group 1: younger than 40, group 2: 40 - <50, group 3: 50 - < 60, group 4: 60 - <70, and group 5: 70 or older. As such, the HR for a 10-year increment is 1.049. Alternatively, the risk increases with 5% for every 10 year increase in age. Note that the association is not statistically significant.
library(survival)
data(veteran)
veteran$ageCat <- 5
veteran$ageCat[veteran$age < 70] <- 4
veteran$ageCat[veteran$age < 60] <- 3
veteran$ageCat[veteran$age < 50] <- 2
veteran$ageCat[veteran$age < 40] <- 1
table(veteran$ageCat)
1 2 3 4 5
11 20 22 72 12
cox <- coxph(Surv(time, status) ~ ageCat, data = veteran)
summary(cox)
Call:
coxph(formula = Surv(time, status) ~ ageCat, data = veteran)
n= 137, number of events= 128
coef exp(coef) se(coef) z Pr(>|z|)
ageCat 0.04793 1.04910 0.09265 0.517 0.605
exp(coef) exp(-coef) lower .95 upper .95
ageCat 1.049 0.9532 0.8749 1.258
Concordance= 0.509 (se = 0.028 )
Rsquare= 0.002 (max possible= 0.999 )
Likelihood ratio test= 0.27 on 1 df, p=0.6024
Wald test = 0.27 on 1 df, p=0.6049
Score (logrank) test = 0.27 on 1 df, p=0.6048
#milan's post answers a similar question but not the one as asked. Since age was split into decades and modeled as a continuous variable, the hazard ratio would compare a subject's age-decade compared to the next youngest decade. That is, the HR for subjects aged 51 vs 49 or 59 vs 41 would be the same despite 2 or 18 years between them.
Anyway, the default as you suggest is for a 1-unit increment in the continuous variable, age in this case. It's not always useful to compare subjects by 1-unit change especially when the range gets to be much larger.
You can do the following which is naive to the model, so this should would for a lm, glm, survival::coxph, frailtypack::frailtyPenal, etc.
library('survival')
data(veteran)
## 1-year increase in age
cox <- coxph(Surv(time, status) ~ age, data = veteran)
exp(coef(cox))
# age
# 1.007528
For a multiplicative model like Cox regressions, you can get the x-unit change after the model is fit:
## 5-year increase in age
exp(coef(cox)) ^ 5
# age
# 1.038211
## or equivalently
exp(coef(cox) * 5)
# age
# 1.038211
However, it's easier to create a variable for the age transformation then fit the model:
## or you can create a variable to model
veteran <- within(veteran, {
age5 <- age / 5
})
cox5_1 <- coxph(Surv(time, status) ~ age5, data = veteran)
exp(coef(cox5_1))
# age10
# 1.038211
cox5_2 <- coxph(Surv(time, status) ~ I(age / 5), data = veteran)
exp(coef(cox5_2))
# I(age/5)
# 1.038211
Note you need to use I here in the formula interface since some operators have special meanings in formulae. For example, lm(mpg ~ wt - 1, mtcars) and lm(mpg ~ I(wt - 1), mtcars) are two different models.
You can use these methods in other models, for example frailtyPenal if that is indeed the one you are using:
library('frailtypack')
fp <- frailtyPenal(Surv(time, status) ~ age, data = veteran, n.knots = 12, kappa = 1e5)
exp(fp$coef)
exp(fp$coef) ^ 5
fp5_1 <- frailtyPenal(Surv(time, status) ~ age5, data = veteran, n.knots = 12, kappa = 1e5)
fp5_2 <- frailtyPenal(Surv(time, status) ~ I(age / 5), data = veteran, n.knots = 12, kappa = 1e5)
exp(fp5_1$coef)
exp(fp5_2$coef)

R - Plm and lm - Fixed effects

I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.

Doing calculations on summary elements

Is there an easy way to run followup mathematical calculations on elements of a summary? I have log transformed data that is run through an anova analysis. I would like to calculate the antilog of the summary output.
I have the following code:
require(multcomp)
inc <- log(Inc)
myanova <- aov(inc ~ educ)
tukey <- glht(myanova, linfct = mcp(educ = "Tukey"))
summary(tukey)
Which produces an output as follows:
Estimate Std. Error t value Pr(>|t|)
12 - under12 == 0 0.32787 0.08493 3.861 0.00104 **
13to15 - under12 == 0 0.49187 0.08775 5.606 < 0.001 ***
16 - under12 == 0 0.89775 0.09217 9.740 < 0.001 ***
over16 - under12 == 0 0.99856 0.09316 10.719 < 0.001 ***
13to15 - 12 == 0 0.16400 0.04674 3.509 0.00394 **
etc.
How can I easily execute an antilog calculation on the Estimate values?
This is a bit of a hack, so I'd recommend further checking, but if all you want is to see exponented estimates and standard errors I think something similar to the following will work (I used different data).
> amod <- aov(breaks ~ tension, data = warpbreaks)
> tukey = glht(amod, linfct = mcp(tension = "Tukey"))
> tsum = summary(tukey)
> tsum[[10]]$coefficients = exp(tsum[[10]]$coefficients)
> tsum[[10]]$sigma = exp(tsum[[10]]$sigma)
> tsum
If you want to use coef(tukey) to give you the estimates then you would reverse transform with:
exp(coef(tukey))
I think this should work:
coef(tukey)
to get the estimated values. here an example:
amod <- aov(breaks ~ tension, data = warpbreaks)
tukey <- glht(amod, linfct = mcp(tension = "Tukey"))
Now if want to get all tukey summary elements you type you apply head or tail to get a named list with the summary elements.
head(summary(tukey))
$model
Call:
aov(formula = breaks ~ tension, data = warpbreaks)
Terms:
tension Residuals
Sum of Squares 2034.259 7198.556
Deg. of Freedom 2 51
Residual standard error: 11.88058
Estimated effects may be unbalanced
$linfct
(Intercept) tensionM tensionH
M - L 0 1 0
H - L 0 0 1
H - M 0 -1 1
attr(,"type")
[1] "Tukey"
$rhs
[1] 0 0 0
$coef
(Intercept) tensionM tensionH
36.38889 -10.00000 -14.72222
$vcov
(Intercept) tensionM tensionH
(Intercept) 7.841564 -7.841564 -7.841564
tensionM -7.841564 15.683128 7.841564
tensionH -7.841564 7.841564 15.683128
$df
[1] 51

Resources