Nan and Inf in survreg model summary - r

I'm using survival analysis to evaluate the relative distance (instead of time, as it's usually the case with survival statistics) before a given event happened. As the dataset I'm working with is quite big, you can download the .rds file of my dataset here
When modeling the relative distance using survreg(), I encountered NaN and Inf z and p values (presumably deriving from 0 values of Std Error) in the model summary:
Call:
survreg(formula = Surv(RelDistance, Status) ~ Backshore + LowerBSize +
I(LowerBSize^2) + I(LowerBSize^3) + State, data = DataLong,
dist = "exponential")
Value Std. Error z p
(Intercept) 2.65469 1.16e-01 22.9212 2.85e-116
BackshoreDune -0.08647 9.21e-02 -0.9387 3.48e-01
BackshoreForest / Tree (>3m) -0.07017 0.00e+00 -Inf 0.00e+00
BackshoreGrass - pasture -0.79275 1.63e-01 -4.8588 1.18e-06
BackshoreGrass - tussock -0.14687 1.00e-01 -1.4651 1.43e-01
BackshoreMangrove 0.08207 0.00e+00 Inf 0.00e+00
BackshoreSeawall -0.47019 1.43e-01 -3.2889 1.01e-03
BackshoreShrub (<3m) -0.14004 9.45e-02 -1.4815 1.38e-01
BackshoreUrban / Building 0.00000 0.00e+00 NaN NaN
LowerBSize -0.96034 1.96e-02 -49.0700 0.00e+00
I(LowerBSize^2) 0.06403 1.87e-03 34.2782 1.66e-257
I(LowerBSize^3) -0.00111 3.84e-05 -28.8070 1.75e-182
StateNT 0.14936 0.00e+00 Inf 0.00e+00
StateQLD 0.09533 1.02e-01 0.9384 3.48e-01
StateSA 0.01030 1.06e-01 0.0973 9.22e-01
StateTAS 0.19096 1.26e-01 1.5171 1.29e-01
StateVIC -0.07467 1.26e-01 -0.5917 5.54e-01
StateWA 0.24800 9.07e-02 2.7335 6.27e-03
Scale fixed at 1
Exponential distribution
Loglik(model)= -1423.4 Loglik(intercept only)= -3282.8
Chisq= 3718.86 on 17 degrees of freedom, p= 0
Number of Newton-Raphson Iterations: 6
n= 6350
I thought the Inf and NaN could be caused by data separation, and merged some levels of Backshore together:
levels(DataLong$Backshore)[levels(DataLong$Backshore)%in%c("Grass -
pasture", "Grass - tussock", "Shrub (<3m)")] <- "Grass - pasture & tussock
/ Shrub(<3m)"
levels(DataLong$Backshore)[levels(DataLong$Backshore)%in%c("Seawall",
"Urban / Building")] <- "Anthropogenic"
levels(DataLong$Backshore)[levels(DataLong$Backshore)%in%c("Forest / Tree
(>3m)", "Mangrove")] <- "Tree(>3m) / Mangrove"
but the issue persists when running the model again (i.e. Backshore Tree(>3m) / Mangrove).
Call:
survreg(formula = Surv(RelDistance, Status) ~ Backshore + LowerBSize +
I(LowerBSize^2) + I(LowerBSize^3) + State, data = DataLong,
dist = "exponential")
Value Std. Error z p
(Intercept) 2.6684 1.18e-01 22.551 1.32e-112
BackshoreDune -0.1323 9.43e-02 -1.402 1.61e-01
BackshoreTree(>3m) / Mangrove -0.0530 0.00e+00 -Inf 0.00e+00
BackshoreGrass - pasture & tussock / Shrub(<3m) -0.2273 8.95e-02 -2.540 1.11e-02
BackshoreAnthropogenic -0.5732 1.38e-01 -4.156 3.24e-05
LowerBSize -0.9568 1.96e-02 -48.920 0.00e+00
I(LowerBSize^2) 0.0639 1.87e-03 34.167 7.53e-256
I(LowerBSize^3) -0.0011 3.84e-05 -28.713 2.59e-181
StateNT 0.2892 0.00e+00 Inf 0.00e+00
StateQLD 0.0715 1.00e-01 0.713 4.76e-01
StateSA 0.0507 1.05e-01 0.482 6.30e-01
StateTAS 0.1990 1.26e-01 1.581 1.14e-01
StateVIC -0.0604 1.26e-01 -0.479 6.32e-01
StateWA 0.2709 9.05e-02 2.994 2.76e-03
Scale fixed at 1
Exponential distribution
Loglik(model)= -1428.4 Loglik(intercept only)= -3282.8
Chisq= 3708.81 on 13 degrees of freedom, p= 0
Number of Newton-Raphson Iterations: 6
n= 6350
I've looked for an explanation for this behaviour pretty much everywhere in the survival package documentation and online, but I couldn't find anything that related to this.
Does anyone know what could be the cause of Inf and NaNs in this case?

The covariate LowerBSize perfectly predicts the Status outcome; Status==0 only if LowerBSize==0 and Status==1 only if LowerBSize>0.
table(DataLong$LowerBSize, DataLong$Status)
0 1
0 4996 0
1.2 0 271
2.4 0 331
4.9 0 256
9.6 0 155
19.2 0 148
36.3 0 193
A convenient way to consider LowerBSize in the model is to include the binary variable LowerBSize>0:
survreg(formula = Surv(RelDistance, Status) ~ Backshore + State +
I(LowerBSize>0), data = DataLong, dist = "exponential")
Coefficients:
(Intercept) BackshoreDune BackshoreForest / Tree (>3m)
22.97248461 -0.04798348 -0.27440059
BackshoreGrass - pasture BackshoreGrass - tussock BackshoreMangrove
-0.33624746 -0.07545700 0.12020217
BackshoreSeawall BackshoreShrub (<3m) BackshoreUrban / Building
-0.01008893 -0.05115076 0.29125024
StateNT StateQLD StateSA
0.15385826 0.11617931 0.08405151
StateTAS StateVIC StateWA
0.14914393 0.08803225 0.06395311
I(LowerBSize > 0)TRUE
-23.75967069
Scale fixed at 1
Loglik(model)= -316.5 Loglik(intercept only)= -3282.8
Chisq= 5932.66 on 15 degrees of freedom, p= <2e-16
n= 6350

#MarcoSandri is correct that censoring is confounded with LowerBSize, but I'm not sure that's the entire solution. It could explain why the model is so unstable, but that by itself shouldn't (AFAICT) make the model ill-posed. If I replace LowerBSize+ I(LowerBSize^2) + I(LowerBSize^3) with an orthogonal polynomial (poly(LowerBSize,3)) I get more reasonable-looking answers:
ss3 <- survreg(formula = Surv(RelDistance, Status) ~ Backshore +
poly(LowerBSize,3) + State, data = DataLong,
dist = "exponential")
Call:
survreg(formula = Surv(RelDistance, Status) ~ Backshore + poly(LowerBSize,
3) + State, data = DataLong, dist = "exponential")
Value Std. Error z p
(Intercept) 2.18e+00 1.34e-01 16.28 < 2e-16
BackshoreDune -1.56e-01 1.06e-01 -1.47 0.14257
BackshoreForest / Tree (>3m) -2.24e-01 2.01e-01 -1.11 0.26549
BackshoreGrass - pasture -8.63e-01 1.74e-01 -4.97 6.7e-07
BackshoreGrass - tussock -2.14e-01 1.13e-01 -1.89 0.05829
BackshoreMangrove 3.66e-01 4.59e-01 0.80 0.42519
BackshoreSeawall -5.37e-01 1.53e-01 -3.51 0.00045
BackshoreShrub (<3m) -2.08e-01 1.08e-01 -1.92 0.05480
BackshoreUrban / Building -1.17e+00 3.22e-01 -3.64 0.00028
poly(LowerBSize, 3)1 -6.58e+01 1.41e+00 -46.63 < 2e-16
poly(LowerBSize, 3)2 5.09e+01 1.19e+00 42.72 < 2e-16
poly(LowerBSize, 3)3 -4.05e+01 1.41e+00 -28.73 < 2e-16
StateNT 2.61e-01 1.93e-01 1.35 0.17557
StateQLD 9.72e-02 1.12e-01 0.87 0.38452
StateSA -4.11e-04 1.15e-01 0.00 0.99715
StateTAS 1.91e-01 1.35e-01 1.42 0.15581
StateVIC -9.55e-02 1.35e-01 -0.71 0.47866
StateWA 2.46e-01 1.01e-01 2.44 0.01463
If I fit exactly the same model but with poly(LowerBSize,3,raw=TRUE) (calling the result ss4, see below) I get your pathologies again. Furthermore, the model with orthogonal polynomials actually fits better (has a higher log-likelihood):
logLik(ss4)
## 'log Lik.' -1423.382 (df=18)
logLik(ss3)
## 'log Lik.' -1417.671 (df=18)
In a perfect mathematical/computational world, this shouldn't be true - it's another indication that there's something unstable about specifying the LowerBSize effects this way. I'm a little surprised this happens - the number of unique values of LowerBSize is small but shouldn't be pathological, and the range of values isn't huge or far from zero ...
I still can't say what's really causing this, but the proximal problem is probably the strong correlation between the linear/quadratic/cubic terms: for something simpler like linear regression a correlation of 0.993 (between quad & cubic terms) doesn't cause severe problems, but the more complicated the numerical problem (e.g. survival analysis vs. linear regression), the more correlation can be an issue ...
X <- model.matrix( ~ Backshore + LowerBSize +
I(LowerBSize^2) + I(LowerBSize^3) + State,
data=DataLong)
print(cor(X[,grep("LowerBSize",colnames(X))]),digits=3)
library(corrplot)
png("survcorr.png")
corrplot.mixed(cor(X[,-1]),lower="ellipse",upper="number",
tl.cex=0.4)
dev.off()

Related

Different results of coxph with time-varying coefficients between Stata and R

I'm hoping any of you could shed some light on the following. I have been attempting to replicate a Cox PH model from Stata in R. As you can see below, I get the same results for Cox PH models without tvcs in both programs:
Stata Cox PH model
stset date_endpoint, failure(cause_endpoint2==4) exit(failure) origin(time capture_date) id(wolf_ID)
id: wolf_ID
failure event: cause_endpoint2 == 4
obs. time interval: (date_endpoint[_n-1], date_endpoint]
exit on or before: failure
t for analysis: (time-origin)
origin: time capture_date
--------------------------------------------------------------------------
5,664 total observations
0 exclusions
--------------------------------------------------------------------------
5,664 observations remaining, representing
513 subjects
231 failures in single-failure-per-subject data
279,430.5 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 3,051
stcox deer_hunt bear_hunt, strata(winter lib_kill) efron robust cluster(wolf_ID)
failure _d: cause_endpoint2 == 4
analysis time _t: (date_endpoint-origin)
origin: time capture_date
id: wolf_ID
Iteration 0: log pseudolikelihood = -993.65252
Iteration 1: log pseudolikelihood = -992.55768
Iteration 2: log pseudolikelihood = -992.55733
Refining estimates:
Iteration 0: log pseudolikelihood = -992.55733
Stratified Cox regr. -- Efron method for ties
No. of subjects = 513 Number of obs = 5,664
No. of failures = 231
Time at risk = 279430.5
Wald chi2(2) = 2.21
Log pseudolikelihood = -992.55733 Prob > chi2 = 0.3317
(Std. Err. adjusted for 513 clusters in wolf_ID)
--------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf.Interval]
-------------+------------------------------------------------------------
deer_hunt | .7860433 .1508714 -1.25 0.210 .539596 1.145049
bear_hunt | 1.21915 .2687211 0.90 0.369 .7914762 1.877916
--------------------------------------------------------------------------
Stratified by winter lib_kill
R Cox PH model
> LTF.coxph <- coxph(Surv(`_t0`,`_t`, endpoint_r_enc=="ltf") ~ deer_hunt
+ + bear_hunt + strata(winter, lib_kill), data=statadta, ties = "efron", id = wolf_ID)
> summary(LTF.coxph)
Call:
coxph(formula = Surv(`_t0`, `_t`, endpoint_r_enc == "ltf") ~
deer_hunt + bear_hunt + strata(winter, lib_kill), data = statadta,
ties = "efron", id = wolf_ID)
n= 5664, number of events= 231
coef exp(coef) se(coef) z Pr(>|z|)
deer_hunt -0.2407 0.7860 0.1849 -1.302 0.193
bear_hunt 0.1982 1.2191 0.2174 0.911 0.362
exp(coef) exp(-coef) lower .95 upper .95
deer_hunt 0.786 1.2722 0.5471 1.129
bear_hunt 1.219 0.8202 0.7962 1.867
Concordance= 0.515 (se = 0.022 )
Likelihood ratio test= 2.19 on 2 df, p=0.3
Wald test = 2.21 on 2 df, p=0.3
Score (logrank) test = 2.22 on 2 df, p=0.3
> cox.zph(LTF.coxph)
chisq df p
deer_hunt 5.5773 1 0.018
bear_hunt 0.0762 1 0.783
GLOBAL 5.5773 2 0.062
The problem I have is that results look very different when adding a time-varying coefficient (tvc() in Stata and tt() in R) for one of the variables in my model. Nothing is the same between models (coefficients for all variables or their significance).
Stata Cox PH model with tvc()
stcox deer_hunt bear_hunt, tvc(deer_hunt) strata(winter lib_kill) efron robust cluster(wolf_ID)
failure _d: cause_endpoint2 == 4
analysis time _t: (date_endpoint-origin)
origin: time capture_date
id: wolf_ID
Iteration 0: log pseudolikelihood = -993.65252
Iteration 1: log pseudolikelihood = -990.70475
Iteration 2: log pseudolikelihood = -990.69386
Iteration 3: log pseudolikelihood = -990.69386
Refining estimates:
Iteration 0: log pseudolikelihood = -990.69386
Stratified Cox regr. -- Efron method for ties
No. of subjects = 513 Number of obs = 5,664
No. of failures = 231
Time at risk = 279430.5
Wald chi2(3) = 4.72
Log pseudolikelihood = -990.69386 Prob > chi2 = 0.1932
(Std. Err. adjusted for 513 clusters in wolf_ID)
--------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf Interval]
-------------+------------------------------------------------------------
main |
deer_hunt | 1.043941 .2643779 0.17 0.865 .6354908 1.714915
bear_hunt | 1.204522 .2647525 0.85 0.397 .7829279 1.853138
-------------+------------------------------------------------------------
tvc |
deer_hunt | .9992683 .0004286 -1.71 0.088 .9984287 1.000109
------------------------------------------------------------------------------
Stratified by winter lib_kill
Note: Variables in tvc equation interacted with _t.
R Cox PH model with tt()
> LTF.tvc1.coxph <- coxph(Surv(`_t0`,`_t`, endpoint_r_enc=="ltf") ~ deer_hunt + bear_hunt + tt(deer_hunt) + strata(winter, lib_kill),
+ data=statadta, ties = "efron", id = wolf_ID, cluster = wolf_ID,
+ tt=function(x,t,...){x*t})
> summary(LTF.tvc1.coxph)
Call:
coxph(formula = Surv(`_t0`, `_t`, endpoint_r_enc == "ltf") ~
deer_hunt + bear_hunt + tt(deer_hunt) + strata(winter, lib_kill),
data = statadta, ties = "efron", tt = function(x, t, ...) {
x * t
}, id = wolf_ID, cluster = wolf_ID)
n= 5664, number of events= 231
coef exp(coef) se(coef) robust se z Pr(>|z|)
deer_hunt 0.4741848 1.6067039 0.2082257 0.2079728 2.280 0.02261 *
bear_hunt -0.7923208 0.4527927 0.1894531 0.1890497 -4.191 2.78e-05 ***
tt(deer_hunt)-0.0009312 0.9990692 0.0003442 0.0003612 -2.578 0.00994 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
deer_hunt 1.6067 0.6224 1.0688 2.4153
bear_hunt 0.4528 2.2085 0.3126 0.6559
tt(deer_hunt) 0.9991 1.0009 0.9984 0.9998
Concordance= 0.634 (se = 0.02 )
Likelihood ratio test= 28.29 on 3 df, p=3e-06
Wald test = 25.6 on 3 df, p=1e-05
Score (logrank) test = 26.19 on 3 df, p=9e-06, Robust = 32.6 p=4e-07
Moreover, I checked this post before posting this because I did not find it very helpful. The 'noadjust' Stata command was useful for SEs, but it does not answer my main issue of also getting different covariate coefficients between programs for the main and time-varying effects when I add those time-varying effects to the Cox model in each program (and the same formula for calculating the time-varying effects). That is really my main concern: the difference in covariate estimates seems substantial and would result in different prescriptions, I believe
I have been unable to figure out what is happening there, and am hoping the community can help.

Recreate spss GEE regression table in R

I have the (sample) dataset below:
round<-c( 0.125150, 0.045800, -0.955299, -0.232007, 0.120880, -0.041525, 0.290473, -0.648752, 0.113264, -0.403685)
square<-c(-0.634753, 0.000492, -0.178591, -0.202462, -0.592054, -0.583173, -0.632375, -0.176673, -0.680557, -0.062127)
ideo<-c(0,1,0,1,0,1,0,0,1,1)
ex<-data.frame(round,square,ideo)
When I ran the GEE regression in SPSS I took this table as a result.
I used packages gee and geepack in R to run the same analysis and I took these results:
#gee
summary(gee(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 1.0541 0.4099 2.572 0.1328 7.937
square 1.1811 0.8321 1.419 0.4095 2.884
round 0.7072 0.5670 1.247 0.1593 4.439
#geepack
summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 1.054 0.133 63.00 2.1e-15 ***
square 1.181 0.410 8.32 0.0039 **
round 0.707 0.159 19.70 9.0e-06 ***
---
I would like to recreate exactly the table of SPSS(not the results as I use a subset of the original dataset)but I do not know how to achieve all these results.
A tiny bit of tidyverse magic can get the same results - more or less.
Get the information from coef(summary(geeglm())) and compute the necessary columns:
library("tidyverse")
library("geepack")
coef(summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))) %>%
mutate(lowerWald = Estimate-1.96*Std.err, # Lower Wald CI
upperWald=Estimate+1.96*Std.err, # Upper Wald CI
df=1,
ExpBeta = exp(Estimate)) %>% # Transformed estimate
mutate(lWald=exp(lowerWald), # Upper transformed
uWald=exp(upperWald)) # Lower transformed
This produces the following (with the data you provided). The order and the names of the columns could be modified to suit your needs
Estimate Std.err Wald Pr(>|W|) lowerWald upperWald df ExpBeta lWald uWald
1 1.0541 0.1328 62.997 2.109e-15 0.7938 1.314 1 2.869 2.212 3.723
2 1.1811 0.4095 8.318 3.925e-03 0.3784 1.984 1 3.258 1.460 7.270
3 0.7072 0.1593 19.704 9.042e-06 0.3949 1.019 1 2.028 1.484 2.772

R - Plm and lm - Fixed effects

I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.

Calculating OR for logistic regression using rms

I have a logistic regression model, for which I have been using the rms package. The model fits best using a log term for tn1, and for clinical interpretation I’m using log2. I ran the model using lrm from the rms package, and then to double check, I ran it using glm. The initial coefficients are the same:
h <- lrm(formula = outcomehosp ~ I(log2(tn1 + 0.001)) + apscore_ad +
emsurg + corrapiidiag, data = d, x = TRUE, y = TRUE)
Coef S.E. Wald Z Pr(>|Z|)
Intercept -3.4570 0.3832 -9.02 <0.0001
tn1 0.0469 0.0180 2.60 0.0093
apscore_ad 0.1449 0.0127 11.44 <0.0001
emsurg 0.0731 0.3228 0.23 0.8208
f <- glm(formula = outcomehosp ~ apscore_ad + emsurg + corrapiidiag +
I(log2(tn1 + 0.001)), family = binomial(), data = tn24)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.45699 0.38315 -9.023 < 2e-16
I(log2(tn1 + 0.001)) 0.04690 0.01804 2.600 0.00932
apscore_ad 0.14487 0.01267 11.438 < 2e-16
emsurg 0.07310 0.32277 0.226 0.82082
However when I try to get the odds ratios, they are noticeably different for tn1 between the two models, and this doesn’t seem to be a difference of log2 transformation.
summary(h)
Effects Response : outcomehosp
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
tn1 0 0.21 0.21 0.362120 0.15417 6.5300e-02 0.673990
Odds Ratio 0 0.21 0.21 1.436400 NA 1.0675e+00 1.962100
apscore_ad 14 25.00 11.00 1.593600 0.15631 1.3605e+00 1.961000
Odds Ratio 14 25.00 11.00 4.921400 NA 3.8981e+00 7.106600
emsurg 0 1.00 1.00 0.073103 0.33051 -5.8224e-01 0.734860
Odds Ratio 0 1.00 1.00 1.075800 NA 5.5865e-01 2.085200
exp(f$coefficients)
(Intercept) 0.03152467
apscore_ad 1.15589222
emsurg 1.07584115
I(log2(tn1 + 0.001)) 1.04802
Would anyone be able to explain what the rms package is calculating the odds ratio of? Many thanks.
The tn1 effect from summary(h) is the effect on the log of the odds ratio of tn1 going from 0 to 0.21 -- the inter-quartile range. See ?summary.rms.
So, the effect from the first row of summary(h) is 0.36212 = (log2(0.211)-log2(0.001))*.0469.

Error with post hoc tests in ANCOVA with factorial design

Does anyone know how to do post hoc tests in an ANCOVA model with a factorial design?
I have two vectors consisting of 23 baseline values (covariate) and 23 values after treatment (dependent variable) and I have two factors with both two levels. I created an ANCOVA model and calculated the adjusted means, standard errors and confidence intervals. Example:
library(effects)
baseline = c(0.7672,1.846,0.6487,0.4517,0.5599,0.2255,0.5946,1.435,0.5374,0.4901,1.258,0.5445,1.078,1.142,0.5,1.044,0.7824,1.059,0.6802,0.8003,0.5547,1.003,0.9213)
after_treatment = c(0.4222,1.442,0.8436,0.5544,0.8818,0.08789,0.6291,1.23,0.4093,0.7828,-0.04061,0.8686,0.8525,0.8036,0.3758,0.8531,0.2897,0.8127,1.213,0.05276,0.7364,1.001,0.8974)
age = factor(c(rep(c("Young","Old"),11),"Young"))
treatment = factor(c(rep("Drug",12),rep("Placebo",11)))
ANC = aov(after_treatment ~ baseline + treatment*age)
effect_treatage = effect("treatment*age",ANC)
data.frame(effect_treatage)
treatment age fit se lower upper
1 Drug Old 0.8232137 0.1455190 0.5174897 1.1289377
2 Placebo Old 0.6168641 0.1643178 0.2716452 0.9620831
3 Drug Young 0.5689036 0.1469175 0.2602413 0.8775659
4 Placebo Young 0.7603360 0.1462715 0.4530309 1.0676410
Now I want test if there is a difference between the adjusted means of
Young-Placebo:Young-Drug
Old-Placebo:Old-Drug
Young-Placebo:Old-Drug
Old-Placebo:Young-Drug
So I tried:
library(multcomp)
pH = glht(ANC, linfct = mcp(treatment*age="Tukey"))
# Error: unexpected '=' in "ph = glht(ANC_nback, linfct = mcp(treat*age="
And:
pH = TukeyHSD(ANC)
# Error in rep.int(n, length(means)) : unimplemented type 'NULL' in 'rep3'
# In addition: Warning message:
# In replications(paste("~", xx), data = mf) : non-factors ignored: baseline
Does anyone know how to resolve this?
Many thanks!
PS for more info see
R: How to graphically plot adjusted means, SE, CI ANCOVA
If you wish to use multcomp, then you can take advantage of the wonderful and seamless interface between lsmeans and multcomp packages (see ?lsm), whereas lsmeans provides support for glht().
baseline = c(0.7672,1.846,0.6487,0.4517,0.5599,0.2255,0.5946,1.435,0.5374,0.4901,1.258,0.5445,1.078,1.142,0.5,1.044,0.7824,1.059,0.6802,0.8003,0.5547,1.003,0.9213)
after_treatment = c(0.4222,1.442,0.8436,0.5544,0.8818,0.08789,0.6291,1.23,0.4093,0.7828,-0.04061,0.8686,0.8525,0.8036,0.3758,0.8531,0.2897,0.8127,1.213,0.05276,0.7364,1.001,0.8974)
age = factor(c(rep(c("Young","Old"),11),"Young"))
treatment = factor(c(rep("Drug",12),rep("Placebo",11)))
Treat <- data.frame(baseline, after_treatment, age, treatment)
ANC <- aov(after_treatment ~ baseline + treatment*age, data=Treat)
library(multcomp)
library(lsmeans)
summary(glht(ANC, linfct = lsm(pairwise ~ treatment * age)))
## Note: df set to 18
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: aov(formula = after_treatment ~ baseline + treatment * age, data = Treat)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Drug,Old - Placebo,Old == 0 0.20635 0.21913 0.942 0.783
## Drug,Old - Drug,Young == 0 0.25431 0.20698 1.229 0.617
## Drug,Old - Placebo,Young == 0 0.06288 0.20647 0.305 0.990
## Placebo,Old - Drug,Young == 0 0.04796 0.22407 0.214 0.996
## Placebo,Old - Placebo,Young == 0 -0.14347 0.22269 -0.644 0.916
## Drug,Young - Placebo,Young == 0 -0.19143 0.20585 -0.930 0.789
## (Adjusted p values reported -- single-step method)
This eliminates the need for reparametrization. You can achieve the same results by using lsmeans alone:
lsmeans(ANC, list(pairwise ~ treatment * age))
## $`lsmeans of treatment, age`
## treatment age lsmean SE df lower.CL upper.CL
## Drug Old 0.8232137 0.1455190 18 0.5174897 1.1289377
## Placebo Old 0.6168641 0.1643178 18 0.2716452 0.9620831
## Drug Young 0.5689036 0.1469175 18 0.2602413 0.8775659
## Placebo Young 0.7603360 0.1462715 18 0.4530309 1.0676410
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## Drug,Old - Placebo,Old 0.20634956 0.2191261 18 0.942 0.7831
## Drug,Old - Drug,Young 0.25431011 0.2069829 18 1.229 0.6175
## Drug,Old - Placebo,Young 0.06287773 0.2064728 18 0.305 0.9899
## Placebo,Old - Drug,Young 0.04796056 0.2240713 18 0.214 0.9964
## Placebo,Old - Placebo,Young -0.14347183 0.2226876 18 -0.644 0.9162
## Drug,Young - Placebo,Young -0.19143238 0.2058455 18 -0.930 0.7893
##
## P value adjustment: tukey method for comparing a family of 4 estimates
You need to use the which argument in TukeyHSD; "listing terms in the fitted model for which the intervals should be calculated". This is needed because you have a non-factor variable in the model ('baseline'). The variable causes trouble when included, which is default when which is not specified.
ANC = aov(after_treatment ~ baseline + treatment*age)
TukeyHSD(ANC, which = c("treatment:age"))
If you wish to use the more flexible glht, see section 3, page 8- here
Reparametrization is a possibility here:
treatAge <- interaction(treatment, age)
ANC1 <- aov(after_treatment ~ baseline + treatAge)
#fits are equivalent:
all.equal(logLik(ANC), logLik(ANC1))
#[1] TRUE
library(multcomp)
summary(glht(ANC1, linfct = mcp(treatAge="Tukey")))
# Simultaneous Tests for General Linear Hypotheses
#
#Multiple Comparisons of Means: Tukey Contrasts
#
#
#Fit: aov(formula = after_treatment ~ baseline + treatAge)
#
#Linear Hypotheses:
# Estimate Std. Error t value Pr(>|t|)
#Placebo.Old - Drug.Old == 0 -0.20635 0.21913 -0.942 0.783
#Drug.Young - Drug.Old == 0 -0.25431 0.20698 -1.229 0.617
#Placebo.Young - Drug.Old == 0 -0.06288 0.20647 -0.305 0.990
#Drug.Young - Placebo.Old == 0 -0.04796 0.22407 -0.214 0.996
#Placebo.Young - Placebo.Old == 0 0.14347 0.22269 0.644 0.916
#Placebo.Young - Drug.Young == 0 0.19143 0.20585 0.930 0.789
#(Adjusted p values reported -- single-step method)

Resources