Different results of coxph with time-varying coefficients between Stata and R - r

I'm hoping any of you could shed some light on the following. I have been attempting to replicate a Cox PH model from Stata in R. As you can see below, I get the same results for Cox PH models without tvcs in both programs:
Stata Cox PH model
stset date_endpoint, failure(cause_endpoint2==4) exit(failure) origin(time capture_date) id(wolf_ID)
id: wolf_ID
failure event: cause_endpoint2 == 4
obs. time interval: (date_endpoint[_n-1], date_endpoint]
exit on or before: failure
t for analysis: (time-origin)
origin: time capture_date
--------------------------------------------------------------------------
5,664 total observations
0 exclusions
--------------------------------------------------------------------------
5,664 observations remaining, representing
513 subjects
231 failures in single-failure-per-subject data
279,430.5 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 3,051
stcox deer_hunt bear_hunt, strata(winter lib_kill) efron robust cluster(wolf_ID)
failure _d: cause_endpoint2 == 4
analysis time _t: (date_endpoint-origin)
origin: time capture_date
id: wolf_ID
Iteration 0: log pseudolikelihood = -993.65252
Iteration 1: log pseudolikelihood = -992.55768
Iteration 2: log pseudolikelihood = -992.55733
Refining estimates:
Iteration 0: log pseudolikelihood = -992.55733
Stratified Cox regr. -- Efron method for ties
No. of subjects = 513 Number of obs = 5,664
No. of failures = 231
Time at risk = 279430.5
Wald chi2(2) = 2.21
Log pseudolikelihood = -992.55733 Prob > chi2 = 0.3317
(Std. Err. adjusted for 513 clusters in wolf_ID)
--------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf.Interval]
-------------+------------------------------------------------------------
deer_hunt | .7860433 .1508714 -1.25 0.210 .539596 1.145049
bear_hunt | 1.21915 .2687211 0.90 0.369 .7914762 1.877916
--------------------------------------------------------------------------
Stratified by winter lib_kill
R Cox PH model
> LTF.coxph <- coxph(Surv(`_t0`,`_t`, endpoint_r_enc=="ltf") ~ deer_hunt
+ + bear_hunt + strata(winter, lib_kill), data=statadta, ties = "efron", id = wolf_ID)
> summary(LTF.coxph)
Call:
coxph(formula = Surv(`_t0`, `_t`, endpoint_r_enc == "ltf") ~
deer_hunt + bear_hunt + strata(winter, lib_kill), data = statadta,
ties = "efron", id = wolf_ID)
n= 5664, number of events= 231
coef exp(coef) se(coef) z Pr(>|z|)
deer_hunt -0.2407 0.7860 0.1849 -1.302 0.193
bear_hunt 0.1982 1.2191 0.2174 0.911 0.362
exp(coef) exp(-coef) lower .95 upper .95
deer_hunt 0.786 1.2722 0.5471 1.129
bear_hunt 1.219 0.8202 0.7962 1.867
Concordance= 0.515 (se = 0.022 )
Likelihood ratio test= 2.19 on 2 df, p=0.3
Wald test = 2.21 on 2 df, p=0.3
Score (logrank) test = 2.22 on 2 df, p=0.3
> cox.zph(LTF.coxph)
chisq df p
deer_hunt 5.5773 1 0.018
bear_hunt 0.0762 1 0.783
GLOBAL 5.5773 2 0.062
The problem I have is that results look very different when adding a time-varying coefficient (tvc() in Stata and tt() in R) for one of the variables in my model. Nothing is the same between models (coefficients for all variables or their significance).
Stata Cox PH model with tvc()
stcox deer_hunt bear_hunt, tvc(deer_hunt) strata(winter lib_kill) efron robust cluster(wolf_ID)
failure _d: cause_endpoint2 == 4
analysis time _t: (date_endpoint-origin)
origin: time capture_date
id: wolf_ID
Iteration 0: log pseudolikelihood = -993.65252
Iteration 1: log pseudolikelihood = -990.70475
Iteration 2: log pseudolikelihood = -990.69386
Iteration 3: log pseudolikelihood = -990.69386
Refining estimates:
Iteration 0: log pseudolikelihood = -990.69386
Stratified Cox regr. -- Efron method for ties
No. of subjects = 513 Number of obs = 5,664
No. of failures = 231
Time at risk = 279430.5
Wald chi2(3) = 4.72
Log pseudolikelihood = -990.69386 Prob > chi2 = 0.1932
(Std. Err. adjusted for 513 clusters in wolf_ID)
--------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf Interval]
-------------+------------------------------------------------------------
main |
deer_hunt | 1.043941 .2643779 0.17 0.865 .6354908 1.714915
bear_hunt | 1.204522 .2647525 0.85 0.397 .7829279 1.853138
-------------+------------------------------------------------------------
tvc |
deer_hunt | .9992683 .0004286 -1.71 0.088 .9984287 1.000109
------------------------------------------------------------------------------
Stratified by winter lib_kill
Note: Variables in tvc equation interacted with _t.
R Cox PH model with tt()
> LTF.tvc1.coxph <- coxph(Surv(`_t0`,`_t`, endpoint_r_enc=="ltf") ~ deer_hunt + bear_hunt + tt(deer_hunt) + strata(winter, lib_kill),
+ data=statadta, ties = "efron", id = wolf_ID, cluster = wolf_ID,
+ tt=function(x,t,...){x*t})
> summary(LTF.tvc1.coxph)
Call:
coxph(formula = Surv(`_t0`, `_t`, endpoint_r_enc == "ltf") ~
deer_hunt + bear_hunt + tt(deer_hunt) + strata(winter, lib_kill),
data = statadta, ties = "efron", tt = function(x, t, ...) {
x * t
}, id = wolf_ID, cluster = wolf_ID)
n= 5664, number of events= 231
coef exp(coef) se(coef) robust se z Pr(>|z|)
deer_hunt 0.4741848 1.6067039 0.2082257 0.2079728 2.280 0.02261 *
bear_hunt -0.7923208 0.4527927 0.1894531 0.1890497 -4.191 2.78e-05 ***
tt(deer_hunt)-0.0009312 0.9990692 0.0003442 0.0003612 -2.578 0.00994 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
deer_hunt 1.6067 0.6224 1.0688 2.4153
bear_hunt 0.4528 2.2085 0.3126 0.6559
tt(deer_hunt) 0.9991 1.0009 0.9984 0.9998
Concordance= 0.634 (se = 0.02 )
Likelihood ratio test= 28.29 on 3 df, p=3e-06
Wald test = 25.6 on 3 df, p=1e-05
Score (logrank) test = 26.19 on 3 df, p=9e-06, Robust = 32.6 p=4e-07
Moreover, I checked this post before posting this because I did not find it very helpful. The 'noadjust' Stata command was useful for SEs, but it does not answer my main issue of also getting different covariate coefficients between programs for the main and time-varying effects when I add those time-varying effects to the Cox model in each program (and the same formula for calculating the time-varying effects). That is really my main concern: the difference in covariate estimates seems substantial and would result in different prescriptions, I believe
I have been unable to figure out what is happening there, and am hoping the community can help.

Related

R post-hoc test brms

I have the following model:
prior1 <- c(
prior(normal(0, 50), class = b),
prior(exponential(0.1), class = sd),
prior(exponential(0.1), class = sigma))
BMvpa <- brm (
RT ~ 1 + GroupC*PrimeC*CongC*EmoC*SexC
+ (1 + CongC *PrimeC*EmoC || ID)
+ (1 + GroupC*PrimeC*CongC*EmoC*SexC || Target),
data = df1,
family = exgaussian(),
prior = prior1,
warmup = 2000, iter = 5000,
chains = 3,
cores= parallel::detectCores(),
sample_prior = TRUE
)
Here is part of my output:
Predictors Estimate SE Lower Upper Rhat BF01
1 Intercept 700.175 13.470 674.330 726.617 1.000 <NA>
2 Group 49.027 33.542 -17.261 115.122 1.000 0.51
3 Prime -12.197 2.816 -17.799 -6.655 1.000 0.017
4 Congruency -15.879 2.798 -21.507 -10.435 1.001 <0.001
5 Emotion 17.092 6.860 3.740 30.373 1.000 0.32
6 Sex 5.362 24.381 -42.347 52.871 1.001 2.031
...
21 Group x Prime x Emotion 22.339 8.509 5.464 39.136 1.001 0.24
...
Each variable (Group, Sex, Prime, Congruency , Emotion) is dichotomous and coded -0.5 (TD, M, LSF, ICG, Joy), +0.5 (ASD, F, HSF, CG, Anger).
I would like to go more in details in the interaction Group x Prime x Emotion (represented on the figure) and would like to know the posterior distribution regarding the effect of Prime for each group and each emotion.
I thought about 2 strategies.
1/First using emmeans:
BMGPE_emm <- emmeans(BMvpa, ~ GroupC:PrimeC:EmoC)
BMGPE_fac <- update(BMGPE_emm, levels =list(GroupC= c("TD","ASD"), PrimeC= c("LSF","HSF"), EmoC=c("joy","anger")))
contRT1 <- as.data.frame(contrast(BMGPE_fac, method = "pairwise", by = c("GroupC","EmoC")))
Output:
1 LSF - HSF TD joy 8.706978 -0.3446188 18.33661
2 LSF - HSF ASD joy 20.487280 10.2622944 30.46115
3 LSF - HSF TD anger 15.029082 6.2702713 24.67623
4 LSF - HSF ASD anger 4.412052 -5.6749680 14.60393
I am not sure about this because I would have expected only negative estimates (HSF primes reducing response time).
Additionally, is there a possibility to compute a Bayes factor here?
2/ Second, using the hypothesisfunction (I read the blogpost of Matti Vuorre). I think it would be the best but the result I got are even more strange and I think I probably made a mistake (I was expected only negative estimates) :
> hypothesis(BMvpa,c(qAJ = "PrimeC + 0.5*GroupC - 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qAJ 3.77 17.35 -30.51 37.79 3.56 0.78
---
> hypothesis(BMvpa,c(qAA = "PrimeC + 0.5*GroupC + 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qAA 20.86 17.46 -13.54 55.14 1.63 0.62
---
> hypothesis(BMvpa,c(qTJ = "PrimeC - 0.5*GroupC - 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qTJ -45.26 17.34 -78.93 -10.78 0.15 0.13 *
---
> hypothesis(BMvpa,c(qTA = "PrimeC - 0.5*GroupC - 0.5*EmoC = 0"))
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 qTA -45.26 17.34 -78.93 -10.78 0.15 0.13 *
---
So my question is: how could I have the posterior distribution for the efefct of prime in each subgroup (and BF).

How to fix "ran out of iterations and did not converge or more coefficientsmay be infinite" in coxph function?

I want to write code for calculating hazard rate using coxph for a dataset. This data has 5 variables, 2 of them are used in Surv(), and two of them are used as covariates. Now I can write the function which can simply calculate hazard rate for two covariates after input dataname. However, when I want to calculate hazard ratio using same function for 3 covariates, the program said "run out of iterations and did not converge or more coefficients maybe infinite",and the result contains all five variables as covariates (which should be three). Here is my code, can anyone correct it? Thanks!
library(KMsurv)
library(survival)
data(larynx)
larynx2 = larynx[,c(2,5,1,3,4)]
larynx2$stage = as.factor(larynx2$stage)
mod = function(dataname){
fit = coxph(Surv(dataname[,1],dataname[,2]) ~ ., data = dataname, ties = "breslow")
return(list(result = summary(fit)))
}
mod(larynx2)
How about this? Since column names in the formula works, we build the formula dynamically using the column names:
mod = function(dataname) {
form = as.formula(sprintf("Surv(%s, %s) ~ .", names(dataname)[1], names(dataname)[2]))
fit = coxph(form, data = dataname, ties = "breslow")
return(list(result = summary(fit)))
}
mod(larynx2)
# $result
# Call:
# coxph(formula = form, data = dataname, ties = "breslow")
#
# n= 90, number of events= 50
#
# coef exp(coef) se(coef) z Pr(>|z|)
# stage2 0.15078 1.16275 0.46459 0.325 0.7455
# stage3 0.64090 1.89820 0.35616 1.799 0.0719 .
# stage4 1.72100 5.59012 0.43660 3.942 8.09e-05 ***
# age 0.01855 1.01872 0.01432 1.295 0.1954
# diagyr -0.01923 0.98096 0.07655 -0.251 0.8017
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# exp(coef) exp(-coef) lower .95 upper .95
# stage2 1.163 0.8600 0.4678 2.890
# stage3 1.898 0.5268 0.9444 3.815
# stage4 5.590 0.1789 2.3757 13.154
# age 1.019 0.9816 0.9905 1.048
# diagyr 0.981 1.0194 0.8443 1.140
#
# Concordance= 0.676 (se = 0.039 )
# Rsquare= 0.182 (max possible= 0.988 )
# Likelihood ratio test= 18.13 on 5 df, p=0.003
# Wald test = 20.87 on 5 df, p=9e-04
# Score (logrank) test = 24.4 on 5 df, p=2e-04

Recreate spss GEE regression table in R

I have the (sample) dataset below:
round<-c( 0.125150, 0.045800, -0.955299, -0.232007, 0.120880, -0.041525, 0.290473, -0.648752, 0.113264, -0.403685)
square<-c(-0.634753, 0.000492, -0.178591, -0.202462, -0.592054, -0.583173, -0.632375, -0.176673, -0.680557, -0.062127)
ideo<-c(0,1,0,1,0,1,0,0,1,1)
ex<-data.frame(round,square,ideo)
When I ran the GEE regression in SPSS I took this table as a result.
I used packages gee and geepack in R to run the same analysis and I took these results:
#gee
summary(gee(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 1.0541 0.4099 2.572 0.1328 7.937
square 1.1811 0.8321 1.419 0.4095 2.884
round 0.7072 0.5670 1.247 0.1593 4.439
#geepack
summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 1.054 0.133 63.00 2.1e-15 ***
square 1.181 0.410 8.32 0.0039 **
round 0.707 0.159 19.70 9.0e-06 ***
---
I would like to recreate exactly the table of SPSS(not the results as I use a subset of the original dataset)but I do not know how to achieve all these results.
A tiny bit of tidyverse magic can get the same results - more or less.
Get the information from coef(summary(geeglm())) and compute the necessary columns:
library("tidyverse")
library("geepack")
coef(summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))) %>%
mutate(lowerWald = Estimate-1.96*Std.err, # Lower Wald CI
upperWald=Estimate+1.96*Std.err, # Upper Wald CI
df=1,
ExpBeta = exp(Estimate)) %>% # Transformed estimate
mutate(lWald=exp(lowerWald), # Upper transformed
uWald=exp(upperWald)) # Lower transformed
This produces the following (with the data you provided). The order and the names of the columns could be modified to suit your needs
Estimate Std.err Wald Pr(>|W|) lowerWald upperWald df ExpBeta lWald uWald
1 1.0541 0.1328 62.997 2.109e-15 0.7938 1.314 1 2.869 2.212 3.723
2 1.1811 0.4095 8.318 3.925e-03 0.3784 1.984 1 3.258 1.460 7.270
3 0.7072 0.1593 19.704 9.042e-06 0.3949 1.019 1 2.028 1.484 2.772

R - Plm and lm - Fixed effects

I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.

Why doesn't Stata and R produce the same output for a Multinomial Logit model?

Let's say I want to do a simple multinomial logit model using an online dataset in R:
library(nnet)
data <- data.table(read.dta('http://data.princeton.edu/wws509/datasets/irished.dta'))
ml <- multinom(educg ~ gender + prestigeg + reasong, data=data)
summary(ml)
You get the following output
summary(ml)
Call:
multinom(formula = educg ~ gender + prestigeg + reasong, data = data)
Coefficients:
(Intercept) genderfemale prestigegQ2 prestigegQ3 prestigegQ4 reasongQ2 reasongQ3 reasongQ4
senior -1.650999 0.3051297 0.8704957 1.189714 1.340206 -0.08303942 1.035163 1.627145
3rd level -5.792979 0.1615402 1.5331076 1.682500 2.227006 2.11053104 3.232968 4.963707
Std. Errors:
(Intercept) genderfemale prestigegQ2 prestigegQ3 prestigegQ4 reasongQ2 reasongQ3 reasongQ4
senior 0.3203241 0.2304163 0.3023462 0.3376034 0.3288158 0.2990188 0.3063954 0.3488479
3rd level 1.1165939 0.3477700 0.5534933 0.5878517 0.5433370 1.0789145 1.0644124 1.0532858
Residual Deviance: 730.8832
AIC: 762.8832
If I perform a similar routine in Stata:
use http://data.princeton.edu/wws509/datasets/irished.dta
mlogit educg gender prestigeg reasong
I get the following output:
Iteration 0: log likelihood = -433.16499
Iteration 1: log likelihood = -376.86517
Iteration 2: log likelihood = -371.52279
Iteration 3: log likelihood = -371.42355
Iteration 4: log likelihood = -371.42343
Iteration 5: log likelihood = -371.42343
Multinomial logistic regression Number of obs = 435
LR chi2(6) = 123.48
Prob > chi2 = 0.0000
Log likelihood = -371.42343 Pseudo R2 = 0.1425
------------------------------------------------------------------------------
educg | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
junior | (base outcome)
-------------+----------------------------------------------------------------
senior |
gender | .2577712 .2247087 1.15 0.251 -.1826498 .6981921
prestigeg | .4394042 .1027884 4.27 0.000 .2379427 .6408657
reasong | .5584275 .1059711 5.27 0.000 .3507279 .766127
_cons | -2.890597 .533933 -5.41 0.000 -3.937086 -1.844108
-------------+----------------------------------------------------------------
3rd_level |
gender | .1360704 .3416126 0.40 0.690 -.5334779 .8056188
prestigeg | .6387618 .1532933 4.17 0.000 .3383125 .9392111
reasong | 1.431763 .197151 7.26 0.000 1.045355 1.818172
_cons | -7.032375 .9904472 -7.10 0.000 -8.973616 -5.091134
------------------------------------------------------------------------------
Why are these values completely different? How do I get Stata-like output for a multinomial logit model in R?

Resources