Cox proportional hazard model in R vs Stata

Cox proportional hazard model in R vs Stata - r

I´m trying to replicate in R a cox proportional hazard model estimation from Stata using the following data http://iojournal.org/wp-content/uploads/2015/05/FortnaReplicationData.dta
The command in stata is the following:
stset enddate2009, id(VPFid) fail(warends) origin(time startdate)
stcox HCTrebels o_rebstrength demdum independenceC transformC lnpop lngdppc africa diffreligion warage if keepobs==1, cluster(js_country)
Cox regression -- Breslow method for ties
No. of subjects = 104 Number of obs = 566
No. of failures = 86
Time at risk = 194190
Wald chi2(10) = 56.29
Log pseudolikelihood = -261.94776 Prob > chi2 = 0.0000
(Std. Err. adjusted for 49 clusters in js_countryid)
-------------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
HCTrebels | .4089758 .1299916 -2.81 0.005 .2193542 .7625165
o_rebstrength | 1.157554 .2267867 0.75 0.455 .7884508 1.699447
demdum | .5893352 .2353317 -1.32 0.185 .2694405 1.289027
independenceC | .5348951 .1882826 -1.78 0.075 .268316 1.066328
transformC | .5277051 .1509665 -2.23 0.025 .3012164 .9244938
lnpop | .9374204 .0902072 -0.67 0.502 .7762899 1.131996
lngdppc | .9158258 .1727694 -0.47 0.641 .6327538 1.325534
africa | .5707749 .1671118 -1.92 0.055 .3215508 1.013165
diffreligion | 1.537959 .4472004 1.48 0.139 .869834 2.719275
warage | .9632408 .0290124 -1.24 0.214 .9080233 1.021816
-------------------------------------------------------------------------------
With R, I´m using the following:
data <- read.dta("FortnaReplicationData.dta")
data4 <- subset(data, keepobs==1)
data4$end_date <- data4$`_t`
data4$start_date <- data4$`_t0`
levels(data4$o_rebstrength) <- c(0:4)
data4$o_rebstrength <- as.numeric(levels(data4$o_rebstrength[data4$o_rebstrength])
data4 <- data4[,c("start_date", "end_date","HCTrebels", "o_rebstrength", "demdum", "independenceC", "transformC", "lnpop", "lngdppc", "africa", "diffreligion", "warage", "js_countryid", "warends")]
data4 <- na.omit(data4)
surv <- coxph(Surv(start_date, end_date, warends) ~ HCTrebels+ o_rebstrength +demdum + independenceC+ transformC+ lnpop+ lngdppc+ africa +diffreligion+ warage+cluster(js_countryid), data = data4, robust = TRUE, method="breslow")
coef exp(coef) se(coef) robust se z p
HCTrebels -0.8941 0.4090 0.3694 0.3146 -2.84 0.0045
o_rebstrength 0.1463 1.1576 0.2214 0.1939 0.75 0.4505
demdum -0.5288 0.5893 0.4123 0.3952 -1.34 0.1809
independenceC -0.6257 0.5349 0.3328 0.3484 -1.80 0.0725
transformC -0.6392 0.5277 0.3384 0.2831 -2.26 0.0240
lnpop -0.0646 0.9374 0.1185 0.0952 -0.68 0.4974
lngdppc -0.0879 0.9158 0.2060 0.1867 -0.47 0.6377
africa -0.5608 0.5708 0.3024 0.2898 -1.94 0.0530
diffreligion 0.4305 1.5380 0.3345 0.2878 1.50 0.1347
warage -0.0375 0.9632 0.0405 0.0298 -1.26 0.2090
Likelihood ratio test=30.1 on 10 df, p=0.000827
n= 566, number of events= 86
I get the same hazard ratio coefficients but the standard errors does not look the same. The Z and p values are close but not exactly the same. Why might be the difference between the results in R and Stata?

As user20650 noticed, when including "nohr" in the Stata options you get exactly the same standard errors as in R. Still there was a small difference in the standard errors when using clusters. user20650 again noticed that the difference was given because Stata default standard errors are multiplied g/(g − 1), where g is the number of cluster while R does not adjust these standard errors. So a solution is just to include noadjust in Stata or have the standard errors adjusted in R by doing:
sqrt(diag(vcov(surv))* (49/48))
If still we want in R to have the same standard errors from Stata, as when not specifying nohr, we need to know that when nhr is left off we obtain $exp(\beta)$ with the standard errors resulting from fitting the model in those scale. In particular obtained by applying the delta method to the original standard-error estimate. "The delta method obtains the standard error of a transformed variable by calculating the variance of the corresponding first-order Taylor expansion, which for the transform $exp(\beta)$ amounts to mutiplying the oringal standard error by $exp(\hat{\beta})$. This trick of calculation yields identical rsults as does transforming the parameters prior to estimation and then reestimating" (Cleves et al 2010). In R we can do it by using:
library(msm)
se <-diag(vcov(surv)* (49/48))
sapply(se, function(x) deltamethod(~ exp(x1), coef(surv)[which(se==x)], x))
HCTrebels o_rebstrength demdum independenceC transformC lnpop lngdppc africa diffreligion warage
0.1299916 0.2267867 0.2353317 0.1882826 0.1509665 0.0902072 0.1727694 0.1671118 0.4472004 0.02901243

Related

Different results of coxph with time-varying coefficients between Stata and R

I'm hoping any of you could shed some light on the following. I have been attempting to replicate a Cox PH model from Stata in R. As you can see below, I get the same results for Cox PH models without tvcs in both programs:
Stata Cox PH model
stset date_endpoint, failure(cause_endpoint2==4) exit(failure) origin(time capture_date) id(wolf_ID)
id: wolf_ID
failure event: cause_endpoint2 == 4
obs. time interval: (date_endpoint[_n-1], date_endpoint]
exit on or before: failure
t for analysis: (time-origin)
origin: time capture_date
--------------------------------------------------------------------------
5,664 total observations
0 exclusions
--------------------------------------------------------------------------
5,664 observations remaining, representing
513 subjects
231 failures in single-failure-per-subject data
279,430.5 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 3,051
stcox deer_hunt bear_hunt, strata(winter lib_kill) efron robust cluster(wolf_ID)
failure _d: cause_endpoint2 == 4
analysis time _t: (date_endpoint-origin)
origin: time capture_date
id: wolf_ID
Iteration 0: log pseudolikelihood = -993.65252
Iteration 1: log pseudolikelihood = -992.55768
Iteration 2: log pseudolikelihood = -992.55733
Refining estimates:
Iteration 0: log pseudolikelihood = -992.55733
Stratified Cox regr. -- Efron method for ties
No. of subjects = 513 Number of obs = 5,664
No. of failures = 231
Time at risk = 279430.5
Wald chi2(2) = 2.21
Log pseudolikelihood = -992.55733 Prob > chi2 = 0.3317
(Std. Err. adjusted for 513 clusters in wolf_ID)
--------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf.Interval]
-------------+------------------------------------------------------------
deer_hunt | .7860433 .1508714 -1.25 0.210 .539596 1.145049
bear_hunt | 1.21915 .2687211 0.90 0.369 .7914762 1.877916
--------------------------------------------------------------------------
Stratified by winter lib_kill
R Cox PH model
> LTF.coxph <- coxph(Surv(`_t0`,`_t`, endpoint_r_enc=="ltf") ~ deer_hunt
+ + bear_hunt + strata(winter, lib_kill), data=statadta, ties = "efron", id = wolf_ID)
> summary(LTF.coxph)
Call:
coxph(formula = Surv(`_t0`, `_t`, endpoint_r_enc == "ltf") ~
deer_hunt + bear_hunt + strata(winter, lib_kill), data = statadta,
ties = "efron", id = wolf_ID)
n= 5664, number of events= 231
coef exp(coef) se(coef) z Pr(>|z|)
deer_hunt -0.2407 0.7860 0.1849 -1.302 0.193
bear_hunt 0.1982 1.2191 0.2174 0.911 0.362
exp(coef) exp(-coef) lower .95 upper .95
deer_hunt 0.786 1.2722 0.5471 1.129
bear_hunt 1.219 0.8202 0.7962 1.867
Concordance= 0.515 (se = 0.022 )
Likelihood ratio test= 2.19 on 2 df, p=0.3
Wald test = 2.21 on 2 df, p=0.3
Score (logrank) test = 2.22 on 2 df, p=0.3
> cox.zph(LTF.coxph)
chisq df p
deer_hunt 5.5773 1 0.018
bear_hunt 0.0762 1 0.783
GLOBAL 5.5773 2 0.062
The problem I have is that results look very different when adding a time-varying coefficient (tvc() in Stata and tt() in R) for one of the variables in my model. Nothing is the same between models (coefficients for all variables or their significance).
Stata Cox PH model with tvc()
stcox deer_hunt bear_hunt, tvc(deer_hunt) strata(winter lib_kill) efron robust cluster(wolf_ID)
failure _d: cause_endpoint2 == 4
analysis time _t: (date_endpoint-origin)
origin: time capture_date
id: wolf_ID
Iteration 0: log pseudolikelihood = -993.65252
Iteration 1: log pseudolikelihood = -990.70475
Iteration 2: log pseudolikelihood = -990.69386
Iteration 3: log pseudolikelihood = -990.69386
Refining estimates:
Iteration 0: log pseudolikelihood = -990.69386
Stratified Cox regr. -- Efron method for ties
No. of subjects = 513 Number of obs = 5,664
No. of failures = 231
Time at risk = 279430.5
Wald chi2(3) = 4.72
Log pseudolikelihood = -990.69386 Prob > chi2 = 0.1932
(Std. Err. adjusted for 513 clusters in wolf_ID)
--------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf Interval]
-------------+------------------------------------------------------------
main |
deer_hunt | 1.043941 .2643779 0.17 0.865 .6354908 1.714915
bear_hunt | 1.204522 .2647525 0.85 0.397 .7829279 1.853138
-------------+------------------------------------------------------------
tvc |
deer_hunt | .9992683 .0004286 -1.71 0.088 .9984287 1.000109
------------------------------------------------------------------------------
Stratified by winter lib_kill
Note: Variables in tvc equation interacted with _t.
R Cox PH model with tt()
> LTF.tvc1.coxph <- coxph(Surv(`_t0`,`_t`, endpoint_r_enc=="ltf") ~ deer_hunt + bear_hunt + tt(deer_hunt) + strata(winter, lib_kill),
+ data=statadta, ties = "efron", id = wolf_ID, cluster = wolf_ID,
+ tt=function(x,t,...){x*t})
> summary(LTF.tvc1.coxph)
Call:
coxph(formula = Surv(`_t0`, `_t`, endpoint_r_enc == "ltf") ~
deer_hunt + bear_hunt + tt(deer_hunt) + strata(winter, lib_kill),
data = statadta, ties = "efron", tt = function(x, t, ...) {
x * t
}, id = wolf_ID, cluster = wolf_ID)
n= 5664, number of events= 231
coef exp(coef) se(coef) robust se z Pr(>|z|)
deer_hunt 0.4741848 1.6067039 0.2082257 0.2079728 2.280 0.02261 *
bear_hunt -0.7923208 0.4527927 0.1894531 0.1890497 -4.191 2.78e-05 ***
tt(deer_hunt)-0.0009312 0.9990692 0.0003442 0.0003612 -2.578 0.00994 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
deer_hunt 1.6067 0.6224 1.0688 2.4153
bear_hunt 0.4528 2.2085 0.3126 0.6559
tt(deer_hunt) 0.9991 1.0009 0.9984 0.9998
Concordance= 0.634 (se = 0.02 )
Likelihood ratio test= 28.29 on 3 df, p=3e-06
Wald test = 25.6 on 3 df, p=1e-05
Score (logrank) test = 26.19 on 3 df, p=9e-06, Robust = 32.6 p=4e-07
Moreover, I checked this post before posting this because I did not find it very helpful. The 'noadjust' Stata command was useful for SEs, but it does not answer my main issue of also getting different covariate coefficients between programs for the main and time-varying effects when I add those time-varying effects to the Cox model in each program (and the same formula for calculating the time-varying effects). That is really my main concern: the difference in covariate estimates seems substantial and would result in different prescriptions, I believe
I have been unable to figure out what is happening there, and am hoping the community can help.

Recreate spss GEE regression table in R

I have the (sample) dataset below:
round<-c( 0.125150, 0.045800, -0.955299, -0.232007, 0.120880, -0.041525, 0.290473, -0.648752, 0.113264, -0.403685)
square<-c(-0.634753, 0.000492, -0.178591, -0.202462, -0.592054, -0.583173, -0.632375, -0.176673, -0.680557, -0.062127)
ideo<-c(0,1,0,1,0,1,0,0,1,1)
ex<-data.frame(round,square,ideo)
When I ran the GEE regression in SPSS I took this table as a result.
I used packages gee and geepack in R to run the same analysis and I took these results:
#gee
summary(gee(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 1.0541 0.4099 2.572 0.1328 7.937
square 1.1811 0.8321 1.419 0.4095 2.884
round 0.7072 0.5670 1.247 0.1593 4.439
#geepack
summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 1.054 0.133 63.00 2.1e-15 ***
square 1.181 0.410 8.32 0.0039 **
round 0.707 0.159 19.70 9.0e-06 ***
---
I would like to recreate exactly the table of SPSS(not the results as I use a subset of the original dataset)but I do not know how to achieve all these results.

A tiny bit of tidyverse magic can get the same results - more or less.
Get the information from coef(summary(geeglm())) and compute the necessary columns:
library("tidyverse")
library("geepack")
coef(summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))) %>%
mutate(lowerWald = Estimate-1.96*Std.err, # Lower Wald CI
upperWald=Estimate+1.96*Std.err, # Upper Wald CI
df=1,
ExpBeta = exp(Estimate)) %>% # Transformed estimate
mutate(lWald=exp(lowerWald), # Upper transformed
uWald=exp(upperWald)) # Lower transformed
This produces the following (with the data you provided). The order and the names of the columns could be modified to suit your needs
Estimate Std.err Wald Pr(>|W|) lowerWald upperWald df ExpBeta lWald uWald
1 1.0541 0.1328 62.997 2.109e-15 0.7938 1.314 1 2.869 2.212 3.723
2 1.1811 0.4095 8.318 3.925e-03 0.3784 1.984 1 3.258 1.460 7.270
3 0.7072 0.1593 19.704 9.042e-06 0.3949 1.019 1 2.028 1.484 2.772

Why doesn't Stata and R produce the same output for a Multinomial Logit model?

Let's say I want to do a simple multinomial logit model using an online dataset in R:
library(nnet)
data <- data.table(read.dta('http://data.princeton.edu/wws509/datasets/irished.dta'))
ml <- multinom(educg ~ gender + prestigeg + reasong, data=data)
summary(ml)
You get the following output
summary(ml)
Call:
multinom(formula = educg ~ gender + prestigeg + reasong, data = data)
Coefficients:
(Intercept) genderfemale prestigegQ2 prestigegQ3 prestigegQ4 reasongQ2 reasongQ3 reasongQ4
senior -1.650999 0.3051297 0.8704957 1.189714 1.340206 -0.08303942 1.035163 1.627145
3rd level -5.792979 0.1615402 1.5331076 1.682500 2.227006 2.11053104 3.232968 4.963707
Std. Errors:
(Intercept) genderfemale prestigegQ2 prestigegQ3 prestigegQ4 reasongQ2 reasongQ3 reasongQ4
senior 0.3203241 0.2304163 0.3023462 0.3376034 0.3288158 0.2990188 0.3063954 0.3488479
3rd level 1.1165939 0.3477700 0.5534933 0.5878517 0.5433370 1.0789145 1.0644124 1.0532858
Residual Deviance: 730.8832
AIC: 762.8832
If I perform a similar routine in Stata:
use http://data.princeton.edu/wws509/datasets/irished.dta
mlogit educg gender prestigeg reasong
I get the following output:
Iteration 0: log likelihood = -433.16499
Iteration 1: log likelihood = -376.86517
Iteration 2: log likelihood = -371.52279
Iteration 3: log likelihood = -371.42355
Iteration 4: log likelihood = -371.42343
Iteration 5: log likelihood = -371.42343
Multinomial logistic regression Number of obs = 435
LR chi2(6) = 123.48
Prob > chi2 = 0.0000
Log likelihood = -371.42343 Pseudo R2 = 0.1425
------------------------------------------------------------------------------
educg | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
junior | (base outcome)
-------------+----------------------------------------------------------------
senior |
gender | .2577712 .2247087 1.15 0.251 -.1826498 .6981921
prestigeg | .4394042 .1027884 4.27 0.000 .2379427 .6408657
reasong | .5584275 .1059711 5.27 0.000 .3507279 .766127
_cons | -2.890597 .533933 -5.41 0.000 -3.937086 -1.844108
-------------+----------------------------------------------------------------
3rd_level |
gender | .1360704 .3416126 0.40 0.690 -.5334779 .8056188
prestigeg | .6387618 .1532933 4.17 0.000 .3383125 .9392111
reasong | 1.431763 .197151 7.26 0.000 1.045355 1.818172
_cons | -7.032375 .9904472 -7.10 0.000 -8.973616 -5.091134
------------------------------------------------------------------------------
Why are these values completely different? How do I get Stata-like output for a multinomial logit model in R?

Calculating OR for logistic regression using rms

I have a logistic regression model, for which I have been using the rms package. The model fits best using a log term for tn1, and for clinical interpretation I’m using log2. I ran the model using lrm from the rms package, and then to double check, I ran it using glm. The initial coefficients are the same:
h <- lrm(formula = outcomehosp ~ I(log2(tn1 + 0.001)) + apscore_ad +
emsurg + corrapiidiag, data = d, x = TRUE, y = TRUE)
Coef S.E. Wald Z Pr(>|Z|)
Intercept -3.4570 0.3832 -9.02 <0.0001
tn1 0.0469 0.0180 2.60 0.0093
apscore_ad 0.1449 0.0127 11.44 <0.0001
emsurg 0.0731 0.3228 0.23 0.8208
f <- glm(formula = outcomehosp ~ apscore_ad + emsurg + corrapiidiag +
I(log2(tn1 + 0.001)), family = binomial(), data = tn24)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.45699 0.38315 -9.023 < 2e-16
I(log2(tn1 + 0.001)) 0.04690 0.01804 2.600 0.00932
apscore_ad 0.14487 0.01267 11.438 < 2e-16
emsurg 0.07310 0.32277 0.226 0.82082
However when I try to get the odds ratios, they are noticeably different for tn1 between the two models, and this doesn’t seem to be a difference of log2 transformation.
summary(h)
Effects Response : outcomehosp
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
tn1 0 0.21 0.21 0.362120 0.15417 6.5300e-02 0.673990
Odds Ratio 0 0.21 0.21 1.436400 NA 1.0675e+00 1.962100
apscore_ad 14 25.00 11.00 1.593600 0.15631 1.3605e+00 1.961000
Odds Ratio 14 25.00 11.00 4.921400 NA 3.8981e+00 7.106600
emsurg 0 1.00 1.00 0.073103 0.33051 -5.8224e-01 0.734860
Odds Ratio 0 1.00 1.00 1.075800 NA 5.5865e-01 2.085200
exp(f$coefficients)
(Intercept) 0.03152467
apscore_ad 1.15589222
emsurg 1.07584115
I(log2(tn1 + 0.001)) 1.04802
Would anyone be able to explain what the rms package is calculating the odds ratio of? Many thanks.

The tn1 effect from summary(h) is the effect on the log of the odds ratio of tn1 going from 0 to 0.21 -- the inter-quartile range. See ?summary.rms.
So, the effect from the first row of summary(h) is 0.36212 = (log2(0.211)-log2(0.001))*.0469.

Why do column names get concatenated into the row output of a linear model summary?

I've never noticed this behavior before, but I'm surprised at the output naming conventions for linear model summaries. My question, essentially, is why row names in a linear model summary always seem to carry the name of the column they came from.
An example
Suppose you had some data for 300 movie audience members from three different cities:
Chicago
Milwaukee
Dayton
And suppose all of them were subjected to the stinking pile of confusing, contaminated waste that was Spider-Man 3. After enduring the entirety of that cinematic abomination, they were asked to rate the movie on a 100-point scale.
Because all of the audience members were reasonable human beings, the ratings were all below zero. (Naturally. Anyone who's seen the movie would agree.)
Here's what that might look like in R:
> score <- rnorm(n = 300, mean = -50, sd = 10)
> city <- rep(c("Chicago", "Milwaukee", "Dayton"), times = 100)
> spider.man.3.sucked <- data.frame(score, city)
> head(spider.man.3.sucked)
score city
1 -64.57515 Chicago
2 -50.51050 Milwaukee
3 -56.51409 Dayton
4 -45.55133 Chicago
5 -47.88686 Milwaukee
6 -51.22812 Dayton
Great. So let's run a quick linear model, assign it to lm1, and get its summary output:
> lm1 <- lm(score ~ city, data = spider.man.3.sucked)
> summary(lm1)
Call:
lm(formula = score ~ city, data = spider.man.3.sucked)
Residuals:
Min 1Q Median 3Q Max
-29.8515 -6.1090 -0.4745 6.0340 26.2616
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.3621 0.9630 -53.337 <2e-16 ***
cityDayton 1.1892 1.3619 0.873 0.383
cityMilwaukee 0.8288 1.3619 0.609 0.543
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.63 on 297 degrees of freedom
Multiple R-squared: 0.002693, Adjusted R-squared: -0.004023
F-statistic: 0.4009 on 2 and 297 DF, p-value: 0.6701
What's bugging me
The part I want to highlight is this:
cityDayton 1.1892 1.3619 0.873 0.383
cityMilwaukee 0.8288 1.3619 0.609 0.543
It looks like R sensibly concatenated the column name (city, if you remember from above) with the distinct value (in this case either Dayton or Milwaukee). If I don't want R to output in that format, is there any way to override it? For example, in my case all I'd need is simply:
Dayton 1.1892 1.3619 0.873 0.383
Milwaukee 0.8288 1.3619 0.609 0.543
Two questions in one
So,
What's controlling the format of the output for linear model summary rows, and
Can/should I change it?

The extractor function for that component of a summary object is coef. Does this provide the means to control your output acceptably:
summ <- summary(lm1)
csumm <- coef(summ)
rownames(csumm) <- sub("^city", "", rownames(csumm))
print(csumm[-1,], digits=4)
# Estimate Std. Error t value Pr(>|t|)
# Dayton 0.8133 1.485 0.5478 0.5842
# Milwaukee 0.3891 1.485 0.2621 0.7934
(No random seed was set so cannot match your values.)

For 1) it appears to happen inside model.matrix.default() and inside internal R compiled code for that matter.
It might be difficult to change this easily - the obvious way would be to write your own model.matrix.default() that calls model.matrix.default() and updates the names afterwards. But this isn't tested or tried.

Here is a hack
# RUN REGRESSION
require(ggplot2)
lm1 = lm(tip ~ total_bill + sex + day, data = tips)
# FUNCTION TO REMOVE FACTOR NAMES FROM MODEL SUMMARY
remove_factors = function(mod){
mydf = mod$model
# PREPARE VECTOR OF VARIABLES WITH REPETITIONS = UNIQUE FACTOR LEVELS
vars = names(mod$model)[-1]
eachlen = sapply(mydf[,vars,drop=F], function(x)
ifelse(is.numeric(x), 1, length(unique(x)) - 1))
vars = rep(vars, eachlen)
# REPLACE COEF NAMES WITH VARIABLE NAME WHEN APPROPRIATE
coefs = names(lm1$coefficients)[-1]
coefs2 = stringr::str_replace(coefs, vars, "")
names(mod$coefficients)[-1] = ifelse(coefs2 == "", coefs, coefs2)
return(mod)
}
summary(remove_factors(lm1))
This gives
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.95588 0.27579 3.47 0.00063 ***
total_bill 0.10489 0.00758 13.84 < 2e-16 ***
Male -0.03844 0.14215 -0.27 0.78706
Sat -0.08088 0.26226 -0.31 0.75806
Sun 0.08282 0.26741 0.31 0.75706
Thur -0.02063 0.26975 -0.08 0.93910
However, it is not always advisable to do this, as you can see from running the same hack for a different regression. It is not clear what the Yes variable in the last name stands for. R by default writes it as smokerYes to make its meaning clear. So use with caution.
lm2 = lm(tip ~ total_bill + sex + day + smoker, data = tips)
summary(remove_factors(lm2))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.05182 0.29315 3.59 0.00040 ***
total_bill 0.10569 0.00763 13.86 < 2e-16 ***
Male -0.03769 0.14217 -0.27 0.79114
Sat -0.12636 0.26648 -0.47 0.63582
Sun 0.00407 0.27959 0.01 0.98841
Thur -0.09283 0.27994 -0.33 0.74048
Yes -0.13935 0.14422 -0.97 0.33489

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cox proportional hazard model in R vs Stata - r

Related

Different results of coxph with time-varying coefficients between Stata and R

Recreate spss GEE regression table in R

Why doesn't Stata and R produce the same output for a Multinomial Logit model?

Calculating OR for logistic regression using rms

Why do column names get concatenated into the row output of a linear model summary?

Categories

Resources