I am using the survey package to analyse a longitudinal database. The data looks like
personid spellid long.w Dur rc sex 1 10 age
1 1 278 6.4702295519 0 0 47 20 16
2 1 203 2.8175129012 1 1 126 87 62
3 1 398 6.1956669321 0 0 180 6 37
4 1 139 7.2791061847 1 0 104 192 20
7 1 10 3.6617503439 1 0 18 24 25
8 1 3 2.265464682 0 1 168 136 40
9 1 134 6.3180994022 0 1 116 194 35
10 1 272 6.9167936912 0 0 39 119 45
11 1 296 5.354798213 1 1 193 161 62
After the variable SEX I have 10 bootstrap weights, then the variable Age.
The longitudinal weight is given in the column long.w
I am using the following code.
data.1 <- read.table("Panel.csv", sep = ",",header=T)
library(survey)
library(survival)
#### Unweigthed model
mod.1 <- summary(coxph(Surv(Dur, rc) ~ age + sex, data.1))
mod.1
coxph(formula = Surv(Dur, rc) ~ age + sex, data = data.1)
n= 36, number of events= 14
coef exp(coef) se(coef) z Pr(>|z|)
age -4.992e-06 1.000e+00 2.291e-02 0.000 1.000
sex 5.277e-01 1.695e+00 5.750e-01 0.918 0.359
exp(coef) exp(-coef) lower .95 upper .95
age 1.000 1.00 0.9561 1.046
sex 1.695 0.59 0.5492 5.232
Concordance= 0.651 (se = 0.095 )
Rsquare= 0.024 (max possible= 0.858 )
### --- Weights
weights <- data.1[,7:16]*data.1$long.w
panel <-svrepdesign(data=data.1,
weights=data.1[,3],
type="BRR",
repweights=weights,
combined.weights=TRUE
)
#### Weighted model
mod.1.w <- svycoxph(Surv(Dur,rc)~ age+ sex ,design=panel)
summary(mod.1.w)
Balanced Repeated Replicates with 10 replicates.
Call:
svycoxph.svyrep.design(formula = Surv(Dur, rc) ~ age + sex, design = panel)
n= 36, number of events= 14
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0198 1.0200 0.0131 1.512 0.131
sex 1.0681 2.9098 0.2336 4.572 4.84e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.02 0.9804 0.9941 1.047
sex 2.91 0.3437 1.8407 4.600
Concordance= 0.75 (se = 0.677 )
Rsquare= NA (max possible= NA )
Likelihood ratio test= NA on 2 df, p=NA
Wald test = 28.69 on 2 df, p=5.875e-07
Score (logrank) test = NA on 2 df, p=NA
### ----
> panel.2 <-svrepdesign(data=data.1,
+ weights=data.1[,3],
+ type="BRR",
+ repweights=data.1[,7:16],
+ combined.weights=FALSE,
+ )
Warning message:
In svrepdesign.default(data = data.1, weights = data.1[, 3], type = "BRR", :
Data look like combined weights: mean replication weight is 101.291666666667 and mean sampling weight is 203.944444444444
mod.2.w <- svycoxph(Surv(Dur,rc)~ age+ sex ,design=panel.2)
> summary(mod.2.w)
Call: svrepdesign.default(data = data.1, weights = data.1[, 3], type = "BRR",
repweights = data.1[, 7:16], combined.weights = FALSE, )
Balanced Repeated Replicates with 10 replicates.
Call:
svycoxph.svyrep.design(formula = Surv(Dur, rc) ~ age + sex, design = panel.2)
n= 36, number of events= 14
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0198 1.0200 0.0131 1.512 0.131
sex 1.0681 2.9098 0.2336 4.572 4.84e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.02 0.9804 0.9941 1.047
sex 2.91 0.3437 1.8407 4.600
Concordance= 0.75 (se = 0.677 )
Rsquare= NA (max possible= NA )
Likelihood ratio test= NA on 2 df, p=NA
Wald test = 28.69 on 2 df, p=5.875e-07
Score (logrank) test = NA on 2 df, p=NA
The sum of the longitudinal weights is 7,342. The total of events must be around 2,357 and the censored observations a total of 4,985 for a "population" of 7,342 individuals
Do models mod.1.w and mod.2.w take into consideration the longitudinal weights? If the do, why the summary report only n= 36, number of events= 14 ?
The design works well if I take other statistics. For example the mean of Dur in data.1 without considering the sampling design is around 4.9 and 5.31 when I consider svymean(~Dur, panel.2) for example.
Related
A minimal example is given as below. Why the last row (the last level of combination) got NA? I thought it should be treated as a reference level, am I right? Which is the reference level in the combination?
> library(survival)
> summary(coxph(Surv(time, status) ~ factor(ph.ecog) * factor(sex), data = lung))
Call:
coxph(formula = Surv(time, status) ~ factor(ph.ecog) * factor(sex),
data = lung)
n= 227, number of events= 164
(因为不存在,1个观察量被删除了)
coef exp(coef) se(coef) z Pr(>|z|)
factor(ph.ecog)1 0.40410 1.49796 0.23525 1.718 0.08584 .
factor(ph.ecog)2 0.84691 2.33242 0.26889 3.150 0.00163 **
factor(ph.ecog)3 2.01808 7.52390 1.03086 1.958 0.05027 .
factor(sex)2 -0.67710 0.50809 0.38435 -1.762 0.07812 .
factor(ph.ecog)1:factor(sex)2 0.07719 1.08024 0.45168 0.171 0.86432
factor(ph.ecog)2:factor(sex)2 0.32936 1.39008 0.49576 0.664 0.50646
factor(ph.ecog)3:factor(sex)2 NA NA 0.00000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
factor(ph.ecog)1 1.4980 0.6676 0.9446 2.375
factor(ph.ecog)2 2.3324 0.4287 1.3770 3.951
factor(ph.ecog)3 7.5239 0.1329 0.9976 56.743
factor(sex)2 0.5081 1.9682 0.2392 1.079
factor(ph.ecog)1:factor(sex)2 1.0802 0.9257 0.4457 2.618
factor(ph.ecog)2:factor(sex)2 1.3901 0.7194 0.5261 3.673
factor(ph.ecog)3:factor(sex)2 NA NA NA NA
Concordance= 0.643 (se = 0.025 )
Likelihood ratio test= 30.07 on 6 df, p=0.00004
Wald test = 29.75 on 6 df, p=0.00004
Score (logrank) test = 32.92 on 6 df, p=0.00001
More strange is that when focus on the interaction, one or more last levels would got NA.
> summary(coxph(Surv(time, status) ~ factor(ph.ecog) : factor(sex), data = lung))
Call:
coxph(formula = Surv(time, status) ~ factor(ph.ecog):factor(sex),
data = lung)
n= 227, number of events= 164
(因为不存在,1个观察量被删除了)
coef exp(coef) se(coef) z Pr(>|z|)
factor(ph.ecog)0:factor(sex)1 -0.49916 0.60704 0.31688 -1.575 0.11520
factor(ph.ecog)1:factor(sex)1 -0.09506 0.90932 0.28525 -0.333 0.73894
factor(ph.ecog)2:factor(sex)1 0.34774 1.41587 0.31502 1.104 0.26964
factor(ph.ecog)3:factor(sex)1 1.51892 4.56729 1.04054 1.460 0.14436
factor(ph.ecog)0:factor(sex)2 -1.17627 0.30843 0.41739 -2.818 0.00483 **
factor(ph.ecog)1:factor(sex)2 -0.69498 0.49909 0.31614 -2.198 0.02793 *
factor(ph.ecog)2:factor(sex)2 NA NA 0.00000 NA NA
factor(ph.ecog)3:factor(sex)2 NA NA 0.00000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
factor(ph.ecog)0:factor(sex)1 0.6070 1.6473 0.3262 1.1296
factor(ph.ecog)1:factor(sex)1 0.9093 1.0997 0.5199 1.5904
factor(ph.ecog)2:factor(sex)1 1.4159 0.7063 0.7636 2.6252
factor(ph.ecog)3:factor(sex)1 4.5673 0.2189 0.5942 35.1048
factor(ph.ecog)0:factor(sex)2 0.3084 3.2422 0.1361 0.6989
factor(ph.ecog)1:factor(sex)2 0.4991 2.0037 0.2686 0.9274
factor(ph.ecog)2:factor(sex)2 NA NA NA NA
factor(ph.ecog)3:factor(sex)2 NA NA NA NA
Concordance= 0.643 (se = 0.025 )
Likelihood ratio test= 30.07 on 6 df, p=0.00004
Wald test = 29.75 on 6 df, p=0.00004
Score (logrank) test = 32.92 on 6 df, p=0.00001
UPDATE
Thanks the operation suggested by #rawr, I checked my real data and found it cannot be explained by this:
> with(data_case, ftable(class, Grouping, OS_Status))
OS_Status 0 1
class Grouping
A a 33 14
b 22 26
B a 49 21
b 43 28
C a 86 25
b 77 42
> fit = coxph(Surv(OS, OS_Status) ~ class:Grouping, data_case)
> summary(fit)
Call:
coxph(formula = Surv(OS, OS_Status) ~ class:Grouping, data = data_case)
n= 466, number of events= 156
coef exp(coef) se(coef) z Pr(>|z|)
classA:Groupinga -0.3504 0.7044 0.3099 -1.131 0.2582
classB:Groupinga -0.4621 0.6299 0.2695 -1.715 0.0864 .
classC:Groupinga -0.6477 0.5232 0.2534 -2.556 0.0106 *
classA:Groupingb 0.3717 1.4502 0.2503 1.485 0.1376
classB:Groupingb -0.1213 0.8858 0.2455 -0.494 0.6213
classC:Groupingb NA NA 0.0000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
classA:Groupinga 0.7044 1.4196 0.3838 1.2930
classB:Groupinga 0.6299 1.5875 0.3714 1.0684
classC:Groupinga 0.5232 1.9112 0.3184 0.8598
classA:Groupingb 1.4502 0.6896 0.8879 2.3685
classB:Groupingb 0.8858 1.1289 0.5475 1.4331
classC:Groupingb NA NA NA NA
Concordance= 0.58 (se = 0.027 )
Likelihood ratio test= 16.48 on 5 df, p=0.006
Wald test = 16.74 on 5 df, p=0.005
Score (logrank) test = 17.5 on 5 df, p=0.004
male_data:
surgery age cancer survival
a00001 yes <=50 0 10
a00002 yes >50 1 15
a00003 no >50 0 2
.
.
.
.
Result:
Call:
coxph(formula = Surv(survival, cancer) ~ surgery + age, data = male_data)
n= 550517, number of events= 3276
coef exp(coef) se(coef) z Pr(>|z|)
surgery:yes -0.03 0.97 0.04 -0.88 0.377
age:>50 3.26 26.09 0.04 78.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
surgery:yes 0.97 1.03 0.9 1.04
age:>50 26.09 0.04 24.05 28.3
Concordance= 0.817 (se = 0.005 )
Likelihood ratio test= 7607 on 2 df, p=<2e-16
Wald test = 6182 on 2 df, p=<2e-16
Score (logrank) test = 13993 on 2 df, p=<2e-16
Now I need to fill the result in a new form. But I do not know what "adjusted by" means. Can I find this value in the result? I would appreciate any advice!
In any observational study, you need to define clearly your exposure and outcome of interest. It seems that your exposure of interest is whether the patient had surgery or not and the outcome either the incidence of cancer or death from cancer. If this is the case, your results are adjusted by age.
You are the only person who can answer this question! It truly depends on the research question and the underlying causal structure you are thinking about.
I want to calculate CI in mixed models, zero inflated negative binomial and hurdle model. My code for hurdle model looks like this (x1, x2 continuous, x3 categorical):
m1 <- glmmTMB(count~x1+x2+x3+(1|year/class),
data = bd, zi = ~x2+x3+(1|year/class), family = truncated_nbinom2,
)
I used confint, and I got these results:
ci <- confint(m1,parm="beta_")
ci
2.5 % 97.5 % Estimate
cond.(Intercept) 1.816255e-01 0.448860094 0.285524861
cond.x1 9.045278e-01 0.972083366 0.937697401
cond.x2 1.505770e+01 26.817439186 20.094998772
cond.x3high 1.190972e+00 1.492335046 1.333164894
cond.x3low 1.028147e+00 1.215828654 1.118056377
cond.x3reg 1.135515e+00 1.385833853 1.254445909
class:year.cond.Std.Dev.(Intercept)2.256324e+00 2.662976154 2.441845815
year.cond.Std.Dev.(Intercept) 1.051889e+00 1.523719169 1.157153015
zi.(Intercept) 1.234418e-04 0.001309705 0.000402085
zi.x2 2.868578e-02 0.166378014 0.069084606
zi.x3high 8.972025e-01 1.805832900 1.272869874
Am I calculating the intervals correctly? Why is there only one category in x3 for zi?
If possible, I would also like to know if it's possible to plot these CIs.
Thanks!
Data looks like this:
class id year count x1 x2 x3
956 5 3002 2002 3 15.6 47.9 high
957 5 4004 2002 3 14.3 47.9 low
958 5 6021 2002 3 14.2 47.9 high
959 4 2030 2002 3 10.5 46.3 high
960 4 2031 2002 3 15.3 46.3 high
961 4 2034 2002 3 15.2 46.3 reg
with x1 and x2 continuous, x3 three level categorical variable (factor)
Summary of the model:
summary(m1)
'giveCsparse' has been deprecated; setting 'repr = "T"' for you'giveCsparse' has been deprecated; setting 'repr = "T"' for you'giveCsparse' has been deprecated; setting 'repr = "T"' for you
Family: truncated_nbinom2 ( log )
Formula: count ~ x1 + x2 + x3 + (1 | year/class)
Zero inflation: ~x2 + x3 + (1 | year/class)
Data: bd
AIC BIC logLik deviance df.resid
37359.7 37479.7 -18663.8 37327.7 13323
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
class:year(Intercept) 0.79701 0.8928
year (Intercept) 0.02131 0.1460
Number of obs: 13339, groups: class:year, 345; year, 15
Zero-inflation model:
Groups Name Variance Std.Dev.
dpto:year (Intercept) 1.024e+02 1.012e+01
year (Intercept) 7.842e-07 8.856e-04
Number of obs: 13339, groups: class:year, 345; year, 15
Overdispersion parameter for truncated_nbinom2 family (): 1.02
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.25343 0.23081 -5.431 5.62e-08 ***
x1 -0.06433 0.01837 -3.501 0.000464 ***
x2 3.00047 0.14724 20.378 < 2e-16 ***
x3high 0.28756 0.05755 4.997 5.82e-07 ***
x3low 0.11159 0.04277 2.609 0.009083 **
x3reg 0.22669 0.05082 4.461 8.17e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.8188 0.6025 -12.977 < 2e-16 ***
x2 -2.6724 0.4484 -5.959 2.53e-09 ***
x3high 0.2413 0.1784 1.352 0.17635
x3low -0.1325 0.1134 -1.169 0.24258
x3reg -0.3806 0.1436 -2.651 0.00802 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
CI with broom.mixed
> broom.mixed::tidy(m1, effects="fixed", conf.int=TRUE)
# A tibble: 12 x 9
effect component term estimate std.error statistic p.value conf.low conf.high
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 fixed cond (Intercept) -1.25 0.231 -5.43 5.62e- 8 -1.71 -0.801
2 fixed cond x1 -0.0643 0.0184 -3.50 4.64e- 4 -0.100 -0.0283
3 fixed cond x2 3.00 0.147 20.4 2.60e-92 2.71 3.29
4 fixed cond x3high 0.288 0.0575 5.00 5.82e- 7 0.175 0.400
5 fixed cond x3low 0.112 0.0428 2.61 9.08e- 3 0.0278 0.195
6 fixed cond x3reg 0.227 0.0508 4.46 8.17e- 6 0.127 0.326
7 fixed zi (Intercept) -9.88 1.32 -7.49 7.04e-14 -12.5 -7.30
8 fixed zi x1 0.214 0.120 1.79 7.38e- 2 -0.0206 0.448
9 fixed zi x2 -2.69 0.449 -6.00 2.01e- 9 -3.57 -1.81
10 fixed zi x3high 0.232 0.178 1.30 1.93e- 1 -0.117 0.582
11 fixed zi x3low -0.135 0.113 -1.19 2.36e- 1 -0.357 0.0878
12 fixed zi x4reg -0.382 0.144 -2.66 7.74e- 3 -0.664 -0.101
tl;dr as far as I can tell this is a bug in confint.glmmTMB (and probably in the internal function glmmTMB:::getParms). In the meantime, broom.mixed::tidy(m1, effects="fixed") should do what you want. (There's now a fix in progress in the development version on GitHub, should make it to CRAN sometime? soon ...)
Reproducible example:
set up data
set.seed(101)
n <- 1e3
bd <- data.frame(
year=factor(sample(2002:2018, size=n, replace=TRUE)),
class=factor(sample(1:20, size=n, replace=TRUE)),
x1 = rnorm(n),
x2 = rnorm(n),
x3 = factor(sample(c("low","reg","high"), size=n, replace=TRUE),
levels=c("low","reg","high")),
count = rnbinom(n, mu = 3, size=1))
fit
library(glmmTMB)
m1 <- glmmTMB(count~x1+x2+x3+(1|year/class),
data = bd, zi = ~x2+x3+(1|year/class), family = truncated_nbinom2,
)
confidence intervals
confint(m1, "beta_") ## wrong/ incomplete
broom.mixed::tidy(m1, effects="fixed", conf.int=TRUE) ## correct
You may want to think about which kind of confidence intervals you want:
Wald CIs (default) are much faster to compute and are generally OK as long as (1) your data set is large and (2) you aren't estimating any parameters on the log/logit scale that are near the boundaries
likelihood profile CIs are more accurate but much slower
I am trying to explore regressions between abundances and 3 variables. My data (test.gam)looks like this:
# A tibble: 6 x 5
Site Abundance SPM isotherm SiOH4
<chr> <dbl> <dbl> <dbl> <dbl>
1 cycle1 0.769 5960367. 102. 18.2
2 cycle1 0.632 6496360. 97.5 18.2
3 cycle1 0.983 5328652. 105 18.2
4 cycle1 1 6212034. 110 18.2
5 cycle1 0.821 5468987. 105 18.2
6 cycle1 0.734 5280549. 112. 18.2
In one of these variable (SiOH4), I have only one value per Site, while for the 2 other variables, I have single value for each station (each row being a station).
To plot the relation between abundances and SiOH4 I would simply compute a mean value for each Site. The relation show that there is a constant increase of abundances with SiOH4 levels: Plot1.
Now I tried running a GAM on this data using the following code:
mod_gam1 <- gam(Abundance ~ s(isotherm, bs = "cr", k = 5)
+ SPM + s(SiOH4, bs = "cr", k = 5), data = test.gam, family = gaussian(link = log), gamma = 1.4)
giving me these results:
Family: gaussian
Link function: log
Formula:
Abundance ~ s(isotherm, bs = "cr", k = 5) + SPM + s(SiOH4, bs = "cr",
k = 5)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.182e-01 8.244e-02 -9.925 < 2e-16 ***
SPM -4.356e-08 1.153e-08 -3.778 0.000219 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(isotherm) 2.019 2.485 10.407 1.46e-05 ***
s(SiOH4) 3.861 3.986 9.823 1.01e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.492 Deviance explained = 51.2%
GCV = 0.044202 Scale est. = 0.040674 n = 177
So I am happy quite happy about the results but then by checking with gam.check, I find that k is too low.
Method: GCV Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-8.801477e-14,5.555545e-13]
(score 0.04420205 & scale 0.04067442).
Hessian positive definite, eigenvalue range [6.631202e-05,7.084933e-05].
Model rank = 10 / 10
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(isotherm) 4.00 2.02 0.85 0.01 **
s(SiOH4) 4.00 3.86 0.59 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that I voluntarily set my k to 5, otherwise there is an overfit of the pattern.
I thought that this might due to the fact that many of the values in SiOH4 are repeated. By modifying my data to keep only the first value of each Site (replacing all other rows with NA) like:
# A tibble: 6 x 5
# Groups: Site [1]
Site Abundance SPM isotherm SiOH4
<chr> <dbl> <dbl> <dbl> <dbl>
1 cycle1 0.769 5960367. 102. 18.2
2 cycle1 0.632 6496360. 97.5 NA
3 cycle1 0.983 5328652. 105 NA
4 cycle1 1 6212034. 110 NA
5 cycle1 0.821 5468987. 105 NA
6 cycle1 0.734 5280549. 112. NA
I hoped preventing this repeated levels. But this way I am also loosing most of my rows, with the na.omit option on. However running the same GAM, I don't have problems with k being too low after using gam.check.
So do I need to keep repetitive values and ignore the warning from gam.check or there is a way to somehow keep all rows even if NA exist?
I am fitting a cox model to some data that is structured as such:
str(test)
'data.frame': 147 obs. of 8 variables:
$ AGE : int 71 69 90 78 61 74 78 78 81 45 ...
$ Gender : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ...
$ RACE : Factor w/ 5 levels "","BLACK","HISPANIC",..: 5 2 5 5 5 5 5 5 5 1 ...
$ SIDE : Factor w/ 2 levels "L","R": 1 1 2 1 2 1 1 1 2 1 ...
$ LESION.INDICATION: Factor w/ 12 levels "CLAUDICATION",..: 1 11 4 11 9 1 1 11 11 11 ...
$ RUTH.CLASS : int 3 5 4 5 4 3 3 5 5 5 ...
$ LESION.TYPE : Factor w/ 3 levels "","OCCLUSION",..: 3 3 2 3 3 3 2 3 3 3 ...
$ Primary : int 1190 1032 166 689 219 840 1063 115 810 157 ...
the RUTH.CLASS variable is actually a factor, and i've changed it to one as such:
> test$RUTH.CLASS <- as.factor(test$RUTH.CLASS)
> summary(test$RUTH.CLASS)
3 4 5 6
48 56 35 8
great.
after fitting the model
stent.surv <- Surv(test$Primary)
> cox.ruthclass <- coxph(stent.surv ~ RUTH.CLASS, data=test )
>
> summary(cox.ruthclass)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS, data = test)
n= 147, number of events= 147
coef exp(coef) se(coef) z Pr(>|z|)
RUTH.CLASS4 0.1599 1.1734 0.1987 0.804 0.42111
RUTH.CLASS5 0.5848 1.7947 0.2263 2.585 0.00974 **
RUTH.CLASS6 0.3624 1.4368 0.3846 0.942 0.34599
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4 1.173 0.8522 0.7948 1.732
RUTH.CLASS5 1.795 0.5572 1.1518 2.796
RUTH.CLASS6 1.437 0.6960 0.6762 3.053
Concordance= 0.574 (se = 0.026 )
Rsquare= 0.045 (max possible= 1 )
Likelihood ratio test= 6.71 on 3 df, p=0.08156
Wald test = 7.09 on 3 df, p=0.06902
Score (logrank) test = 7.23 on 3 df, p=0.06478
> levels(test$RUTH.CLASS)
[1] "3" "4" "5" "6"
When i fit more variables in the model, similar things happen:
cox.fit <- coxph(stent.surv ~ RUTH.CLASS + LESION.INDICATION + LESION.TYPE, data=test )
>
> summary(cox.fit)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS + LESION.INDICATION +
LESION.TYPE, data = test)
n= 147, number of events= 147
coef exp(coef) se(coef) z Pr(>|z|)
RUTH.CLASS4 -0.5854 0.5569 1.1852 -0.494 0.6214
RUTH.CLASS5 -0.1476 0.8627 1.0182 -0.145 0.8847
RUTH.CLASS6 -0.4509 0.6370 1.0998 -0.410 0.6818
LESION.INDICATIONEMBOLIC -0.4611 0.6306 1.5425 -0.299 0.7650
LESION.INDICATIONISCHEMIA 1.3794 3.9725 1.1541 1.195 0.2320
LESION.INDICATIONISCHEMIA/CLAUDICATION 0.2546 1.2899 1.0189 0.250 0.8027
LESION.INDICATIONREST PAIN 0.5302 1.6993 1.1853 0.447 0.6547
LESION.INDICATIONTISSUE LOSS 0.7793 2.1800 1.0254 0.760 0.4473
LESION.TYPEOCCLUSION -0.5886 0.5551 0.4360 -1.350 0.1770
LESION.TYPESTEN -0.7895 0.4541 0.4378 -1.803 0.0714 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4 0.5569 1.7956 0.05456 5.684
RUTH.CLASS5 0.8627 1.1591 0.11726 6.348
RUTH.CLASS6 0.6370 1.5698 0.07379 5.499
LESION.INDICATIONEMBOLIC 0.6306 1.5858 0.03067 12.964
LESION.INDICATIONISCHEMIA 3.9725 0.2517 0.41374 38.141
LESION.INDICATIONISCHEMIA/CLAUDICATION 1.2899 0.7752 0.17510 9.503
LESION.INDICATIONREST PAIN 1.6993 0.5885 0.16645 17.347
LESION.INDICATIONTISSUE LOSS 2.1800 0.4587 0.29216 16.266
LESION.TYPEOCCLUSION 0.5551 1.8015 0.23619 1.305
LESION.TYPESTEN 0.4541 2.2023 0.19250 1.071
Concordance= 0.619 (se = 0.028 )
Rsquare= 0.137 (max possible= 1 )
Likelihood ratio test= 21.6 on 10 df, p=0.01726
Wald test = 22.23 on 10 df, p=0.01398
Score (logrank) test = 23.46 on 10 df, p=0.009161
> levels(test$LESION.INDICATION)
[1] "CLAUDICATION" "EMBOLIC" "ISCHEMIA" "ISCHEMIA/CLAUDICATION"
[5] "REST PAIN" "TISSUE LOSS"
> levels(test$LESION.TYPE)
[1] "" "OCCLUSION" "STEN"
truncated output from model.matrix below:
> model.matrix(cox.fit)
RUTH.CLASS4 RUTH.CLASS5 RUTH.CLASS6 LESION.INDICATIONEMBOLIC LESION.INDICATIONISCHEMIA
1 0 0 0 0 0
2 0 1 0 0 0
We can see that the the first level of each of these is being excluded from the model. Any input would be greatly appreciated. I noticed that on the LESION.TYPE variable, the blank level "" is not being included, but that is not by design - that should be a NA or something similar.
I'm confused and could use some help with this. Thanks.
Factors in any model return coefficients based on a base level (a contrast).Your contrasts default to a base factor. There is no point in calculating a coefficient for the dropped value because the model will return the predictions when that dropped value = 1 given that all the other factor values are 0 (factors are complete and mutually exclusive for every observation). You can alter your default contrast by changing the contrasts in your options.
For your coefficients to be versus an average of all factors:
options(contrasts=c(unordered="contr.sum", ordered="contr.poly"))
For your coefficients to be versus a specific treatment (what you have above and your default):
options(contrasts=c(unordered="contr.treatment", ordered="contr.poly"))
As you can see there are two types of factors in R: unordered (or categorical, e.g. red, green, blue) and ordered (e.g. strongly disagree, disagree, no opinion, agree, strongly agree)