R - cox hazard model not including levels of a factor - r

I am fitting a cox model to some data that is structured as such:
str(test)
'data.frame': 147 obs. of 8 variables:
$ AGE : int 71 69 90 78 61 74 78 78 81 45 ...
$ Gender : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ...
$ RACE : Factor w/ 5 levels "","BLACK","HISPANIC",..: 5 2 5 5 5 5 5 5 5 1 ...
$ SIDE : Factor w/ 2 levels "L","R": 1 1 2 1 2 1 1 1 2 1 ...
$ LESION.INDICATION: Factor w/ 12 levels "CLAUDICATION",..: 1 11 4 11 9 1 1 11 11 11 ...
$ RUTH.CLASS : int 3 5 4 5 4 3 3 5 5 5 ...
$ LESION.TYPE : Factor w/ 3 levels "","OCCLUSION",..: 3 3 2 3 3 3 2 3 3 3 ...
$ Primary : int 1190 1032 166 689 219 840 1063 115 810 157 ...
the RUTH.CLASS variable is actually a factor, and i've changed it to one as such:
> test$RUTH.CLASS <- as.factor(test$RUTH.CLASS)
> summary(test$RUTH.CLASS)
3 4 5 6
48 56 35 8
great.
after fitting the model
stent.surv <- Surv(test$Primary)
> cox.ruthclass <- coxph(stent.surv ~ RUTH.CLASS, data=test )
>
> summary(cox.ruthclass)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS, data = test)
n= 147, number of events= 147
coef exp(coef) se(coef) z Pr(>|z|)
RUTH.CLASS4 0.1599 1.1734 0.1987 0.804 0.42111
RUTH.CLASS5 0.5848 1.7947 0.2263 2.585 0.00974 **
RUTH.CLASS6 0.3624 1.4368 0.3846 0.942 0.34599
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4 1.173 0.8522 0.7948 1.732
RUTH.CLASS5 1.795 0.5572 1.1518 2.796
RUTH.CLASS6 1.437 0.6960 0.6762 3.053
Concordance= 0.574 (se = 0.026 )
Rsquare= 0.045 (max possible= 1 )
Likelihood ratio test= 6.71 on 3 df, p=0.08156
Wald test = 7.09 on 3 df, p=0.06902
Score (logrank) test = 7.23 on 3 df, p=0.06478
> levels(test$RUTH.CLASS)
[1] "3" "4" "5" "6"
When i fit more variables in the model, similar things happen:
cox.fit <- coxph(stent.surv ~ RUTH.CLASS + LESION.INDICATION + LESION.TYPE, data=test )
>
> summary(cox.fit)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS + LESION.INDICATION +
LESION.TYPE, data = test)
n= 147, number of events= 147
coef exp(coef) se(coef) z Pr(>|z|)
RUTH.CLASS4 -0.5854 0.5569 1.1852 -0.494 0.6214
RUTH.CLASS5 -0.1476 0.8627 1.0182 -0.145 0.8847
RUTH.CLASS6 -0.4509 0.6370 1.0998 -0.410 0.6818
LESION.INDICATIONEMBOLIC -0.4611 0.6306 1.5425 -0.299 0.7650
LESION.INDICATIONISCHEMIA 1.3794 3.9725 1.1541 1.195 0.2320
LESION.INDICATIONISCHEMIA/CLAUDICATION 0.2546 1.2899 1.0189 0.250 0.8027
LESION.INDICATIONREST PAIN 0.5302 1.6993 1.1853 0.447 0.6547
LESION.INDICATIONTISSUE LOSS 0.7793 2.1800 1.0254 0.760 0.4473
LESION.TYPEOCCLUSION -0.5886 0.5551 0.4360 -1.350 0.1770
LESION.TYPESTEN -0.7895 0.4541 0.4378 -1.803 0.0714 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4 0.5569 1.7956 0.05456 5.684
RUTH.CLASS5 0.8627 1.1591 0.11726 6.348
RUTH.CLASS6 0.6370 1.5698 0.07379 5.499
LESION.INDICATIONEMBOLIC 0.6306 1.5858 0.03067 12.964
LESION.INDICATIONISCHEMIA 3.9725 0.2517 0.41374 38.141
LESION.INDICATIONISCHEMIA/CLAUDICATION 1.2899 0.7752 0.17510 9.503
LESION.INDICATIONREST PAIN 1.6993 0.5885 0.16645 17.347
LESION.INDICATIONTISSUE LOSS 2.1800 0.4587 0.29216 16.266
LESION.TYPEOCCLUSION 0.5551 1.8015 0.23619 1.305
LESION.TYPESTEN 0.4541 2.2023 0.19250 1.071
Concordance= 0.619 (se = 0.028 )
Rsquare= 0.137 (max possible= 1 )
Likelihood ratio test= 21.6 on 10 df, p=0.01726
Wald test = 22.23 on 10 df, p=0.01398
Score (logrank) test = 23.46 on 10 df, p=0.009161
> levels(test$LESION.INDICATION)
[1] "CLAUDICATION" "EMBOLIC" "ISCHEMIA" "ISCHEMIA/CLAUDICATION"
[5] "REST PAIN" "TISSUE LOSS"
> levels(test$LESION.TYPE)
[1] "" "OCCLUSION" "STEN"
truncated output from model.matrix below:
> model.matrix(cox.fit)
RUTH.CLASS4 RUTH.CLASS5 RUTH.CLASS6 LESION.INDICATIONEMBOLIC LESION.INDICATIONISCHEMIA
1 0 0 0 0 0
2 0 1 0 0 0
We can see that the the first level of each of these is being excluded from the model. Any input would be greatly appreciated. I noticed that on the LESION.TYPE variable, the blank level "" is not being included, but that is not by design - that should be a NA or something similar.
I'm confused and could use some help with this. Thanks.

Factors in any model return coefficients based on a base level (a contrast).Your contrasts default to a base factor. There is no point in calculating a coefficient for the dropped value because the model will return the predictions when that dropped value = 1 given that all the other factor values are 0 (factors are complete and mutually exclusive for every observation). You can alter your default contrast by changing the contrasts in your options.
For your coefficients to be versus an average of all factors:
options(contrasts=c(unordered="contr.sum", ordered="contr.poly"))
For your coefficients to be versus a specific treatment (what you have above and your default):
options(contrasts=c(unordered="contr.treatment", ordered="contr.poly"))
As you can see there are two types of factors in R: unordered (or categorical, e.g. red, green, blue) and ordered (e.g. strongly disagree, disagree, no opinion, agree, strongly agree)

Related

How to understand and explain the level with NA value in coxph interaction model

A minimal example is given as below. Why the last row (the last level of combination) got NA? I thought it should be treated as a reference level, am I right? Which is the reference level in the combination?
> library(survival)
> summary(coxph(Surv(time, status) ~ factor(ph.ecog) * factor(sex), data = lung))
Call:
coxph(formula = Surv(time, status) ~ factor(ph.ecog) * factor(sex),
data = lung)
n= 227, number of events= 164
(因为不存在,1个观察量被删除了)
coef exp(coef) se(coef) z Pr(>|z|)
factor(ph.ecog)1 0.40410 1.49796 0.23525 1.718 0.08584 .
factor(ph.ecog)2 0.84691 2.33242 0.26889 3.150 0.00163 **
factor(ph.ecog)3 2.01808 7.52390 1.03086 1.958 0.05027 .
factor(sex)2 -0.67710 0.50809 0.38435 -1.762 0.07812 .
factor(ph.ecog)1:factor(sex)2 0.07719 1.08024 0.45168 0.171 0.86432
factor(ph.ecog)2:factor(sex)2 0.32936 1.39008 0.49576 0.664 0.50646
factor(ph.ecog)3:factor(sex)2 NA NA 0.00000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
factor(ph.ecog)1 1.4980 0.6676 0.9446 2.375
factor(ph.ecog)2 2.3324 0.4287 1.3770 3.951
factor(ph.ecog)3 7.5239 0.1329 0.9976 56.743
factor(sex)2 0.5081 1.9682 0.2392 1.079
factor(ph.ecog)1:factor(sex)2 1.0802 0.9257 0.4457 2.618
factor(ph.ecog)2:factor(sex)2 1.3901 0.7194 0.5261 3.673
factor(ph.ecog)3:factor(sex)2 NA NA NA NA
Concordance= 0.643 (se = 0.025 )
Likelihood ratio test= 30.07 on 6 df, p=0.00004
Wald test = 29.75 on 6 df, p=0.00004
Score (logrank) test = 32.92 on 6 df, p=0.00001
More strange is that when focus on the interaction, one or more last levels would got NA.
> summary(coxph(Surv(time, status) ~ factor(ph.ecog) : factor(sex), data = lung))
Call:
coxph(formula = Surv(time, status) ~ factor(ph.ecog):factor(sex),
data = lung)
n= 227, number of events= 164
(因为不存在,1个观察量被删除了)
coef exp(coef) se(coef) z Pr(>|z|)
factor(ph.ecog)0:factor(sex)1 -0.49916 0.60704 0.31688 -1.575 0.11520
factor(ph.ecog)1:factor(sex)1 -0.09506 0.90932 0.28525 -0.333 0.73894
factor(ph.ecog)2:factor(sex)1 0.34774 1.41587 0.31502 1.104 0.26964
factor(ph.ecog)3:factor(sex)1 1.51892 4.56729 1.04054 1.460 0.14436
factor(ph.ecog)0:factor(sex)2 -1.17627 0.30843 0.41739 -2.818 0.00483 **
factor(ph.ecog)1:factor(sex)2 -0.69498 0.49909 0.31614 -2.198 0.02793 *
factor(ph.ecog)2:factor(sex)2 NA NA 0.00000 NA NA
factor(ph.ecog)3:factor(sex)2 NA NA 0.00000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
factor(ph.ecog)0:factor(sex)1 0.6070 1.6473 0.3262 1.1296
factor(ph.ecog)1:factor(sex)1 0.9093 1.0997 0.5199 1.5904
factor(ph.ecog)2:factor(sex)1 1.4159 0.7063 0.7636 2.6252
factor(ph.ecog)3:factor(sex)1 4.5673 0.2189 0.5942 35.1048
factor(ph.ecog)0:factor(sex)2 0.3084 3.2422 0.1361 0.6989
factor(ph.ecog)1:factor(sex)2 0.4991 2.0037 0.2686 0.9274
factor(ph.ecog)2:factor(sex)2 NA NA NA NA
factor(ph.ecog)3:factor(sex)2 NA NA NA NA
Concordance= 0.643 (se = 0.025 )
Likelihood ratio test= 30.07 on 6 df, p=0.00004
Wald test = 29.75 on 6 df, p=0.00004
Score (logrank) test = 32.92 on 6 df, p=0.00001
UPDATE
Thanks the operation suggested by #rawr, I checked my real data and found it cannot be explained by this:
> with(data_case, ftable(class, Grouping, OS_Status))
OS_Status 0 1
class Grouping
A a 33 14
b 22 26
B a 49 21
b 43 28
C a 86 25
b 77 42
> fit = coxph(Surv(OS, OS_Status) ~ class:Grouping, data_case)
> summary(fit)
Call:
coxph(formula = Surv(OS, OS_Status) ~ class:Grouping, data = data_case)
n= 466, number of events= 156
coef exp(coef) se(coef) z Pr(>|z|)
classA:Groupinga -0.3504 0.7044 0.3099 -1.131 0.2582
classB:Groupinga -0.4621 0.6299 0.2695 -1.715 0.0864 .
classC:Groupinga -0.6477 0.5232 0.2534 -2.556 0.0106 *
classA:Groupingb 0.3717 1.4502 0.2503 1.485 0.1376
classB:Groupingb -0.1213 0.8858 0.2455 -0.494 0.6213
classC:Groupingb NA NA 0.0000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
classA:Groupinga 0.7044 1.4196 0.3838 1.2930
classB:Groupinga 0.6299 1.5875 0.3714 1.0684
classC:Groupinga 0.5232 1.9112 0.3184 0.8598
classA:Groupingb 1.4502 0.6896 0.8879 2.3685
classB:Groupingb 0.8858 1.1289 0.5475 1.4331
classC:Groupingb NA NA NA NA
Concordance= 0.58 (se = 0.027 )
Likelihood ratio test= 16.48 on 5 df, p=0.006
Wald test = 16.74 on 5 df, p=0.005
Score (logrank) test = 17.5 on 5 df, p=0.004

Plot interaction in paneldata

I have run this regression without any problems and I get 4 coefficients, for each interaction between econ_sit and educ_cat. Econ_sit is a continous variable, and educ_cat is a categorical variable from 1-6. How can i plot the coefficients only for the interaction terms in a good way?
model_int_f <- felm(satis_gov_sc ~ econ_sit*factor(educ_cat) + factor(benefit) + econ_neth + age + gender + pol_sof
| factor(wave) + factor(id) # Respondent and time fixed effects
| 0
| id, # Cluster standard errors on each respondent
data = full1)
summary(model_int_f)
Call:
felm(formula = satis_gov_sc ~ econ_sit * factor(educ_cat) + factor(benefit) + econ_neth + age + gender + pol_sof | factor(wave) + factor(id) | 0 | id, data = full1)
Residuals:
Min 1Q Median 3Q Max
-0.58468 -0.04464 0.00000 0.04728 0.78470
Coefficients:
Estimate Cluster s.e. t value Pr(>|t|)
econ_sit 0.1411692 0.0603100 2.341 0.01928 *
factor(educ_cat)2 0.0525580 0.0450045 1.168 0.24292
factor(educ_cat)3 0.1229048 0.0576735 2.131 0.03313 *
factor(educ_cat)4 0.1244146 0.0486455 2.558 0.01057 *
factor(educ_cat)5 0.1245556 0.0520246 2.394 0.01669 *
factor(educ_cat)6 0.1570034 0.0577240 2.720 0.00655 **
factor(benefit)2 -0.0030380 0.0119970 -0.253 0.80010
factor(benefit)3 0.0026064 0.0072590 0.359 0.71957
econ_neth 0.0642726 0.0131940 4.871 1.14e-06 ***
age 0.0177453 0.0152661 1.162 0.24512
gender 0.1088780 0.0076137 14.300 < 2e-16 ***
pol_sof 0.0006003 0.0094504 0.064 0.94935
econ_sit:factor(educ_cat)2 -0.0804820 0.0653488 -1.232 0.21816
econ_sit:factor(educ_cat)3 -0.0950652 0.0793818 -1.198 0.23114
econ_sit:factor(educ_cat)4 -0.1259772 0.0692072 -1.820 0.06877 .
econ_sit:factor(educ_cat)5 -0.1469749 0.0654870 -2.244 0.02485 *
econ_sit:factor(educ_cat)6 -0.1166243 0.0693709 -1.681 0.09279 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1161 on 11159 degrees of freedom
(23983 observations deleted due to missingness)
Multiple R-squared(full model): 0.8119 Adjusted R-squared: 0.717
Multiple R-squared(proj model): 0.00657 Adjusted R-squared: -0.4946
F-statistic(full model, *iid*):8.557 on 5630 and 11159 DF, p-value: < 2.2e-16
F-statistic(proj model): 55.38 on 17 and 5609 DF, p-value: < 2.2e-16
This is what my data looks like:
$ id : num 1 1 1 1 2 2 2 2 3 3 3 3
$ wave : chr "2013" "2015" "2016" "2017" ...
$ satis_gov_sc: num 0.5 0.4 0.4 0.6 0.6 0.5 0.6 0.7 0.7 0.7 ...
$ econ_sit : num NA NA 0.708 0.75 0.708 ...
$ educ_cat : num 5 5 5 5 5 6 6 6 6 6 ...
$ benefit : num 3 3 3 3 3 3 3 3 3 3 ...
$ econ_neth : num NA 0.6 0.6 0.7 0.7 0.5 0.4 0.6 0.8 0.7 ...
$ age : num 58 60 61 62 63 51 53 54 55 56 ...
$ gender : num 1 1 1 1 1 1 1 1 1 1 ...
$ pol_sof : num 1 1 1 0.8 1 1 1 1 0.8 1 ...
I've tried to run af simple plot_model with the following code:
plot_model(model_int_f, type = "pred", terms = c("econ_sit", "educ_cat"))
However I only get error because the felm function is not compatible with "pred":
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "felm"
Any suggestions on how to plot the interaction terms?
Thanks in advance!
felm does not have a predict method so it is not compatible with plot_model. You could use some other fixed effects library.
Here's an example using fixest. As you did not provide a sample of your data, I have used data(iris).
library(fixest); library(sjPlot)
res = feols(Sepal.Length ~ Sepal.Width + Petal.Length:Species | Species, cluster='Species', iris)
plot_model(res, type = "pred", terms = c("Petal.Length", "Species"))

R coxph() with interaction term, Warning: X matrix deemed to be singular

Please be patient with me. I'm new to this site.
I am modeling turtle nest survival using the coxph() function and have run into a confusing problem with an interaction term between species and nest cages. I have nests from 3 species of turtles (7, 10, and 111 nests per species).
There are nest cages on all nests for the species(1) with 7 nests.
There are no nest cages on all the nests for the species(2) with 10 nests.
There are nest cages on about half of the nests for the species(3) with 111 nests.
Here is my model with the summary output:
S<-Surv(time, event)
n8<-coxph(S~species:cage, data=nesta1)
Warning message:
In coxph(S ~ species:cage, data = nesta1) :
X matrix deemed to be singular; variable 1 5 6
summary(n8)
Call:
coxph(formula = S ~ species:cage, data = nesta1)
n= 128, number of events= 73
coef exp(coef) se(coef) z Pr(>|z|)
species1:cageN NA NA 0.0000 NA NA
species2:cageN 1.2399 3.4554 0.3965 3.128 0.00176 **
species3:cageN 0.5511 1.7351 0.2664 2.068 0.03860 *
species1:cageY -0.1054 0.8999 0.6145 -0.172 0.86379
species2:cageY NA NA 0.0000 NA NA
species3:cageY NA NA 0.0000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
species1:cageN NA NA NA NA
species2:cageN 3.4554 0.2894 1.5887 7.515
species3:cageN 1.7351 0.5763 1.0293 2.925
species1:cageY 0.8999 1.1112 0.2698 3.001
species2:cageY NA NA NA NA
species3:cageY NA NA NA NA
Concordance= 0.61 (se = 0.038 )
Rsquare= 0.079 (max possible= 0.993 )
Likelihood ratio test= 10.57 on 3 df, p=0.01426
Wald test = 11.36 on 3 df, p=0.009908
Score (logrank) test = 12.22 on 3 df, p=0.006672
I understand that I would have singularities for species 1 and 2, but not for species 3. Why would the "species3:cageY" line be singular when there are species 3 nests with nest cages on them?
Is it ok to include species 1 and 2 even though they have those singularities?
Edit: I cannot find any errors in my data. I have decimal numbers for the time variable for a few nests, but that doesn't seem to be a problem for species 3 nests without a nest cage. For species 3, I have the full range of time values for nests with and without a nest cage and I have both true and false events for nests with and without a nest cage.
Edit:
with( nesta1, table(event, species, cage))
, , cage = N
species
event 1 2 3
0 0 1 24
1 0 9 38
, , cage = Y
species
event 1 2 3
0 4 0 26
1 3 0 23
Edit 2: I understand that interaction-only models are not very useful, but the interaction term results behave the same way whether I have other main effects in the model or not. I've removed the other main effects to simplify this question.
Thank you!

Survey package (survival analysis)

I am using the survey package to analyse a longitudinal database. The data looks like
personid spellid long.w Dur rc sex 1 10 age
1 1 278 6.4702295519 0 0 47 20 16
2 1 203 2.8175129012 1 1 126 87 62
3 1 398 6.1956669321 0 0 180 6 37
4 1 139 7.2791061847 1 0 104 192 20
7 1 10 3.6617503439 1 0 18 24 25
8 1 3 2.265464682 0 1 168 136 40
9 1 134 6.3180994022 0 1 116 194 35
10 1 272 6.9167936912 0 0 39 119 45
11 1 296 5.354798213 1 1 193 161 62
After the variable SEX I have 10 bootstrap weights, then the variable Age.
The longitudinal weight is given in the column long.w
I am using the following code.
data.1 <- read.table("Panel.csv", sep = ",",header=T)
library(survey)
library(survival)
#### Unweigthed model
mod.1 <- summary(coxph(Surv(Dur, rc) ~ age + sex, data.1))
mod.1
coxph(formula = Surv(Dur, rc) ~ age + sex, data = data.1)
n= 36, number of events= 14
coef exp(coef) se(coef) z Pr(>|z|)
age -4.992e-06 1.000e+00 2.291e-02 0.000 1.000
sex 5.277e-01 1.695e+00 5.750e-01 0.918 0.359
exp(coef) exp(-coef) lower .95 upper .95
age 1.000 1.00 0.9561 1.046
sex 1.695 0.59 0.5492 5.232
Concordance= 0.651 (se = 0.095 )
Rsquare= 0.024 (max possible= 0.858 )
### --- Weights
weights <- data.1[,7:16]*data.1$long.w
panel <-svrepdesign(data=data.1,
weights=data.1[,3],
type="BRR",
repweights=weights,
combined.weights=TRUE
)
#### Weighted model
mod.1.w <- svycoxph(Surv(Dur,rc)~ age+ sex ,design=panel)
summary(mod.1.w)
Balanced Repeated Replicates with 10 replicates.
Call:
svycoxph.svyrep.design(formula = Surv(Dur, rc) ~ age + sex, design = panel)
n= 36, number of events= 14
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0198 1.0200 0.0131 1.512 0.131
sex 1.0681 2.9098 0.2336 4.572 4.84e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.02 0.9804 0.9941 1.047
sex 2.91 0.3437 1.8407 4.600
Concordance= 0.75 (se = 0.677 )
Rsquare= NA (max possible= NA )
Likelihood ratio test= NA on 2 df, p=NA
Wald test = 28.69 on 2 df, p=5.875e-07
Score (logrank) test = NA on 2 df, p=NA
### ----
> panel.2 <-svrepdesign(data=data.1,
+ weights=data.1[,3],
+ type="BRR",
+ repweights=data.1[,7:16],
+ combined.weights=FALSE,
+ )
Warning message:
In svrepdesign.default(data = data.1, weights = data.1[, 3], type = "BRR", :
Data look like combined weights: mean replication weight is 101.291666666667 and mean sampling weight is 203.944444444444
mod.2.w <- svycoxph(Surv(Dur,rc)~ age+ sex ,design=panel.2)
> summary(mod.2.w)
Call: svrepdesign.default(data = data.1, weights = data.1[, 3], type = "BRR",
repweights = data.1[, 7:16], combined.weights = FALSE, )
Balanced Repeated Replicates with 10 replicates.
Call:
svycoxph.svyrep.design(formula = Surv(Dur, rc) ~ age + sex, design = panel.2)
n= 36, number of events= 14
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0198 1.0200 0.0131 1.512 0.131
sex 1.0681 2.9098 0.2336 4.572 4.84e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.02 0.9804 0.9941 1.047
sex 2.91 0.3437 1.8407 4.600
Concordance= 0.75 (se = 0.677 )
Rsquare= NA (max possible= NA )
Likelihood ratio test= NA on 2 df, p=NA
Wald test = 28.69 on 2 df, p=5.875e-07
Score (logrank) test = NA on 2 df, p=NA
The sum of the longitudinal weights is 7,342. The total of events must be around 2,357 and the censored observations a total of 4,985 for a "population" of 7,342 individuals
Do models mod.1.w and mod.2.w take into consideration the longitudinal weights? If the do, why the summary report only n= 36, number of events= 14 ?
The design works well if I take other statistics. For example the mean of Dur in data.1 without considering the sampling design is around 4.9 and 5.31 when I consider svymean(~Dur, panel.2) for example.

Models giving 100% accuracy, random forest, logit, C5.0?

When trying to fit models to predict the outcome "death" I am having a 100% accuracy, this is obviously wrong. Could someone tell me what am I missing?
library(caret)
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
control <- trainControl(method="repeatedcv", repeats=3, number=5)
#C5.0 decision tree
set.seed(100)
modelC50 <- train(death~., data=training_Score, method="C5.0",trControl=control)
summary(modelC50)
#Call:
#C5.0.default(x = structure(c(3, 4, 2, 30, 4, 12, 156, 0.0328767150640488, 36, 0.164383560419083, 22,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
# 0, 0, 0, 0,
#C5.0 [Release 2.07 GPL Edition] Tue Aug 4 10:23:10 2015
#-------------------------------
#Class specified by attribute `outcome'
#Read 27875 cases (23 attributes) from undefined.data
#21 attributes winnowed
#Estimated importance of remaining attributes:
#-2147483648% no.subjective.fevernofever
#Rules:
#Rule 1: (26982, lift 1.0)
# no.subjective.fevernofever <= 0
# -> class no [1.000]
#Rule 2: (893, lift 31.2)
# no.subjective.fevernofever > 0
# -> class yes [0.999]
#Default class: no
#Evaluation on training data (27875 cases):
# Rules
# ----------------
# No Errors
# 2 0( 0.0%) <<
# (a) (b) <-classified as
# ---- ----
# 26982 (a): class no
# 893 (b): class yes
# Attribute usage:
# 100.00% no.subjective.fevernofever
#Time: 0.1 secs
confusionMatrix(predictC50, testing_Score$death)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
For the Random Forest model
set.seed(100)
modelRF <- train(death~., data=training_Score, method="rf", trControl=control)
predictRF <- predict(modelRF,testing_Score)
confusionMatrix(predictRF, testing_Score$death)
#Confusion Matrix and Statistics
#
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
predictRFprobs <- predict(modelRF, testing_Score, type = "prob")
For the Logit model
set.seed(100)
modelLOGIT <- train(death~., data=training_Score,method="glm",family="binomial", trControl=control)
summary(modelLOGIT)
#Call:
#NULL
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-2.409e-06 -2.409e-06 -2.409e-06 -2.409e-06 2.409e-06
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -2.657e+01 7.144e+04 0.000 1.000
#age.in.months 3.554e-15 7.681e+01 0.000 1.000
#temp -1.916e-13 1.885e+03 0.000 1.000
#genderfemale 3.644e-14 4.290e+03 0.000 1.000
#no.subjective.fevernofever 5.313e+01 1.237e+04 0.004 0.997
#palloryes -1.156e-13 4.747e+03 0.000 1.000
#jaundiceyes -2.330e-12 1.142e+04 0.000 1.000
#vomitingyes 1.197e-13 4.791e+03 0.000 1.000
#diarrheayes -3.043e-13 4.841e+03 0.000 1.000
#dark.urineyes -6.958e-13 1.037e+04 0.000 1.000
#intercostal.retractionyes 2.851e-13 1.003e+04 0.000 1.000
#subcostal.retractionyes 7.414e-13 1.012e+04 0.000 1.000
#wheezingyes -1.756e-12 1.091e+04 0.000 1.000
#rhonchiyes -1.659e-12 1.074e+04 0.000 1.000
#difficulty.breathingyes 4.496e-13 6.504e+03 0.000 1.000
#deep.breathingyes 1.086e-12 7.075e+03 0.000 1.000
#convulsionsyes -1.294e-12 6.424e+03 0.000 1.000
#lethargyyes -4.338e-13 6.188e+03 0.000 1.000
#unable.to.sityes -4.284e-13 8.118e+03 0.000 1.000
#unable.to.drinkyes 7.297e-13 6.507e+03 0.000 1.000
#altered.consciousnessyes 2.907e-12 1.071e+04 0.000 1.000
#unconsciousnessyes 2.868e-11 1.505e+04 0.000 1.000
#meningeal.signsyes -1.177e-11 1.570e+04 0.000 1.000
#(Dispersion parameter for binomial family taken to be 1)
# Null deviance: 7.9025e+03 on 27874 degrees of freedom
#Residual deviance: 1.6172e-07 on 27852 degrees of freedom
#AIC: 46
#Number of Fisher Scoring iterations: 25
#predictLOGIT<-predict(modelLOGIT,testing_Score)
confusionMatrix(predictLOGIT, testing_Score$death)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
The data before slicing was:
str(riskFinal)
#'data.frame': 46458 obs. of 23 variables:
# $ age.in.months : num 3 3 4 2 1.16 ...
# $ temp : num 35.5 39.4 36.8 35.2 35 34.3 37.2 35.2 34.6 35.3 ...
# $ gender : Factor w/ 2 levels "male","female": 1 2 2 2 1 1 1 2 1 1 ...
# $ no.subjective.fever : Factor w/ 2 levels "fever","nofever": 1 1 2 2 1 1 2 2 2 1 ...
# $ pallor : Factor w/ 2 levels "no","yes": 2 2 1 1 2 2 2 1 2 2 ...
# $ jaundice : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ vomiting : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 2 1 1 ...
# $ diarrhea : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
# $ dark.urine : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ intercostal.retraction: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 1 2 ...
# $ subcostal.retraction : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 1 1 ...
# $ wheezing : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# $ rhonchi : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
# $ difficulty.breathing : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 1 1 1 2 ...
# $ deep.breathing : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 2 ...
# $ convulsions : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 2 1 2 2 ...
# $ lethargy : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unable.to.sit : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ unable.to.drink : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
# $ altered.consciousness : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unconsciousness : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ meningeal.signs : Factor w/ 2 levels "no","yes": 1 2 2 1 1 2 1 2 2 1 ...
# $ death : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 2 2 2 1 ...
EDIT: based on the comments, I realized that the no.subjective.fever variable had the exactly same values as the target variable death, so I excluded it from the model. Then I got even stranger results:
RANDOM FOREST
set.seed(100)
nmodelRF<- train(death~.-no.subjective.fever, data=training_Score, method="rf", trControl=control)
summary(nmodelRF)
npredictRF<-predict(nmodelRF,testing_Score)
> confusionMatrix(npredictRF, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17988 595
# yes 0 0
#
# Accuracy : 0.968
# 95% CI : (0.9653, 0.9705)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5109
#
# Kappa : 0
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 1.000
# Specificity : 0.000
# Pos Pred Value : 0.968
# Neg Pred Value : NaN
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 1.000
# Balanced Accuracy : 0.500
#
# 'Positive' Class : no
Logit
set.seed(100)
nmodelLOGIT<- train(death~.-no.subjective.fever, data=training_Score,method="glm",family="binomial", trControl=control)
>summary(nmodelLOGIT)
# Call:
# NULL
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.5113 -0.2525 -0.2041 -0.1676 3.1698
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.432065 1.084942 2.242 0.024984 *
#age.in.months -0.001047 0.001293 -0.810 0.417874
#temp -0.168704 0.028815 -5.855 4.78e-09 ***
#genderfemale -0.053306 0.070468 -0.756 0.449375
#palloryes 0.282123 0.076518 3.687 0.000227 ***
#jaundiceyes 0.323755 0.144607 2.239 0.025165 *
#vomitingyes -0.533661 0.082948 -6.434 1.25e-10 ***
#diarrheayes -0.040272 0.080417 -0.501 0.616520
#dark.urineyes -0.583666 0.168787 -3.458 0.000544 ***
#intercostal.retractionyes -0.021717 0.129607 -0.168 0.866926
#subcostal.retractionyes 0.269588 0.128772 2.094 0.036301 *
#wheezingyes -0.587940 0.150475 -3.907 9.34e-05 ***
#rhonchiyes -0.008565 0.140095 -0.061 0.951249
#difficulty.breathingyes 0.397394 0.087789 4.527 5.99e-06 ***
#deep.breathingyes 0.399302 0.098761 4.043 5.28e-05 ***
#convulsionsyes 0.132609 0.094038 1.410 0.158491
#lethargyyes 0.338599 0.089934 3.765 0.000167 ***
#unable.to.sityes 0.452111 0.104556 4.324 1.53e-05 ***
#unable.to.drinkyes 0.516878 0.089685 5.763 8.25e-09 ***
#altered.consciousnessyes 0.433672 0.123288 3.518 0.000436 ***
#unconsciousnessyes 0.754012 0.136105 5.540 3.03e-08 ***
#meningeal.signsyes 0.188823 0.161088 1.172 0.241130
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 7902.5 on 27874 degrees of freedom
# Residual deviance: 7148.5 on 27853 degrees of freedom
# AIC: 7192.5
#
# Number of Fisher Scoring iterations: 6
npredictLOGIT<-predict(nmodelLOGIT,testing_Score)
>confusionMatrix(npredictLOGIT, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17982 592
# yes 6 3
#
# Accuracy : 0.9678
# 95% CI : (0.9652, 0.9703)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5605
#
# Kappa : 0.009
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 0.999666
# Specificity : 0.005042
# Pos Pred Value : 0.968127
# Neg Pred Value : 0.333333
# Prevalence : 0.967981
# Detection Rate : 0.967659
# Detection Prevalence : 0.999516
# Balanced Accuracy : 0.502354
#
# 'Positive' Class : no
The 100% accuracy results are probably not correct. I assume that they are due to the fact that the target variable (or another variable with essentially the same entries as the target variable, as pointed out in a comment by #ulfelder) is included in the training set and in the test set. Usually these columns need to be removed during the model building and testing process, since they represent the target that describes the classification, whereas the train/test data should only contain information that (hopefully) leads to a correct classification according to the target variable.
You could try the following:
target <- riskFinal$death
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
train_target <- training_Score$death
test_target <- test_Score$death
training_Score <- training_Score[,-which(colnames(training_Score)=="death")]
test_Score <- test_Score[,-which(colnames(test_Score)=="death")]
modelRF <- train(training_Score, train_target, method="rf", trControl=control)
Then you could proceed like you did before, noting that the target "death" is stored in the variables train_target and test_target.
Hope this helps.

Resources