How to include the interaction between a covariate and time for a non-proportional hazards model? - r

How to include the interaction between a covariate and and time for a non-proportional hazards model?
I often find that the proportional hazards assumption for the Cox regressions doesn’t hold.
Take the following data as an example.
> head(data2)
no np_p age_dx1 race1 mr_dx er_1 pr_1 sct_1 surv_mo km_stts1
1 20 1 2 4 1 2 2 4 52 1
2 33 1 3 1 2 1 2 1 11 1
3 67 1 2 4 4 1 1 3 20 1
4 90 1 3 1 3 3 3 2 11 1
5 143 1 2 4 3 1 1 2 123 0
6 180 1 3 1 3 1 1 2 9 1
First, I fitted a Cox regression model.
> fit2 <- coxph(Surv(surv_mo, km_stts1) ~ np_p + age_dx1 + race1 + mr_dx + er_1 + pr_1 + sct_1, data = data)
Second, I assessed the proportional hazards assumption.
> check_PH2 <- cox.zph(fit2, transform = "km")
> check_PH2
rho chisq p
np_p 0.00946 0.0748 7.84e-01
age_dx1 -0.00889 0.0640 8.00e-01
race1 -0.03148 0.7827 3.76e-01
mr_dx -0.03120 0.7607 3.83e-01
er_1 -0.14741 18.5972 1.61e-05
pr_1 0.05906 2.9330 8.68e-02
sct_1 0.17651 23.8030 1.07e-06
GLOBAL NA 53.2844 3.26e-09
So, this means that the hazard function of er_1 and sct_1 were nonproportional over time (Right?).
In my opinion, I can include the interaction between these two covariates and time seperately in the model. But I don't know how to perform it using R.
Thank you.

Related

Cannot introduce offset into negative binomial regression

I'm performing a count data analysis in R, dealing with a data called 'doctor' which is:
V2 V3 L4 V5
1 1 32 10.866795 1
2 2 104 10.674706 1
3 3 206 10.261581 1
4 4 186 9.446440 1
5 5 102 8.578665 1
6 1 2 9.841080 2
7 2 12 9.275472 2
8 3 28 8.649974 2
9 4 28 7.857481 2
10 5 31 7.287561 2
The best model was V3~V2+L4+V5+V2:L4:V5 using stepwise AIC. Now I want to set L4 as the offset and perform negative binomial regression including the interaction, so I used the code nbinom=glm.nb(V3~V2+V5+V2:V5,offset=L4) but get this error message that says Error in glm.control(...) : unused argument (offset = L4). What have I done wrong here?
Offsets are entered using an offset term in the model formula:
nbinom=glm.nb(V3~V2+V5+V2:V5+offset(L4))
Also you can use V2*V5 instead of V2+V5+V2:V5

Error message in Firth's logistic regression

I have produced a logistic regression model in R using the logistf function from the logistf package due to quasi-complete separation. I get the error message:
Error in solve.default(object$var[2:(object$df + 1), 2:(object$df + 1)]) :
system is computationally singular: reciprocal condition number = 3.39158e-17
The data is structured as shown below, though a lot of the data has been cut here. Numbers represent levels (i.e 1 = very low, 5 = very high) not count data. Variables OrdA to OrdH are ordered factors. The variable Binary is a factor.
OrdA OrdB OrdC OrdE OrdF OrdG OrdH Binary
1 3 4 1 1 2 1 1
2 3 4 5 1 3 1 1
1 3 2 5 2 4 1 0
1 1 1 1 3 1 2 0
3 2 2 2 1 1 1 0
I have read here that this can be caused by multicollinearity, but have tested this and it is not the problem.
VIFModel <- lm(Binary ~ OrdA + OrdB + OrdC + OrdD + OrdE +
OrdF + OrdG + OrdH, data = VIFdata)
vif(VIFModel)
GVIF Df GVIF^(1/(2*Df))
OrdA 6.09 3 1.35
OrdB 3.50 2 1.37
OrdC 7.09 3 1.38
OrdD 6.07 2 1.57
OrdE 5.48 4 1.23
OrdF 3.05 2 1.32
OrdG 5.41 4 1.23
OrdH 3.03 2 1.31
The post also indicates that the problem can be caused by having "more variables than observations." However, I have 8 independent variables and 82 observations.
For context each independent variable is ordinal with 5 levels, and the binary dependent variable has 30% of the observations with "successes." I'm not sure if this could be associated with the issue. How do I fix this issue?
X <- model.matrix(Binary ~ OrdA+OrdB+OrdC+OrdD+OrdE+OrdF+OrdG+OrdH,
Data3, family = "binomial"); dim(X); Matrix::rankMatrix(X)
[1] 82 24
[1] 23
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.820766e-14
Short answer: your ordinal input variables are transformed to 24 predictor variables (number of columns of the model matrix), but the rank of your model matrix is only 23, so you do indeed have multicollinearity in your predictor variables. I don't know what vif is doing ...
You can use svd(X) to help figure out which components are collinear ...

I ran a basic logistic regression model using R, and my calculated probabilities are never less than .5 and it doesn't sound right to me.

I'm trying to learn R and predictive modeling so I thought I'd do a logistic regression to determine probabilities of teams in my fantasy football league making the playoffs.
I have a table with data from a prior season that hasTeam Name, Total Wins, Total Losses, Points For, Points Against, and whether they made the playoffs (1 or 0). Then I run the following regression model.
See table below.
Team Wins Losses PointsFor PointsAgainst Playoffs
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
Team 1 6 1 649.4 567.3 1
Team 2 5 2 603.3 523.5 1
Team 3 5 2 603.0 491.2 1
Team 4 4 3 604.7 616.0 1
Team 5 4 3 581.2 587.9 0
Team 6 3 4 635.1 623.5 0
Team 7 3 4 626.6 619.8 0
Team 8 3 4 577.1 620.6 0
Team 9 3 4 556.5 552.4 0
Team 10 3 4 551.9 637.4 0
Team 11 3 4 517.1 515.7 1
Team 12 0 7 446.3 596.9 0
Then I run the model below.
playoff.lm <- glm( Playoffs ~ Wins + Losses + PointsFor + PointsAgainst, data = predict.2017, family = binomial(link = "logit"))
With the following results
Call:
glm(formula = Playoffs ~ Wins + Losses + PointsFor + PointsAgainst,
family = binomial(link = "logit"), data = predict.2017)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.46807 -0.22076 -0.07455 0.05830 1.44255
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.45650 20.16166 0.767 0.443
Wins 3.27222 3.04645 1.074 0.283
Losses NA NA NA NA
PointsFor -0.01558 0.08507 -0.183 0.855
PointsAgainst -0.03203 0.06161 -0.520 0.603
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16.3006 on 11 degrees of freedom
Residual deviance: 5.7816 on 8 degrees of freedom
AIC: 13.782
Number of Fisher Scoring iterations: 8
I then use predict with data from this season to estimate probabilities of teams making the playoffs.
playoffpredict <- predict(playoff.lm,newdata = summary.2018, type="response")
invlogit <- function(x)
{
1/(1+exp(-x))
}
prob <- sapply(playoffpredict, invlogit)
results <- cbind(summary.2018,prob)
results
The problem is my probabilities never get less than .5. This doesn't seem right.
Team Wins Losses PointsFor PointsAgainst Prob of Playoff
Team 1 7 0 829.7 561.3 0.7309
Team 2 5 2 743.9 673.0 0.5522
Team 3 5 2 694.8 609.0 0.6933
Team 4 5 2 649.7 634.4 0.6897
Team 5 4 3 696.9 610.1 0.5339
Team 6 3 4 694.2 679.7 0.5002
Team 7 3 4 679.8 666.8 0.5003
Team 8 3 4 569.5 621.9 0.5072
Team 9 2 5 797.5 818.3 0.5000
Team 10 2 5 662.7 853.8 0.5000
Team 11 2 5 655.3 732.0 0.5000
Team 12 1 6 549.8 763.5 0.5000
I know it's pretty simple, and not a lot of data. I will add more data from prior seasons but I'm just trying to get started.
Firstly it seems that your model isn't very accurate. Your p-values are high so that you do not have any significant variable. Try combinations and log of them (for instance log(losses)) or something else. Your bad predictions may come from the poor power of your model

How to calculate survival probabilities in R?

I am trying to fit a parametric survival model. I think I managed to do so. However, I could not succeed in calculating the survival probabilities:
library(survival)
zaman <- c(65,156,100,134,16,108,121,4,39,143,56,26,22,1,1,5,65,
56,65,17,7,16,22,3,4,2,3,8,4,3,30,4,43)
test <- c(rep(1,17),rep(0,16))
WBC <- c(2.3,0.75,4.3,2.6,6,10.5,10,17,5.4,7,9.4,32,35,100,
100,52,100,4.4,3,4,1.5,9,5.3,10,19,27,28,31,26,21,79,100,100)
status <- c(rep(1,33))
data <- data.frame(zaman,test,WBC)
surv3 <- Surv(zaman[test==1], status[test==1])
fit3 <- survreg( surv3 ~ log(WBC[test==1]),dist="w")
On the other hand, no problem at all while calculating the survival probabilities using the Kaplan-Meier Estimation:
fit2 <- survfit(Surv(zaman[test==0], status[test==0]) ~ 1)
summary(fit2)$surv
Any idea why?
You can get the predicted probabilities from a survreg object with predict:
predict(fit3)
If you're interested in combining this with the original data, and also in the residual and standard errors of the predictions, you can use the augment function in my broom package:
library(broom)
augment(fit3)
A full analysis might look something like:
library(survival)
library(broom)
data <- data.frame(zaman, test, WBC, status)
subdata <- data[data$test == 1, ]
fit3 <- survreg( Surv(zaman, status) ~ log(WBC), subdata, dist="w")
augment(fit3, subdata)
With the output:
zaman test WBC status .fitted .se.fit .resid
1 65 1 2.30 1 115.46728 43.913188 -50.467281
2 156 1 0.75 1 197.05852 108.389586 -41.058516
3 100 1 4.30 1 85.67236 26.043277 14.327641
4 134 1 2.60 1 108.90836 39.624106 25.091636
5 16 1 6.00 1 73.08498 20.029707 -57.084979
6 108 1 10.50 1 55.96298 13.989099 52.037022
7 121 1 10.00 1 57.28065 14.350609 63.719348
8 4 1 17.00 1 44.47189 11.607368 -40.471888
9 39 1 5.40 1 76.85181 21.708514 -37.851810
10 143 1 7.00 1 67.90395 17.911170 75.096054
11 56 1 9.40 1 58.99643 14.848751 -2.996434
12 26 1 32.00 1 32.88935 10.333303 -6.889346
13 22 1 35.00 1 31.51314 10.219871 -9.513136
14 1 1 100.00 1 19.09922 8.963022 -18.099216
15 1 1 100.00 1 19.09922 8.963022 -18.099216
16 5 1 52.00 1 26.09034 9.763728 -21.090343
17 65 1 100.00 1 19.09922 8.963022 45.900784
In this case, the .fitted column is the predicted probabilities.

Multivariate Linear Mixed Model in lme4

I wonder how to fit multivariate linear mixed model with lme4. I fitted univariate linear mixed models with the following code:
library(lme4)
lmer.m1 <- lmer(Y1~A*B+(1|Block)+(1|Block:A), data=Data)
summary(lmer.m1)
anova(lmer.m1)
lmer.m2 <- lmer(Y2~A*B+(1|Block)+(1|Block:A), data=Data)
summary(lmer.m2)
anova(lmer.m2)
I'd like to know how to fit multivariate linear mixed model with lme4. The data is below:
Block A B Y1 Y2
1 1 1 135.8 121.6
1 1 2 149.4 142.5
1 1 3 155.4 145.0
1 2 1 105.9 106.6
1 2 2 112.9 119.2
1 2 3 121.6 126.7
2 1 1 121.9 133.5
2 1 2 136.5 146.1
2 1 3 145.8 154.0
2 2 1 102.1 116.0
2 2 2 112.0 121.3
2 2 3 114.6 137.3
3 1 1 133.4 132.4
3 1 2 139.1 141.8
3 1 3 157.3 156.1
3 2 1 101.2 89.0
3 2 2 109.8 104.6
3 2 3 111.0 107.7
4 1 1 124.9 133.4
4 1 2 140.3 147.7
4 1 3 147.1 157.7
4 2 1 110.5 99.1
4 2 2 117.7 100.9
4 2 3 129.5 116.2
Thank in advance for your time and cooperation.
This can sometimes be faked satisfactorily in nlme/lme4 by simply reformatting your data like
require(reshape)
Data = melt(data, id.vars=1:3, variable_name='Y')
Data$Y = factor(gsub('Y(.+)', '\\1', Data$Y))
> Data
Block A B Y value
1 1 1 1 1 135.8
2 1 1 2 1 149.4
3 1 1 3 1 155.4
4 1 2 1 1 105.9
5 1 2 2 1 112.9
6 1 2 3 1 121.6
...
and then including the new variable Y in your linear mixed model.
However, for true Multivariate Generalized Linear Mixed Models (MGLMM), you will probably need the sabreR package or similar. There is also an entire book to accompany the package, Multivariate Generalized Linear Mixed Models Using R. If you have a proxy to a subscribing institution, you might even be able to download it for free from http://www.crcnetbase.com/isbn/9781439813270. I would refer you there for any further advice, as this is a meaty topic and I am very much a novice.
lmer and its elder sibling lme are inherently "one parameter left of ~". Have a look at the car packages; it offers no off-the shelf repeated measurement support, but you will find a few comments on the subject by searching the R list:
John Fox on car package
#John's answer above should be largely right. You add a dummy variable (ie--the factor variable Y) to the model. Here you have 3 subscripts i= 1...N for observations, j=1,...,4 for blocks, and h=1,2 for the dependent var. But you also need to force the level 1 error term to 0 (or to near zero), which I'm not sure lme4 does. Ben Bolker might provide more information. This is described more in Goldstein (2011) Chap 6 and Chap 7 for latent multivariate models.
IE
Y_hij = \beta_{01} z_{1ij} + \beta_{02} z_{2ij} + \beta X + u_{1j} z_{1ij} + u_{2j} z_{2ij}
So:
require(reshape2)
Data = melt(data, id.vars=1:3, variable_name='Y')
Data$Y = factor(gsub('Y(.+)', '\\1', Data$Y))
m1 <- lmer(value ~ Y + A*B + (1|Block) + (1|Block*A), data= Data)
# not sure how to set the level 1 variance to 0, #BenBolker
# also unclear to me if you're requesting Y*A*B instead of Y + A*B

Resources