I'm doing an analysis where I want to find what are determinants for (cardiovascular) events in my cohort of patients. I want to do a cox analysis (coxph from the survival package). Now I want to find which determinants are independent determinants.
Now the question is: do I simply make one model in which I throw all my prespecified determinants (age, gender, BMI, cholesterol levels etc) and then see what happens? Or do I have to do univariate testing first? And then only add the significant/near significant (e.g. p value <0.10) in the full cox model?
This was the model I am using now:
model1 <- coxph(Surv(time,event==1)~age+gender+ckdepi+smoking+cholesterol+hdl+bpsys+t2dm+bmi, data=data)
And this is my output:
coef exp(coef) se(coef) z p
age 0.04235 1.04326 0.00911 4.65 3.3e-06***
gender -0.36583 0.69362 0.11538 -3.17 0.00152**
ckdepi -0.01078 0.98927 0.00309 -3.49 0.00048***
smoking 0.14560 1.15674 0.03070 4.74 2.1e-06***
chol 0.10312 1.10862 0.03746 2.75 0.00590**
hdl -0.04485 0.95614 0.13231 -0.34 0.73465
sysbp 0.00356 1.00357 0.00225 1.58 0.11392
t2dm 0.39539 1.48496 0.11763 3.36 0.00078***
bmi -0.00898 0.99106 0.01227 -0.73 0.46427
Likelihood ratio test=97.4 on 9 df, p=0
n= 3084, number of events= 525
(2 observations deleted due to missingness)
Also, I THINK I need to do testing for linearity (e.g. use restricted cubic splines or addquadratic functions) in the fully adjusted model, but I'm not sure. Is this correct?
In that case, my next step would be to test at least ckdepi and bmi since I'm pretty sure they might not be linear.
Related
I have a very simple stat question probably.
So, I am fitting linear mixed models like this:
lme(dependent ~ Group + Sex + Age + npgs, data=boookclub, random = ~ 1| subject)
Group is a factor variable with levels = 0, 1 , 2 , 3
The dependent are continuous variables standardized (mean 0) and the others are covariates with sex being factor, with Male/Female levels, Age being numerical, and npgs being numerical continuous standardized as well.
When I get the table with beta, standard error, t and p values, I get this:
Value Std.Error DF t-value p-value
(Intercept) -0.04550502 0.02933385 187 -1.551280 0.0025
Group1 0.04219801 0.03536929 181 1.193069 0.2344
Group2 0.03350827 0.03705896 181 0.904188 0.3671
Group3 0.00192119 0.03012654 181 0.063771 0.9492
SexMale 0.03866387 0.05012901 181 0.771287 0.4415
Age -0.00011675 0.00148684 181 -0.078520 0.9375
npgs 0.15308844 0.01637163 181 9.350835 0.0000
SexMale:Age 0.00492966 0.00276117 181 1.785352 0.0759
My problem is: how do I get the beta of Group0? In this case the intercept is Group0 but also the average of npgs, being npgs standardized. How do I get the Beta of Group0? And how can I check if Group0 is significantly associated to the dependent? I'd like to see the effect of all Group levels.
Thanks
The easiest way to do what you want may be with the emmeans package, but you may also have some conceptual issues. Technical details first, then conceptual:
Technical
Fitting an example (this isn't necessarily statistically sensible, but I wanted an example with a categorical fixed effect)
library(nlme)
m1 <- lme(Yield~Variety, random = ~1|Block, data=Alfalfa)
As with your example, the effects are "intercept" (= mean of the baseline group, which is the "Cossack" variety in this case [by default, the alphabetically-first group]), "Ladak" (difference between Ladak and Cossack means) and "Ranger" (similarly). (As #Ben hints in the comments above, R automatically generates dummies for [most of] the levels of the categorical variables [factors] in your model.)
coef(summary(m1))
## Value Std.Error DF t-value p-value
## (Intercept) 1.57166667 0.11665326 64 13.4729767 2.373343e-20
## VarietyLadak 0.09458333 0.07900687 64 1.1971532 2.356624e-01
## VarietyRanger -0.01916667 0.07900687 64 -0.2425949 8.090950e-01
The emmeans package is a convenient way to see predicted values for each group without recoding.
library(emmeans)
emmeans(m1, spec = ~Variety)
## Variety emmean SE df lower.CL upper.CL
## Cossack 1.57 0.117 5 1.27 1.87
## Ladak 1.67 0.117 5 1.37 1.97
## Ranger 1.55 0.117 5 1.25 1.85
Conceptual
You can't "check if Group0 is significantly associated with the dependent [response] variable". You can only check whether the response variables differs significantly between two groups, or whether it differs significantly among all groups (e.g. the results of anova()). You have to pick a baseline. (If you insist, you can test all pairwise comparisons among groups; emmeans can help with this too.) If you "remove the intercept" (by fitting Variety ~Yield-1, or by looking at the results that emmeans produces) then the difference you are quantifying is the difference between the mean of a particular group and zero. This is usually not a meaningful question; in the example here, for instance, this would be testing whether a wheat variety gave a yield that was significantly greater than zero — probably not very interesting.
On the other hand, if you are just interested in estimating the expected value in each group (conditioning on the baseline values of the other variables in the model), along with the standard errors/CIs, then the answers you get from emmeans are perfectly sensible.
There's a related question here that explains why you get an NA value if you manually create dummies for every level of your factor ...
I've run a Cox regression model with a categorical predictor and I'm not sure whether I can compare the different levels to each other, or just to the reference category?
For example, in the lung dataset:
lung$ph.ecog=factor(lung$ph.ecog)
fit1=coxph(Surv(lung$time, lung$status) ~ ph.ecog, data=lung)
I know you'd never set ph.ecog as a factor, its just demonstrative. The output of this model is (the reference is 0):
coef exp(coef) se(coef) z Pr(>|z|)
ph.ecog1 0.3688 1.4461 0.1987 1.857 0.0634 .
ph.ecog2 0.9164 2.5002 0.2245 4.081 4.48e-05 ***
ph.ecog3 2.2080 9.0973 1.0258 2.152 0.0314 *
So, I know that ph.ecog=1 has a hazard ratio of 1.44 which means it is 1.4 times more likely than ph.ecog=0, and a level of 2 is 2.5 times more likely than 0, but can I say that ph.ecog 2 is 1.728926 (=2.5002/1.4461) more likely than ph.ecog=1?
Is this also true for other regression models, such as odds ratios?
Many thanks
I have been trying to perform k-fold cross-validation in R on a data set that I have created. The link to this data is as follows:
https://drive.google.com/open?id=0B6vqHScIRbB-S0ZYZW1Ga0VMMjA
I used the following code:
library(DAAG)
six = read.csv("six.csv") #opening file
fit <- lm(Height ~ GLCM.135 + Blue + NIR, data=six) #applying a regression model
summary(fit) # show results
CVlm(data =six, m=10, form.lm = formula(Height ~ GLCM.135 + Blue + NIR )) # 10 fold cross validation
This produces the following output (Summarized version)
Sum of squares = 7.37 Mean square = 1.47 n = 5
Overall (Sum over all 5 folds)
ms
3.75
Warning message:
In CVlm(data = six, m = 10, form.lm = formula(Height ~ GLCM.135 + :
As there is >1 explanatory variable, cross-validation
predicted values for a fold are not a linear function
of corresponding overall predicted values. Lines that
are shown for the different folds are approximate
I do not understand what the ms value refers to as I have seen different interpretations on the internet. It is my understanding that K-fold cross validations produce a overall RMSE value for a specified model (which is what I am trying to obtain for my research).
I also don't understand why the results generated produce a Overall (Sum over all 5 folds), when I have specified a 10 fold cross validation in the code.
If anyone can help it would be much appreciated.
When I ran this same thing, I saw that it did do 10 folds, but the final output printed was the same as yours ("Sum over all 5 folds"). The "ms" is the mean squared prediction error. The value of 3.75 is not exactly a simple average across all 10 folds either (got 3.67):
msaverage <- (1.19+6.04+1.26+2.37+3.57+5.24+8.92+2.03+4.62+1.47)/10
msaverage
Notice the average as well as most folds are higher than "Residual standard error" (1.814). This is what we would expect as the CV error represents model performance likely on "test" data (not data used to trained the model). For instance on Fold 10, notice the residuals calculated are on the predicted observations (5 observations) that were not used in the training for that model:
fold 10
Observations in test set: 5
12 14 26 54 56
Predicted 20.24 21.18 22.961 18.63 17.81
cvpred 20.15 21.14 22.964 18.66 17.86
Height 21.98 22.32 22.870 17.12 17.37
CV residual 1.83 1.18 -0.094 -1.54 -0.49
Sum of squares = 7.37 Mean square = 1.47 n = 5
It appears this warning we received may be common too -- also saw it in this article: http://www.rpubs.com/jmcimula/xCL1aXpM3bZ
One thing I can suggest that may be useful to you is that in the case of linear regression, there is a closed form solution for leave-one-out-cross-validation (loocv) without actually fitting multiple models.
predictedresiduals <- residuals(fit)/(1 - lm.influence(fit)$hat)
PRESS <- sum(predictedresiduals^2)
PRESS #Predicted Residual Sum of Squares Error
fitanova <- anova(fit) #Anova to get total sum of squares
tss <- sum(fitanova$"Sum Sq") #Total sum of squares
predrsquared <- 1 - PRESS/(tss)
predrsquared
Notice this value is 0.574 vs. the original Rsquared of 0.6422
To better convey the concept of RMSE, it is useful to see the distribution of the predicted residuals:
hist(predictedresiduals)
RMSE can then calculated simply as:
sd(predictedresiduals)
I am running a mixed effects model using the coxme() function in R. The model analyzes the event of product success of firms in different countries.
Fixed effects are for example GDP, population, technology and cultural variables. Random effects are the different countries.
I know that with coxph() it is possible to test for proportional hazard using the cox.zph() command.
My question: How can I check for proportional hazard with coxme()?
Fixed effects in a random-effects coxme model can be checked for proportional hazards (PH) with the same cox.zph() function used for standard coxph() models. According to the manual, the fit argument for cox.zph() is "the result of fitting a Cox regression model, using the coxph or coxme functions."
Random effects "are not checked for proportional hazards, rather they are treated as a fixed offset in model."
An example, borrowed from this Cross-Validated question:
> library(survival)
> library(coxme)
> df <- stanford2
> df$cid <- round(df$id / 10) + 1 ## generates some clusters
> fit <- coxme(Surv(time, status) ~ age + t5 + (1 | cid),data=df)
> fit
Cox mixed-effects model fit by maximum likelihood
Data: df
events, n = 102, 157 (27 observations deleted due to missingness)
Iterations= 2 12
NULL Integrated Fitted
Log-likelihood -451.0944 -446.8618 -446.8261
Chisq df p AIC BIC
Integrated loglik 8.47 3.00 0.037317 2.47 -5.41
Penalized loglik 8.54 2.04 0.014582 4.46 -0.88
Model: Surv(time, status) ~ age + t5 + (1 | cid)
Fixed coefficients
coef exp(coef) se(coef) z p
age 0.02960206 1.030045 0.01135724 2.61 0.0091
t5 0.17056610 1.185976 0.18330590 0.93 0.3500
Random effects
Group Variable Std Dev Variance
cid Intercept 0.0199835996 0.0003993443
> cox.zph(fit)
chisq df p
age 0.831 3.00 0.84
t5 2.062 2.04 0.36
GLOBAL 2.767 5.04 0.74
This was done with survival_3.1-11 and coxme_2.2-16.
I am testing differences on the number of pollen grains loading on plant stigmas in different habitats and stigma types.
My sample design comprises two habitats, with 10 sites each habitat.
In each site, I have up to 3 stigma types (wet, dry and semidry), and for each stigma stype, I have different number of plant species, with different number of individuals per plant species (code).
So, I ended up with nested design as follow: habitat/site/stigmatype/stigmaspecies/code
As it is a descriptive study, stigmatype, stigmaspecies and code vary between sites.
My response variable (n) is the number of pollengrains (log10+1)per stigma per plant, average because i collected 3 stigmas per plant.
Data doesnt fit Poisson distribution because (i) is not integers, and (ii) variance much higher than the mean (ratio = 911.0756). So, I fitted as negative.binomial.
After model selection, I have:
m4a <- glmer(n ~ habitat*stigmatype + (1|stigmaspecies/code),
family=negative.binomial(2))
> summary(m4a)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: Negative Binomial(2) ( log )
Formula: n ~ habitat * stigmatype + (1 | stigmaspecies/code)
AIC BIC logLik deviance
993.9713 1030.6079 -487.9856 975.9713
Random effects:
Groups Name Variance Std.Dev.
code:stigmaspecies (Intercept) 1.034e-12 1.017e-06
stigmaspecies (Intercept) 4.144e-02 2.036e-01
Residual 2.515e-01 5.015e-01
Number of obs: 433, groups: code:stigmaspecies, 433; stigmaspecies, 41
Fixed effects:
Estimate Std. Error t value Pr(>|z|)
(Intercept) -0.31641 0.08896 -3.557 0.000375 ***
habitatnon-invaded -0.67714 0.10060 -6.731 1.68e-11 ***
stigmatypesemidry -0.24193 0.15975 -1.514 0.129905
stigmatypewet -0.07195 0.18665 -0.385 0.699885
habitatnon-invaded:stigmatypesemidry 0.60479 0.22310 2.711 0.006712 **
habitatnon-invaded:stigmatypewet 0.16653 0.34119 0.488 0.625491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) hbttn- stgmtyps stgmtypw hbttnn-nvdd:stgmtyps
hbttnn-nvdd -0.335
stgmtypsmdr -0.557 0.186
stigmatypwt -0.477 0.160 0.265
hbttnn-nvdd:stgmtyps 0.151 -0.451 -0.458 -0.072
hbttnn-nvdd:stgmtypw 0.099 -0.295 -0.055 -0.403 0.133
Two questions:
How do I check for overdispersion from this output?
What is the best way to go through model validation here?
I have been using:
qqnorm(resid(m4a))
hist(resid(m4a))
plot(fitted(m4a),resid(m4a))
While qqnorm() and hist() seem ok, and there is a tendency of heteroscedasticity on the 3rd graph. And here is my final question:
Can I go through model validation with this graph in glmer? or is there a better way to do it? if not, how much should I worry about the 3rd graph?
a simple way to check for overdispersion in glmer is:
> library("blmeco")
> dispersion_glmer(your_model) #it shouldn't be over
> 1.4
To solve overdispersion I usually add an observation level random factor
For model validation I usually start from these plots...but then depends on your specific model...
par(mfrow=c(2,2))
qqnorm(resid(your_model), main="normal qq-plot, residuals")
qqline(resid(your_model))
qqnorm(ranef(your_model)$id[,1])
qqline(ranef(your_model)$id[,1])
plot(fitted(your_model), resid(your_model)) #residuals vs fitted
abline(h=0)
dat_kackle$fitted <- fitted(your_model) #fitted vs observed
plot(your_data$fitted, jitter(your_data$total,0.1))
abline(0,1)
hope this helps a little....
cheers
Just an addition to Q1 for those who might find this by googling: the blmco dispersion_glmer function appears to be outdated. It is better to use #Ben_Bolker's function for this purpose:
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
Source: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion.
With the highlighted notion:
Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter.
PS. Why outdated?
The lme4 package includes the residuals function these days, and Pearson residuals are supposedly more robust for this type of calculation than the deviance residuals. The blmeco::dispersion_glmer sums up the deviance residuals together with u cubed, divides by residual degrees of freedom and takes a square root of the value (the function):
dispersion_glmer <- function (modelglmer)
{
n <- length(resid(modelglmer))
return(sqrt(sum(c(resid(modelglmer), modelglmer#u)^2)/n))
}
The blmeco solution gives considerably higher deviance/df ratios than Bolker's function. Since Ben is one of the authors of the lme4 package, I would trust his solution more although I am not qualified to rationalize the statistical reason.
x <- InsectSprays
x$id <- rownames(x)
mod <- lme4::glmer(count ~ spray + (1|id), data = x, family = poisson)
blmeco::dispersion_glmer(mod)
# [1] 1.012649
overdisp_fun(mod)
# chisq ratio rdf p
# 55.7160734 0.8571704 65.0000000 0.7873823