predict.coxph() and survC1::Est.Cval -- type for predict() output - r

Given a coxph() model, I want to use predict() to predict hazards and then use survC1::Est.Cval( . . . nofit=TRUE) to get a c-value for the model.
The Est.Cval() documentation is rather terse, but says that "nofit=TRUE: If TRUE, the 3rd column of mydata is used as the risk score directly in calculation of C."
Say, for simplicity, that I want to predict on the same data I built the model on. For
coxModel a Cox regression model from coxph();
time a vector of times (positive reals), the same times that coxModel was built on; and
event a 0/1 vector, the same length, of event/censor indicators, the same events that coxModel was built on --
does this indicate that I want
predictions <- predict(coxModel, type="risk")
dd <- cbind(time, event, pred)
Est.Cval(mydata=dd, tau=tau, nofit=TRUE)
or should that first line be
predictions <- predict(coxModel, type="lp")
?
Thanks for any help,

The answer is that it doesn't matter.
Basically, the concordance value is testing, for all comparable pairs of times (events and censors), how probable it is that the later time has the lower risk (for a really good model, almost always). But since e^u is a monotonic function of real u, and the c-value is only testing comparisons, it doesn't matter whether you provide the hazard ratio, e^(sum{\beta_i x_i}), or the linear predictor, sum{\beta_i x_i}.
Since #42 motivated me to come up with a minimal working example, we can test this. We'll compare the values that Est.Cval() provides using one input versus using the other; and we can compare both to the value we get from coxph().
(That last value won't match exactly, because Est.Cval() uses the method of Uno et al. 2011 (Uno, H., Cai, T., Pencina, M. J., D’Agostino, R. B. & Wei, L. J. On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statist. Med. 30, 1105–1117 (2011), https://onlinelibrary.wiley.com/doi/full/10.1002/sim.4154) but it can serve as a sanity check, since the values should be close.)
The following is based on the example worked through in Survival Analysis with R, 2017-09-25, by Joseph Rickert, https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/.
library("survival")
library("survC1")
# Load dataset included with survival package
data("veteran")
# The variable `time` records survival time; `status` indicates whether the
# patient’s death was observed (status=1) or that survival time was censored
# (status = 0).
# The model they build in the example:
coxModel <- coxph(Surv(time, status) ~ trt + celltype + karno + diagtime +
age + prior, data=veteran)
# The results
summary(coxModel)
Note the c-score it gives us:
Concordance= 0.736 (se = 0.021 )
Now, we calculate the c-score given by Est.Cval() on the two types of values:
# The value from Est.Cval(), using a risk input
cvalByRisk <- Est.Cval(mydata=cbind(time=veteran$time, event=veteran$status,
predictions=predict(object=coxModel, newdata=veteran, type="risk")),
tau=2000, nofit=TRUE)
# The value from Est.Cval(), using a linear predictor input
cvalByLp <- Est.Cval(mydata=cbind(time=veteran$time, event=veteran$status,
predictions=predict(object=coxModel, newdata=veteran, type="lp")),
tau=2000, nofit=TRUE)
And we compare the results:
cvalByRisk$Dhat
[1] 0.7282348
cvalByLp$Dhat
[1] 0.7282348

Related

GAM smooths interaction differences - calculate p value using mgcv and gratia 0.6

I am using the useful gratia package by Gavin Simpson to extract the difference in two smooths for two different levels of a factor variable. The smooths are generated by the wonderful mgcv package. For example
library(mgcv)
library(gratia)
m1 <- gam(outcome ~ s(dep_var, by = fact_var) + fact_var, data = my.data)
diff1 <- difference_smooths(m1, smooth = "s(dep_var)")
draw(diff1)
This give me a graph of the difference between the two smooths for each level of the "by" variable in the gam() call. The graph has a shaded 95% credible interval (CI) for the difference.
Statistical significance, or areas of statistical significance at the 0.05 level, is assessed by whether or where the y = 0 line crosses the CI, where the y axis represents the difference between the smooths.
Here is an example from Gavin's site where the "by" factor variable had 3 levels.
The differences are clearly statistically significant (at 0.05) over nearly all of the graphs.
Here is another example I have generated using a "by" variable with 2 levels.
The difference in my example is clearly not statistically significant anywhere.
In the mgcv package, an approximate p value is outputted for a smooth fit that tests the null hypothesis that the coefficients are all = 0, based on a chi square test.
My question is, can anyone suggest a way of calculating a p value that similarly assesses the difference between the two smooths instead of solely relying on graphical evidence?
The output from difference_smooths() is a data frame with differences between the smooth functions at 100 points in the range of the smoothed variable, the standard error for the difference and the upper and lower limits of the CI.
Here is a link to the release of gratia 0.4 that explains the difference_smooths() function
enter link description here
but gratia is now at version 0.6
enter link description here
Thanks in advance for taking the time to consider this.
Don
One way of getting a p value for the interaction between the by factor variables is to manipulate the difference_smooths() function by activating the ci_level option. Default is 0.95. The ci_level can be manipulated to find a level where the y = 0 is no longer within the CI bands. If for example this occurred when ci_level = my_level, the p value for testing the hypothesis that the difference is zero everywhere would be 1 - my_level.
This is not totally satisfactory. For example, it would take a little manual experimentation and it may be difficult to discern accurately when zero drops out of the CI. Although, a function could be written to search the accompanying data frame that is outputted with difference_smooths() as the ci_level is varied. This is not totally satisfactory either because the detection of a non-zero CI would be dependent on the 100 points chosen by difference_smooths() to assess the difference between the two curves. Then again, the standard errors are approximate for a GAM using mgcv, so that shouldn't be too much of a problem.
Here is a graph where the zero first drops out of the CI.
Zero dropped out at ci_level = 0.88 and was still in the interval at ci_level = 0.89. So an approxiamte p value would be 1 - 0.88 = 0.12.
Can anyone think of a better way?
Reply to Gavin Simpson's comments Feb 19
Thanks very much Gavin for taking the time to make your comments.
I am not sure if using the criterion, >= 0 (for negative diffs), is a good way to go. Because of the draws from the posterior, there is likely to be many diffs that meet this criterion. I am interpreting your criterion as sample the posterior distribution and count how many differences meet the criterion, calculate the percentage and that is the p value. Correct me if I have misunderstood. Using this approach, I consistently got p values at around 0.45 - 0.5 for different gam models, even when it was clear the difference in the smooths should be statistically significant, at least at p = 0.05, because the confidence band around the smooth did not contain zero at a number of points.
Instead, I was thinking perhaps it would be better to compare the means of the posterior distribution of each of the diffs. For example
# get coefficients for the by smooths
coeff.level1 <- coef(gam.model1)[31:38]
coeff.level0 <- coef(gam.model1)[23:30]
# these indices are specific to my multi-variable gam.model1
# in my case 8 coefficients per smooth
# get posterior coefficients variances for the by smooths' coefficients
vp_level1 <- gam.model1$Vp[31:38, 31:38]
vp_level0 <- gam.model1$Vp[23:30, 23:30]
#run the simulation to get the distribution of each
#difference coefficient using the joint variance
library(MASS)
no.draws = 1000
sim <- mvrnorm(n = no.draws, (coeff.level1 - coeff.level0),
(vp_level1 + vp_level0))
# sim is a no.draws X no. of coefficients (8 in my case) matrix
# put the results into a data.frame.
y.group <- data.frame(y = as.vector(sim),
group = c(rep(1,no.draws), rep(2,no.draws),
rep(3,no.draws), rep(4,no.draws),
rep(5,no.draws), rep(6,no.draws),
rep(7,no.draws), rep(8,no.draws)) )
# y has the differences sampled from their posterior distributions.
# group is just a grouping name for the 8 sets of differences,
# (one set for each difference in coefficients)
# compare means with a linear regression
lm.test <- lm(y ~ as.factor(group), data = y.group)
summary(lm.test)
# The p value for the F statistic tells you how
# compatible the data are with the null hypothesis that
# all the group means are equal to each other.
# Same F statistic and p value from
anova(lm.test)
One could argue that if all coefficients are not equal to each other then they all can't be equal to zero but that isn't what we want here.
The basis of the smooth tests of fit given by summary(mgcv::gam.model1)
is a joint test of all coefficients == 0. This would be from a type of likelihood ratio test where model fit with and without a term are compared.
I would appreciate some ideas how to do this with the difference between two smooths.
Now that I got this far, I had a rethink of your original suggestion of using the criterion, >= 0 (for negative diffs). I reinterpreted this as meaning for each simulated coefficient difference distribution (in my case 8), count when this occurs and make a table where each row (my case, 8) is for one of these distributions with two columns holding this count and (number of simulation draws minus count), Then on this table run a chi square test. When I did this, I got a very low p value when I believe I shouldn't have as 0 was well within the smooth difference CI across almost all the levels of the exposure. Maybe I am still misunderstanding your suggestion.
Follow up thought Feb 24
In a follow up thought, we could create a variable that represents the interaction between the by factor and continuous variable
library(dplyr)
my.dat <- my.dat %>% mutate(interact.var =
ifelse(factor.2levels == "yes", 1, 0)*cont.var)
Here I am assuming that factor.2levels has the levels ("no", "yes"), and "no" is the reference level. The ifelse function creates a dummy variable which is multiplied by the continuous variable to generate the interactive variable.
Then we place this interactive variable in the GAM and get the usual statistical test for fit, that is, testing all the coefficients == 0.
#GavinSimpson actually posted a method of how to get the difference between two smooths and assess its statistical significance here in 2017. Thanks to Matteo Fasiolo for pointing me in that direction.
In that approach, the by variable is converted to an ordered categorical variable which causes mgcv::gam to produce difference smooths in comparison to the reference level. Statistical significance for the difference smooths is then tested in the usual way with the summary command for the gam model.
However, and correct me if I have misunderstood, the ordered factor approach causes the smooth for the main effect to now be the smooth for the reference level of the ordered factor.
The approach I suggested, see the main post under the heading, Follow up thought Feb 24, where the interaction variable is created, gives an almost identical result for the p value for the difference smooth but does not change the smooth for the main effect. It also does not change the intercept and the linear term for the by categorical variable which also both changed with the ordered variable approach.

Visualizing regression coefficient of a regression

I am trying to figure out the best way to display a list of 30+ coefficients on a regression of a continuous variable.
(This may belong more in CrossValidated, I am not sure.)
Here is my example:
library("nycflights13")
library(dplyr)
flights <- nycflights13::flights
flights<- sample_n (flights, 3000)
m1<- glm(formula = arr_delay ~ . , data = flights)
summary(m1)
An option is dwplot from dotwhisker
library(dotwhisker)
dwplot(m1)
As #BenBolker commented, by default, the dwplot scales regression coeffficients by 2 standard deviations of the predictor variable
Or if we need a data.frame/tibble, then use tidy from broom
library(broom)
tidy(m1)
This may help. You could select a specific coefficient with the following :
str(flights) # to print list of data features
coef(m1)["age"] # here I just suppose that you have an axis called "age", you could select as many features coefficients as you want. For this you coud use a vector of relevant axis.
You could have a look at :
extract coefficients from glm in R
tl;dr dwplot is still (a) right answer, but there's a lot to say about the details of how you're fitting this model (and why it takes a really really long time).
glm vs lm
You're using glm() to fit a linear model, which isn't incorrect (and which would allow you to generalize to problems with count or binary responses). However, it's overkill in this case — lm() will work just fine, and be faster [considerably faster when it comes to generating confidence intervals etc.]
system.time(m1 <- glm(formula = arr_delay ~ . , data = flights)) ## 6 seconds
system.time(m2 <- lm(formula = arr_delay ~ . , data = flights, x=TRUE)) ## 13 seconds
(the reason for including x=TRUE will be discussed below)
The time difference becomes more stark when tidying/computing confidence intervals:
setTimeLimit(elapsed=600)
system.time(tidy(m1, conf.int=TRUE)) ## gave up after 10 minutes
system.time(tt <- tidy(m2, conf.int=TRUE)) ## 3.2 seconds
Tidying glms by default uses MASS::confint.glm() to compute confidence intervals by likelihood profiling, which is more accurate than Wald (mean +/- 1.96*SE) intervals for non-Gaussian responses), but way slower.
modeling choices
One of the reasons that everything is so slow is that there are lots of parameters (length(coef(m2)) is 1761). Why?
Although there are only 19 columns in the input data frame (so we might naively expect 18 coefficients), 4 of them are categorical, so get expanded to indicator variables:
catvars <- names(flights)[sapply(flights,is.character)]
sapply(catvars, function(x) length(unique(flights[[x]])))
## carrier tailnum origin dest
## 15 1653 3 94
So, most of the coefficients come from modeling the departures of individual planes (tailnum) [table(table(flights$tailnum)) shows that in this subsample of the data, more than half of the planes are recorded only once ...] It might not make sense to include this variable (if I were going to use tailnum, I would treat it as a random effect, although that would add a lot of modeling complexity).
Let's proceed without tailnum (we will still have plenty of coefficients to worry about).
plotting
At this point we're doing approximately what dotwhisker::dwplot does, but doing it by hand for more flexibility (in particular, ordering the terms by value).
The next step (1) extracts coefficients/conf int etc.; (2) scales non-binary variables by 2SD (using an internal function from dotwhisker); (3) drops the intercept; (4) makes term a factor ordered by the coefficient value and computes whether the term is significant (i.e., whether the lower and upper CI limits are both above or both below zero).
tt <- (tidy(m3, conf.int=TRUE)
%>% dotwhisker::by_2sd(flights)
%>% filter(term!="(Intercept)")
%>% mutate(term=reorder(factor(term),estimate),
sig=(conf.low*conf.high)>1)
)
Plot:
(ggplot(tt, aes(x=estimate,y=term,xmin=conf.low,xmax=conf.high))
+ geom_pointrange(aes(colour=sig))
+ geom_vline(xintercept=0,lty=2)
+ scale_colour_manual(values=c("black","red"))
)

Generalized linear model vs Generalized additive model

I'm trying to follow this paper: Using a data science approach to predict cocaine use frequency from depressive symptoms where they use glm, gam with the beck inventory depression. So I did found a similiar dataset to test those models. However I'm having a hard time with both models. For example I have two variables d64a and d64b, and they're coded with 1,2,3,4 meaning that they're ordinal. Also, in the paper y2 is only the value of 1 but i have also a variable extra (that can be dependent, the proportion of consume)
For the GAM model I have:
b<-gam(y2~s(d64a)+s(d64b),data=DATOS2)
but I have the following error:
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
A term has fewer unique covariate combinations than specified maximum degrees of freedom
Meanwhile for the glm, I have the following:
d<-glm(y2~d64a+d64b,data=DATOS2)
I don't know since d64a and d64b are ordinal I have to use factor()?
The error message tells you that one or both of d64a and d64b do not have 9 (nine) unique values.
By default s(...) will create a basis with nine functions. You get this error if there are fewer than nine unique values in the covariate.
Check which covariates are affected using:
length(unique(d64a))
length(unique(d64b))
and see what the number of unique values is for each of the covariates you wish to include. Then set the k argument to the number returned above if it is less than nine. FOr example, assume that the above checks returned 5 and 7 unique covariates, then you would indicate this by setting k as follows:
b <- gam(y2 ~ s(d64a, k = 5) + s(d64b, k = 7), data = DATOS2)

glm summary not giving coefficients values

I'm trying to apply glm on a given dataset,but the summary(model1) is not giving me the correct output , it's not giving coefficient values for Estimate Std. Error z value Pr(>|z|) etc, it's just giving me NA as an output for individual attribute element.
TEXT <- c('Learned a new concept today : metamorphic testing. t.co/0is1IUs3aW','BMC Bioinformatics BioMed Central: Detecting novel ncRNAs by experimental #RNomics is not an easy task... http:/t.co/ui3Unxpx #bing #MyEN','BMC Bioinformatics BioMed Central: small #RNA with a regulatory function as a scientific ... Detecting novel… http:/t.co/wWHOEkR0vc #bing','True or false? link(#Addition, #Classification) http:/t.co/zMJuTFt8iq #Oxytocin','Biologists do have a sense of humor, especially computational bio people http:/t.co/wFZqaaFy')
NAME <- c('QSoft Consulting','Fabrice Leclerc','Sungsam Gong','Frederic','Zach Stednick')
SCREEN_NAME <-c ('QSoftConsulting','rnomics','sunggong','rnomics','jdwasmuth')
FOLLOWERS_COUNT <- c(734,1900,234,266,788)
RETWEET <- c(1,3,5,0,2)
FRIENDS_COUNT <-c(34,532,77,213,422)
STATUSES_COUNT <- c(234,643,899,222,226)
FAVOURITES_COUNT <- c(144,2677,445,930,254)
df <- data.frame(TEXT,NAME,SCREEN_NAME,RETWEET,FRIENDS_COUNT,STATUSES_COUNT,FAVOURITES_COUNT)
mydata<-df
mydata$FAVOURITES_COUNT <- ifelse( mydata$FAVOURITES_COUNT >= 445, 1, 0) #converting fav_count to binary values
Splitting data
library(caret)
split=0.60
trainIndex <- createDataPartition(mydata$FAVOURITES_COUNT, p=split, list=FALSE)
data_train <- mydata[ trainIndex,]
data_test <- mydata[-trainIndex,]
glm model
library(e1071)
model1 <- glm(FAVOURITES_COUNT~.,family = binomial, data = data_train)
summary(model1)
I want to get the p value for further analysis so far i think my code is right, how can i get the correct output?
A binomial distribution will only work if the dependent variable has two outcomes. You should consider a Poisson distribution when the dependent variable is a count. See here for more details: http://www.statmethods.net/advstats/glm.html
Your code for fitting the GLM is programmatically correct. However, there are a few issues:
As mentioned in the comments, for every variable that is categorical, you should use as.factor() to make it into a factor. GLM doesn't know what a "string" variable is.
As MorganBall indicated, if your data truly is count data, you may consider fitting it using a Poisson GLM, instead of converting to binary and using Logistic regression.
You indicate that you have 13 parameters and 1000 observations. While this may seem like enough data, note that some of these parameters may have very few (close to 0?) observations in them. This is a problem.
In addition, did you make sure that your data does not perfectly separate the response? Because if there are some combinations of parameters that do separate the response perfectly, the maximum likelihood estimate won't converge and theoretically goes to infinity. Practically speaking, you'll get very large standard errors for your estimates.

Standardisation in MuMIn package in R

I am using the 'MuMIn' package in R to select models and calculate effect sizes of the input variables (rain, brk, onset, wid). To make my effect size comparable between variables, I standardised them using standardize function in arm package. Here is the code that I am following:
For reference, please refer to the appendix of this paper: http://onlinelibrary.wiley.com/doi/10.1111/j.1420-9101.2010.02210.x/full
Grueber et al. 2011: Multimodel inference in ecology and evolution: challenges and solutions
data1<-read.csv("data.csv",header=TRUE) #reads the data
global.model<-lmer(yld.res ~ rain + brk + onset + wid + (1|state),data=data1,REML="FALSE") # prepares a global model
stdz.model <- standardize(global.model,standardize.y = FALSE) # standardise the input varaibles
model.set <- dredge(stdz.model) ### generates the full submodel set
top.models <- get.models(model.set, subset= delta<2) # selects models with delta AIC <2
model.avg(top.models) # calculates the average effect size of input variables
Here is the result of model.avg(top.models) which gives the average effect size of each input variable
Coefficients:
(Intercept) brk rain wid onset
subset -4.281975e-14 -106.0919 51.54688 39.82837 35.68766
I read around how the standardize function works- subtracts mean and divides by 2SD.
My question is this: Since I have standardised the input variables, should not the effect sizes be between -1 to 1? or the effect size which the output shows is correct?
Please advise
Thanks a lot
This is more of a statistical question than a programming question, but: you've only standardized the predictor variables, not the response variable (you specified standardize.y=FALSE); therefore, each of your coefficients represents the expected change of the response (in the response's units!) per 2 SD change in the predictor. If the range of the response is large (as it must be in your example), then there could be a very large change. For example, if I were analyzing the change in elephant weight measured in milligrams, I could expect very large changes in the response for reasonably small changes in the predictors (e.g. sex, age, food availability). You should probably use standardize.y=TRUE if you want truly nondimensional/unitless effect sizes. Even nondimensional effects aren't necessarily constrained to be between -1 and +1, but it would be surprising for them to be so large.
By the way, I think your standardize function comes from the arm package, not from MuMIn (library("sos"); findFn("standardize",sortby="Function)).

Resources