I'm trying to develop a predictive model using cancer survival data and used the R package survivalsvm which uses SVM method. After running the following code i got some results but finding it difficult to interpret it. I know that in Cox regression it predicts the Cumulative Hazard Function, but is it the same in survivalsvm? I ran both Cox and survivalsvm models and the results are quite different:
smodel_svm = survivalsvm(Surv(time, outcome) ~ radius.mean + tumor.size, data=training_set, gamma.mu = 1)
pred_test_svm = predict(smodel_svm, test_set)
summary(pred_test_svm)
The difference might be because you're using default parameters that is using type="regression" that uses the regression approach as described in this paper.
In summary, the authors(Van Belle et al. ) propose a different approach(MODEL 2 and MODEL 3) that essentially uses a Cox model but with both regression and ranking constraints.
Note however that the authors concluded:
Comparison of model 2 with the
coxmodel revealed no significant differences in performance. The advantage of model 2 above cox model lies in the easy extension towards non-linear models without the need to check non-linearities in the variables before modelling
From the function's docs(focusing on the parameter type):
The following denotations are used for the models implemented:
'regression' referring to the regression approach, named SVCR in Van Belle et al. (2011b),
'vanbelle1' according to the first version of survival surpport vector machines based on ranking constraints, named RANKSVMC by Van Belle et al. (2011b),
'vanbelle2' according to the second version of survival surpport vector machines based on ranking constraints like presented in model1 by Van Belle et al. (2011b) and
Predictions from survivalsvm have to be interpreted as ranks. The way in survival svm is to be able to predict ranks among individuals to estimate which patient, for example, should be handled earlier than others. See also fouodo et al. (2018) for more details about package usage in R.
If you use the regression approach, it will tell you the prediction of survival time. If you use vanbelle1 or vanbelle2 it will tell you the rank. The hybrid method also tells u the rank of the observation. As I know, from the rank we have, we can cluster them into groups of high-risk and low-risk using the rank or sometimes refer as a prognostic index.
Related
I am performing a spatial analysis of student grades according to their city of origin using R. I have several covariates such as poverty, education and socio-cultural indices. So far I have fitted univariate models such as: linear regression, weighted linear regression and CAR (conditional autoregressive).
Now, I am reading "Hierarchical Modeling and Analysis for Spatial Data" from Banerjee, Carlin and Gelfand. I am interested in applying multivariate models, in particular a MCAR (Multivariate Conditional Autoregressive) model.
However, I have not found any code in R (or Python) that has it implemented. The most possible has been the "spatialreg" library that includes univariate CAR and SAR models.
Is there any library that you know of that includes them? Thanks in advance
I have found "CARBayes" package. This works perfectly for fitting MCAR model.
I am doing a counterfactual impact evaluation on survival data. More precisely, I try to evaluate the impact of vocational training on time spent in unemployment. I use the Kaplan Meier estimator of the survival curve (package survival).
Before doing Kaplan Meier, I use coarsened exact matching (aim is ATT) to get the control and treatment groups close in terms of pretreatment covariates (package MatchIt).
For the Kaplan Meier estimator, I have to use the weights form the matching, which works well using the weights option and robust standard errors of survfit :
library(survival)
library(survminer)
kp_cem <- survfit(Surv(time=time_cem,event=status_cem)~treatment_cem, data=data_impact_cem,robust =TRUE,weights =weights)
Although, when I try to use a log-rank test to test for the difference in survival curves between treatment and control groups, I cannot take into account the frequency weights from the matching so the test statistics are not correct.
log_rank <- survdiff(Surv(time=time_cem,event=status_cem)~treatment_cem, data=data_impact_cem,rho=0)
I tried the option "pval = TRUE" of ggsurvplot (package survminer) but the problem is the same, the frequency weights are not taken into account.
How can I include frequency weights in survdiff? Are there other packages to compute log-rank test taking into account frequency weights (obtained after matching)?
There are at least two ways to do this:
First, you can use the survey::svylogrank function, as #IRTFM suggests. This will treat the weights as sampling weights, but I think that's ok with the robust standard errors that svylogrank uses.
Second, you can use survival::coxph. The logrank test is the score test in a Cox model, and coxph takes frequency weights. Use robust=TRUE if you want a robust score test: it will be at the bottom of the output of summary(your_cox_model) and you can extract it as summary(your_cox_model)$robscore
Thank you very much #Thomas Lumley and #IRTFM for your answers.
Here is how I apply your 2 suggestions (I added some comments + references).
1. Using survey::svylogrank
I don’t feel very confortable using sampling weights while it is really frequency weights that I have.
How should I specify the survey design ? The weights come from Coarsened Exact Matching (matchit with method = "cem") which is a class of stratum matching.
Should I specify the strata and the weights in the survey design ? In this vignette form Matchit Estimating Effects After Matching, it is suggested to use only weights and robust standard errors in the survival analysis (not the strata) (p. 27).
Here is how I specify the design and how I obtain the log-rank test using the package survey taking into account the weights from matching :
library(survey)
design_weights <- svydesign(id=~ibis, strata=~subclass, weights=~weights, data=data_impact_cem)
log_rank <- svylogrank(Surv(time=time_cem,event=status_cem)~treatment_cem, design=design_weights, rho=0)
2. Using survival::coxph
Thank you for this piece of information, being quite new to survival analysis, I overlooked this nice property of the equivalency of score test from cox model and log-rank test. For people wishing more info on this subject, I found this book very instructive : Moore, D. (2016). Applied survival analysis using R. New York: NY : Springer (p 58).
I find this 2d option more attractive than the 1st involving survey. Here is how I apply it :
library(survival)
cox_cem <-coxph(Surv(time=time_cem,event=status_cem)~treatment_cem, data=data_impact_cem,robust =TRUE,weights =weights)
sum_cox_cem <-summary(cox_cem)
score_test <-sum_cox_month[[13]][[1]]
score_test <- round(score_test,3)
pvalue <- sum_cox_month[[13]][[3]]
pvalue <-if(pvalue<0.001){"<0.001"} else{round(pvalue,3)}
Here is the difference between the 2 test statistics (quite close in the end).
enter image description here
Though, I still wonder why the weights option does not exist in survdiff.
Im using the book Applied Survival Analysis Using R by Moore to try and model some time-to-event data. The issue I'm running into is plotting the estimated survival curves from the cox model. Because of this I'm wondering if my understanding of the model is wrong or not. My data is simple: a time column t, an event indicator column (1 for event 0 for censor) i, and a predictor column with 6 factor levels p.
I believe I can plot estimated surival curves for a cox model as follows below. But I don't understand how to use survfit and baseplot, nor functions from survminer to achieve the same end. Here is some generic code for clarifying my question. I'll use the pharmcoSmoking data set to demonstrate my issue.
library(survival)
library(asaur)
t<-pharmacoSmoking$longestNoSmoke
i<-pharmacoSmoking$relapse
p<-pharmacoSmoking$levelSmoking
data<-as.data.frame(cbind(t,i,p))
model <- coxph(Surv(data$t, data$i) ~ p, data=data)
As I understand it, with the following code snippets, modeled after book examples, a baseline (cumulative) hazard at my reference factor level for p may be given from
base<-basehaz(model, centered=F)
An estimate of the survival curve is given by
s<-exp(-base$hazard)
t<-base$time
plot(s~t, typ = "l")
The survival curve associated with a different factor level may then be given by
beta_n<-model$coefficients #only one coef in this case
s_n <- s^(exp(beta_n))
lines(s_n~t)
where beta_n is the coefficient for the nth factor level from the cox model. The code above gives what I think are estimated survival curves for heavy vs light smokers in the pharmcoSmokers dataset.
Since thats a bit of code I was looking to packages for a one-liner solution, I had a hard time with the documentation for Survival ( there weren't many examples in the docs) and also tried survminer. For the latter I've tried:
library(survminer)
ggadjustedcurves(model, variable ="p" , data=data)
This gives me something different than my prior code, although it is similar. Is the method I used earlier incorrect? Or is there a different methodology that accounts for the difference? The survminer code doesn't work from my data (I get a 'can't allocated vector of size yada yada error, and my data is ~1m rows) which seems weird considering I can make plots using what I did before no problem. This is the primary reason I am wondering if I am understanding how to plot survival curves for my model.
(I am using R and the lqmm package)
I was wondering how to consider autocorrelation in a Linear Quantile mixed models (LQMM).
I have a data frame that looks like this:
df1<-data.frame( Time=seq(as.POSIXct("2017-11-13 00:00:00",tz="UTC"),
as.POSIXct("2017-11-13 00:1:59",tz="UTC"),"sec"),
HeartRate=rnorm(120, mean=60, sd=10),
Treatment=rep("TreatmentA",120),
AnimalID=rep("ID01",120),
Experiment=rep("Exp01",120))
df2<-data.frame( Time=seq(as.POSIXct("2017-08-11 00:00:00",tz="UTC"),
as.POSIXct("2017-08-11 00:1:59",tz="UTC"),"sec"),
HeartRate=rnorm(120, mean=62, sd=14),
Treatment=rep("TreatmentB",120),
AnimalID=rep("ID02",120),
Experiment=rep("Exp02",120))
df<-rbind(df1,df2)
head(df)
With:
The heart rates (HeartRate) that are measured every second on some animals (AnimalID). These measures are carried during an experiment (Experiment) with different treatment possible (Treatment). Each animal (AnimalID) was observed for multiple experiments with different treatments. I wish to look at the effect of the variable Treatment on the 90th percentile of the Heart Rates but including Experiment as a random effect and consider the autocorrelation (as heart rates are taken every second). (If there is a way to include AnimalID as random effect as well it would be even better)
Model for now:
library(lqmm)
model<-lqmm(fixed= HeartRate ~ Treatment, random= ~1| Exp01, data=df, tau=0.9)
Thank you very much in advance for your help.
Let me know if you need more information.
For resources on thinking about this type of problem you might look at chapters 17 and 19 of Koenker et al. 2018 Handbook of Quantile Regression from CRC Press. Neither chapter has nice R code to go from, but they discuss different approaches to the kind of data you're working with. lqmm does use nlme machinery, so there may be a way to customize the covariance matrices for the random effects, but I suspect it would be easiest to either ask for help from the package author or to do a deep dive into the package code to figure out how to do that.
Another resource is the quantile regression model for mixed effects models accounting for autocorrelation in 'Quantile regression for mixed models with an application to examine blood pressure trends in China' by Smith et al. (2015). They model a bivariate response with a copula, but you could do the simplified version with univariate response. I think their model only at this points incorporates lag-1 correlation structure within subjects/clusters. The code for that model does not seem to be available online either though.
Apologies in advance for no data samples:
I built out a random forest of 128 trees with no tuning having 1 binary outcome and 4 explanatory continuous variables. I then compared the AUC of this forest against a forest already built and predicting on cases. What I want to figure out is how to determine what exactly is lending predictive power to this new forest. Univariate analysis with the outcome variable led to no significant findings. Any technique recommendations would be greatly appreciated.
EDIT: To summarize, I want to perform multivariate analysis on these 4 explanatory variables to identify what interactions are taking place that may explain the forest's predictive power.
Random Forest is what's known as a "black box" learning algorithm, because there is no good way to interpret the relationship between input and outcome variables. You can however use something like the variable importance plot or partial dependence plot to give you a sense of what variables are contributing the most in making predictions.
Here are some discussions on variable importance plots, also here and here. It is implemented in the randomForest package as varImpPlot() and in the caret package as varImp(). The interpretation of this plot depends on the metric you are using to assess variable importance. For example if you use MeanDecreaseAccuracy, a high value for a variable would mean that on average, a model that includes this variable reduces classification error by a good amount.
Here are some other discussions on partial dependence plots for predictive models, also here. It is implemented in the randomForest package as partialPlot().
In practice, 4 explanatory variables is not many, so you can just easily run a binary logistic regression (possibly with a L2 regularization) for a more interpretative model. And compare it's performance against a random forest. See this discussion about variable selection. It is implemented in the glmnet package. Basically a L2 regularization, also known as ridge, is a penalty term added to your loss function that shrinks your coefficients for reduced variance, at the expense of increased bias. This effectively reduces prediction error if the amount of reduced variance more than compensates for the bias (this is often the case). Since you only have 4 inputs variables, I suggested L2 instead of L1 (also known as lasso, which also does automatic feature selection). See this answer for ridge and lasso shrinkage parameter tuning using cv.glmnet: How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables?