I would like to analysis my data based on the gradient boosted model.
On the other hand, as my data is a kind of cohort, I have a trouble understanding the result of this model.
Here's my code. Analysis was performed based on the example data.
install.packages("randomForestSRC")
install.packages("gbm")
install.packages("survival")
library(randomForestSRC)
library(gbm)
library(survival)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
set.seed(9512)
train <- sample(1:nrow(data), round(nrow(data)*0.7))
data.train <- data[train, ]
data.test <- data[-train, ]
set.seed(9741)
gbm <- gbm(Surv(days, status)~.,
data.train,
interaction.depth=2,
shrinkage=0.01,
n.trees=500,
distribution="coxph")
summary(gbm)
set.seed(9741)
gbm.pred <- predict.gbm(gbm,
n.trees=500,
newdata=data.test,
type="response")
As I read the package documnet, "gbm.pred" is the result of cox's partial likelihood.
set.seed(9741)
lambda0 = basehaz.gbm(t=data.test$days,
delta=data.test$status,
t.eval=sort(data.test$days),
cumulative = FALSE,
f.x=gbm.pred,
smooth=T)
hazard=lambda0*exp(gbm.pred)
In this code, lambda0 is a baseline hazard fuction.
So, according to formula: h(t/x)=lambda0(t)*exp(f(x))
"hazard" is hazard function.
However, what I've wanted to calculte was the "survival function".
Because, I would like to compare the outcome of original data (data$status) to the prediction result (survival function).
Please let me know how to calculate survival function.
Thank you
Actually, the returns is cumulative baseline hazard function(integral part: \int^t\lambda(z)dz), and survival function can be computed as below:
s(t|X)=exp{-e^f(X)\int^t\lambda(z)dz}
f(X) is prediction of gbm, which is equal to log-hazard proportion.
I think this tutorial about gbm-based survival analysis would help to u!
https://github.com/liupei101/Tutorial-Machine-Learning-Based-Survival-Analysis/blob/master/Tutorial_Survival_GBM.ipynb
Related
I am fitting time-to-event survival data using surv.CoxBoost in the mlr package. My question: is there any way to get relative importance for the variables in the fitted model? I have seen this post detailing variablke importance for cvglment but haven't seen any on CoxBoost.
Any idea?
below is an example of a model using CoxBoost`. You may need to install CoxBoost from here as seems no longer on CRAN.
library(randomForestSRC)
library(mlr)
library(survival)
library(CoxBoost)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
set.seed(9512)
train <- sample(1:nrow(data), round(nrow(data)*0.7))
data.train <- data[train, ]
data.test <- data[-train, ]
task = makeSurvTask( data=data.train, target=c('days', 'status'))
learner= makeLearner("surv.CoxBoost")
trained.learner=train(learner,task)
CoxBoostfit <- trained.learner$learner.model
CoxBoostfit$coefficients
I analyzed my data with 'gbm' R package. My data is based on a cohort study. Therefore, I ran 'gbm' model based on the 'coxph' results.
After constructing a model, I would like to see how this model can predict well. On the other hand, like the code below, the values of prediction are negative. So, I have a trouble understanding this phenomenon.
Please let me know how to interpret this value.
Here's my code.
install.packages("survival")
install.packages("randomForestSRC")
install.packages("gbm")
library(survival)
library(randomForestSRC)
library(gbm)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
exposure <- names(data[, names(data.model) !=c("days", "status")])
formula <- as.formula(paste("Surv(days, status)~", paste(exposure, collapse="+")))
set.seed(123)
ex <- gbm(Surv(days, status)~.,
data=data,
distribution="coxph",
cv.folds=5,
shrinkage=.01,
n.trees=1000)
set.seed(123)
pred <- predict(ex, n.trees=1000, type="response")
Read the ?predict.gbm help page, particularly the parameter type. By default predictions are on the link scale.
i would like to develop a cox proportional hazard model with r, use it to predict input and evaluate the accuracy of the model. For the evaluation I would like to use the Brior score.
# import various packages, needed at some point of the script
library("survival")
library("survminer")
library("prodlim")
library("randomForestSRC")
library("pec")
library("rpart")
library("mlr")
library("Hmisc")
library("ipred")
# load lung cancer data
data("lung")
head(lung)
# recode status variable
lung$status <- lung$status-1
# Delete rows with missing values
lung <- na.omit(lung)
# split data into training and testing
## 80% of the sample size
smp_size <- floor(0.8 * nrow(lung))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(lung)), size = smp_size)
# training and testing data
train.lung <- lung[train_ind, ]
test.lung <- lung[-train_ind, ]
# time and failure event
s <- Surv(train.lung$time, train.lung$status)
# create model
cox.ph2 <- coxph(s~age+meal.cal+wt.loss, data=train.lung)
# predict
pred <- predict(cox.ph2, newdata = train.lung)
# evaluate
sbrier(s, pred)
as an outcome of the prediction I would expect the time (as in "when does this individuum experience failure). Instead I get values like this
[1] 0.017576359 -0.135928959 -0.347553969 0.112509137 -0.229301199 -0.131861582 0.044589175 0.002634008
[9] 0.345966978 0.209488560 0.002418358
What does that mean?
Furthermore sbrier does not work. Apparently it can not work with the prediction pred (no surprise there)
How do I solve this? How do I make a prediction with cox.ph2? How can I evaluate the model afterwards?
The predict() function won't return a time value, you have to specify the argument type = c("lp", "risk","expected","terms","survival") in the predict() function.
If you want to get the hazard ratios :
predict(cox.ph2, newdata = test.lung, type = "risk")
Note that you want to predict the values on the test set not the training set.
I have read that you can use AFT models in your case :
https://stats.stackexchange.com/questions/79362/how-to-get-predictions-in-terms-of-survival-time-from-a-cox-ph-model
You also can read this post :
Calculate the Survival prediction using Cox Proportional Hazard model in R
Hope it will help
I am sorry for posting this question again but I really need help on this now.
I am trying to calculate the AUC of training set of randomForest model in R and there are two ways to calculate this but give different results. The following is a reproductible example of my question. I really appreciate it if someone could help!!!
library(randomForest)
library(pROC)
library(ROCR)
# prep training to binary outcome
train <- iris[iris$Species %in% c('virginica', 'versicolor'),]
train$Species <- droplevels(train$Species)
# build model
rfmodel <- randomForest(Species~., data=train, importance=TRUE, ntree=2)
#the first way to calculate training auc
rf_p_train <- predict(rfmodel, type="prob",newdata = train)[,2]
rf_pr_train <- prediction(rf_p_train, train$Species)
r_auc_train1 <- performance(rf_pr_train, measure = "auc")#y.values[[1]]
r_auc_train1 #0.9888
#the second way to calculate training auc
rf_p_train <- as.vector(rfmodel$votes[,2])
rf_pr_train <- prediction(rf_p_train, train$Species);
r_auc_train2 <- performance(rf_pr_train, measure = "auc")#y.values[[1]]
r_auc_train2 #0.9175
To receive the same results for both prediction functions you should exclude the newdata parameter from the first one (explained in the package documentation for the predict function),
rf_p_train <- predict(rfmodel, type="prob")[,2]
rf_pr_train <- prediction(rf_p_train, train$Species)
r_auc_train1 <- performance(rf_pr_train, measure = "auc")#y.values[[1]]
r_auc_train1
returns,
[1] 0.8655172
The second function returns the OOB votes as explained in the package documentation of the randomForest function,
rf_p_train <- as.vector(rfmodel$votes[,2])
rf_pr_train <- prediction(rf_p_train, train$Species);
r_auc_train2 <- performance(rf_pr_train, measure = "auc")#y.values[[1]]
r_auc_train2
returns (the same result),
[1] 0.8655172
I'm trying to use R's gbm regression model.
I want to compute the coefficient of determination (R squared) between the cross validation predicted response values and the true response values. However, the cv.fitted values of the gbm.object only provides the predicted response values for 1-train.fraction. So in order to get what I want I need to find which of the observations correspond to the cv.fitted values.
Any idea how to get that information?
You can use the predict function to easily get at model predictions, if I'm understanding your question correctly.
dat <- data.frame(y = runif(1000), x=rnorm(1000))
gbmMod <- gbm::gbm(y~x, data=dat, n.trees=5000, cv.folds=0)
summary(lm(predict(gbmMod, n.trees=5000) ~ dat$y))$adj.r.squared
But shouldn't we hold data to the side and assess model accuracy on test data? This would correspond to the following, where I partition the data into a training set (70%) and testing set (30%):
inds <- sample(1:nrow(dat), 0.7*nrow(dat))
train <- dat[inds, ]
test <- dat[-inds, ]
gbmMod2 <- gbm::gbm(y~x, data=train, n.trees=5000)
preds <- predict(gbmMod2, newdata = test, n.trees=5000)
summary(lm(preds ~ test[,1]))$adj.r.squared
It's also worth noting that the number of trees in the gbm can be tuned using the gbm.perf function and the cv.folds argument to the gbm function. This helps avoids overfitting.