What is the "linear.predictors" in the result of cox? - r

I analyzed cox proportional hazard regression with example data.
My R code is below.
install.packages("randomForestSRC")
install.packages("survival")
library(randomForestSRC)
library(survival)
data(pbc, package="randomForestSRC")
pbc.na <- na.omit(pbc)
surv.f <- as.formula(Surv(days, status) ~ .)
cox.obj <- coxph(surv.f, data=pbc.na)
cox.obj
cox.obj$linear.predictors
What I want to know is that "linear.predictors".
All of the objects in my data seem to have this index, but I have no idea exactly.
Please let me know what it is.
Thans always.

Related

How can I calculate the mean square error in R of a regression tree?

I am working with the wine quality database.
I am studying regression trees depending on different variables as:
library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)
vinos <- read.csv(file = 'Wine.csv', header = T)
arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)
arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)
I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as
aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")
and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?
This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.
Remember that the function of the Mean Squared Error (MSE) is the following one:
So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).
A solution for your wine dataset could be this one, based on the previous website.
library(rpart)
library(dplyr)
library(data.table)
vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
# Split data into train and test subsets
sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
train <- vinos[sample_index, ]
test <- vinos[-sample_index, ]
# Train regression trees models
arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
# Make predictions for each model
pred0 <- predict(arbol0, newdata = test)
pred1 <- predict(arbol1, newdata = test)
# Calculate MSE for each model
mean((pred0 - test$quality)^2)
mean((pred1 - test$quality)^2)

Relative variable importance from CoxBoost

I am fitting time-to-event survival data using surv.CoxBoost in the mlr package. My question: is there any way to get relative importance for the variables in the fitted model? I have seen this post detailing variablke importance for cvglment but haven't seen any on CoxBoost.
Any idea?
below is an example of a model using CoxBoost`. You may need to install CoxBoost from here as seems no longer on CRAN.
library(randomForestSRC)
library(mlr)
library(survival)
library(CoxBoost)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
set.seed(9512)
train <- sample(1:nrow(data), round(nrow(data)*0.7))
data.train <- data[train, ]
data.test <- data[-train, ]
task = makeSurvTask( data=data.train, target=c('days', 'status'))
learner= makeLearner("surv.CoxBoost")
trained.learner=train(learner,task)
CoxBoostfit <- trained.learner$learner.model
CoxBoostfit$coefficients

How can I calculate survival function in gbm package analysis?

I would like to analysis my data based on the gradient boosted model.
On the other hand, as my data is a kind of cohort, I have a trouble understanding the result of this model.
Here's my code. Analysis was performed based on the example data.
install.packages("randomForestSRC")
install.packages("gbm")
install.packages("survival")
library(randomForestSRC)
library(gbm)
library(survival)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
set.seed(9512)
train <- sample(1:nrow(data), round(nrow(data)*0.7))
data.train <- data[train, ]
data.test <- data[-train, ]
set.seed(9741)
gbm <- gbm(Surv(days, status)~.,
data.train,
interaction.depth=2,
shrinkage=0.01,
n.trees=500,
distribution="coxph")
summary(gbm)
set.seed(9741)
gbm.pred <- predict.gbm(gbm,
n.trees=500,
newdata=data.test,
type="response")
As I read the package documnet, "gbm.pred" is the result of cox's partial likelihood.
set.seed(9741)
lambda0 = basehaz.gbm(t=data.test$days,
delta=data.test$status,
t.eval=sort(data.test$days),
cumulative = FALSE,
f.x=gbm.pred,
smooth=T)
hazard=lambda0*exp(gbm.pred)
In this code, lambda0 is a baseline hazard fuction.
So, according to formula: h(t/x)=lambda0(t)*exp(f(x))
"hazard" is hazard function.
However, what I've wanted to calculte was the "survival function".
Because, I would like to compare the outcome of original data (data$status) to the prediction result (survival function).
Please let me know how to calculate survival function.
Thank you
Actually, the returns is cumulative baseline hazard function(integral part: \int^t\lambda(z)dz), and survival function can be computed as below:
s(t|X)=exp{-e^f(X)\int^t\lambda(z)dz}
f(X) is prediction of gbm, which is equal to log-hazard proportion.
I think this tutorial about gbm-based survival analysis would help to u!
https://github.com/liupei101/Tutorial-Machine-Learning-Based-Survival-Analysis/blob/master/Tutorial_Survival_GBM.ipynb

Why are the predict values of gbm (R package) negative?

I analyzed my data with 'gbm' R package. My data is based on a cohort study. Therefore, I ran 'gbm' model based on the 'coxph' results.
After constructing a model, I would like to see how this model can predict well. On the other hand, like the code below, the values of prediction are negative. So, I have a trouble understanding this phenomenon.
Please let me know how to interpret this value.
Here's my code.
install.packages("survival")
install.packages("randomForestSRC")
install.packages("gbm")
library(survival)
library(randomForestSRC)
library(gbm)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
exposure <- names(data[, names(data.model) !=c("days", "status")])
formula <- as.formula(paste("Surv(days, status)~", paste(exposure, collapse="+")))
set.seed(123)
ex <- gbm(Surv(days, status)~.,
data=data,
distribution="coxph",
cv.folds=5,
shrinkage=.01,
n.trees=1000)
set.seed(123)
pred <- predict(ex, n.trees=1000, type="response")
Read the ?predict.gbm help page, particularly the parameter type. By default predictions are on the link scale.

Mice pool() function and coxph(): in mice.df (...) : large sample assumed?

I used the coxph() function from the survival package in multiply imputed dataset and encountered a warning when trying to pool the results. The warning message states: "In mice.df(m, lambda, dfcom, method) : Large sample assumed.
A reproducible example is below (with publically available data, without worrying to much about the appropriateness of using both mice() and coxph() with these data):
library(mice)
library(survival)
#load publically available data
data(pbc)
#select variables for the reproducable example
pbc.select <- pbc[pbc$status %in% c(0,1) , c("id", "time", "status", "trt")]
imp <- mice(pbc.select) #impute trt
fit <- with(imp, coxph(Surv(time, status) ~ trt)) #fit coxph in each imp
pool(fit) #pool the models; get's the error
This warning seems to result from the pool() function trying to require dfcom from
dfcom <- df.residual(object)
where df.residual() is not applicable to the object referred to in this context, which is of class coxph
class(fit) # "mira" "matrix"
class(fit$analyses[[1]]) "coxph"
My questions are
am I using the correct syntax for purpose
if so, is there a way to supply pool() with the appropriate information?
how does this assumption influence the results?

Resources