Predict probability from Cox PH model

Predict probability from Cox PH model - r

I am trying to use cox model to predict the probability of failure after time (which is named stop) 3.
bladder1 <- bladder[bladder$enum < 5, ]
coxmodel = coxph(Surv(stop, event) ~ (rx + size + number) +
cluster(id), bladder1)
range(predict(coxmodel, bladder1, type = "lp"))
range(predict(coxmodel, bladder1, type = "risk"))
range(predict(coxmodel, bladder1, type = "terms"))
range(predict(coxmodel, bladder1, type = "expected"))
However, the outputs of predict function are all not in 0-1 range. Is there any function or how can I use the lp prediction and baseline hazard function to calculate probability?

Please read the help page for predict.coxph. None of those are supposed to be probabilities. The linear predictor for a specific set of covariates is the log-hazard-ratio relative to a hypothetical (and very possibly non-existent) case with the mean of all the predictor values. The 'expected' comes the closest to a probability since it is a predicted number of events, but it would require specification of the time and then be divided by the number at risk at the beginning of observation.
In the case of the example offered on that help page for predict, you can see that the sum of predicted events is close the the actual number:
> sum(predict(fit,type="expected"), na.rm=TRUE)
[1] 163
> sum(lung$status==2)
[1] 165
I suspect you may want to be working instead with the survfit function, since the probability of event is 1-probability of survival.
?survfit.coxph
The code for a similar question appears here: Adding column of predicted Hazard Ratio to dataframe after Cox Regression in R
Since you suggested using the bladder1 dataset, then this would be the code for a specification of time=5
summary(survfit(coxmodel), time=5)
#------------------
Call: survfit(formula = coxmodel)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 302 26 0.928 0.0141 0.901 0.956
That would return as a list with the survival prediction as a list element named $surv:
> str(summary(survfit(coxmodel), time=5))
List of 14
$ n : int 340
$ time : num 5
$ n.risk : num 302
$ n.event : num 26
$ conf.int: num 0.95
$ type : chr "right"
$ table : Named num [1:7] 340 340 340 112 NA 51 NA
..- attr(*, "names")= chr [1:7] "records" "n.max" "n.start" "events" ...
$ n.censor: num 19
$ surv : num 0.928
$ std.err : num 0.0141
$ lower : num 0.901
$ upper : num 0.956
$ cumhaz : num 0.0744
$ call : language survfit(formula = coxmodel)
- attr(*, "class")= chr "summary.survfit"
> summary(survfit(coxmodel), time=5)$surv
[1] 0.9282944

Related

How to model the marginals of a copula as student t distributions in R

I am trying to model the performance of a portfolio consisting of a basket of ETFs. To do this, I am using a T copula. For now, I have specified the marginals (i.e. the performance of the individual ETFs) as being normal, however, I want to use a Student t-distribution instead of a normal distribution.
I have looked into the fit.st() method from the QRM package, but I am unsure how to combine this with the copula package.
I know how to implement normally distributed margins:
mv.NE <- mvdc(normalCopula(0.75), c("norm"),
list(list(mean = 0, sd =2)))
How can I do the same thing, but with a t-distribution?

All that you need to do is to use tCopula instead of the normalCopula. You need to set up the parameter and degree of freedom of t-copula. And you need to specify the margins as well.
Hence, here we replace the normalCopula with tCopula and df=5 is the degree of freedom. Both margins are normal (as you want).
mv.NE <- mvdc(tCopula(0.75, df=5), c("norm", "norm"),
+ list(list(mean = 0, sd =2), list(list(mean = 0, sd =2))))
The result is:
Multivariate Distribution Copula based ("mvdc")
# copula:
t-copula, dim. d = 2
Dimension: 2
Parameters:
rho.1 = 0.75
df = 5.00
# margins:
[1] "norm" "norm"
with 2 (not identical) margins; with parameters (# paramMargins)
List of 2
$ :List of 2
..$ mean: num 0
..$ sd : num 2
$ :List of 1
..$ mean:List of 2
.. ..$ mean: num 0
.. ..$ sd : num 2
For t-margins, use this:
mv.NE <- mvdc(tCopula(0.75), c("t","t"),list(t=5,t=5))
Multivariate Distribution Copula based ("mvdc")
# copula:
t-copula, dim. d = 2
Dimension: 2
Parameters:
rho.1 = 0.75
df = 4.00
# margins:
[1] "t" "t"
with 2 (not identical) margins; with parameters (# paramMargins)
List of 2
$ t: Named num 5
..- attr(*, "names")= chr "df"
$ t: Named num 5
..- attr(*, "names")= chr "df"

Standard errors for smooth coefficient kernel regression with npscoef {np}

While fitting a Smooth Coefficient Kernel Regression with help of npscoef {np} in R, I cannot output the standard errors for the regression estimates.
The Help states that if errors = TRUE, asymptotic standard errors should be computed and returned in the resulting smoothcoefficient object.
Based on the example provided by the authors of the package "NP":
library("np")
data(wage1)
NP.Ydata <- wage1$lwage
NP.Xdata <- wage1[c("educ", "tenure", "exper", "expersq")]
NP.Zdata <- wage1[c("female", "married")]
NP.bw.scoef <- npscoefbw(xdat=NP.Xdata, ydat=NP.Ydata, zdat=NP.Zdata)
NP.scoef <- npscoef(NP.bw.scoef,
betas = TRUE,
residuals = TRUE,
errors = TRUE)
Coefficients are in the object coef(NP.scoef) saved under betas = TRUE
> str(coef(NP.scoef))
num [1:526, 1:5] 0.146 0.504 0.196 0.415 0.415 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "Intercept" "educ" "tenure" "exper" ...
But should not the standard errors for the estimates be saved under errors = TRUE?
I see only one column vector. Not 5 for intercept + 4 explanatory variables.
> str(se(NP.scoef))
num [1:526] 0.015 0.0155 0.0155 0.0268 0.0128 ...
I am confused. Hope for a clarification.

Extract distances from hclust (hierarchical clustering) object

I would like to calculate how good the fit of my cluster analysis solution for the actual distance scores is. To do that, I need to extract the distance between the stimuli I am clustering. I know that when looking at the dendrogram I can extract the distance, for example between 5 and -14 is .219 (the height of where they are connected), but is there an automatic way of extracting the distances from the information in the hclust object?
List of 7
$ merge : int [1:14, 1:2] -5 -1 -6 -4 -10 -2 1 -9 -12 -3 ...
$ height : num [1:14] 0.219 0.228 0.245 0.266 0.31 ...
$ order : int [1:15] 3 11 5 14 4 1 8 12 10 15 ...
$ labels : chr [1:15] "1" "2" "3" "4" ...
$ method : chr "ward.D"
$ call : language hclust(d = as.dist(full_naive_eucAll, diag = F, upper = F), method = "ward.D")
$ dist.method: NULL
- attr(*, "class")= chr "hclust"

Yes.
You are asking about the cophenetic distance.
d_USArrests <- dist(USArrests)
hc <- hclust(d_USArrests, "ave")
par(mfrow = c(1,2))
plot(hc)
plot(cophenetic(hc) ~ d_USArrests)
cor(cophenetic(hc), d_USArrests)
The same method can also be applied to compare two hierarchical clustering methods, and is implemented in the dendextend R package (the function makes sure the two distance matrix are ordered to match). For example:
# install.packages('dendextend')
library("dendextend")
d_USArrests <- dist(USArrests)
hc1 <- hclust(d_USArrests, "ave")
hc2 <- hclust(d_USArrests, "single")
cor_cophenetic(hc1, hc2)
# 0.587977

Saving output of confusionMatrix as a .csv table

I have a following code resulting in a table-like output
lvs <- c("normal", "abnormal")
truth <- factor(rep(lvs, times = c(86, 258)),
levels = rev(lvs))
pred <- factor(
c(
rep(lvs, times = c(54, 32)),
rep(lvs, times = c(27, 231))),
levels = rev(lvs))
xtab <- table(pred, truth)
library(caret)
confusionMatrix(xtab)
confusionMatrix(pred, truth)
confusionMatrix(xtab, prevalence = 0.25)
I would like to export the below part of the output as a .csv table
Accuracy : 0.8285
95% CI : (0.7844, 0.8668)
No Information Rate : 0.75
P-Value [Acc > NIR] : 0.0003097
Kappa : 0.5336
Mcnemar's Test P-Value : 0.6025370
Sensitivity : 0.8953
Specificity : 0.6279
Pos Pred Value : 0.8783
Neg Pred Value : 0.6667
Prevalence : 0.7500
Detection Rate : 0.6715
Detection Prevalence : 0.7645
Balanced Accuracy : 0.7616
Attempt to write it as a .csv table results in the error message:
write.csv(confusionMatrix(xtab),file="file.csv")
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""confusionMatrix"" to a data.frame
Doing the whole work manually, for obvious reasons, is impractical and prone to human errors.
Any suggestions on how to export it as a .csv?

Using caret package
results <- confusionMatrix(pred, truth)
as.table(results) gives
Reference
Prediction X1 X0
X1 36 29
X0 218 727
as.matrix(results,what="overall") gives
Accuracy 7.554455e-01
Kappa 1.372895e-01
AccuracyLower 7.277208e-01
AccuracyUpper 7.816725e-01
AccuracyNull 7.485149e-01
AccuracyPValue 3.203599e-01
McnemarPValue 5.608817e-33
and
as.matrix(results, what = "classes") gives
Sensitivity 0.8953488
Specificity 0.6279070
Pos Pred Value 0.8783270
Neg Pred Value 0.6666667
Precision 0.8783270
Recall 0.8953488
F1 0.8867562
Prevalence 0.7500000
Detection Rate 0.6715116
Detection Prevalence 0.7645349
Balanced Accuracy 0.7616279
Using these and write.csv command you can get the entire confusionMatrix info

Ok, so if you inspect the output of confusionMatrix(xtab, prevalence = 0.25) , it's a list:
cm <- confusionMatrix(pred, truth)
str(cm)
List of 5
$ positive: chr "abnormal"
$ table : 'table' int [1:2, 1:2] 231 27 32 54
..- attr(*, "dimnames")=List of 2
.. ..$ Prediction: chr [1:2] "abnormal" "normal"
.. ..$ Reference : chr [1:2] "abnormal" "normal"
$ overall : Named num [1:7] 0.828 0.534 0.784 0.867 0.75 ...
..- attr(*, "names")= chr [1:7] "Accuracy" "Kappa" "AccuracyLower" "AccuracyUpper" ...
$ byClass : Named num [1:8] 0.895 0.628 0.878 0.667 0.75 ...
..- attr(*, "names")= chr [1:8] "Sensitivity" "Specificity" "Pos Pred Value" "Neg Pred Value" ...
$ dots : list()
- attr(*, "class")= chr "confusionMatrix"
From here on you select the appropriate objects that you want to create a csv from and make a data.frame that will have a column for each variable. In your case, this will be:
tocsv <- data.frame(cbind(t(cm$overall),t(cm$byClass)))
# You can then use
write.csv(tocsv,file="file.csv")

I found that capture.output works best for me.
It simply copies your output as a .csv file
(you can also do it as .txt)
capture.output(
confusionMatrix(xtab, prevalence = 0.25),
file = "F:/Home Office/result.csv")

The absolute easiest solution is to simply write out using readr::write_rds. You can export and import all while keeping the confusionMatrix structure intact.

If A is a caret::confusionMatrix object, then:
broom::tidy(A) %>% writexl::write_xlsx("mymatrix.xlsx")
optionally replace writexl with write.csv().
To also include the table on a separate sheet:
broom::tidy(A) %>% list(as.data.frame(A$table)) %>% writexl::write_xlsx("mymatrix.xlsx")

Evaluating a statistical model in R

I have a very big data set (ds). One of its columns is Popularity, of type factor ('High' / ' Low').
I split the data to 70% and 30% in order to create a training set (ds_tr) and a test set (ds_te).
I have created the following model using a Logistic regression:
mdl <- glm(formula = popularity ~ . -url , family= "binomial", data = ds_tr )
then I created a predict object (will do it again for ds_te)
y_hat = predict(mdl, data = ds_tr - url , type = 'response')
I want to find the precision value which corresponds to a cutoff threshold of 0.5 and find the recall value which corresponds to a cutoff threshold of 0.5, so I did:
library(ROCR)
pred <- prediction(y_hat, ds_tr$popularity)
perf <- performance(pred, "prec", "rec")
The result is a table of many values
str(perf)
Formal class 'performance' [package "ROCR"] with 6 slots
..# x.name : chr "Recall"
..# y.name : chr "Precision"
..# alpha.name : chr "Cutoff"
..# x.values :List of 1
.. ..$ : num [1:27779] 0.00 7.71e-05 7.71e-05 1.54e-04 2.31e-04 ...
..# y.values :List of 1
.. ..$ : num [1:27779] NaN 1 0.5 0.667 0.75 ...
..# alpha.values:List of 1
.. ..$ : num [1:27779] Inf 0.97 0.895 0.89 0.887 ...
How do I find the specific precision and recall values corresponding to a cutoff threshold of 0.5?

Acces the slots of performance object (through the combination of # + list)
We create a dataset with all possible values:
probab.cuts <- data.frame(cut=perf#alpha.values[[1]], prec=perf#y.values[[1]], rec=perf#x.values[[1]])
You can view all associated values
probab.cuts
If you want to select the requested values, it is trivial to do:
tail(probab.cuts[probab.cuts$cut > 0.5,], 1)
Manual check
tab <- table(ds_tr$popularity, y_hat > 0.5)
tab[4]/(tab[4]+tab[2]) # recall
tab[4]/(tab[4]+tab[3]) # precision

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Predict probability from Cox PH model - r

Related

How to model the marginals of a copula as student t distributions in R

Standard errors for smooth coefficient kernel regression with npscoef {np}

Extract distances from hclust (hierarchical clustering) object

Saving output of confusionMatrix as a .csv table

Evaluating a statistical model in R

Categories

Resources