R Survival Curve Plot Legend

R Survival Curve Plot Legend - r

I have a table that looks like this:
ID Survival Event Allele
2 5 1 WildType
2 0 1 WildType
3 3 1 WildType
4 38 0 Variant
I want to do a kaplan meier plot, and tell me if wild type or variants tend to survive longer.
I have this code:
library(survival)
Table <-read.table("Table1",header=T)
fit=survfit(Surv(Table$Survival,Table$Event)~Table$Allele)
plot(fit,lty=2:3,col=3:4)
From the fit p value, I can see that the survival of these two groups have significantly different survival curves.
survdiff(formula = Surv(dat$Death, dat$Event) ~ dat$Allele, rho = 0)
# N Observed Expected (O-E)^2/E (O-E)^2/V
# dat$Allele=Variant 5592 3400 3503 3.00 8.63
# dat$Allele=WildType 3232 2056 1953 5.39 8.63
# Chisq= 8.6 on 1 degrees of freedom, p= 0.0033
The plot looks as expected (i.e. two curves).
All I want to do is put a legend on the plot, so that I can see which data is represented by the black and red lines, i.e. do the Wild Type or Variant survive longer.
I have tried these two commands:
lab <-gsub("x=","",names(fit$strata))
legend("top",legend=lab,col=3:4,lty=2:3,horiz=FALSE,bty='n')
The first command works (i.e. I get no error). The second command, I get this error:
Error in strwidth(legend, units = "user", cex = cex, font = text.font) :
plot.new has not been called yet
I've tried reading forums etc., but none of the answers seem to work for me (for example, changing between top/topright/topleft etc. doesn't matter).
Edit 1: This is an example of a table for which I get this error:
ID Survival Event Allele
25808 5 1 WTHomo
22196 0 1 Variant
22518 3 1 Variant
25013 38 0 Variant
27354 5 1 Variant
27223 4 1 Variant
22700 5 1 Variant
22390 24 1 Variant
17586 1 1 Variant
What exactly happens is: when I type the very last command ( legend("top",legend=lab,col=3:4,lty=2:3,horiz=FALSE,bty='n')), the XII window opens, except it's completely blank.
But then if you just type "plot(fit,lty=2:3,col=3:4)", the XII window and the plot appear.
Edit 2: Also, this graph will have two lines, how do I tell which line is which variable? Would the easiest way to do this be to type summary(fit) which gives me two tables. Then, whichever variable comes first in the table, I put in first in the legend?
Many thanks
Eva

You can also do this using ggsurvplot() from survminer.
Here is an example
library(survminer) # Contains ggsurvplot()
library(survival) # Contains survfit()
ggsurvplot(
fit=survfit(Surv(time, censor) ~ Allele, data=your_data,type="kaplan-meier"), # Model
xlab="Years",
ylab="Overall survival probability",
legend.labs=c("WildType","Variant"), # Assign names to groups which are shown in the plot
conf.int = T, # Adds a 95%-confidence interval
pval = T, # Displays the P-value in the plot
pval.method = T # Shows the statistical method used for obtaining the P-value
)

I too have had repeated problems with the "plot.new has not been called yet" error! Strangely, the error was intermittent and repeating the identical commands did not always result in the error! In my case, I found that by preceding the plotting command with
plot.new()
stopped the error from appearing! I have no idea why. Just as an aside, I also had no problem adding a legend to the survival plot using your command.

Related

Adjusted survival curve based on weigthed cox regression

I'm trying to make an adjusted survival curve based on a weighted cox regression performed on a case cohort data set in R, but unfortunately, I can't make it work. I was therefore hoping that some of you may be able to figure it out why it isn't working.
In order to illustrate the problem, I have used (and adjusted a bit) the example from the "Package 'survival'" document, which means im working with:
data("nwtco")
subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,]
ccoh.data$subcohort <- subcoh[selccoh]
ccoh.data$age <- ccoh.data$age/12 # Age in years
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data,subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
The data set is looking like this:
seqno instit histol stage study rel edrel age in.subcohort subcohort
4 4 2 1 4 3 0 6200 2.333333 TRUE TRUE
7 7 1 1 4 3 1 324 3.750000 FALSE FALSE
11 11 1 2 2 3 0 5570 2.000000 TRUE TRUE
14 14 1 1 2 3 0 5942 1.583333 TRUE TRUE
17 17 1 1 2 3 1 960 7.166667 FALSE FALSE
22 22 1 1 2 3 1 93 2.666667 FALSE FALSE
Then, I'm trying to illustrate the effect of stage in an adjusted survival curve, using the ggadjustedcurves-function from the survminer package:
library(suvminer)
ggadjustedcurves(fit.ccSP, variable = ccoh.data$stage, data = ccoh.data)
#Error in survexp(as.formula(paste("~", variable)), data = ndata, ratetable = fit) :
# Invalid rate table
But unfortunately, this is not working. Can anyone figure out why? And can this somehow be fixed or done in another way?
Essentially, I'm looking for a way to graphically illustrate the effect of a continuous variable in a weighted cox regression performed on a case cohort data set, so I would, generally, also be interested in hearing if there are other alternatives than the adjusted survival curves?

Two reasons it is throwing errors.
The ggadjcurves function is not being given a coxph.object, which it's halp page indicated was the designed first object.
The specification of the variable argument is incorrect. The correct method of specifying a column is with a length-1 character vector that matches one of the names in the formula. You gave it a vector whose value was a vector of length 1154.
This code succeeds:
fit.ccSP <- coxph(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data)
ggadjustedcurves(fit.ccSP, variable = 'stage', data = ccoh.data)
It might not answer your desires, but it does answer the "why-error" part of your question. You might want to review the methods used by Therneau, Cynthia S Crowson, and Elizabeth J Atkinson in their paper on adjusted curves:
https://cran.r-project.org/web/packages/survival/vignettes/adjcurve.pdf

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!

As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

Wald-test for single statistic in R

I have a series of hazard-rates at two points (low and high point) in the curve with corresponding standard errors. I calculate the hazard-ratio by dividing the high point hazard-rate by the low point hazard-rate. This is the hratio column. Now in the next column I would like to show the probability (p-value) that the ratio is significantly different from 1 using the Wald-test.
I have tried doing this using the wald.test() from the aods3 package, but I keep getting an error messages. It seems that the code only allows for the comparison of two related regression models.
How would you go about doing this?
> wald
fit.low se.low fit.high se.high hratio
1 0.09387638 0.002597817 0.09530283 0.002800329 0.9850324
2 0.10941588 0.002870383 0.10831292 0.003061924 1.0101831
3 0.02549611 0.001054303 0.02857411 0.001368525 0.8922802
4 0.02818208 0.000917136 0.02871669 0.000936373 0.9813833
5 0.04857652 0.000554676 0.04897211 0.000568229 0.9919222
6 0.05121328 0.000565592 0.05142951 0.000554893 0.9957956
> library(aods3)
> wald$pv <- wald.test(b=wald$hratio)
Error in wald.test(b = wald$hratio) :
One of the arguments Terms or L must be used.

define L=NULL, Terms=NULL, Sigma = vcov(b)

R multiclass/multinomial classification ROC using multiclass.roc (Package ‘pROC’)

I am having difficulties understanding how the multiclass.roc parameters should look like.
Here a snapshot of my data:
> head(testing.logist$cut.rank)
[1] 3 3 3 3 1 3
Levels: 1 2 3
> head(mnm.predict.test.probs)
1 2 3
9 1.013755e-04 3.713862e-02 0.96276001
10 1.904435e-11 3.153587e-02 0.96846413
12 6.445101e-23 1.119782e-11 1.00000000
13 1.238355e-04 2.882145e-02 0.97105472
22 9.027254e-01 7.259787e-07 0.09727389
26 1.365667e-01 4.034372e-01 0.45999610
>
I tried calling multiclass.roc with:
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs,
formula=response~predictor
)
but naturally I get an error:
Error in roc.default(response, predictor, levels = X, percent = percent, :
Predictor must be numeric or ordered.
When it's a binary classification problem I know that 'predictor' should contain probabilities (one per observation). However, in my case, I have 3 classes, so my predictor is a list of rows that each have 3 columns (or a sublist of 3 values) correspond to the probability for each class.
Does anyone know how should my 'predictor' should look like rather than what it's currently look like ?

The pROC package is not really designed to handle this case where you get multiple predictions (as probabilities for each class). Typically you would assess your P(class = 1)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs[,1])
And then do it again with P(class = 2) and P(class = 3). Or better, determine the most likely class:
predicted.class <- apply(mnm.predict.test.probs, 1, which.max)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=predicted.class)
Consider multiclass.roc as a toy that can sometimes be helpful but most likely won't really fit your needs.

R vegan RDA not all levels of constraint are displayed in triplot

In my RDA triplot I would like to display 'sites', 'species' and their constraints which in my case are Field and Trt. The problem is that not all levels of the constraints are displayed in the plot. There are two levels of each factor.
My RDA code is:
Dummy.rda <- rda(species.rda ~ Field + Trt,RDA.env, scale=TRUE)
summary(Dummy.rda, scaling=3) #Here I see only one level of each reported in:Biplot scores for constraining variables. However all levels appear in: Centroids for factor constraints
anova.cca(Dummy.rda, step=100, by='margin') # degrees of freedom are correct for both factors (df=1)
plot(Dummy.rda, scaling = 3) #This displays all levels of Field and Trt but only one of each has an arrow
plot(Dummy.rda, display = "species", xlim = xlims, ylim = ylims,
scaling = 3)
text(Dummy.rda, scaling = 3, display = "bp") # I want to customize the RDA plot, but this 'text' only displays 1 level of each of Field and Trt.

The missing levels are because you are trying to view the factor variables as if they were continuous ones - strictly they should not be displayed as biplot arrows I guess. Anyway, just as in regression with dummy variables, one of the levels of the factor cannot be included because it is linearly dependent on the dummy variables for the remaining levels in the model matrix. Consider this example:
require("vegan")
data(dune)
data(dune.env)
mod <- rda(dune ~ Management, data = dune.env)
> model.matrix(mod)
ManagementHF ManagementNM ManagementSF
2 0 0 0
13 0 0 1
4 0 0 1
16 0 0 1
6 1 0 0
1 0 0 1
8 1 0 0
5 1 0 0
....<truncated>
What you see in the output from model.matrix() are the variables that went into the ordination. Notice that there are three variables in the model matrix but there are four levels in the Management factor:
> with(dune.env, levels(Management))
[1] "BF" "HF" "NM" "SF"
The convention in R is to use the first level of a factor as the reference level. In a regression this would be included in the intercept, but we don't have one of those in RDA. Notice how in the first row of the model.matrix() output, all values are 0; this indicates that that row was in the BF management group. But as only three variables went into the model proper, we can only represent them by three biplot arrows - that is just the way the maths works.
What we can do is plot the group centroids and that is what is shown in the summary() output you refer to and which can be extracted using scores():
> scores(mod, display = "cn")
RDA1 RDA2
ManagementBF -1.2321245 1.9945903
ManagementHF -1.1847246 0.7128810
ManagementNM 2.1149031 0.4258579
ManagementSF -0.5115703 -2.0172205
attr(,"const")
[1] 6.322924
Hence to add the centroids to the existing plot do:
text(mod, scaling = 3, display = "cn")
No matter what you do, you can't add a biplot arrow for the reference group.
I hope that explains the behaviour you are seeing?