Plotting results of several Bland-Altman analysis - r

I compared several diagnostic methods to a gold standard using Bland-Altman plots. Now I would graphically represent the difference in agreement between each method and the gold standard in one single plot. I'm trying to plot means, confidence intervals and variance derived from the various Bland-Altman plots as horizontal boxplots, but I don't know I to do that.
I have a dataframe like this:
Method LCL mean UCL var
A -5 4 15 27
B -9 2 13 33
C -8 4 16 36
Thank you very much for your help!
Corrado

You need to realize that a "true" boxplot is a specific type of plot based on non-parametric statistics, none of which you have offered. If you want to call it something else you are free to do so and you can use the bxp function to do the plotting. You need to create a matrix with 5 rows and 3 columns with the values for whisker and box parameters. You may be thinking that the variance could be used to construct standard deviation?
dat <- read.table(text="Method LCL mean UCL var
A -5 4 15 27
B -9 2 13 33
C -8 4 16 36
", header=TRUE)
dat$sdpd <- dat$mean + dat$var^0.5
dat$sdmd <- dat$mean - dat$var^0.5
dat
#------
Method LCL mean UCL var sdpd sdmd
1 A -5 4 15 27 9.196152 -1.196152
2 B -9 2 13 33 7.744563 -3.744563
3 C -8 4 16 36 10.000000 -2.000000
#----------
bxpm <- with(dat, t(matrix(c(LCL, sdmd, mean, sdpd, UCL), 3,5)))
bxpm
#----------
[,1] [,2] [,3]
[1,] -5.000000 -9.000000 -8
[2,] -1.196152 -3.744563 -2
[3,] 4.000000 2.000000 4
[4,] 9.196152 7.744563 10
[5,] 15.000000 13.000000 16
bxp(list(stats=bxpm, names=dat$Method ), main="Not a real boxplot\n
Perhaps a double dynamite plot?")

I can't provide you with working R code as you didn't supply raw data (which are needed for boxplots), and it is not clear what you want to display as nothing indicates where your gold standard comes into play in the given aggregated data (are these repeated measurements with different instruments?), unless the reported means stand for difference between the ith method and the reference method (in which case I don't see how you could use a boxplot). A basic plot of your data might look like
dfrm <- data.frame(method=LETTERS[1:3], lcl=c(-5,-9,-8),
mean=c(4,2,4), ucl=c(15,13,16), var=c(27,33,36))
# I use stripchart to avoid axis relabeling and casting of factor to numeric
# with default plot function
stripchart(mean ~ seq(1,3), data=dfrm, vertical=TRUE, ylim=c(-10,20),
group.names=levels(dfrm$method), pch=19)
with(dfrm, arrows(1:3, mean-lcl, 1:3, mean+lcl, angle=90, code=3, length=.1))
abline(h=0, lty=2)
However, I can recommend you to take a look at the MethComp package which will specifically help you in comparing several methods to a gold standard, with or without replicates, as well as in displaying results. The companion textbook is
Carstensen, B. Comparing Clinical Measurement Methods. John Wiley
& Sons Ltd 2010

Have you tried using R's boxplot() command?
I think by default it assumes you are supplying the raw data, and specifying a factor with which to segment the data. It will compute it's own bounds for the box, which may or may not correspond to what you are using. If you want to be able to easily fine tune r-graphics, and you have a little bit of time to learn, checkout hadly wikham's ggplot2. It's powerful, flexible, and pretty!
Good luck!.

Related

Adjusted survival curve based on weigthed cox regression

I'm trying to make an adjusted survival curve based on a weighted cox regression performed on a case cohort data set in R, but unfortunately, I can't make it work. I was therefore hoping that some of you may be able to figure it out why it isn't working.
In order to illustrate the problem, I have used (and adjusted a bit) the example from the "Package 'survival'" document, which means im working with:
data("nwtco")
subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,]
ccoh.data$subcohort <- subcoh[selccoh]
ccoh.data$age <- ccoh.data$age/12 # Age in years
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data,subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
The data set is looking like this:
seqno instit histol stage study rel edrel age in.subcohort subcohort
4 4 2 1 4 3 0 6200 2.333333 TRUE TRUE
7 7 1 1 4 3 1 324 3.750000 FALSE FALSE
11 11 1 2 2 3 0 5570 2.000000 TRUE TRUE
14 14 1 1 2 3 0 5942 1.583333 TRUE TRUE
17 17 1 1 2 3 1 960 7.166667 FALSE FALSE
22 22 1 1 2 3 1 93 2.666667 FALSE FALSE
Then, I'm trying to illustrate the effect of stage in an adjusted survival curve, using the ggadjustedcurves-function from the survminer package:
library(suvminer)
ggadjustedcurves(fit.ccSP, variable = ccoh.data$stage, data = ccoh.data)
#Error in survexp(as.formula(paste("~", variable)), data = ndata, ratetable = fit) :
# Invalid rate table
But unfortunately, this is not working. Can anyone figure out why? And can this somehow be fixed or done in another way?
Essentially, I'm looking for a way to graphically illustrate the effect of a continuous variable in a weighted cox regression performed on a case cohort data set, so I would, generally, also be interested in hearing if there are other alternatives than the adjusted survival curves?
Two reasons it is throwing errors.
The ggadjcurves function is not being given a coxph.object, which it's halp page indicated was the designed first object.
The specification of the variable argument is incorrect. The correct method of specifying a column is with a length-1 character vector that matches one of the names in the formula. You gave it a vector whose value was a vector of length 1154.
This code succeeds:
fit.ccSP <- coxph(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data)
ggadjustedcurves(fit.ccSP, variable = 'stage', data = ccoh.data)
It might not answer your desires, but it does answer the "why-error" part of your question. You might want to review the methods used by Therneau, Cynthia S Crowson, and Elizabeth J Atkinson in their paper on adjusted curves:
https://cran.r-project.org/web/packages/survival/vignettes/adjcurve.pdf

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

R multiclass/multinomial classification ROC using multiclass.roc (Package ‘pROC’)

I am having difficulties understanding how the multiclass.roc parameters should look like.
Here a snapshot of my data:
> head(testing.logist$cut.rank)
[1] 3 3 3 3 1 3
Levels: 1 2 3
> head(mnm.predict.test.probs)
1 2 3
9 1.013755e-04 3.713862e-02 0.96276001
10 1.904435e-11 3.153587e-02 0.96846413
12 6.445101e-23 1.119782e-11 1.00000000
13 1.238355e-04 2.882145e-02 0.97105472
22 9.027254e-01 7.259787e-07 0.09727389
26 1.365667e-01 4.034372e-01 0.45999610
>
I tried calling multiclass.roc with:
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs,
formula=response~predictor
)
but naturally I get an error:
Error in roc.default(response, predictor, levels = X, percent = percent, :
Predictor must be numeric or ordered.
When it's a binary classification problem I know that 'predictor' should contain probabilities (one per observation). However, in my case, I have 3 classes, so my predictor is a list of rows that each have 3 columns (or a sublist of 3 values) correspond to the probability for each class.
Does anyone know how should my 'predictor' should look like rather than what it's currently look like ?
The pROC package is not really designed to handle this case where you get multiple predictions (as probabilities for each class). Typically you would assess your P(class = 1)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs[,1])
And then do it again with P(class = 2) and P(class = 3). Or better, determine the most likely class:
predicted.class <- apply(mnm.predict.test.probs, 1, which.max)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=predicted.class)
Consider multiclass.roc as a toy that can sometimes be helpful but most likely won't really fit your needs.

Getting percentile values from gamlss centile curves

This question is related to: Selecting Percentile curves using gamlss::lms in R
I can get centile curve from following data and code:
age = sample(5:15, 500, replace=T)
yvar = rnorm(500, age, 20)
mydata = data.frame(age, yvar)
head(mydata)
age yvar
1 12 13.12974
2 14 -18.97290
3 10 42.11045
4 12 27.89088
5 11 48.03861
6 5 24.68591
h = lms(yvar, age , data=mydata, n.cyc=30)
centiles(h,xvar=mydata$age, cent=c(90), points=FALSE)
How can I now get yvar on the curve for each of x value (5:15) which would represent 90th percentiles for data after smoothening?
I tried to read help pages and found fitted(h) and fv(h) to get fitted values for entire data. But how to get values for each age level at 90th centile curve level? Thanks for your help.
Edit: Following figure show what I need:
I tried following but it is correct since value are incorrect:
mydata$fitted = fitted(h)
aggregate(fitted~age, mydata, function(x) quantile(x,.9))
age fitted
1 5 6.459680
2 6 6.280579
3 7 6.290599
4 8 6.556999
5 9 7.048602
6 10 7.817276
7 11 8.931219
8 12 10.388048
9 13 12.138104
10 14 14.106250
11 15 16.125688
The values are very different from 90th quantile directly from data:
> aggregate(yvar~age, mydata, function(x) quantile(x,.9))
age yvar
1 5 39.22938
2 6 35.69294
3 7 25.40390
4 8 26.20388
5 9 29.07670
6 10 32.43151
7 11 24.96861
8 12 37.98292
9 13 28.28686
10 14 43.33678
11 15 44.46269
See if this makes sense. The 90th percentile of a normal distribution with mean and sd of 'smn' and 'ssd' is qnorm(.9, smn, ssd): So this seems to deliver (somewhat) sensible results, albeit not the full hack of centiles that I suggested:
plot(h$xvar, qnorm(.9, fitted(h), h$sigma.fv))
(Note the massive overplotting from only a few distinct xvars but 500 points. Ande you may want to set the ylim so that the full range can be appreciated.)
The caveat here is that you need to check the other parts of the model to see if it is really just an ordinary Normal model. In this case it seems to be:
> h$mu.formula
y ~ pb(x)
<environment: 0x10275cfb8>
> h$sigma.formula
~1
<environment: 0x10275cfb8>
> h$nu.formula
NULL
> h$tau.formula
NULL
So the model is just mean-estimate with a fixed-variance (the ~1) across the range of the xvar, and there are no complications from higher order parameters like a Box-Cox model. (And I'm unable to explain why this is not the same as the plotted centiles. For that you probably need to correspond with the package authors.)

R Survival Curve Plot Legend

I have a table that looks like this:
ID Survival Event Allele
2 5 1 WildType
2 0 1 WildType
3 3 1 WildType
4 38 0 Variant
I want to do a kaplan meier plot, and tell me if wild type or variants tend to survive longer.
I have this code:
library(survival)
Table <-read.table("Table1",header=T)
fit=survfit(Surv(Table$Survival,Table$Event)~Table$Allele)
plot(fit,lty=2:3,col=3:4)
From the fit p value, I can see that the survival of these two groups have significantly different survival curves.
survdiff(formula = Surv(dat$Death, dat$Event) ~ dat$Allele, rho = 0)
# N Observed Expected (O-E)^2/E (O-E)^2/V
# dat$Allele=Variant 5592 3400 3503 3.00 8.63
# dat$Allele=WildType 3232 2056 1953 5.39 8.63
# Chisq= 8.6 on 1 degrees of freedom, p= 0.0033
The plot looks as expected (i.e. two curves).
All I want to do is put a legend on the plot, so that I can see which data is represented by the black and red lines, i.e. do the Wild Type or Variant survive longer.
I have tried these two commands:
lab <-gsub("x=","",names(fit$strata))
legend("top",legend=lab,col=3:4,lty=2:3,horiz=FALSE,bty='n')
The first command works (i.e. I get no error). The second command, I get this error:
Error in strwidth(legend, units = "user", cex = cex, font = text.font) :
plot.new has not been called yet
I've tried reading forums etc., but none of the answers seem to work for me (for example, changing between top/topright/topleft etc. doesn't matter).
Edit 1: This is an example of a table for which I get this error:
ID Survival Event Allele
25808 5 1 WTHomo
22196 0 1 Variant
22518 3 1 Variant
25013 38 0 Variant
27354 5 1 Variant
27223 4 1 Variant
22700 5 1 Variant
22390 24 1 Variant
17586 1 1 Variant
What exactly happens is: when I type the very last command ( legend("top",legend=lab,col=3:4,lty=2:3,horiz=FALSE,bty='n')), the XII window opens, except it's completely blank.
But then if you just type "plot(fit,lty=2:3,col=3:4)", the XII window and the plot appear.
Edit 2: Also, this graph will have two lines, how do I tell which line is which variable? Would the easiest way to do this be to type summary(fit) which gives me two tables. Then, whichever variable comes first in the table, I put in first in the legend?
Many thanks
Eva
You can also do this using ggsurvplot() from survminer.
Here is an example
library(survminer) # Contains ggsurvplot()
library(survival) # Contains survfit()
ggsurvplot(
fit=survfit(Surv(time, censor) ~ Allele, data=your_data,type="kaplan-meier"), # Model
xlab="Years",
ylab="Overall survival probability",
legend.labs=c("WildType","Variant"), # Assign names to groups which are shown in the plot
conf.int = T, # Adds a 95%-confidence interval
pval = T, # Displays the P-value in the plot
pval.method = T # Shows the statistical method used for obtaining the P-value
)
I too have had repeated problems with the "plot.new has not been called yet" error! Strangely, the error was intermittent and repeating the identical commands did not always result in the error! In my case, I found that by preceding the plotting command with
plot.new()
stopped the error from appearing! I have no idea why. Just as an aside, I also had no problem adding a legend to the survival plot using your command.

Resources