creating multi-panel ROC curve plots with lattice - r

I would like to create a plot which contains two panels, and each panel contains two ROC curve. To introduce my (failing) approach, I generate data frame containing the true label, labels for four methods (each method corresponds to)
N <- 20
TF <- rep(c(0,1),each=N/2)
pred <- method <- true <- NULL
for (imethod in 1 : 4){
pred <- c(pred,seq(-1,1,length.out=N) + rnorm(N) )
method <- c(method,rep(imethod,N))
true <- c(true,TF)
}
dat.roc <-
data.frame(true=true,pred=pred,method=method,panel=rep(1:2,each=length(method)/2))
xyplot(true ~ pred|panel, data=dat.roc,groups=method,
xlim=c(0,1),xlab="1-specificity",
ylab="sensitivity",
panel=function(x,y,...){
DD <- table(-x,y)
sens <- cumsum(DD[,2])/sum(DD[,2])
mspec <- cumsum(DD[,1])/sum(DD[,1])
panel.xyplot(mspec,sens,type="l",...)
panel.abline(0,1)
})
The plot have two panels, each of which has only ONE ROC curve (with two colors)! How can I correctly specify lattice to return two ROC curve in each panel?

Since you're using cumsum here in your panel function, you want to make sure that you are creating different plots for each group, not just each panel. One way to do this is to use the panel.superpose panel function So you would change your code to
xyplot(true ~ pred|panel, data=dat.roc,groups=method,
xlim=c(0,1),xlab="1-specificity",
ylab="sensitivity",
panel=panel.superpose,
panel.groups=function(x,y,type, ...){
DD <- table(-x,y)
sens <- cumsum(DD[,2])/sum(DD[,2])
mspec <- cumsum(DD[,1])/sum(DD[,1])
panel.xyplot(mspec,sens,type="l",...)
panel.abline(0,1)
})
which produces the plot

Related

How to add labels to original data given clustering result using hclust

Just say I have some unlabeled data which I know should be clustered into six catergories, like for example this dataset:
library(tidyverse)
ts <- read_table(url("http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.data"), col_names = FALSE)
If I create an hclust object with a sample of 60 from the original dataset like so:
n <- 10
s <- sample(1:100, n)
idx <- c(s, 100+s, 200+s, 300+s, 400+s, 500+s)
ts.samp <- ts[idx,]
observedLabels <- c(rep(1,n), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
# compute DTW distances
library(dtw)#Dynamic Time Warping (DTW)
distMatrix <- dist(ts.samp, method= 'DTW')
# hierarchical clustering
hc <- hclust(distMatrix, method='average')
I know that I can then add the labels to the dendrogram for viewing like this:
observedLabels <- c(rep(1,), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
plot(hc, labels=observedLabels, main="")
However, I would like to the correct labels to the initial data frame that was clustered. So for ts.samp I would like to add a extra column with the correct label that each observation has been clustered into.
It would seems that ts.samp$cluster <- hc$label should add the cluster to the data frame, however hc$label returns NULL.
Can anyone help with extracting this information?
You need to define a level where you cut your dendrogram, this will form the groups.
Use:
labels <- cutree(hc, k = 3) # you set the number of k that's more appropriate, see how to read a dendrogram
ts.samp$grouping <- labels
Let's look at the dendrogram in order to find the best number for k:
plot(hc, main="")
abline(h=500, col = "red") # cut at height 500 forms 2 groups
abline(h=300, col = "blue") # cut at height 300 forms 3/4 groups
It looks like either 2 or 3 might be good. You need to find the highest jump in the vertical lines (Height).
Use the horizontal lines at that height and count the cluster "formed".

R plotting multiple survival curves in the same plot

I am trying to plot multiple survival curves in the same plot. Using plot I can easily do this by
plot(sr_fit_0, col = 'red' , conf.int=TRUE, xlim=c(0, max_m))
par(new=TRUE)
plot(sr_fit_1, col ='blue', conf.int=TRUE, xlim=c(0, max_m))`
But now I want to use ggsurv to plot survival curve and I don't know how to have both of them in the same plot(not subplots). Any help is appreciated.
I generated some data for life below for life of hamsters and gerbils. You can use the survfit() function similar to other curve fitting functions and define a data frame column that splits the population. When you create the plot with ggsurv() I think it will display what you are looking for.
## Make some data for varmint life
set.seed(1); l1 <- rnorm(120, 2.5, 1)
gerbils <- data.frame(life = l1[l1>0])
set.seed(3); l2 <- rnorm(120, 3, 1)
hamsters <- data.frame(life = l2[l2>0])
## Load required packages
require('survival'); require('GGally')
## Generate fits for survival curves
## (Note that Surv(x) creates a Survival Object)
sf.gerbils <- survfit(Surv(life) ~ 1, data = gerbils)
sf.hamsters <- survfit(Surv(life) ~ 1, data = hamsters)
ggsurv(sf.gerbils) #Survival plot for gerbils
ggsurv(sf.hamsters) #Survival plot for hamsters
## Combine gerbils and hamsters while adding column for identification
varmints <- rbind((cbind(gerbils, type = 'gerbil')),
(cbind(hamsters, type = 'hamster')))
## Generate survival for fit for all varmints as a function of type
sf.varmints <- survfit(Surv(life) ~ type, data = varmints)
## Plot the survival curves on one chart
ggsurv(sf.varmints)

R superimposing bivariate normal density (ellipses) on scatter plot

There are similar questions on the website, but I could not find an answer to this seemingly very simple problem. I fit a mixture of two gaussians on the Old Faithful Dataset:
if(!require("mixtools")) { install.packages("mixtools"); require("mixtools") }
data_f <- faithful
plot(data_f$waiting, data_f$eruptions)
data_f.k2 = mvnormalmixEM(as.matrix(data_f), k=2, maxit=100, epsilon=0.01)
data_f.k2$mu # estimated mean coordinates for the 2 multivariate Gaussians
data_f.k2$sigma # estimated covariance matrix
I simply want to super-impose two ellipses for the two Gaussian components of the model described by the mean vectors data_f.k2$mu and the covariance matrices data_f.k2$sigma. To get something like:
For those interested, here is the MatLab solution that created the plot above.
If you are interested in the colors as well, you can use the posterior to get the appropriate groups. I did it with ggplot2, but first I show the colored solution using #Julian's code.
# group data for coloring
data_f$group <- factor(apply(data_f.k2$posterior, 1, which.max))
# plotting
plot(data_f$eruptions, data_f$waiting, col = data_f$group)
for (i in 1: length(data_f.k2$mu)) ellipse(data_f.k2$mu[[i]],data_f.k2$sigma[[i]], col=i)
And for my version using ggplot2.
# needs ggplot2 package
require("ggplot2")
# ellipsis data
ell <- cbind(data.frame(group=factor(rep(1:length(data_f.k2$mu), each=250))),
do.call(rbind, mapply(ellipse, data_f.k2$mu, data_f.k2$sigma,
npoints=250, SIMPLIFY=FALSE)))
# plotting command
p <- ggplot(data_f, aes(color=group)) +
geom_point(aes(waiting, eruptions)) +
geom_path(data=ell, aes(x=`2`, y=`1`)) +
theme_bw(base_size=16)
print(p)
You can use the ellipse-function from package mixtools. The initial problem was that this function swaps x and y from your plot. I'll try to figure this out and update the answe. (I'll leave the colors to somebody else...)
plot( data_f$eruptions,data_f$waiting)
for (i in 1: length(data_f.k2$mu)) ellipse(data_f.k2$mu[[i]],data_f.k2$sigma[[i]])
Using mixtools internal plotting function:
plot.mixEM(data_f.k2, whichplots=2)

using R to plot interaction plot

I have created a model using following
age hrs charges
530.6071 792.10 3474.60
408.6071 489.70 1247.06
108.0357 463.00 1697.07
106.6071 404.15 1676.33
669.4643 384.65 1701.13
556.4643 358.15 1630.30
665.4643 343.85 2468.83
508.4643 342.35 3366.44
106.0357 335.25 2876.82
interaction_model <- rlm( charges~age+hrs+age*hrs, age_vs_hrs_charges_cleaned);
Any idea how i can plot this in 3D?
I already plotted using
library(effects);
plot(effect(term="age:hrs", mod=interaction_model,default.levels=20),multiline=TRUE);
but this is not very clear visualization.
Any help?
There are several ways to do this.
model <- lm( charges~age+hrs+age*hrs, df)
# set up grid of (x,y) values
age <- seq(0,1000, by=20)
hrs <- seq(0,1000, by=20)
gg <- expand.grid(age=age, hrs=hrs)
# prediction from the linear model
gg$charges <-predict(model,newdata=gg)
# contour plot
library(ggplot2)
library(colorRamps)
library(grDevices)
jet.colors <- colorRampPalette(matlab.like(9))
ggplot(gg, aes(x=age, y=hrs, z=charges))+
stat_contour(aes(color=..level..),binwidth=200, size=2)+
scale_color_gradientn(colours=jet.colors(8))
# 3D scatterplot
library(scatterplot3d)
scatterplot3d(gg$age, gg$hrs, gg$charges)
# interactive 3D scatterplot (just a screen shot here)
library(rgl)
plot3d(gg$age,gg$hrs,gg$charges)
# interactive 3D surface plot with shading (screen shot)
colorjet <- jet.colors(100)
open3d()
rgl.surface(x=age, z=hrs, y=0.05*gg$charges,
color=colorzjet[ findInterval(gg$charges, seq(min(gg$charges), max(gg$charges), length=100))] )
axes3d()
A little while ago I wrote a couple of functions to display the results of a (general) linear model, together with colour coded data points, in either 3D (interactive, using rgl) or 2D (using a contour plot) :
# plot predictions of a (general) linear model as a function of two explanatory variables as an image / contour plot
# together with the actual data points
# mean value is used for any other variables in the model
plotImage=function(model=NULL,plotx=NULL,ploty=NULL,plotPoints=T,plotContours=T,plotLegend=F,npp=1000,xlab=NULL,ylab=NULL,zlab=NULL,xlim=NULL,ylim=NULL,pch=16,cex=1.2,lwd=0.1,col.palette=NULL) {
n=npp
require(rockchalk)
require(aqfig)
require(colorRamps)
require(colorspace)
require(MASS)
mf=model.frame(model);emf=rockchalk::model.data(model)
if (is.null(xlab)) xlab=plotx
if (is.null(ylab)) ylab=ploty
if (is.null(zlab)) zlab=names(mf)[[1]]
if (is.null(col.palette)) col.palette=rev(rainbow_hcl(1000,c=100))
x=emf[,plotx];y=emf[,ploty];z=mf[,1]
if (is.null(xlim)) xlim=c(min(x)*0.95,max(x)*1.05)
if (is.null(ylim)) ylim=c(min(y)*0.95,max(y)*1.05)
preds=predictOMatic(model,predVals=c(plotx,ploty),n=npp,divider="seq")
zpred=matrix(preds[,"fit"],npp,npp)
zlim=c(min(c(preds$fit,z)),max(c(preds$fit,z)))
par(mai=c(1.2,1.2,0.5,1.2),fin=c(6.5,6))
graphics::image(x=seq(xlim[1],xlim[2],len=npp),y=seq(ylim[1],ylim[2],len=npp),z=zpred,xlab=xlab,ylab=ylab,col=col.palette,useRaster=T,xaxs="i",yaxs="i")
if (plotContours) graphics::contour(x=seq(xlim[1],xlim[2],len=npp),y=seq(ylim[1],ylim[2],len=npp),z=zpred,xlab=xlab,ylab=ylab,add=T,method="edge")
if (plotPoints) {cols1=col.palette[(z-zlim[1])*999/diff(zlim)+1]
pch1=rep(pch,length(n))
cols2=adjustcolor(cols1,offset=c(-0.3,-0.3,-0.3,1))
pch2=pch-15
points(c(rbind(x,x)),c(rbind(y,y)), cex=cex,col=c(rbind(cols1,cols2)),pch=c(rbind(pch1,pch2)),lwd=lwd) }
box()
if (plotLegend) vertical.image.legend(zlim=zlim,col=col.palette) # TO DO: add z axis label, maybe make legend a bit smaller?
}
# plot predictions of a (general) linear model as a function of two explanatory variables as an interactive 3D plot
# mean value is used for any other variables in the model
plotPlaneFancy=function(model=NULL,plotx1=NULL,plotx2=NULL,plotPoints=T,plotDroplines=T,npp=50,x1lab=NULL,x2lab=NULL,ylab=NULL,x1lim=NULL,x2lim=NULL,cex=1.5,col.palette=NULL,segcol="black",segalpha=0.5,interval="none",confcol="lightgrey",confalpha=0.4,pointsalpha=1,lit=T,outfile="graph.png",aspect=c(1,1,0.3),zoom=1,userMatrix=matrix(c(0.80,-0.60,0.022,0,0.23,0.34,0.91,0,-0.55,-0.72,0.41,0,0,0,0,1),ncol=4,byrow=T),windowRect=c(0,29,1920,1032)) { # or library(colorRamps);col.palette <- matlab.like(1000)
require(rockchalk)
require(rgl)
require(colorRamps)
require(colorspace)
require(MASS)
mf=model.frame(model);emf=rockchalk::model.data(model)
if (is.null(x1lab)) x1lab=plotx1
if (is.null(x2lab)) x2lab=plotx2
if (is.null(ylab)) ylab=names(mf)[[1]]
if (is.null(col.palette)) col.palette=rev(rainbow_hcl(1000,c=100))
x1=emf[,plotx1]
x2=emf[,plotx2]
y=mf[,1]
if (is.null(x1lim)) x1lim=c(min(x1),max(x1))
if (is.null(x2lim)) x2lim=c(min(x2),max(x2))
preds=predictOMatic(model,predVals=c(plotx1,plotx2),n=npp,divider="seq",interval=interval)
ylim=c(min(c(preds$fit,y)),max(c(preds$fit,y)))
open3d(zoom=zoom,userMatrix=userMatrix,windowRect=windowRect)
if (plotPoints) plot3d(x=x1,y=x2,z=y,type="s",col=col.palette[(y-min(y))*999/diff(range(y))+1],size=cex,aspect=aspect,xlab=x1lab,ylab=x2lab,zlab=ylab,lit=lit,alpha=pointsalpha)
if (!plotPoints) plot3d(x=x1,y=x2,z=y,type="n",col=col.palette[(y-min(y))*999/diff(range(y))+1],size=cex,aspect=aspect,xlab=x1lab,ylab=x2lab,zlab=ylab)
if ("lwr" %in% names(preds)) persp3d(x=unique(preds[,plotx1]),y=unique(preds[,plotx2]),z=matrix(preds[,"lwr"],npp,npp),color=confcol, alpha=confalpha, lit=lit, back="lines",add=TRUE)
ypred=matrix(preds[,"fit"],npp,npp)
cols=col.palette[(ypred-min(ypred))*999/diff(range(ypred))+1]
persp3d(x=unique(preds[,plotx1]),y=unique(preds[,plotx2]),z=ypred,color=cols, alpha=0.7, lit=lit, back="lines",add=TRUE)
if ("upr" %in% names(preds)) persp3d(x=unique(preds[,plotx1]),y=unique(preds[,plotx2]),z=matrix(preds[,"upr"],npp,npp),color=confcol, alpha=confalpha, lit=lit, back="lines",add=TRUE)
if (plotDroplines) segments3d(x=rep(x1,each=2),y=rep(x2,each=2),z=matrix(t(cbind(y,fitted(model))),nc=1),col=segcol,lty=2,alpha=segalpha)
if (!is.null(outfile)) rgl.snapshot(outfile, fmt="png", top=TRUE)
}
Here is what you get as output with your model :
data=data.frame(age=c(530.6071,408.6071,108.0357,106.6071,669.4643,556.4643,665.4643,508.4643,106.0357),
hrs=c(792.10,489.70,463.00,404.15,384.65,358.15,343.85,342.35,335.25),
charges=c(3474.60,1247.06,1697.07,1676.33,1701.13,1630.30,2468.83,3366.44,2876.82))
library(MASS)
fit1=rlm( charges~age+hrs+age*hrs, data)
plotPlaneFancy(fit1, plotx1 = "age", plotx2 = "hrs")
plotPlaneFancy(fit1, plotx1 = "age", plotx2 = "hrs",interval="confidence")
(or interval="prediction" to show 95% prediction intervals)
plotImage(fit1,plotx="age",ploty="hrs",plotContours=T,plotLegend=T)

Utilise Surv object in ggplot or lattice

Anyone knows how to take advantage of ggplot or lattice in doing survival analysis? It would be nice to do a trellis or facet-like survival graphs.
So in the end I played around and sort of found a solution for a Kaplan-Meier plot. I apologize for the messy code in taking the list elements into a dataframe, but I couldnt figure out another way.
Note: It only works with two levels of strata. If anyone know how I can use x<-length(stratum) to do this please let me know (in Stata I could append to a macro-unsure how this works in R).
ggkm<-function(time,event,stratum) {
m2s<-Surv(time,as.numeric(event))
fit <- survfit(m2s ~ stratum)
f$time <- fit$time
f$surv <- fit$surv
f$strata <- c(rep(names(fit$strata[1]),fit$strata[1]),
rep(names(fit$strata[2]),fit$strata[2]))
f$upper <- fit$upper
f$lower <- fit$lower
r <- ggplot (f, aes(x=time, y=surv, fill=strata, group=strata))
+geom_line()+geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.3)
return(r)
}
I have been using the following code in lattice. The first function draws KM-curves for one group and would typically be used as the panel.group function, while the second adds the log-rank test p-value for the entire panel:
km.panel <- function(x,y,type,mark.time=T,...){
na.part <- is.na(x)|is.na(y)
x <- x[!na.part]
y <- y[!na.part]
if (length(x)==0) return()
fit <- survfit(Surv(x,y)~1)
if (mark.time){
cens <- which(fit$time %in% x[y==0])
panel.xyplot(fit$time[cens], fit$surv[cens], type="p",...)
}
panel.xyplot(c(0,fit$time), c(1,fit$surv),type="s",...)
}
logrank.panel <- function(x,y,subscripts,groups,...){
lr <- survdiff(Surv(x,y)~groups[subscripts])
otmp <- lr$obs
etmp <- lr$exp
df <- (sum(1 * (etmp > 0))) - 1
p <- 1 - pchisq(lr$chisq, df)
p.text <- paste("p=", signif(p, 2))
grid.text(p.text, 0.95, 0.05, just=c("right","bottom"))
panel.superpose(x=x,y=y,subscripts=subscripts,groups=groups,...)
}
The censoring indicator has to be 0-1 for this code to work. The usage would be along the following lines:
library(survival)
library(lattice)
library(grid)
data(colon) #built-in example data set
xyplot(status~time, data=colon, groups=rx, panel.groups=km.panel, panel=logrank.panel)
If you just use 'panel=panel.superpose' then you won't get the p-value.
I started out following almost exactly the approach you use in your updated answer. But the thing that's irritating about the survfit is that it only marks the changes, not each tick - e.g., it will give you 0 - 100%, 3 - 88% instead of 0 - 100%, 1 - 100%, 2 - 100%, 3 - 88%. If you feed that into ggplot, your lines will slope from 0 to 3, rather than remaining flat and dropping straight down at 3. That might be fine depending on your application and assumptions, but it's not the classic KM plot. This is how I handled the varying numbers of strata:
groupvec <- c()
for(i in seq_along(x$strata)){
groupvec <- append(groupvec, rep(x = names(x$strata[i]), times = x$strata[i]))
}
f$strata <- groupvec
For what it's worth, this is how I ended up doing it - but this isn't really a KM plot, either, because I'm not calculating out the KM estimate per se (although I have no censoring, so this is equivalent... I believe).
survcurv <- function(surv.time, group = NA) {
#Must be able to coerce surv.time and group to vectors
if(!is.vector(as.vector(surv.time)) | !is.vector(as.vector(group))) {stop("surv.time and group must be coercible to vectors.")}
#Make sure that the surv.time is numeric
if(!is.numeric(surv.time)) {stop("Survival times must be numeric.")}
#Group can be just about anything, but must be the same length as surv.time
if(length(surv.time) != length(group)) {stop("The vectors passed to the surv.time and group arguments must be of equal length.")}
#What is the maximum number of ticks recorded?
max.time <- max(surv.time)
#What is the number of groups in the data?
n.groups <- length(unique(group))
#Use the number of ticks (plus one for t = 0) times the number of groups to
#create an empty skeleton of the results.
curves <- data.frame(tick = rep(0:max.time, n.groups), group = NA, surv.prop = NA)
#Add the group names - R will reuse the vector so that equal numbers of rows
#are labeled with each group.
curves$group <- unique(group)
#For each row, calculate the number of survivors in group[i] at tick[i]
for(i in seq_len(nrow(curves))){
curves$surv.prop[i] <- sum(surv.time[group %in% curves$group[i]] > curves$tick[i]) /
length(surv.time[group %in% curves$group[i]])
}
#Return the results, ordered by group and tick - easier for humans to read.
return(curves[order(curves$group, curves$tick), ])
}

Resources