Different data in upper and lower panel of scatterplot matrix - r

I want to plot two different data sets in a scatterplot matrix.
I know that I can use upper.panel and lower.panel to differentiate the plot function. However, I don’t succeed in putting my data in a suitable format to harness this.
Assume I have two tissues (“brain” and “heart”) and four conditions (1–4). Now I can use e.g. pairs(data$heart) to get a scatterplot matrix for one of the data sets. Assume I have the following data:
conditions <- 1 : 4
noise <- rnorm(100)
data <- list(brain = sapply(conditions, function (x) noise + 0.1 * rnorm(100)),
heart = sapply(conditions, function (x) noise + 0.3 * rnorm(100)))
How do I get this into a format so that pairs(data, …) plots one data set above and one below the diagonal, as shown here (green = brain, violet = heart):
Just using
pairs(data, upper.panel = something, lower.panel = somethingElse)
Doesn’t work because that will plot all conditions versus all conditions without regard for different tissue – it essentially ignores the list, and the same when reordering the hierarchy (i.e. having data = (A=list(brain=…, heart=…), B=list(brain=…, heart=…), …)).

This is the best I seem to be able to do via passing arguments:
foo.upper <- function(x,y,ind.upper,col.upper,ind.lower,col.lower,...){
points(x[ind.upper],y[ind.upper],col = col.upper,...)
}
foo.lower <- function(x,y,ind.lower,col.lower,ind.upper,col.upper,...){
points(x[ind.lower],y[ind.lower],col = col.lower,...)
}
pairs(dat[,-5],
lower.panel = foo.lower,
upper.panel = foo.upper,
ind.upper = dat$type == 'brain',
ind.lower = dat$type == 'heart',
col.upper = 'blue',
col.lower = 'red')
Note that each panel needs all arguments. ... is a cruel mistress. If you include only the panel specific arguments in each function, it appears to work, but you get lots and lots of warnings from R trying to pass these arguments on to regular plotting functions and obviously they won't exist.
This was my quick first attempt, but it seems ugly:
dat <- as.data.frame(do.call(rbind,data))
dat$type <- rep(c('brain','heart'),each = 100)
foo.upper <- function(x,y,...){
points(x[dat$type == 'brain'],y[dat$type == 'brain'],col = 'red',...)
}
foo.lower <- function(x,y,...){
points(x[dat$type == 'heart'],y[dat$type == 'heart'],col = 'blue',...)
}
pairs(dat[,-5],lower.panel = foo.lower,upper.panel = foo.upper)
I'm abusing R's scoping here in this second version a somewhat ugly way. (Of course, you could probably do this more cleanly in lattice, but you probably knew that.)
The only other option I can think of is to design your own scatter plot matrix using layout, but that's probably quite a bit of work.
Lattice Edit
Here's at least a start on a lattice solution. It should handle varying x,y axis ranges better, but I haven't tested that.
dat <- do.call(rbind,data)
dat <- as.data.frame(dat)
dat$grp <- rep(letters[1:2],each = 100)
plower <- function(x,y,grp,...){
panel.xyplot(x[grp == 'a'],y[grp == 'a'],col = 'red',...)
}
pupper <- function(x,y,grp,...){
panel.xyplot(x[grp == 'b'],y[grp == 'b'],...)
}
splom(~dat[,1:4],
data = dat,
lower.panel = plower,
upper.panel = pupper,
grp = dat$grp)

Related

R-package beeswarm generates same x-coordinates

I am working on a script where I need to calculate the coordinates for a beeswarm plot without immediately plotting. When I use beeswarm, I get x-coordinates that aren't swarmed, and more or less the same value:
But if I generate the same plot again it swarms correctly:
And if I use dev.off() I again get no swarming:
The code I used:
n <- 250
df = data.frame(x = floor(runif(n, 0, 5)),
y = rnorm(n = n, mean = 500, sd = 100))
#Plot 1:
A = with(df, beeswarm(y ~ x, do.plot = F))
plot(x = A$x, y=A$y)
#Plot 2:
A = with(df, beeswarm(y ~ x, do.plot = F))
plot(x = A$x, y=A$y)
dev.off()
#Plot 3:
A = with(df, beeswarm(y ~ x, do.plot = F))
plot(x = A$x, y=A$y)
It seems to me like beeswarm uses something like the current plot parameters (or however it is called) to do the swarming and therefore chokes when a plot isn't showing. I have tried to play around with beeswarm parameters such as spacing, breaks, corral, corralWidth, priority, and xlim, but it does not make a difference. FYI: If do.plot is set to TRUE the x-coordinates are calculated correctly, but this is not helpful as I don't want to plot immediately.
Any tips or comments are greatly appreciated!
You're right; beeswarm uses the current plot parameters to calculate the amount of space to leave between points. It seems that setting "do.plot=FALSE" does not do what one would expect, and I'm not sure why I included this parameter.
If you want to control the parameters manually, you could use the functions swarmx or swarmy instead. These functions must be applied to each group separately, e.g.
dfsplitswarmed <- by(df, df$x, function(aa) swarmx(aa$x, aa$y, xsize = 0.075, ysize = 7.5, cex = 1, log = ""))
dfswarmed <- do.call(rbind, dfsplitswarmed)
plot(dfswarmed)
In this case, I set the xsize and ysize values based on what the function would default to for this particular data set. If you can find a set of xsize/ysize values that work for your data, this approach might work for you.
Otherwise, perhaps a simpler approach would be to leave do.plot=TRUE, and then discard the plots.

Add elements to a previous subplot within an active base R graphics device?

Let's say I generate 9 groups of data in a list data and plot them each with a for loop. I could use *apply here too, whichever you prefer.
data = list()
layout(mat = matrix(1:9, nrow = 3))
for(i in 1:9){
data[[i]] = rnorm(n = 100, mean = i, sd = 1)
plot(data[[i]])
}
After creating all the data, I want to decide which one is best:
best_data = which.min(sapply(data, sd))
Now I want to highlight that best data on the plot to distinguish it. Is there a plotting function that lets me go back to a specified sub-plot in the active device and add an element (maybe a title)?
I know I could make a second for loop: for loop 1 generates the data, then I assess which is best, then for loop 2 creates the plots, but this seems less efficient and more verbose.
Does such a plotting function exist for base R graphics?
#rawr's answer is simple and easy. But I thought I'd point out another option that allows you to select the "best" data set before you plot, in case you want more flexibility to plot the "best" data set differently from the rest.
For example:
# Create the data
data = lapply(1:9, function(i) rnorm(n = 100, mean = i, sd = 1))
par(mar=c(4,4,1,1))
layout(mat = matrix(1:9, nrow = 3))
rng = range(data)
# Plot each data frame
lapply(1:9, function(i) {
# Select data frame with lowest SD
best = which.min(sapply(data, sd))
# Highlight data frame with lowest SD by coloring points red
plot(data[[i]], col=ifelse(best==i,"red","black"), pch=ifelse(best==i, 3, 1), ylim=rng)
})

How to color different groups in qqplot?

I'm plotting some Q-Q plots using the qqplot function. It's very convenient to use, except that I want to color the data points based on their IDs. For example:
library(qualityTools)
n=(rnorm(n=500, m=1, sd=1) )
id=c(rep(1,250),rep(2,250))
myData=data.frame(x=n,y=id)
qqPlot(myData$x, "normal",confbounds = FALSE)
So the plot looks like:
I need to color the dots based on their "id" values, for example blue for the ones with id=1, and red for the ones with id=2. I would greatly appreciate your help.
You can try setting col = myData$y. I'm not sure how the qqPlot function works from that package, but if you're not stuck with using that function, you can do this in base R.
Using base R functions, it would look something like this:
# The example data, as generated in the question
n <- rnorm(n=500, m=1, sd=1)
id <- c(rep(1,250), rep(2,250))
myData <- data.frame(x=n,y=id)
# The plot
qqnorm(myData$x, col = myData$y)
qqline(myData$x, lty = 2)
Not sure how helpful the colors will be due to the overplotting in this particular example.
Not used qqPlot before, but it you want to use it, there is a way to achieve what you want. It looks like the function invisibly passes back the data used in the plot. That means we can do something like this:
# Use qqPlot - it generates a graph, but ignore that for now
plotData <- qqPlot(myData$x, "normal",confbounds = FALSE, col = sample(colors(), nrow(myData)))
# Given that you have the data generated, you can create your own plot instead ...
with(plotData, {
plot(x, y, col = ifelse(id == 1, "red", "blue"))
abline(int, slope)
})
Hope that helps.

superpose a histogram and an xyplot

I'd like to superpose a histogram and an xyplot representing the cumulative distribution function using r's lattice package.
I've tried to accomplish this with custom panel functions, but can't seem to get it right--I'm getting hung up on one plot being univariate and one being bivariate I think.
Here's an example with the two plots I want stacked vertically:
set.seed(1)
x <- rnorm(100, 0, 1)
discrete.cdf <- function(x, decreasing=FALSE){
x <- x[order(x,decreasing=FALSE)]
result <- data.frame(rank=1:length(x),x=x)
result$cdf <- result$rank/nrow(result)
return(result)
}
my.df <- discrete.cdf(x)
chart.hist <- histogram(~x, data=my.df, xlab="")
chart.cdf <- xyplot(100*cdf~x, data=my.df, type="s",
ylab="Cumulative Percent of Total")
graphics.off()
trellis.device(width = 6, height = 8)
print(chart.hist, split = c(1,1,1,2), more = TRUE)
print(chart.cdf, split = c(1,2,1,2))
I'd like these superposed in the same frame, rather than stacked.
The following code doesn't work, nor do any of the simple variations of it that I have tried:
xyplot(cdf~x,data=cdf,
panel=function(...){
panel.xyplot(...)
panel.histogram(~x)
})
You were on the right track with your custom panel function. The trick is passing the correct arguments to the panel.- functions. For panel.histogram, this means not passing a formula and supplying an appropriate value to the breaks argument:
EDIT Proper percent values on y-axis and type of plots
xyplot(100*cdf~x,data=my.df,
panel=function(...){
panel.histogram(..., breaks = do.breaks(range(x), nint = 8),
type = "percent")
panel.xyplot(..., type = "s")
})
This answer is just a placeholder until a better answer comes.
The hist() function from the graphics package has an option called add. The following does what you want in the "classical" way:
plot( my.df$x, my.df$cdf * 100, type= "l" )
hist( my.df$x, add= T )

Utilise Surv object in ggplot or lattice

Anyone knows how to take advantage of ggplot or lattice in doing survival analysis? It would be nice to do a trellis or facet-like survival graphs.
So in the end I played around and sort of found a solution for a Kaplan-Meier plot. I apologize for the messy code in taking the list elements into a dataframe, but I couldnt figure out another way.
Note: It only works with two levels of strata. If anyone know how I can use x<-length(stratum) to do this please let me know (in Stata I could append to a macro-unsure how this works in R).
ggkm<-function(time,event,stratum) {
m2s<-Surv(time,as.numeric(event))
fit <- survfit(m2s ~ stratum)
f$time <- fit$time
f$surv <- fit$surv
f$strata <- c(rep(names(fit$strata[1]),fit$strata[1]),
rep(names(fit$strata[2]),fit$strata[2]))
f$upper <- fit$upper
f$lower <- fit$lower
r <- ggplot (f, aes(x=time, y=surv, fill=strata, group=strata))
+geom_line()+geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.3)
return(r)
}
I have been using the following code in lattice. The first function draws KM-curves for one group and would typically be used as the panel.group function, while the second adds the log-rank test p-value for the entire panel:
km.panel <- function(x,y,type,mark.time=T,...){
na.part <- is.na(x)|is.na(y)
x <- x[!na.part]
y <- y[!na.part]
if (length(x)==0) return()
fit <- survfit(Surv(x,y)~1)
if (mark.time){
cens <- which(fit$time %in% x[y==0])
panel.xyplot(fit$time[cens], fit$surv[cens], type="p",...)
}
panel.xyplot(c(0,fit$time), c(1,fit$surv),type="s",...)
}
logrank.panel <- function(x,y,subscripts,groups,...){
lr <- survdiff(Surv(x,y)~groups[subscripts])
otmp <- lr$obs
etmp <- lr$exp
df <- (sum(1 * (etmp > 0))) - 1
p <- 1 - pchisq(lr$chisq, df)
p.text <- paste("p=", signif(p, 2))
grid.text(p.text, 0.95, 0.05, just=c("right","bottom"))
panel.superpose(x=x,y=y,subscripts=subscripts,groups=groups,...)
}
The censoring indicator has to be 0-1 for this code to work. The usage would be along the following lines:
library(survival)
library(lattice)
library(grid)
data(colon) #built-in example data set
xyplot(status~time, data=colon, groups=rx, panel.groups=km.panel, panel=logrank.panel)
If you just use 'panel=panel.superpose' then you won't get the p-value.
I started out following almost exactly the approach you use in your updated answer. But the thing that's irritating about the survfit is that it only marks the changes, not each tick - e.g., it will give you 0 - 100%, 3 - 88% instead of 0 - 100%, 1 - 100%, 2 - 100%, 3 - 88%. If you feed that into ggplot, your lines will slope from 0 to 3, rather than remaining flat and dropping straight down at 3. That might be fine depending on your application and assumptions, but it's not the classic KM plot. This is how I handled the varying numbers of strata:
groupvec <- c()
for(i in seq_along(x$strata)){
groupvec <- append(groupvec, rep(x = names(x$strata[i]), times = x$strata[i]))
}
f$strata <- groupvec
For what it's worth, this is how I ended up doing it - but this isn't really a KM plot, either, because I'm not calculating out the KM estimate per se (although I have no censoring, so this is equivalent... I believe).
survcurv <- function(surv.time, group = NA) {
#Must be able to coerce surv.time and group to vectors
if(!is.vector(as.vector(surv.time)) | !is.vector(as.vector(group))) {stop("surv.time and group must be coercible to vectors.")}
#Make sure that the surv.time is numeric
if(!is.numeric(surv.time)) {stop("Survival times must be numeric.")}
#Group can be just about anything, but must be the same length as surv.time
if(length(surv.time) != length(group)) {stop("The vectors passed to the surv.time and group arguments must be of equal length.")}
#What is the maximum number of ticks recorded?
max.time <- max(surv.time)
#What is the number of groups in the data?
n.groups <- length(unique(group))
#Use the number of ticks (plus one for t = 0) times the number of groups to
#create an empty skeleton of the results.
curves <- data.frame(tick = rep(0:max.time, n.groups), group = NA, surv.prop = NA)
#Add the group names - R will reuse the vector so that equal numbers of rows
#are labeled with each group.
curves$group <- unique(group)
#For each row, calculate the number of survivors in group[i] at tick[i]
for(i in seq_len(nrow(curves))){
curves$surv.prop[i] <- sum(surv.time[group %in% curves$group[i]] > curves$tick[i]) /
length(surv.time[group %in% curves$group[i]])
}
#Return the results, ordered by group and tick - easier for humans to read.
return(curves[order(curves$group, curves$tick), ])
}

Resources