multiple histograms on top of eachother without bins - r

Let's say I've got this dataframe with 2 levels. LC and HC.
Now i want to get 2 plots like below on top of eachother.
data <- data.frame(
welltype=c("LC","LC","LC","LC","LC","HC","HC","HC","HC","HC"),
value=c(1,2,1,2,1,5,4,5,4,5))
The code to get following plot =
x <- rnorm(1000)
y <- hist(x)
plot(y$breaks,
c(y$counts,0),
type="s",col="blue")
(with thanks to Joris Meys)
So, how do I even start on this. Since I'm used to java I was thinking of a for loop, but I've been told not to do it this way.

Next to the method provided by Aaron, there's a ggplot solution as well (see below),
but I would strongly advise you to use the densities, as they will give nicer plots and are a whole lot easier to construct :
# make data
wells <- c("LC","HC","BC")
Data <- data.frame(
welltype=rep(wells,each=100),
value=c(rnorm(100),rnorm(100,2),rnorm(100,3))
)
ggplot(Data,aes(value,fill=welltype)) + geom_density(alpha=0.2)
gives :
For the plot you requested :
# make hists dataframe
hists <- tapply(Data$value,Data$welltype,
function(i){
tmp <- hist(i)
data.frame(br=tmp$breaks,co=c(tmp$counts,0))
})
ll <- sapply(hists,nrow)
hists <- do.call(rbind,hists)
hists$fac <- rep(wells,ll)
# make plot
require(ggplot2)
qplot(br,co,data=hists,geom="step",colour=fac)

You can use the same code except with points instead of plot for adding additional lines to the plot.
Making up some data
set.seed(5)
d <- data.frame(x=c(rnorm(1000)+3, rnorm(1000)),
g=rep(1:2, each=1000) )
And doing it in a fairly straightforward way:
x1 <- d$x[d$g==1]
x2 <- d$x[d$g==2]
y1 <- hist(x1, plot=FALSE)
y2 <- hist(x2, plot=FALSE)
plot(y1$breaks, c(y1$counts,0), type="s",col="blue",
xlim=range(c(y1$breaks, y2$breaks)), ylim=range(c(0,y1$counts, y2$counts)))
points(y2$breaks, c(y2$counts,0), type="s", col="red")
Or in a more R-ish way:
col <- c("blue", "red")
ds <- split(d$x, d$g)
hs <- lapply(ds, hist, plot=FALSE)
plot(0,0,type="n",
ylim=range(c(0,unlist(lapply(hs, function(x) x$counts)))),
xlim=range(unlist(lapply(hs, function(x) x$breaks))) )
for(i in seq_along(hs)) {
points(hs[[i]]$breaks, c(hs[[i]]$counts,0), type="s", col=col[i])
}
EDIT: Inspired by Joris's answer, I'll note that lattice can also easily do overlapping density plots.
library(lattice)
densityplot(~x, group=g, data=d)

Related

How can I create a list of plots to be rendered with ggplot?

I am trying to construct a list of ggplot graphics, which will be plotted later. What I have so far, using Anscombe's quartet for an example, is:
library(ggplot2)
library(gridExtra)
base <- ggplot() + xlim(4,19)
plots = vector(mode = "list", length = 4)
for(i in 1:4) {
x <- anscombe[,i]
y <- anscombe[,i+4]
p <- geom_point(aes(x,y),colour="blue")
q <- geom_smooth(aes(x,y),method="lm",colour="red",fullrange=T)
plots[[i]] <- base+p+q
}
grid.arrange(grobs = plots,ncol=2)
As I travel through the loop, I want the current values of the plots p and q to be added with the base plot, into the i-th value of the list. That is, so that list element number i contains the plots relating to the i-th x and y columns from the dataset.
However, what happens is that the last plot only is drawn, four times. I've done something very similar with base R, using mfrow, plot and abline, so that I believe my logic is correct, but my implementation isn't. I suspect that the issue is with these lines:
plots = vector(mode = "list", length = 4)
plots[[i]] <- base+p+q
How can I create a list of ggplot graphics; starting with an empty list?
(If this is a trivial and stupid question, I apologise. I am very new both to R and to the Grammar of Graphics.)
The code works properly if lapply() is used instead of a for loop.
plots <- lapply(1:4, function(i) {
# create plot number i
})
The reason for this issue is that ggplot uses lazy evaluation. By the time the plots are rendered, the loop already iterated to i=4 and the last plot will be displayed four times.
Full working example:
library(ggplot2)
library(gridExtra)
base <- ggplot() + xlim(4,19)
plots <- lapply(1:4, function(i) {
x <- anscombe[,i]
y <- anscombe[,i+4]
p <- geom_point(aes(x,y),colour="blue")
q <- geom_smooth(aes(x,y),method="lm",colour="red",fullrange=T)
base+p+q
})
grid.arrange(grobs = plots,ncol=2)
To force evaluation, there's a simple solution, change aes(...) into aes_(...) and your code works.
library(ggplot2)
library(gridExtra)
base <- ggplot() + xlim(4,19)
plots <- lapply(1:4, function(i) {
x <- anscombe[,i]
y <- anscombe[,i+4]
p <- geom_point(aes_(x,y),colour="blue")
q <- geom_smooth(aes_(x,y),method="lm",colour="red",fullrange=T)
base+p+q
})
grid.arrange(grobs = plots,ncol=2)

loop lm predictions for plotting multiple lines

I want to plot linear-model-lines for each ID.
How can I create predictions for multiple lms (or glms) using sequences of different length? I tried:
#some fake data
res<-runif(60,1,20)
var<-runif(60,10,50)
ID<-rep(c("A","B","B","C","C","C"),10)
data<- data.frame(ID,res,var)
#lm
library(data.table)
dt <- data.table(data,key="ID")
fits <- lapply(unique(data$ID),
function(z)lm(res~var, data=dt[J(z),], y=T))
#sequence for each ID of length var(ID)
mins<-matrix(with(data, tapply(var,ID,min)))
mins1<-mins[,1]
maxs<-matrix(with(data,tapply(var,ID,max)))
maxs1<-maxs[,1]
my_var<-list()
for(i in 1:3){
my_var[[i]]<- seq(from=mins1[[i]],to=maxs1[[i]],by=1)
}
# predict on sequences
predslist<- list()
predslist[[i]] <- for(i in 1:3){
dat<-fits[[i]]
predict(dat,newdata= data.frame("var"= my_var,type= "response", se=TRUE))
}
predict results error
Plotting lm lines only for var[i] ranges works in ggplot:
library(ggplot2)
# create ID, x, y as coded by Matt
p <- qplot(x, y)
p + geom_smooth(aes(group=ID), method="lm", size=1, se=F)
Is something like this what you're after?
# generating some fake data
ID <- rep(letters[1:4],each=10)
x <- rnorm(40,mean=5,sd=10)
y <- as.numeric(as.factor(ID))*x + rnorm(40)
# plotting in base R
plot(x, y, col=as.factor(ID), pch=16)
# calling lm() and adding lines
lmlist <- lapply(sort(unique(ID)), function(i) lm(y[ID==i]~x[ID==i]))
for(i in 1:length(lmlist)) abline(lmlist[[i]], col=i)
Don't know if the plotting part is where you're stuck, but the abline() function will draw a least-squares line if you pass in an object returned from lm().
If you want the least-squares lines to begin & end with the min & max x values, here's a workaround. It's not pretty, but seems to work.
plot(x, y, col=as.factor(ID), pch=16)
IDnum <- as.numeric(as.factor(ID))
for(i in 1:length(lmlist)) lines(x[IDnum==i], predict(lmlist[[i]]), col=i)

Graphing a large number of plots

I'm fitting a dose-response curve to many data sets that I want to plot to a single file.
Here's how one data set looks like:
df <- data.frame(dose=c(10,0.625,2.5,0.156,0.0391,0.00244,0.00977,0.00061,10,0.625,2.5,0.156,0.0391,0.00244,0.00977,0.00061,10,0.625,2.5,0.156,0.0391,0.00244,0.00977,0.00061),viability=c(6.12,105,57.9,81.9,86.5,98.3,96.4,81.8,27.3,85.2,80.8,92,82.5,110,90.2,76.6,11.9,89,35.4,79,95.8,117,82.1,95.1),stringsAsFactors=F)
Here's the dose-response fit:
library(drc)
fit <- drm(viability~dose,data=df,fct=LL.4(names=c("Slope","Lower Limit","Upper Limit","ED50")))
Now I'm predicting values in order to plot the curve:
pred.df <- expand.grid(dose=exp(seq(log(max(df$dose)),log(min(df$dose)),length=100)))
pred <- predict(fit,newdata=pred.df,interval="confidence")
pred.df$viability <- pred[,1]
pred.df$viability.low <- pred[,2]
pred.df$viability.high <- pred[,3]
And this is how a single plot looks like:
library(ggplot2)
p <- ggplot(df,aes(x=dose,y=viability))+geom_point()+geom_ribbon(data=pred.df,aes(x=dose,y=viability,ymin=viability.low,ymax=viability.high),alpha=0.2)+labs(y="viability")+
geom_line(data=pred.df,aes(x=dose,y=viability))+coord_trans(x="log")+theme_bw()+scale_x_continuous(name="dose",breaks=sort(unique(df$dose)),labels=format(signif(sort(unique(df$dose)),3),scientific=T))+ggtitle(label="all doses")
adding a few parameter estimates to the plot:
params <- signif(summary(fit)$coefficient[-1,1],3)
names(params) <- c("lower","upper","ed50")
p <- p + annotate("text",size=3,hjust=0,x=2.4e-3,y=5,label=paste(sapply(1:length(params),function(p) paste0(names(params)[p],"=",params[p])),collapse="\n"),colour="black")
Which gives:
Now suppose I have 20 of these that I want to cram in a single figure file.
I thought that a reasonable solution would be to use grid.arrange:
As an example I'll loop 20 times on this example data set:
plot.list <- vector(mode="list",20)
for(i in 1:20){
plot.list[[i]] <- ggplot(df,aes(x=dose,y=viability))+geom_point()+geom_ribbon(data=pred.df,aes(x=dose,y=viability,ymin=viability.low,ymax=viability.high),alpha=0.2)+labs(y="viability")+
geom_line(data=pred.df,aes(x=dose,y=viability))+coord_trans(x="log")+theme_bw()+scale_x_continuous(name="dose",breaks=sort(unique(df$dose)),labels=format(signif(sort(unique(df$dose)),3),scientific=T))+ggtitle(label="all doses")+
annotate("text",size=3,hjust=0,x=2.4e-3,y=5,label=paste(sapply(1:length(params),function(p) paste0(names(params)[p],"=",params[p])),collapse="\n"),colour="black")
}
And then plot using:
library(grid)
library(gridExtra)
grid.arrange(grobs=plot.list,ncol=3,nrow=ceiling(length(plot.list)/3))
Which is obviously poorly scaled. So my question is how to create this figure with better scaling - meaning that all objects are compressed proportionally in way that produces a figure that is still visually interperable.
You should set the device size so that the plots remain readable, e.g.
pl = replicate(11, qplot(1,1), simplify = FALSE)
g = arrangeGrob(grobs = pl, ncol=3)
ggsave("plots.pdf", g, width=15, height=20)

How to put ggplot2 legend in two columns for an area plot

I would like to put a long legend into two columns and I am not having any success. Here's the code that I'm using with the solution found elsewhere which does not work for geom='area', though it works for my other plots. The plot that I do get from the code below looks like:
So how do I plot Q1 with the legend in two columns please?
NVER <- 10
NGRID <- 20
MAT <- matrix(NA, nrow=NVER, ncol=NGRID)
gsd <- 0.1 # standard deviation of the Gaussians
verlocs <- seq(from=0, to=1, length.out=NVER)
thegrid <- seq(from=0, to=1, length.out=NGRID)
# create a mixture of Gaussians with modes spaced evenly on 0 to 1
# i.e. the first mode is at 0 and the last mode is at 1
for (i in 1:NVER) {
# add the shape of gaussian i
MAT[i,] <- dnorm(thegrid, verlocs[[i]], sd=gsd)
}
M2 <- MAT/rowSums(MAT)
colnames(M2) <- as.character(thegrid)
# rownames(M2) <- as.character(verlocs)
library(reshape2)
D2 <- melt(M2)
# head(D2)
# str(D2)
D2$Var1 <- ordered(D2$Var1)
library(ggplot2)
Q1 <- qplot(Var2, value, data=D2, order=Var1, fill=Var1, geom='area')
Q1
# ggsave('sillyrainbow.png')
# now try the stackoverflow guide() solution
Q1 + guides(col=guide_legend(ncol=2)) # try but fail to put the legend in two columns!
Note that the solution in creating columns within a legend list while using ggplot in R code is incorporated above and it does not work unfortunately!
You are referring to the wrong guide.
Q1 + guides(fill=guide_legend(ncol=2))

Adjusting the limits of x and y axis, when adding new curves to a plot in R

I have two datasets (df1 and df2) that are plotted.
df1 = data.frame(x=c(1:10), y=c(1:10))
df2 = data.frame(x=c(0:13), y=c(0:13)^1.2)
# plot
plot(df1)
# add lines of another dataset
lines(df2)
Some values of df2 are out of the plot-range and thus not visible. (In this example I could just plot df2 first). I usually try to find out the ranges of my data, as shown below.
# manual solution
minX = min(df1$x, df2$x)
minY = min(df1$y, df2$y)
maxX = max(df1$x, df2$x)
maxY = max(df1$y, df2$y)
plot (df1, xlim=c(minX, maxX), ylim=c(minY, maxY))
lines(df2)
When having many datasets, this becomes annoying. I was wondering, if there is an easier way of adjusting the ranges of the axis.
In the first step R finds axis ranges itself. Is there also a way that R adjusts the axis-ranges, when new datasets are added?
You could use range to calculate the limits.
Imho, a better solution:
df1 <- data.frame(x=c(1:10), y=c(1:10))
df2 <- data.frame(x=c(0:13), y=c(0:13)^1.2)
ll <- list(df1,df2)
ll <- lapply(1:length(ll),function(i) {res <- ll[[i]]; res$grp <- i; res})
df <- do.call("rbind",ll)
df$grp <- factor(df$grp)
library(ggplot2)
p1 <- ggplot(df,aes(x=x,y=y,group=grp,col=grp)) + geom_line()
p1
I like #Roland's solution, but here is an extension of #Glen_b's solution that works for an arbitrary number of data sets, if you have them all in a list.
(warning: untested!)
dflist <- list(df1,df2,df3,...) ## dots are not literal!
plotline <- function(L,...) { ## here the dots are literal
## use them to specify (e.g.) xlab, ylab, other arguments to plot()
allX <- unlist(lapply(L,"[[","x"))
allY <- unlist(lapply(L,"[[","y"))
plot (df1, xlim=range(allX), ylim=range(allY),type="n",...)
invisible(lapply(L,lines))
}
This assumes that you want all the data sets drawn as lines.
If you want to start specify separate colours, point types, etc., you could extend this function -- but you would be starting to re-invent the lattice and ggplot2 packages at that point.
(If all your data sets are the same size, you should consider matplot)
You could always write a function:
plotline <- function(df1,df2) {
minX = min(df1$x, df2$x)
minY = min(df1$y, df2$y)
maxX = max(df1$x, df2$x)
maxY = max(df1$y, df2$y)
plot (df1, xlim=c(minX, maxX), ylim=c(minY, maxY))
lines(df2)
}
Then you just do this:
plotline(firstdf,seconddf)
If you want to get fancy, you can even include the argument ... and pass it to the plot call.
Look at the matplot function, it will accept a matrix as x, y, or both and do all the automatic range calculations for you. If you have the data in multiple data frames then you can use sapply to extract the key pieces and form the matricies.
This approach is often even simpler than using the lines function multiple times:
df1 <- data.frame(x=1:10, y=1:10)
df2 <- data.frame(x=0:13, y=(0:13)^1.2)
df3 <- data.frame(x= -3:5, y= 5:(-3))
mylist <- list( df1, df2, df3 )
max.n <- max(sapply(mylist,nrow))
tmpfun <- function(df, which.col, n) {
tmp <- df[[which.col]]
c(tmp, rep(NA, n-length(tmp)))
}
matplot( sapply(mylist, tmpfun, which.col='x', n=max.n),
sapply(mylist, tmpfun, which.col='y', n=max.n), type='b' )
The above is even simpler if all the data frames have the same number of rows.
The other approach as mentioned in the comments is to combine the datasets into a single dataset and use tools like lattice graphics or ggplot2:
lengths <- sapply(mylist, nrow)
df.all <- do.call(rbind, mylist)
df.all$group <- rep( seq_along(lengths), lengths )
library(lattice)
xyplot( y~x, data=df.all, groups=group, type='b' )
library(ggplot2)
qplot(x,y, colour=factor(group), data=df.all, geom=c('point','path') )
If all else fails you can use the zoomplot function from the TeachingDemos package to change the limits of base graphics after the fact, but the above methods are much better.

Resources