Multiple boxplots for multiple labels in one graph - r

I have something like the following:
x <- 1:5
y <- 2:6
A <- matrix(NA,nrow=100,ncol=5)
for(i in 1:5){A[,i] <- rnorm(100,x[i],y[i])}
B <- matrix(NA,nrow=100,ncol=5)
for(i in 1:5){B[,i] <- runif(100,min=x[i],max=y[i])}
The following command creates a boxplot for the 5 columns of matrix A:
boxplot(A[,1:5])
What I would like to do now is to have a boxplot like this, where each boxplot of a column of A is plotted next to a boxplot of the corresponding column of B. The boxplots should be directly next to each other, and between pairs of boxplots of the columns 1 to 5 there should be a small distance.
Thanks in advance!

Bind your matrices together columnwise, inserting NA columns:
C <- cbind(A[,1],B[,1])
for ( ii in 2:5 ) C <- cbind(C,NA,A[,ii],B[,ii])
(Yes, this is certainly not the most elegant way - but probably the simplest and easiest to understand.)
Then boxplot and add axis labels:
boxplot(C,xaxt="n")
axis(1,at=1+3*(0:4),labels=rep("A",5),tick=FALSE)
axis(1,at=2+3*(0:4),labels=rep("B",5),tick=FALSE)
axis(1,at=1.5+3*(0:4),labels=1:5,line=2,tick=FALSE)

An implementation with dplyr and tidyr:
# needed libraries
library(dplyr)
library(tidyr)
library(ggplot2)
# converting to dataframes
Aa <- as.data.frame(A)
Bb <- as.data.frame(B)
# melting the dataframes & creating a 'set' variable
mA <- Aa %>% gather(var,value) %>% mutate(set="A")
mB <- Bb %>% gather(var,value) %>% mutate(set="B")
# combining them into one dataframe
AB <- rbind(mA,mB)
# creating the plot
ggplot(AB, aes(x=var, y=value, fill=set)) +
geom_boxplot() +
theme_bw()
which gives:
EDIT: To change the order of the boxes, you can use:
ggplot(AB, aes(x=var, y=value, fill=factor(set, levels=c("B","A")))) +
geom_boxplot() +
theme_bw()
which gives:

Related

ggplot with points and plots of certain columns

Let's say I have a data.frame of three columns:
x <- seq(1,10)
y <- 0.1*x^2
z <- y+rnorm(10,0,10)
d <- data.frame(x,y,z)
I now want a ggplot that plots the points (x,z) and somewhat smooth lines going through (x,y).
How can I achieve that?
"%>%" <- magrittr::"%>%"
d %>%
ggplot2::ggplot(ggplot2::aes(x=x)) +
ggplot2::geom_point(ggplot2::aes(y=z)) +
ggplot2::geom_smooth(ggplot2::aes(y=y))

How to find out if two cells in a dataframe belong to the same pre-specified factor-level

I really dont know how to title this question better, so please bear with me.
library(reshape)
library(ggplot2)
library(dplyr)
dist1 <- matrix(runif(16),4,4)
dist2 <- matrix(runif(16),4,4)
rownames(dist1) <- colnames(dist1) <- paste0("A",1:4)
rownames(dist2) <- colnames(dist2) <- paste0("A",1:4)
m1 <- melt(dist1)
m2 <- melt(dist2)
final <- full_join(m1,m2, by=c("Var1","Var2"))
ggplot(final, aes(value.x,value.y)) + geom_point()
To illustrate my problem, i have a matrix with ecological distances (m1) and one with genetic distances (m2) for a number of biological species. I have merged both matrices and want to plot both distances versus each other.
Here is the twist:
The biological species belong to certain groups, which are given in the dataframe species. I want to check if a x,y pair (as in final$Var1, final$Var2) belongs to the same group of species (here "cat" or "dog"), and then want to color it specifically.
So, i need an R translation for:
species <- data.frame(spcs=as.character(paste0("A",1:4)),
grps=as.factor(c(rep("cat",2),(rep("dog",2)))))
final$group <- If (final$Var1,final$Var2) belongs to the same group as specified
in species, then assign the species group here, else do nothing or assign NA
so i can proceed with
ggplot(final, aes(value.x,value.y, col=group)) + geom_point()
Thank you very much!
Here's one approach that works. I made some changes to the code you provided. Full working example code given below.
library(reshape)
library(ggplot2)
library(dplyr)
dist1 <- matrix(runif(16), 4, 4)
dist2 <- matrix(runif(16), 4, 4)
rownames(dist1) <- colnames(dist1) <- paste0("A", 1:4)
rownames(dist2) <- colnames(dist2) <- paste0("A", 1:4)
m1 <- melt(dist1)
m2 <- melt(dist2)
# I changed the by= argument here
final <- full_join(m1, m2, by=c("X1", "X2"))
# I made some changes to keep spcs character and grps factor
species <- data.frame(spcs=paste0("A", 1:4),
grps=as.factor(c(rep("cat", 2), (rep("dog", 2)))), stringsAsFactors=FALSE)
# define new variables for final indicating group membership
final$g1 <- species$grps[match(final$X1, species$spcs)]
final$g2 <- species$grps[match(final$X2, species$spcs)]
final$group <- as.factor(with(final, ifelse(g1==g2, as.character(g1), "dif")))
# plot just the rows with matching groups
ggplot(final[final$group!="dif", ], aes(value.x, value.y, col=group)) +
geom_point()
# plot all the rows
ggplot(final, aes(value.x, value.y, col=group)) + geom_point()
One way to do this is to set up the species data frame with two columns that correspond to X1 and X2 in final, then merge based on those two columns:
species <- data.frame(X1=paste0("A",1:4),
X2=paste0("A",1:4),
grps=as.factor(c(rep("cat",2),(rep("dog",2)))))
final = merge(final, species, by=c("X1","X2"), all.x=TRUE)
Now you can plot the data using grps as the colour aesthetic:
ggplot(final, aes(value.x,value.y, colour=grps)) + geom_point()

adding layer to a plot in R

Taking some generic data
A <- c(1997,2000,2000,1998,2000,1997,1997,1997)
B <- c(0,0,1,0,0,1,0,0)
df <- data.frame(A,B)
counts <- t(table(A,B))
frac <- counts[1,]/(counts[2,]+counts[1,])
C <- c(1998,2001,2000,1995,2000,1996,1998,1999)
D <- c(1,0,1,0,0,1,0,1)
df2 <- data.frame(C,D)
counts2 <- t(table(C,D))
frac2 <- counts2[1,]/(counts2[2,]+counts2[1,])
If we then want to create a scatterplot for the two datasets on the one scale
We can:
plot(frac, pch=22)
points(frac2, pch=19)
But we see we have two problems
first we want to put our year values (which appear as df$A and df$C) along the x axis
We want the x axis to automatically adjust the scale when the second data is added.
A solution using ggplot2 or base R would be desired
ggplot will do the scaling for you. You can convert the fracs to data.frame and to use with ggplot
library(ggplot2)
ggplot(data.frame(y=frac, x=names(frac)), aes(x, y)) +
geom_point(col="salmon") +
geom_point(data=data.frame(y=frac2, x=names(frac2)), aes(x, y), col="steelblue") +
theme_bw()

ggplot2 and cumsum()

I have a set of UNIX timestamps and URIs and I'm trying to plot the cumulative count of requests for each URI. I managed to do that for one URI at a time using a dummy column:
x.df$count <- apply(x.df,1,function(row) 1) # Create a dummy column for cumsum
x.df <- x.df[order(x.df$time, decreasing=FALSE),] # Sort
ggplot(x.df, aes(x=time, y=cumsum(count))) + geom_line()
However, that would make roughly 30 plots in my case.
ggplot2 does allow you to plot multiple lines into one plot (I copied this piece of code from here):
ggplot(data=test_data_long, aes(x=date, y=value, colour=variable)) +
geom_line()
The problem is that, this way, cumsum() would count on and on.
Does anybody have an idea?
Here's a test data which uses plyr's transform to calculate the cumulative sum first and then apply that data to plot using ggplot2:
set.seed(45)
DF <- data.frame(grp = factor(rep(1:5, each=10)), x=rep(1:10, 5))
DF <- transform(DF, y=runif(nrow(DF)))
# use plyr to calculate cumsum per group of x
require(plyr)
DF.t <- ddply(DF, .(grp), transform, cy = cumsum(y))
# plot
require(ggplot2)
ggplot(DF.t, aes(x=x, y=cy, colour=grp, group=grp)) + geom_line()

Stacked histogram from already summarized counts using ggplot2

I would like some help coloring a ggplot2 histogram generated from already-summarized count data.
The data are something like counts of # males and # females living in a number of different areas. It's easy enough to plot the histogram for the total counts (i.e. males + females):
set.seed(1)
N=100;
X=data.frame(C1=rnbinom(N,15,0.1), C2=rnbinom(N,15,0.1),C=rep(0,N));
X$C=X$C1+X$C2;
ggplot(X,aes(x=C)) + geom_histogram()
However, I'd like to color each bar according to the relative contribution from C1 and C2, so that I get the same histogram (i.e. overall bar heights) as in the above example, plus I see the proportion of type "C1" and "C2" individuals as in a stacked bar chart.
Suggestions for a clean way to do this with ggplot2, using data like "X" in the example?
Very quickly, you can do what the OP wants using the stat="identity" option and the plyr package to manually calculate the histogram, like so:
library(plyr)
X$mid <- floor(X$C/20)*20+10
X_plot <- ddply(X, .(mid), summarize, total=length(C), split=sum(C1)/sum(C)*length(C))
ggplot(data=X_plot) + geom_histogram(aes(x=mid, y=total), fill="blue", stat="identity") + geom_histogram(aes(x=mid, y=split), fill="deeppink", stat="identity")
We basically just make a 'mids' column for how to locate the columns and then make two plots: one with the count for the total (C) and one with the columns adjusted to the count of one of the columns (C1). You should be able to customize from here.
Update 1: I realized I made a small error in calculating the mids. Fixed now. Also, I don't know why I used a 'ddply' statement to calculate the mids. That was silly. The new code is clearer and more concise.
Update 2: I returned to view a comment and noticed something slightly horrifying: I was using sums as the histogram frequencies. I have cleaned up the code a little and also added suggestions from the comments concerning the coloring syntax.
Here's a hack using ggplot_build. The idea is to first get your old/original plot:
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
stored in p. Then, use ggplot_build(p)$data[[1]] to extract the data, specifically, the columns xmin and xmax (to get the same breaks/binwidths of histogram) and count column (to normalize the percentage by count. Here's the code:
# get old plot
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
# get data of old plot: cols = count, xmin and xmax
d <- ggplot_build(p)$data[[1]][c("count", "xmin", "xmax")]
# add a id colum for ddply
d$id <- seq(nrow(d))
How to generate data now? What I understand from your post is this. Take for example the first bar in your plot. It has a count of 2 and it extends from xmin = 147 to xmax = 156.8. When we check X for these values:
X[X$C >= 147 & X$C <= 156.8, ] # count = 2 as shown below
# C1 C2 C
# 19 91 63 154
# 75 86 70 156
Here, I compute (91+86)/(154+156)*(count=2) = 1.141935 and (63+70)/(154+156) * (count=2) = 0.8580645 as the two normalised values for each bar we'll generate.
require(plyr)
dd <- ddply(d, .(id), function(x) {
t <- X[X$C >= x$xmin & X$C <= x$xmax, ]
if(nrow(t) == 0) return(c(0,0))
p <- colSums(t)[1:2]/colSums(t)[3] * x$count
})
# then, it just normal plotting
require(reshape2)
dd <- melt(dd, id.var="id")
ggplot(data = dd, aes(x=id, y=value)) +
geom_bar(aes(fill=variable), stat="identity", group=1)
And this is the original plot:
And this is what I get:
Edit: If you also want to get the breaks proper, then, you can get the corresponding x coordinates from the old plot and use it here instead of id:
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
d <- ggplot_build(p)$data[[1]][c("count", "x", "xmin", "xmax")]
d$id <- seq(nrow(d))
require(plyr)
dd <- ddply(d, .(id), function(x) {
t <- X[X$C >= x$xmin & X$C <= x$xmax, ]
if(nrow(t) == 0) return(c(x$x,0,0))
p <- c(x=x$x, colSums(t)[1:2]/colSums(t)[3] * x$count)
})
require(reshape2)
dd.m <- melt(dd, id.var="V1", measure.var=c("V2", "V3"))
ggplot(data = dd.m, aes(x=V1, y=value)) +
geom_bar(aes(fill=variable), stat="identity", group=1)
How about:
library("reshape2")
mm <- melt(X[,1:2])
ggplot(mm,aes(x=value,fill=variable))+geom_histogram(position="stack")

Resources