r label plots with fractions - r

I would like to create 3 plots each containing a plot of 2 lines from different data frames, and then label each plot with a specific fraction.
So for example I have the 3 data frames:
df1 <- data.frame(x=c(1,2,3,4),y=c(2,3,4,5), z=c(3,3,6,8))
df2 <- data.frame(x=c(3,4,5,6),y=c(1,3,6,7), z=c(2,4,4,8))
df3 <- data.frame(x=c(1,2,2,3),y=c(2,5,6,9), z=c(2,5,6,7))
And I would like to:
1) Create 3 different plots for each data frame, each with one red and one blue line;
2) Add an annotation over the blue line of each plot using a different fraction for each plot.
For example the plot for data frame 1 is something like this:
p1 <- ggplot(data = df1) + geom_line(aes(x=x,y=y, colour="blue")) + geom_line(aes(x=x,y=z, colour="red")) + scale_colour_manual(name="data", values=c("red", "blue"))
Then to add the labels over the blue line I have tried:
p1 + geom_text(aes(x=df1$x[which.max(df1$y)]+1, y = max(df1$y)+4, label = "{\frac{23 22 22}{44 28 32}}", size=2, parse=TRUE))
But this does not work, and I have searched so many hours and cannot find how to use fractions (and brackets enclosing the fraction) in the annotations. Any help is deeply appreciated!
-fra

It is not clear what do you want to have. This is an attempt;
I use mapply to loop over plots and fractions and generate a list of plots.
I create fractions using frac(x,y)
I set limits of plots using scale_y_continuous
I use gridExtra to arrange plots in the same plot (optional)
Here the complete code:
## a generic function that take a fraction ana a data.frame as inputs
## it generate a plot
plot.frac <- function(dat,frac){
p <- ggplot(dat) +
geom_line(aes(x=x,y=y, colour="blue")) +
geom_line(aes(x=x,y=z, colour="red")) +
scale_colour_manual(name="data", values=c("red", "blue"))+
geom_text(x=dat$x[which.max(dat$y)]-0.05, y = max(dat$y)+4,
label = frac, size=5,parse=TRUE)+
## Note the use of limits here to display the annotation
scale_y_continuous(limits = c(min(dat$y), max(dat$y)+5))
p
}
## create a list of data.frame of mapply
df.list <- list(df1,df2,df3)
## ggplot2 use plotmath so for fraction you use frac(x,y)
## here I construct the 2 terms using paste
frac.func <- function(num,den) paste('frac("',num,'","',den,'")',sep='')
num1 <- "line1:23 22 22"
den1 <- "line2: 44 28 32"
num2 <- "line1:23 50 22"
den2 <- "line2: 44 50 32"
num3 <- "line1:23 80 22"
den3 <- "line2: 44 80 32"
## create a list of fractions for mapply
frac.list <- list(frac.func(num1,den1),
frac.func(num2,den2),
frac.func(num3,den3))
frac.list <- list(frac,frac,frac)
## use mapply to call the plot over the 2 lists of data.frame and fractions
ll <- mapply(plot.frac,df.list,frac.list,SIMPLIFY=FALSE)
library(gridExtra)
do.call(grid.arrange,ll)

Related

R - ggplot2 - Get histogram of difference between two groups

Let's say I have a histogram with two overlapping groups. Here's a possible command from ggplot2 and a pretend output graph.
ggplot2(data, aes(x=Variable1, fill=BinaryVariable)) + geom_histogram(position="identity")
So what I have is the frequency or count of each event. What I'd like to do instead is to get the difference between the two events in each bin. Is this possible? How?
For example, if we do RED minus BLUE:
Value at x=2 would be ~ -10
Value at x=4 would be ~ 40 - 200 = -160
Value at x=6 would be ~ 190 - 25 = 155
Value at x=8 would be ~ 10
I'd prefer to do this using ggplot2, but another way would be fine. My dataframe is set up with items like this toy example (dimensions are actually 25000 rows x 30 columns) EDITED: Here is example data to work with GIST Example
ID Variable1 BinaryVariable
1 50 T
2 55 T
3 51 N
.. .. ..
1000 1001 T
1001 1944 T
1002 1042 N
As you can see from my example, I'm interested in a histogram to plot Variable1 (a continuous variable) separately for each BinaryVariable (T or N). But what I really want is the difference between their frequencies.
So, in order to do this we need to make sure that the "bins" we use for the histograms are the same for both levels of your indicator variable. Here's a somewhat naive solution (in base R):
df = data.frame(y = c(rnorm(50), rnorm(50, mean = 1)),
x = rep(c(0,1), each = 50))
#full hist
fullhist = hist(df$y, breaks = 20) #specify more breaks than probably necessary
#create histograms for 0 & 1 using breaks from full histogram
zerohist = with(subset(df, x == 0), hist(y, breaks = fullhist$breaks))
oneshist = with(subset(df, x == 1), hist(y, breaks = fullhist$breaks))
#combine the hists
combhist = fullhist
combhist$counts = zerohist$counts - oneshist$counts
plot(combhist)
So we specify how many breaks should be used (based on values from the histogram on the full data), and then we compute the differences in the counts at each of those breaks.
PS It might be helpful to examine what the non-graphical output of hist() is.
Here's a solution that uses ggplot as requested.
The key idea is to use ggplot_build to get the rectangles computed by stat_histogram. From that you can compute the differences in each bin and then create a new plot using geom_rect.
setup and create a mock dataset with lognormal data
library(ggplot2)
library(data.table)
theme_set(theme_bw())
n1<-500
n2<-500
k1 <- exp(rnorm(n1,8,0.7))
k2 <- exp(rnorm(n2,10,1))
df <- data.table(k=c(k1,k2),label=c(rep('k1',n1),rep('k2',n2)))
Create the first plot
p <- ggplot(df, aes(x=k,group=label,color=label)) + geom_histogram(bins=40) + scale_x_log10()
Get the rectangles using ggplot_build
p_data <- as.data.table(ggplot_build(p)$data[1])[,.(count,xmin,xmax,group)]
p1_data <- p_data[group==1]
p2_data <- p_data[group==2]
Join on the x-coordinates to compute the differences. Note that the y-values aren't the counts, but the y-coordinates of the first plot.
newplot_data <- merge(p1_data, p2_data, by=c('xmin','xmax'), suffixes = c('.p1','.p2'))
newplot_data <- newplot_data[,diff:=count.p1 - count.p2]
setnames(newplot_data, old=c('y.p1','y.p2'), new=c('k1','k2'))
df2 <- melt(newplot_data,id.vars =c('xmin','xmax'),measure.vars=c('k1','diff','k2'))
make the final plot
ggplot(df2, aes(xmin=xmin,xmax=xmax,ymax=value,ymin=0,group=variable,color=variable)) + geom_rect()
Of course the scales and legends still need to be fixed, but that's a different topic.

Merge data.frames for grouped boxplot r

I have two data frames z (1 million observations) and b (500k observations).
z= Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X
b= Tracer time treatment
2 0 S
35 0 S
10 0 X
04 0 X
20 15 S
11 15 S
12 15 X
25 15 X
I'd like to create grouped boxplots using time as a factor and treatment as colour. Essentially I need to bind them together and then differentiate between them but not sure how. One way I tried was using:
zz<-factor(rep("Z", nrow(z))
bb<-factor(rep("B",nrow(b))
dumB<-merge(z,zz) #this won't work because it says it's too big
dumB<-merge(b,zz)
total<-rbind(dumB,dumZ)
But z and zz merge won't work because it says it's 10G in size (which can't be right)
The end plot might be similar to this example: Boxplot with two levels and multiple data.frames
Any thoughts?
Cheers,
EDIT: Added boxplot
I would approach it as follows:
# create a list of your data.frames
l <- list(z,b)
# assign names to the dataframes in the list
names(l) <- c("z","b")
# bind the dataframes together with rbindlist from data.table
# the id parameter will create a variable with the names of the dataframes
# you could also use 'bind_rows(l, .id="id")' from 'dplyr' for this
library(data.table)
zb <- rbindlist(l, id="id")
# create the plot
ggplot(zb, aes(x=factor(time), y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~id) +
theme_bw()
which gives:
Other alternatives for creating your plot:
# facet by 'time'
ggplot(zb, aes(x=id, y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
# facet by 'time' & color by 'id' instead of 'treatment'
ggplot(zb, aes(x=treatment, y=Tracer, color=id)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
In respons to your last comment: to get everything in one plot, you use interaction to distinguish between the different groupings as follows:
ggplot(zb, aes(x=treatment, y=Tracer, color=interaction(id, time))) +
geom_boxplot(width = 0.7, position = position_dodge(width = 0.7)) +
theme_bw()
which gives:
The key is you do not need to perform a merge, which is computationally expensive on large tables. Instead assign a new variable and value (source c(b,z) in my code below) to each dataframe and then rbind. Then it becomes straight forward, my solution is very similar to #Jaap's just with different faceting.
library(ggplot2)
#Create some mock data
t<-seq(1,55,by=2)
z<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
b<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
#Add a variable to each table to id itself
b$source<-"b"
z$source<-"z"
#concatenate the tables together
all<-rbind(b,z)
ggplot(all, aes(source, tracer, group=interaction(treatment,source), fill=treatment)) +
geom_boxplot() + facet_grid(~time)

In R: Get multiple barplots from a single output

I have a data frame (100 x 4). The first column is a set of "bins" 0-100, the remaining columns are the counts for each variable of events within each bin (0 to the maximum number of events).
What I'm trying to do is to plot each of the three columns of data (2:4), alongside each other. Because the counts in each of the bins for each of the data sets is close to identical, the data are overlapped in the histogram/barplots I've created, despite my use of beside=true, and position = dodge.
I've set the first column as both numeric and character, but the results are identical- the bars are overlayed on top of each other. (semi-transparent density plots don't work because I want counts not the distribution densities).
The attached code, based on both R and other documentation produced the attached chart.
barplot(BinCntDF$preT,main=NewMain_Trigger, plot=TRUE,
xlab="sample frequency interval counts (0-100 msec bins)",
names.arg=BinCntDF$dT, las=0,
ylab="bin counts", axes=TRUE, xlim=c(0,100),
ylim=c(0,1000), col="red")
geom_bar(position="dodge")
barplot(BinCntDF$postT, beside=TRUE, add=TRUE)
geom_bar()
The goal is to be able to compare the two (or more) data sets side by side on the same axes, without either overlapping the other(s).
I think you have confused barplot with ggplot2. ggplot2 is a library where the function geom_bar comes from and isn't compatible with barplot which comes with Base R.
Simply compare ?barplot and ?geom_bar, and you will see that geom_bar is from the ggplot2 library. To achieve what you're after I have used the ggplot2 library and reshape2.
Step 1
Based on your description, I have assumed that your data looks roughly like this:
df <- data.frame(x = 1:10,
c1 = sample(0:100, replace=TRUE, size=10),
c2 = sample(0:50, replace=TRUE, size=10),
c3 = sample(0:70, replace=TRUE, size=10))
To plot it using ggplot2 you first have to transform the data to a long format instead of a wide format. You can do this using melt function from reshape2.
library(reshape2)
a <- melt(df, id=c("x"))
The output would look something like this
> head(a)
x variable value
1 1 c1 62
2 2 c1 47
3 3 c1 20
4 4 c1 64
5 5 c1 4
6 6 c1 52
Step 2
There are plenty of tutorials online to what ggplot2 does and the arguments. I would recommend you Google, or search through the many threads in SO to understand.
ggplot(a, aes(x=x, y=value, group=variable, fill=variable)) +
geom_bar(stat='identity', position='dodge')
Which gives you the output:
In a nutshell:
group groups the variables of interest
stat=identity ensures that no additional aggregations are made on your data
With that many bins (100) and groups (3) the plot will look messy, but try this:
set.seed(123)
myDF <- data.frame(bins=1:100, x=sample(1:100, replace=T), y=sample(1:100, replace=T), z=sample(1:100, replace=T))
myDF.m <- melt(myDF, id.vars='bins')
ggplot(myDF.m, aes(x=bins, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge')
You could also try plotting w/ facets:
ggplot(myDF.m, aes(x=bins, y=value, fill=variable)) + geom_bar(stat='identity') + facet_wrap(~ variable)

adding layer to a plot in R

Taking some generic data
A <- c(1997,2000,2000,1998,2000,1997,1997,1997)
B <- c(0,0,1,0,0,1,0,0)
df <- data.frame(A,B)
counts <- t(table(A,B))
frac <- counts[1,]/(counts[2,]+counts[1,])
C <- c(1998,2001,2000,1995,2000,1996,1998,1999)
D <- c(1,0,1,0,0,1,0,1)
df2 <- data.frame(C,D)
counts2 <- t(table(C,D))
frac2 <- counts2[1,]/(counts2[2,]+counts2[1,])
If we then want to create a scatterplot for the two datasets on the one scale
We can:
plot(frac, pch=22)
points(frac2, pch=19)
But we see we have two problems
first we want to put our year values (which appear as df$A and df$C) along the x axis
We want the x axis to automatically adjust the scale when the second data is added.
A solution using ggplot2 or base R would be desired
ggplot will do the scaling for you. You can convert the fracs to data.frame and to use with ggplot
library(ggplot2)
ggplot(data.frame(y=frac, x=names(frac)), aes(x, y)) +
geom_point(col="salmon") +
geom_point(data=data.frame(y=frac2, x=names(frac2)), aes(x, y), col="steelblue") +
theme_bw()

Adding ggplot() objects in a for loop

I would like to create 4 plots which show 4 different conditions in a simulation. The 4 conditions in the simulation are iterated using a for loop. What I would like to do is:
for (cond in 1:4){
1.RUN SIMULATION
2.PLOT RESULTS
}
In the end I would like to have 4 plots arranged on a grid. With plot() I can just use par(mfrow) and the plots would be added automatically. Is there a way to do the same with ggplot?
I am aware that I could use grid.arrange() but that would require storing the plots in separate objects, plot1...plot5. But its not possible to do:
for (cond in 1:4){
1. run simulation
2. plot[cond]<-ggplot(...)
}
I cannot give separate names to the plots, like plot1, plot2, plot3 within the loop.
You could use gridExtra package:
library(gridExtra)
library(ggplot2)
p <- list()
for(i in 1:4){
p[[i]] <- ggplot(YOUR DATA, ETC.)
}
do.call(grid.arrange,p)
I would use facetting in this case. In my experience, explicitly arranging sub-plots is rarely needed in ggplot2. A mockup example will probably illustrate my point better:
run_model = function(id) {
data.frame(x_values = 1:1000,
y_values = runif(1000),
id = sprintf('Plot %d', id))
}
df = do.call('rbind', lapply(1:4, run_model))
head(df)
x_values y_values id
1 1 0.7000696 Plot 1
2 2 0.3992786 Plot 1
3 3 0.2718229 Plot 1
4 4 0.4049928 Plot 1
5 5 0.4158864 Plot 1
6 6 0.1457746 Plot 1
Here, id is the column to specifies to which model run a value belongs. Plotting it can simply be done using:
library(ggplot2)
ggplot(df, aes(x = x_values, y = y_values)) + geom_point() + facet_wrap(~ id)
Another option is to use multiplot function:
library(ggplot2)
p <- list()
for(i in 1:4){
p[[i]] <- ggplot(YOUR DATA, ETC.)
}
do.call(multiplot,p)
More information about that - http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_%28ggplot2%29/

Resources