Stacked histogram from already summarized counts using ggplot2 - r

I would like some help coloring a ggplot2 histogram generated from already-summarized count data.
The data are something like counts of # males and # females living in a number of different areas. It's easy enough to plot the histogram for the total counts (i.e. males + females):
set.seed(1)
N=100;
X=data.frame(C1=rnbinom(N,15,0.1), C2=rnbinom(N,15,0.1),C=rep(0,N));
X$C=X$C1+X$C2;
ggplot(X,aes(x=C)) + geom_histogram()
However, I'd like to color each bar according to the relative contribution from C1 and C2, so that I get the same histogram (i.e. overall bar heights) as in the above example, plus I see the proportion of type "C1" and "C2" individuals as in a stacked bar chart.
Suggestions for a clean way to do this with ggplot2, using data like "X" in the example?

Very quickly, you can do what the OP wants using the stat="identity" option and the plyr package to manually calculate the histogram, like so:
library(plyr)
X$mid <- floor(X$C/20)*20+10
X_plot <- ddply(X, .(mid), summarize, total=length(C), split=sum(C1)/sum(C)*length(C))
ggplot(data=X_plot) + geom_histogram(aes(x=mid, y=total), fill="blue", stat="identity") + geom_histogram(aes(x=mid, y=split), fill="deeppink", stat="identity")
We basically just make a 'mids' column for how to locate the columns and then make two plots: one with the count for the total (C) and one with the columns adjusted to the count of one of the columns (C1). You should be able to customize from here.
Update 1: I realized I made a small error in calculating the mids. Fixed now. Also, I don't know why I used a 'ddply' statement to calculate the mids. That was silly. The new code is clearer and more concise.
Update 2: I returned to view a comment and noticed something slightly horrifying: I was using sums as the histogram frequencies. I have cleaned up the code a little and also added suggestions from the comments concerning the coloring syntax.

Here's a hack using ggplot_build. The idea is to first get your old/original plot:
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
stored in p. Then, use ggplot_build(p)$data[[1]] to extract the data, specifically, the columns xmin and xmax (to get the same breaks/binwidths of histogram) and count column (to normalize the percentage by count. Here's the code:
# get old plot
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
# get data of old plot: cols = count, xmin and xmax
d <- ggplot_build(p)$data[[1]][c("count", "xmin", "xmax")]
# add a id colum for ddply
d$id <- seq(nrow(d))
How to generate data now? What I understand from your post is this. Take for example the first bar in your plot. It has a count of 2 and it extends from xmin = 147 to xmax = 156.8. When we check X for these values:
X[X$C >= 147 & X$C <= 156.8, ] # count = 2 as shown below
# C1 C2 C
# 19 91 63 154
# 75 86 70 156
Here, I compute (91+86)/(154+156)*(count=2) = 1.141935 and (63+70)/(154+156) * (count=2) = 0.8580645 as the two normalised values for each bar we'll generate.
require(plyr)
dd <- ddply(d, .(id), function(x) {
t <- X[X$C >= x$xmin & X$C <= x$xmax, ]
if(nrow(t) == 0) return(c(0,0))
p <- colSums(t)[1:2]/colSums(t)[3] * x$count
})
# then, it just normal plotting
require(reshape2)
dd <- melt(dd, id.var="id")
ggplot(data = dd, aes(x=id, y=value)) +
geom_bar(aes(fill=variable), stat="identity", group=1)
And this is the original plot:
And this is what I get:
Edit: If you also want to get the breaks proper, then, you can get the corresponding x coordinates from the old plot and use it here instead of id:
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
d <- ggplot_build(p)$data[[1]][c("count", "x", "xmin", "xmax")]
d$id <- seq(nrow(d))
require(plyr)
dd <- ddply(d, .(id), function(x) {
t <- X[X$C >= x$xmin & X$C <= x$xmax, ]
if(nrow(t) == 0) return(c(x$x,0,0))
p <- c(x=x$x, colSums(t)[1:2]/colSums(t)[3] * x$count)
})
require(reshape2)
dd.m <- melt(dd, id.var="V1", measure.var=c("V2", "V3"))
ggplot(data = dd.m, aes(x=V1, y=value)) +
geom_bar(aes(fill=variable), stat="identity", group=1)

How about:
library("reshape2")
mm <- melt(X[,1:2])
ggplot(mm,aes(x=value,fill=variable))+geom_histogram(position="stack")

Related

Again, I have 4 graphs on R, different x axis, but similar trend profile. I tried to overlay them but they are not aligned

I was assisted to overlay two graphs with different x-axis on this link I have 2 graphs on R. They have different x axis, but similar trend profile. how do I overlay them on r?.
However, I am now trying to overlay 4 graphs. I tried to overlay them but they are not aligned.
I need assistance to overlay these four graphs.
My initial trial codes were as follows:
My raw data is in this following link https://drive.google.com/drive/folders/1ZZQAATkbeV-Nvq1YYZMYdneZwMvKVUq1?usp=sharing.
Codes used to execute:
first <- ggplot(data = first,
aes(x, y)) +
geom_line(pch = 1)
second <- ggplot(data = second,
aes(x, y)) +
geom_line(pch = 1)
third <- ggplot(data = third,
aes(x, y)) +
geom_line(pch = 1)
fourth <- ggplot(data = fourth,
aes(x, y)) +
geom_line(pch = 1)
first$match <- first$x
second$match <- second$x - second$x[second$y == max(second$y)] + first$x[first$y == max(first$y)]
third$match <- third$x
fourth$match <- fourth$x
first$series = "first"
second$series = "second"
third$series = "third"
fourth$series = "fourth"
all_data <- rbind(first, second, third, fourth)
ggplot(all_data) + geom_line(aes(x = match, y, color = series)) +
scale_x_continuous(name = "X, arbitrary units") +
theme(axis.text.x = element_blank())
Would greatly appreciate the help indeed.
OP, I thought I would propose a solution for your question. OP has 4 datasets with x and y columns, and wants to align the peaks in each dataset so that they stack on top of one another. Here's what it looks like when we plot all datasets together:
p <- ggplot(mapping=aes(x=x, y=y)) + theme_bw() +
geom_line(data=first, aes(color="first")) +
geom_line(data=second, aes(color="second")) +
geom_line(data=third, aes(color="third")) +
geom_line(data=fourth, aes(color="fourth"))
The approach will be as follows:
Find the peak x value for each dataset
Adjust each peak x value to match that of the first peak x value
Combine the datasets and plot together which respects Tidy Data principles
Finding peaks and adjusting x values
To find the peaks, I like to use the findpeaks() function from the pracma library. You feed the function your dataset's y values (arranged by increasing x value), and the function will return a matrix with each row representing a "peak" and the columns give you height of peak in y value, index or row of dataset for the peak, where the peak begins, and where the peak ends. As an example, here's how we can apply this principle and what the result looks like on one of the datasets:
library(pracma)
first <- arrange(first, x) # arrange first by increasing x
findpeaks(first$y, sortstr = TRUE, npeaks=1)
[,1] [,2] [,3] [,4]
[1,] 1047.54 402 286 515
The argument sortstr= indicates we want the list of peaks sorted by "highest" first, and we are only interested in picking the first peak. In this case, we can see that 402 is the index of the x,y value in first for the peak. So we can access that x value via first[index,]$x.
The one concern we may have here is that this may not work for fourth, since the max value of y is actually not the peak of interest; however, if we run the function and test this out, using the findpeaks() method where we return the highest peak works fine: apparently the function does not find there is a "peak" at the right since it has an "up", but not a "down".
The function below handles all the steps to do what we need to: arranging, finding peaks, and adjusting peaks.
# find the minimum peak. We know it's from third, but here's
# how you do it if you don't "know" that
peaks_first <- findpeaks(first$y, sortstr = TRUE, npeaks=1)
peaks_second <- findpeaks(second$y, sortstr = TRUE, npeaks=1)
peaks_third <- findpeaks(third$y, sortstr = TRUE, npeaks=1)
peaks_fourth <- findpeaks(fourth$y, sortstr = TRUE, npeaks=1)
# minimum peak x value
peak_x <- min(c(first[peaks_first[2],]$x, second[peaks_second[2],]$x, third[peaks_third[2],]$x, fourth[peaks_fourth[2],]$x))
# function to use to fix each dataset
fix_x <- function(peak_x, dataset) {
dataset <- arrange(dataset, x)
d_peak <- findpeaks(dataset$y, sortstr = TRUE, npeaks=1)
d_peak_x <- dataset[d_peak[2],]$x
x_adj <- peak_x - d_peak_x
dataset$x <- dataset$x + x_adj
return(dataset)
}
# apply and fix each dataset
fix_first <- fix_x(peak_x, first)
fix_second <- fix_x(peak_x, second)
fix_third <- fix_x(peak_x, third)
fix_fourth <- fix_x(peak_x, fourth)
# combine datasets
fix_first$measure <- 'First'
fix_second$measure <- 'Second'
fix_third$measure <- 'Third'
fix_fourth$measure <- 'Fourth'
fixed <- rbind(fix_first, fix_second, fix_third, fix_fourth)
fixed$measure <- factor(fixed$measure, levels=c('First','Second','Third','Fourth'))
Plot Together
Now fixed contains all the data, and we can plot them all together:
ggplot(fixed, aes(x=x, y=y, color=measure)) + theme_bw() +
geom_line()
Alternate Plotting Methods
If you want to "stack" the lines on top of one another, this is what is known as a ridgeline plot. There are two methods I can show for how to create the ridgeline plot: faceting or using ggridges and geom_ridgeline(). I can demonstrate both.
# Using facets
ggplot(fixed, aes(x=x, y=y, color=measure)) + theme_bw() +
geom_line(show.legend = FALSE) +
facet_grid(measure~.)
Note I chose not to show the legend, since the strip text indicates this same information.
# Using ggridges and geom_ridgeline
ggplot(fixed, aes(x=x, y=measure, color=measure)) + theme_bw() +
geom_ridgeline(aes(height=y), fill=NA, scale=0.001)
When using geom_ridgeline(), you'll notice that the y= aesthetic becomes the column used for the stacking, and your original y value is instead mapped to the height= aesthetic. I also had to play around with scale=, since for discrete values, each measure will be treated as integers (1, 2, 3, 4). Your height= values are waaaay higher than that, so we have to scale them down so that they are around this range (scaled down by about 1000).

boxplots with missing values in R - ggplot

I am trying to make boxplots for a matrix (athTp) with 6 variables (columns) but with many missing values, '
ggplot(athTp)+geom_boxplot()
But maybe sth I am doing wrong...
I tried also to make many box plots and after to arrange the grid, but the final plot was very small (in desired dimensions), loosing many of details.
q1 <- ggplot(athTp,aes(x="V1", y=athTp[,1]))+ geom_boxplot()
..continue with other 5 columns
grid.arrange(q1,q2,q3,q4,q5,q6, ncol=6)
ggsave("plot.pdf",plot = qq, width = 8, height = 8, units = "cm")
Do you have any ideas?
Thanks in advance!
# ok so your data has 6 columns like this
set.seed(666)
dat <- data.frame(matrix(runif(60,1,20),ncol=6))
names(dat) <- letters[1:6]
head(dat)
# so let's get in long format like ggplot likes
library(reshape2)
longdat <- melt(dat)
head(longdat)
# and try your plot call again specifying that we want a box plot per column
# which is now indicated by the "variable" column
# [remember you should specify the x and y axes with `aes()`]
library(ggplot2)
ggplot(longdat, aes(x=variable, y=value)) + geom_boxplot(aes(colour = variable))

adding layer to a plot in R

Taking some generic data
A <- c(1997,2000,2000,1998,2000,1997,1997,1997)
B <- c(0,0,1,0,0,1,0,0)
df <- data.frame(A,B)
counts <- t(table(A,B))
frac <- counts[1,]/(counts[2,]+counts[1,])
C <- c(1998,2001,2000,1995,2000,1996,1998,1999)
D <- c(1,0,1,0,0,1,0,1)
df2 <- data.frame(C,D)
counts2 <- t(table(C,D))
frac2 <- counts2[1,]/(counts2[2,]+counts2[1,])
If we then want to create a scatterplot for the two datasets on the one scale
We can:
plot(frac, pch=22)
points(frac2, pch=19)
But we see we have two problems
first we want to put our year values (which appear as df$A and df$C) along the x axis
We want the x axis to automatically adjust the scale when the second data is added.
A solution using ggplot2 or base R would be desired
ggplot will do the scaling for you. You can convert the fracs to data.frame and to use with ggplot
library(ggplot2)
ggplot(data.frame(y=frac, x=names(frac)), aes(x, y)) +
geom_point(col="salmon") +
geom_point(data=data.frame(y=frac2, x=names(frac2)), aes(x, y), col="steelblue") +
theme_bw()

R plot two series of means with 95% confidence intervals

I am trying to plot the following data
factor <- as.factor(c(1,2,3))
V1_mean <- c(100,200,300)
V2_mean <- c(350,150,60)
V1_stderr <- c(5,9,3)
V2_stderr <- c(12,9,10)
plot <- data.frame(factor,V1_mean,V2_mean,V1_stderr,V2_stderr)
I want to create a plot with factor on the x-axis, value on the y-axis and seperate lines for V1 and V2 (hence the points are the values of V1_mean on one line and V2_mean on the other). I would also like to add error bars for these means based on V1_stderr and V2_stderr
Many thanks
I'm not sure regarding your desired output, but here's a possible solution.
First of all, I wouldn't call your data plot as this is a stored function in R which is being commonly used
Second of all, when you want to plot two lines in ggplot you'll usually have to tide your data using functions such as melt (from reshape2 package) or gather (from tidyr package).
Here's an a possible approach
library(ggplot2)
library(reshape2)
dat <- data.frame(factor, V1_mean, V2_mean, V1_stderr, V2_stderr)
mdat <- cbind(melt(dat[1:3], "factor"), melt(dat[c(1, 4:5)], "factor"))
names(mdat) <- make.names(names(mdat), unique = TRUE)
ggplot(mdat, aes(factor, value, color = variable)) +
geom_point(aes(group = variable)) + # You can also add `geom_point(aes(group = variable)) + ` if you want to see the actual points
geom_errorbar(aes(ymin = value - value.1, ymax = value + value.1))

ggplot2 and cumsum()

I have a set of UNIX timestamps and URIs and I'm trying to plot the cumulative count of requests for each URI. I managed to do that for one URI at a time using a dummy column:
x.df$count <- apply(x.df,1,function(row) 1) # Create a dummy column for cumsum
x.df <- x.df[order(x.df$time, decreasing=FALSE),] # Sort
ggplot(x.df, aes(x=time, y=cumsum(count))) + geom_line()
However, that would make roughly 30 plots in my case.
ggplot2 does allow you to plot multiple lines into one plot (I copied this piece of code from here):
ggplot(data=test_data_long, aes(x=date, y=value, colour=variable)) +
geom_line()
The problem is that, this way, cumsum() would count on and on.
Does anybody have an idea?
Here's a test data which uses plyr's transform to calculate the cumulative sum first and then apply that data to plot using ggplot2:
set.seed(45)
DF <- data.frame(grp = factor(rep(1:5, each=10)), x=rep(1:10, 5))
DF <- transform(DF, y=runif(nrow(DF)))
# use plyr to calculate cumsum per group of x
require(plyr)
DF.t <- ddply(DF, .(grp), transform, cy = cumsum(y))
# plot
require(ggplot2)
ggplot(DF.t, aes(x=x, y=cy, colour=grp, group=grp)) + geom_line()

Resources