ggplot2: multiple variables on x-axis at multiple times - r

I have a data frame for observation numbers (3 observations for same id), height, weight and fev that looks like this (just for example):
id obs height weight fev
1 1 160 80 90
1 2 150 70 85
1 3 155 76 87
2 1 140 67 91
2 2 189 78 71
2 3 178 86 89
I need to plot this data using ggplot2 such that on x-axis there are 3 variables height, weight, fev; and the observation numbers are displayed as 3 vertical lines for each variable (color coded), where each lines show a median as a solid circle, and 25th and 75th percentiles as caps at the upper and lower extremes of the line (no minimum or maximum needed). I have so far tried many variations of box plots but I am not even getting close. Any suggestion(s) how to approach or solve this?
Thanks

OK instead what I did below was make three graphs then piece together with gridExtra. Read more about package here: http://www.sthda.com/english/wiki/wiki.php?id_contents=7930
I took the common legend code from this site to produce the following, starting with our existing longdf2. By piecing together the graphs, the information about corresponding observation is within the title of the graph
id <- rep(1:12, each = 3)
obs <- rep(1:3, 12)
height <- seq(140,189, length.out = 36)
weight <- seq(67,86, length.out = 36)
fev <- seq(71,91, length.out = 36)
df <- as.data.frame(cbind(id,obs,height, weight, fev))
obsonly <- melt(df, id.vars = c('id'), measure.vars = 'obs')
obsonly <- rbind(obsonly,obsonly,obsonly)
newvars <- melt(df[-2],id.vars = 'id')
longdf2 <- cbind(obsonly,newvars)
longdf2 <- longdf2[-4] #dropping second id column
colnames(longdf2)[c(2:5)] <- c('obs', 'obsnum', 'variable', 'value')
#Make graph 1 of observation 1
g1 <- longdf2 %>%
dplyr::filter(obsnum == 1) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 1") +
theme(plot.title = element_text(hjust = 0.5)) #has a legend
g2 <- longdf2 %>%
dplyr::filter(obsnum == 2) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 2") +
theme(plot.title = element_text(hjust = 0.5), legend.position =
'none')
#specified as none to make common legend at end
g3 <- longdf2 %>%
dplyr::filter(obsnum == 3) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 3") +
theme(plot.title = element_text(hjust = 0.5), legend.position =
'none')
library(gridExtra)
get_legend<-function(myggplot){
tmp <- ggplot_gtable(ggplot_build(myggplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)
}
# Save legend
legend <- get_legend(g1)
# Remove legend from 1st graph
g1 <- g1 + theme(legend.position = 'none')
# Combine graphs
grid.arrange(g1, g2, g3, legend, ncol=4, widths=c(2.3, 2.3, 2.3, 0.8))
Plenty of other little tweaks you could make along the way

Try putting the data into long format prior to graphing. I generated some more data, 12 subjects, each with 3 observations.
id <- rep(1:12, each = 3)
obs <- rep(1:3, 12)
height <- seq(140,189, length.out = 36)
weight <- seq(67,86, length.out = 36)
fev <- seq(71,91, length.out = 36)
df <- as.data.frame(cbind(id,obs,height, weight, fev))
library(reshape2) #use to melt data from wide to long format
longdf <- melt(df,id.vars = c('id', 'obs'))
Don't need to define measure variables here since the id.vars are defined, the remaining non-id.vars automatically default to measure variables. If you have more variables in your data set, you'll want to define measure variables in that same line as: measure.vars = c("height,"weight","fev")
longdf <- melt(df,id.vars = c('id', 'obs'), measure.vars = c("height", "weight", "fev"))
Apologies, haven't earned enough votes to put figures into my responses
ggplot(data = longdf, aes(x = variable, y = value, fill = factor(obs))) +
geom_boxplot(notch = T, notchwidth = .25, width = .25, position = position_dodge(.5))
This does not produce the exact graph you described-- which sounded like it was geom_linerange or something similar? -- those geoms require an x, ymin, and ymax to draw. Otherwise a regular, 'ole boxplot has your 1st and 3rd IQRs and median marked. I adjusted parameters of the boxplot to make it thinner with notches and widths, and separated them slightly with the position_dodge(.5)
after reading your response, I edited my original answer
You could try facet_wrap -- and watch the exchanging of "fill" vs. "color" in ggplot. If an object can't be "filled" with a color, like a boxplot or distribution, then it has to be "colored" with a color. Use color instead in the original aes()
ggplot(data = longdf, aes(x = variable, y = value, color = factor(obs))) +
stat_summary(fun.data=median_hilow) + facet_wrap(.~obs)
This gives you observation 1 - height, weight, fev side by side, observation 2- height, ....
If that still isn't what you want perhaps more like height observation 1,2,3; weight observation 1,2,3...then you'll need to modify your melting to have two variable and two value columns. Essentially make two melted dataframes, then cbind. Annnnd because each observation has three variables, you'll need to rbind to make sure both data frames have the same number of rows:
obsonly <- melt(df, id.vars = c('id'), measure.vars = 'obs')
obsonly <- rbind(obsonly,obsonly,obsonly) #making rows equal
longvars <- melt(df[-2],id.vars = 'id') #dropping obs from melt
longdf2 <- cbind(obsonly,longvars)
longdf2 <- longdf2[-4] #dropping second id column
colnames(longdf2)[c(2:5)] <- c('obs', 'obsnum', 'variable', 'value')
ggplot(data = longdf2, aes(x = obsnum, y = value,
color = factor(variable))) +
stat_summary(fun.data=median_hilow) +
facet_wrap(.~variable)
From here you can play around with the x axis marks (probably isn't useful to have a 1.5 observation marked) and the spacing of the lines from each other

Related

matching of shape, color and legend in bubble plot with subset of variable

I have some data
library(data.table)
wide <- data.table(id=c("A","C","B"), var1=c(1,6,1), var2=c(2,6,5), size1=c(11,12,13), size2=c(10,12,10), flag=c(FALSE,TRUE,FALSE))
> wide
id var1 var2 size1 size2 flag
1: A 1 2 11 10 FALSE
2: C 6 6 12 12 TRUE
3: B 1 5 13 10 FALSE
which I would like to plot as bubble plots where id is ordered by var2, and bubbles are as follows:
ID A and B: var1 is plotted in size1 and "empty bubbles" and var2 is plotted in size2 with "filled" bubbles.
ID C is flagged because there is only one value (this is why var1=var2) and it should have a "filled bubble" of a different color.
I have tried this as follows:
cols <- c("v1"="blue", "v2"="red", "flags"="green")
shapes <- c("v1"=16, "v2"=21, "flags"=16)
p1 <- ggplot(data = wide, aes(x = reorder(id,var2))) + scale_size_continuous(range=c(5,15))
p1 <- p1 + geom_point(aes(size=size1, y = var1, color = "v1", shape = "v1"))
p1 <- p1 + geom_point(aes(size=size2, y = var2, color = "v2", shape = "v2", stroke=1.5))
p1 <- p1 + geom_point(data=subset(wide,flag), aes(size=size2[flag], y=var2[flag], color= "flags", shape="flags"))
p1 <- p1 + scale_color_manual(name = "test",
values = cols,
labels = c("v1", "v2", "flags"))
p1 <- p1 + scale_shape_manual(name = "test",
values = shapes,
labels = c("v1", "v2", "flags"))
which gives (in my theme)
but two questions remain:
What happened to the order in the legend? I have followed the recipe of the bottom solution in Two geom_points add a legend but somehow the order does not match.
How to get rid of the stroke around the green bubble and why is it there?
Overall, something appears to go wrong in matching shape and color.
I admit, it took me a while to understand your slightly convoluted plot. Forgive me, but I have allowed myself to change the way to plot, and make (better?) use of ggplot.
The data shape is less than ideal. ggplot works extremely well with long data.
It was a bit of a guesswork to reshape your data, and I decided to go the quick and dirty way to simply bind the rows from selected columns.
Now you can see, that you can achieve the new plot with a single call to geom_point. The rest is "scale_aesthetic" magic...
In order to combine the shape and color legend, safest is to use override.aes. But beware! It does not take named vectors, so the order of the values needs to be in the exact order given by your legend keys - which is usually alphabetic, if you don't have the factor levels defined.
update re: request to order x labels
This hugely depends on the actual data structure. if it is originally as you have presented, I'd first make id a factor with the levels ordered based on your var2. Then, do the data shaping.
library(tidyverse)
# data reshape
wide <- data.frame(id=c("C","B","A"), var1=c(1,6,1), var2=c(2,6,5), size1=c(11,12,13), size2=c(10,12,10), flag=c(FALSE,TRUE,FALSE))
wide <- wide %>% mutate(id = reorder(id, var2))
wide1 <- wide %>% filter(!flag) %>%select(id, var = var1, size = size1)
wide2 <- wide %>% filter(!flag) %>% select(id, var = var2, size = size2)
wide3 <- wide %>% filter(flag) %>% select(id, var = flag, size = size2) %>%
mutate(var = 6)
long <- bind_rows(list(v1 = wide1, v2 = wide2, flag = wide3), .id = "var_id")
# rearrange the vectors for scales aesthetic
cols <- c(flag="green", v1 ="blue", v2="red" )
shapes <- c(flag=16, v1=16, v2 =21 )
ggplot(data = long, aes(x = id, y = var)) +
geom_point(aes(size=size, shape = var_id, color = var_id), stroke=1.5) +
scale_size_continuous(limits = c(5,15),breaks = seq(5,15,5)) +
scale_shape_manual(name = "test", values = shapes) +
scale_color_manual(values = cols, guide = FALSE) +
guides(shape = guide_legend(override.aes = list(color = cols)))
P.S. the reason for the red stroke around the green bubble in your plot is that you also plotted the 'var2' behind your flag.
Created on 2020-04-08 by the reprex package (v0.3.0)

Fill area between two lines, with high/low and dates

Forword: I provide a reasonably satisfactory answer to my own question. I understand this is acceptable practice. Naturally my hope is to invite suggestions and improvements.
My purpose is to plot two time series (stored in a dataframe with dates stored as class 'Date') and to fill the area between the data points with two different colors according to whether one is above the other. For instance, to plot an index of Bonds and an index of Stocks, and to fill the area in red when the Stock index is above the bond index, and to fill the area in blue otherwise.
I have used ggplot2 for this purpose, because I am reasonably familiar with the package (author: Hadley Wickham), but feel free to suggest other approaches. I wrote a custom function based on the geom_ribbon() function of the ggplot2 package. Early on I faced problems related to my lack of experience in handling the geom_ribbon() function and objects of class 'Date'. The function below represents my effort to solve these problems, almost surely it is roundabout, unecessarily complicated, clumsy, etc.. So my question is: Please suggest improvements and/or alternative approaches. Ultimately, it would be great to have a general-purpose function made available here.
Data:
set.seed(123456789)
df <- data.frame(
Date = seq.Date(as.Date("1950-01-01"), by = "1 month", length.out = 12*10),
Stocks = 100 + c(0, cumsum(runif(12*10-1, -30, 30))),
Bonds = 100 + c(0, cumsum(runif(12*10-1, -5, 5))))
library('reshape2')
df <- melt(df, id.vars = 'Date')
Custom Function:
## Function to plot geom_ribbon for class Date
geom_ribbon_date <- function(data, group, N = 1000) {
# convert column of class Date to numeric
x_Date <- as.numeric(data[, which(sapply(data, class) == "Date")])
# append numeric date to dataframe
data$Date.numeric <- x_Date
# ensure fill grid is as fine as data grid
N <- max(N, length(x_Date))
# generate a grid for fill
seq_x_Date <- seq(min(x_Date), max(x_Date), length.out = N)
# ensure the grouping variable is a factor
group <- factor(group)
# create a dataframe of min and max
area <- Map(function(z) {
d <- data[group == z,];
approxfun(d$Date.numeric, d$value)(seq_x_Date);
}, levels(group))
# create a categorical variable for the max
maxcat <- apply(do.call('cbind', area), 1, which.max)
# output a dataframe with x, ymin, ymax, is. max 'dummy', and group
df <- data.frame(x = seq_x_Date,
ymin = do.call('pmin', area),
ymax = do.call('pmax', area),
is.max = levels(group)[maxcat],
group = cumsum(c(1, diff(maxcat) != 0))
)
# convert back numeric dates to column of class Date
df$x <- as.Date(df$x, origin = "1970-01-01")
# create and return the geom_ribbon
gr <- geom_ribbon(data = df, aes(x, ymin = ymin, ymax = ymax, fill = is.max, group = group), inherit.aes = FALSE)
return(gr)
}
Usage:
ggplot(data = df, aes(x = Date, y = value, group = variable, colour = variable)) +
geom_ribbon_date(data = df, group = df$variable) +
theme_bw() +
xlab(NULL) +
ylab(NULL) +
ggtitle("Bonds Versus Stocks (Fake Data!)") +
scale_fill_manual('is.max', breaks = c('Stocks', 'Bonds'),
values = c('darkblue','darkred')) +
theme(legend.position = 'right', legend.direction = 'vertical') +
theme(legend.title = element_blank()) +
theme(legend.key = element_blank())
Result:
While there are related questions and answers on stackoverflow, I haven't found one that was sufficiently detailed for my purpose. Here is a selection of useful exchanges:
create-geom-ribbon-for-min-max-range: Asks a similar question, but provides less detail than I was looking for.
possible-bug-in-geom-ribbon: Closely related, but intermediate steps on how to compute max/min are missing.
fill-region-between-two-loess-smoothed-lines-in-r-with-ggplot: Closely related, but focuses on loess lines. Excellent.
ggplot-colouring-areas-between-density-lines-according-to-relative-position : Closely related, but focuses on densities. This post greatly inspired me.
Perhaps I'm not understanding your full problem but it seems that a fairly direct approach would be to define a third line as the minimum of the two time series at each time point. geom_ribbon is then called twice (once for each unique value of Asset) to plot the ribbons formed by each of the series and the minimum line. Code could look like:
set.seed(123456789)
df <- data.frame(
Date = seq.Date(as.Date("1950-01-01"), by = "1 month", length.out = 12*10),
Stocks = 100 + c(0, cumsum(runif(12*10-1, -30, 30))),
Bonds = 100 + c(0, cumsum(runif(12*10-1, -5, 5))))
library(reshape2)
library(ggplot2)
df <- cbind(df,min_line=pmin(df[,2],df[,3]) )
df <- melt(df, id.vars=c("Date","min_line"), variable.name="Assets", value.name="Prices")
sp <- ggplot(data=df, aes(x=Date, fill=Assets))
sp <- sp + geom_ribbon(aes(ymax=Prices, ymin=min_line))
sp <- sp + scale_fill_manual(values=c(Stocks="darkred", Bonds="darkblue"))
sp <- sp + ggtitle("Bonds Versus Stocks (Fake Data!)")
plot(sp)
This produces following chart:
I actually had the same question some time ago and here is the related post. It defines a function finding the intersections between two lines and an other function which takes a dataframe in input and then colors the space between the two columns using matplotand polygon
EDIT
Here is the code, modified a bit to allow the last polygon to be plotted
set.seed(123456789)
dat <- data.frame(
Date = seq.Date(as.Date("1950-01-01"), by = "1 month", length.out = 12*10),
Stocks = 100 + c(0, cumsum(runif(12*10-1, -30, 30))),
Bonds = 100 + c(0, cumsum(runif(12*10-1, -5, 5))))
intersects <- function(x1, x2) {
seg1 <- which(!!diff(x1 > x2)) # location of first point in crossing segments
above <- x2[seg1] > x1[seg1] # which curve is above prior to crossing
slope1 <- x1[seg1+1] - x1[seg1]
slope2 <- x2[seg1+1] - x2[seg1]
x <- seg1 + ((x2[seg1] - x1[seg1]) / (slope1 - slope2))
y <- x1[seg1] + slope1*(x - seg1)
data.frame(x=x, y=y, pindex=seg1, pabove=(1:2)[above+1L])
# pabove is greater curve prior to crossing
}
fillColor <- function(data, addLines=TRUE) {
## Find points of intersections
ints <- intersects(data[,2], data[,3]) # because the first column is for Dates
intervals <- findInterval(1:nrow(data), c(0, ints$x))
## Make plot
matplot(data, type="n", col=2:3, lty=1, lwd=4,xaxt='n',xlab='Date')
axis(1,at=seq(1,dim(data)[1],length.out=12),
labels=data[,1][seq(1,dim(data)[1],length.out=12)])
legend("topright", c(colnames(data)[2], colnames(data)[3]), col=3:2, lty=1, lwd=2)
## Draw the polygons
for (i in seq_along(table(intervals))) {
xstart <- ifelse(i == 1, 0, ints$x[i-1])
ystart <- ifelse(i == 1, data[1,2], ints$y[i-1])
xend <- ints$x[i]
yend <- ints$y[i]
x <- seq(nrow(data))[intervals == i]
polygon(c(xstart, x, xend, rev(x)), c(ystart, data[x,2], yend, rev(data[x,3])),
col=ints$pabove[i]%%2+2)
}
# add end of plot
xstart <- ints[dim(ints)[1],1]
ystart <- ints[dim(ints)[1],2]
xend <- nrow(data)
yend <- data[dim(data)[1],2]
x <- seq(nrow(data))[intervals == max(intervals)]
polygon(c(xstart, x, xend, rev(x)), c(ystart, data[x,2], yend, rev(data[x,3])),
col=ints[dim(ints)[1]-1,4]%%2+2)
## Add lines for curves
if (addLines)
invisible(lapply(1:2, function(x) lines(seq(nrow(data)), data[,x], col=x%%2+2, lwd=2)))
}
## Plot the data
fillColor(dat,FALSE)
and the final result is this (with the same data used for the question)
#walts answer should remain the winner but while implementing his solution, I gave it a tidy update.
library(tidyverse)
set.seed(2345)
# fake data​
raw_data <-
tibble(
date = as.Date("2020-01-01") + (1:40),
a = 95 + cumsum(runif(40, min = -20, max = 20)),
b = 55 + cumsum(runif(40, min = -1, max = 1))
)
​
# the steps
# the 'y' + 'min_line' + 'group' is the right granularity (by date) to
# create 2 separate ribbons
df <-
raw_data %>%
# find min of the two columns
mutate(min_line = pmin(a, b)) %>%
pivot_longer(c(a, b), names_to = "group", values_to = "y") %>%
print()
​
# the result
ggplot(data = df, aes(x = date, fill = group)) +
geom_ribbon(aes(ymax = y, ymin = min_line)) +
theme_classic()
another option using ggh4x - requires the data to be wide with y for lines 1 and 2 in different columns.
library(ggh4x)
#> Loading required package: ggplot2
set.seed(123456789)
df <- data.frame(
Date = seq.Date(as.Date("1950-01-01"), by = "1 month", length.out = 12*10),
Stocks = 100 + c(0, cumsum(runif(12*10-1, -30, 30))),
Bonds = 100 + c(0, cumsum(runif(12*10-1, -5, 5))))
## The data frame is NOT made long!!
ggplot(data = df, aes(x = Date)) +
stat_difference(aes(ymin = Stocks, ymax = Bonds)) +
scale_fill_brewer(palette = "Set1")
Created on 2022-11-24 with reprex v2.0.2

R, ggplot, separate mean by range of x value

I have a set of data looks like this
CHROM POS GT DIFF
1 chr01 14653 CT 254
2 chr01 14907 AG 254
3 chr01 14930 AG 23
4 chr01 15190 GA 260
5 chr01 15211 TG 21
6 chr01 16378 TC 1167
Where POS range from 1xxxx to 1xxxxxxx.
And CHROM is a categorical variable that contains values of "chr01" to "chr22" and "chrX".
I want to plot a scatterplot:
y(DIFF) vs. X(POS)
having panels separated by CHROM
grouped by GT (different colors by GT)
I'm creating a ggplot with running average (though not time series data).
What I want is to get average for every 1,000,000 range of POS by GT.
For example,
for x in range(1 ~ 1,000,000) , DIFF average = _____
for x in range(1,000,001 ~ 2,000,000), DIFF average = _____
and I want to plot horizontal lines on the ggplot (coloured by GT).
#
What I have so far before apply your function:
After apply your function:
I tried to apply your solution to what I already have, here are some problems:
There are different panels, so the mean values are different for different panel, but when I apply your code, the horizontal mean lines are all identical to the first panel.
I'm having different ranges for x-axis, so when apply your function, it automatically fills out the extra range with the previous horizontal mean line
Here is my code before:
ggplot(data1, aes(x=POS,y=DIFF,colour=GT)) +
geom_point() +
facet_grid(~ CHROM,scales="free_x",space="free_x") +
theme(strip.text.x = element_text(size=40),
strip.background = element_rect(color='lightblue',fill='lightblue'),
legend.position="top",
legend.title = element_text(size=40,colour="darkblue"),
legend.text = element_text(size=40),
legend.key.size = unit(2.5, "cm")) +
guides(fill = guide_legend(title.position="top",
title = "Legend:GT='REF'+'ALT'"),
shape = guide_legend(override.aes=list(size=10))) +
scale_y_log10(breaks=trans_breaks("log10", function(x) 10^x, n=10)) +
scale_x_continuous(breaks = pretty_breaks(n=3))
This should get you started:
# It saves a lot of headaches to just make factors as you need them
options(stringsAsFactors = FALSE)
library(ggplot2)
library(plyr)
# Here's some made-up data - it always helps if you can post a subset of
# your real data, though. The dput() function is really useful for that.
dat <- data.frame(POS = seq(1, 1e7, by = 1e4))
# Add random GT value
dat$GT <- sample(x = c("CT", "AG", "GA", "TG", "TC"),
size = nrow(dat),
replace = TRUE)
# Group by millions - there are several ways to do this that I can
# never remember, but here's a simple way to split by millions
dat$POSgroup <- floor(dat$POS / 1e6)
# Add an arbitrary DIFF value
dat$DIFF <- rnorm(n = nrow(dat),
mean = 200 * dat$POSgroup,
sd = 300)
# Aggregate the data by GT and POS-group
# Ideally, you'd do this inside of the plot using stat_summary,
# but I couldn't get that to work. Using two datasets in a plot
# is okay, though.
datsum <- ddply(dat, .var = "POSgroup", .fun = function(x) {
# Calculate the mean DIFF value for each GT group in this POSgroup
meandiff <- ddply(x, .var = "GT", .fun = summarise, ymean = mean(DIFF))
# Add the center of the POSgroup range as the x position
meandiff$center <- (x$POSgroup[1] * 1e6) + 0.5e6
# Return the results
meandiff
})
# On the plot, these results will be grouped by both POS and GT - but
# ggplot will only accept one vector for grouping. So make a combination.
datsum$combogroup <- paste(datsum$GT, datsum$POSgroup)
# Plot it
ggplot() +
# First, a layer for the points themselves
# Large numbers of points can get pretty slow - you might try getting
# the plot to work with a subsample (~1000) and then add in the rest of
# your data
geom_point(data = dat,
aes(x = POS, y = DIFF, color = as.factor(GT))) +
# Then another layer for the means. There are a variety of geoms you could
# use here, but crossbar with ymin and ymax set to the group mean
# is a simple one
geom_crossbar(data = datsum, aes(x = center,
y = ymean,
ymin = ..y..,
ymax = ..y..,
color = as.factor(GT),
group = combogroup),
size = 1) +
# Some other niceties
scale_x_continuous(breaks = seq(0, 1e7, by = 1e6)) +
labs(x = "POS", y = "DIFF", color = "GT") +
theme_bw()
Which results in this:

How to Create a Graph of Statistical Time Series

I have data in the following format:
Date Year Month Day Flow
1 1953-10-01 1953 10 1 530
2 1953-10-02 1953 10 2 530
3 1953-10-03 1953 10 3 530
I would like to create a graph like this:
Here is my current image and code:
library(ggplot2)
library(plyr)
library(reshape2)
library(scales)
## Read Data
df <- read.csv("Salt River Flow.csv")
## Convert Date column to R-recognized dates
df$Date <- as.Date(df$Date, "%m/%d/%Y")
## Finds Water Years (Oct - Sept)
df$WY <- as.POSIXlt(as.POSIXlt(df$Date)+7948800)$year+1900
## Normalizes Water Years so stats can be applied to just months and days
df$w <- ifelse(month(df$Date) %in% c(10,11,12), 1903, 1904)
##Creates New Date (dat) Column
df$dat <- as.Date(paste(df$w,month(df$Date),day(df$Date), sep = "-"))
## Creates new data frame with summarised data by MonthDay
PlotData <- ddply(df, .(dat), summarise, Min = min(Flow), Tenth = quantile(Flow, p = 0.05), TwentyFifth = quantile(Flow, p = 0.25), Median = quantile(Flow, p = 0.50), Mean = mean(Flow), SeventyFifth = quantile(Flow, p = 0.75), Ninetieth = quantile(Flow, p = 0.90), Max = max(Flow))
## Melts data so it can be plotted with ggplot
m <- melt(PlotData, id="dat")
## Plots
p <- ggplot(m, aes(x = dat)) +
geom_ribbon(aes(min = TwentyFifth, max = Median), data = PlotData, fill = alpha("black", 0.1), color = NA) +
geom_ribbon(aes(min = Median, max = SeventyFifth), data = PlotData, fill = alpha("black", 0.5), color = NA) +
scale_x_date(labels = date_format("%b"), breaks = date_breaks("month"), expand = c(0,0)) +
geom_line(data = subset(m, variable == "Mean"), aes(y = value), size = 1.2) +
theme_bw() +
geom_line(data = subset(m, variable %in% c("Min","Max")), aes(y = value, group = variable)) +
geom_line(data = subset(m, variable %in% c("Ninetieth","Tenth")), aes(y = value, group = variable), linetype = 2) +
labs(x = "Water Year", y = "Flow (cfs)")
p
I am very close but there are some issues I'm having. First, if you can see a way to improve my code, please let me know. The main problem I ran into was that I needed two dataframes to make this graph: one melted, and one not. The unmelted dataframe was necessary (I think) to create the ribbons. I tried many ways to use the melted dataframe for the ribbons, but there was always a problem with the aesthetic length.
Second, I know to have a legend - and I want one, I need to have something in the aesthetics of each line/ribbon, but I am having trouble getting that to work. I think it would involve scale_fill_manual.
Third, and I don't know if this is possible, I would like to have each month label in between the tick marks, not on them (like in the above image).
Any help is greatly appreciated (especially with creating more efficient code).
Thank you.
Something along these lines might get you close with base:
library(lubridate)
library(reshape2)
# simulating data...
Date <- seq(as.Date("1953-10-01"),as.Date("2010-10-01"),by="day")
Year <- year(Date)
Month <- month(Date)
Day <- day(Date)
set.seed(1)
Flow <- rpois(length(Date), 2000)
Data <- data.frame(Date=Date,Year=Year,Month=Month,Day=Day,Flow=Flow)
# use acast to get it in a convenient shape:
PlotData <- acast(Data,Year~Month+Day,value.var="Flow")
# apply for quantiles
Quantiles <- apply(PlotData,2,function(x){
quantile(x,probs=c(1,.9,.75,.5,.25,.1,0),na.rm=TRUE)
})
Mean <- colMeans(PlotData, na.rm=TRUE)
# ugly way to get month tick separators
MonthTicks <- cumsum(table(unlist(lapply(strsplit(names(Mean),split="_"),"[[",1))))
# and finally your question:
plot(1:366,seq(0,max(Flow),length=366),type="n",xlab = "Water Year",ylab="Discharge",axes=FALSE)
polygon(c(1:366,366:1),c(Quantiles["50%",],rev(Quantiles["75%",])),border=NA,col=gray(.6))
polygon(c(1:366,366:1),c(Quantiles["50%",],rev(Quantiles["25%",])),border=NA,col=gray(.4))
lines(1:366,Quantiles["90%",], col = gray(.5), lty=4)
lines(1:366,Quantiles["10%",], col = gray(.5))
lines(1:366,Quantiles["100%",], col = gray(.7))
lines(1:366,Quantiles["0%",], col = gray(.7), lty=4)
lines(1:366,Mean,lwd=3)
axis(1,at=MonthTicks, labels=NA)
text(MonthTicks-15,-100,1:12,pos=1,xpd=TRUE)
axis(2)
The plotting code really isn't that tricky. You'll need to clean up the aesthetics, but polygon() is usually my strategy for shaded regions in plots (confidence bands, whatever).
Perhaps this will get you closer to what you're looking for, using ggplot2 and plyr:
library(ggplot2)
library(plyr)
library(lubridate)
library(scales)
df$MonthDay <- df$Date - years( year(df$Date) + 100 ) #Normalize points to same year
df <- ddply(df, .(Month, Day), mutate, MaxDayFlow = max(Flow) ) #Max flow on day
df <- ddply(df, .(Month, Day), mutate, MinDayFlow = min(Flow) ) #Min flow on day
p <- ggplot(df, aes(x=MonthDay) ) +
geom_smooth(size=2,level=.8,color="black",aes(y=Flow)) + #80% conf. interval
geom_smooth(size=2,level=.5,color="black",aes(y=Flow)) + #50% conf. interval
geom_line( linetype="longdash", aes(y=MaxDayFlow) ) +
geom_line( linetype="longdash", aes(y=MinDayFlow) ) +
labs(x="Month",y="Flow") +
scale_x_date( labels = date_format("%b") ) +
theme_bw()
Edit: Fixed X scale and X scale label
(Partial answer with base plotting function and not including the min, max, or mean.) I suspect you will need to construct a dataset before passing to ggplot, since that is typical for that function. I already do something similar and then pass the resulting matrix to matplot. (It doesn't do that kewl highlighting, but maybe ggplot can do it>
HDL.mon.mat <- aggregate(dfrm$Flow,
list( dfrm$Year + dfrm$Month/12),
quantile, prob=c(0.1,0.25,0.5,0.75, 0.9), na.rm=TRUE)
matplot(HDL.mon.mat[,1], HDL.mon.mat$x, type="pl")

How to prevent two labels to overlap in a barchart?

The image below shows a chart that I created with the code below. I highlighted the missing or overlapping labels. Is there a way to tell ggplot2 to not overlap labels?
week = c(0, 1, 1, 1, 1, 2, 2, 3, 4, 5)
statuses = c('Shipped', 'Shipped', 'Shipped', 'Shipped', 'Not-Shipped', 'Shipped', 'Shipped', 'Shipped', 'Not-Shipped', 'Shipped')
dat <- data.frame(Week = week, Status = statuses)
p <- qplot(factor(Week), data = dat, geom = "bar", fill = factor(Status))
p <- p + geom_bar()
# Below is the most important line, that's the one which displays the value
p <- p + stat_bin(aes(label = ..count..), geom = "text", vjust = -1, size = 3)
p
You can use a variant of the well-known population pyramid.
Some sample data (code inspired by Didzis Elferts' answer):
set.seed(654)
week <- sample(0:9, 3000, rep=TRUE, prob = rchisq(10, df = 3))
status <- factor(rbinom(3000, 1, 0.15), labels = c("Shipped", "Not-Shipped"))
data.df <- data.frame(Week = week, Status = status)
Compute count scores for each week, then convert one category to negative values:
library("plyr")
plot.df <- ddply(data.df, .(Week, Status), nrow)
plot.df$V1 <- ifelse(plot.df$Status == "Shipped",
plot.df$V1, -plot.df$V1)
Draw the plot. Note that the y-axis labels are adapted to show positive values on either side of the baseline.
library("ggplot2")
ggplot(plot.df) +
aes(x = as.factor(Week), y = V1, fill = Status) +
geom_bar(stat = "identity", position = "identity") +
scale_y_continuous(breaks = 100 * -1:5,
labels = 100 * c(1, 0:5)) +
geom_text(aes(y = sign(V1) * max(V1) / 30, label = abs(V1)))
The plot:
For production purposes you'd need to determine the appropriate y-axis tick labels dynamically.
Made new sample data (inspired by code of #agstudy).
week <- sample(0:5,1000,rep=TRUE,prob=c(0.2,0.05,0.15,0.5,0.03,0.1))
statuses <- gl(2,1000,labels=c('Not-Shipped', 'Shipped'))
dat <- data.frame(Week = week, Status = statuses)
Using function ddply() from library plyr made new data frame text.df for labels. Column count contains number of observations in each combination of Week and Status. Then added column ypos that contains cumulative sum of count for each Week plus 15. This will be used for y position. For Not-Shipped ypos replaced with -10.
library(plyr)
text.df<-ddply(dat,.(Week,Status),function(x) data.frame(count=nrow(x)))
text.df<-ddply(text.df,.(Week),transform,ypos=cumsum(count)+15)
text.df$ypos[text.df$Status=="Not-Shipped"]<- -10
Now labels are plotted with geom_text() using new data frame.
ggplot(dat,aes(as.factor(Week),fill=Status))+geom_bar()+
geom_text(data=text.df,aes(x=as.factor(Week),y=ypos,label=count))
One solution to avoid overlaps is to use to dodge position of bars and texts. To avoid missing values you can set ylim. Here an example.
## I create some more realistic data similar to your picture
week <- sample(0:5,1000,rep=TRUE)
statuses <- gl(2,1000,labels=c('Not-Shipped', 'Shipped'))
dat <- data.frame(Week = week, Status = statuses)
## for dodging
dodgewidth <- position_dodge(width=0.9)
## get max y to set ylim
ymax <- max(table(dat$Week,dat$Status))+20
ggplot(dat,aes(x = factor(Week),fill = factor(Status))) +
geom_bar( position = dodgewidth ) +
stat_bin(geom="text", position= dodgewidth, aes( label=..count..),
vjust=-1,size=5)+
ylim(0,ymax)
Based on Didzis plot you could also increase readability by keeping the position on the y axis constant and by colouring the text in the same colour as the legend.
library(ggplot2)
week <- sample(0:5,1000,rep=TRUE,prob=c(0.2,0.05,0.15,0.5,0.03,0.1))
statuses <- gl(2,1000,labels=c('Not-Shipped', 'Shipped'))
dat <- data.frame(Week = week, Status = statuses)
library(plyr)
text.df<-ddply(dat,.(Week,Status),function(x) data.frame(count=nrow(x)))
text.df$ypos[text.df$Status=="Not-Shipped"]<- -15
text.df$ypos[text.df$Status=="Shipped"]<- -55
p <- ggplot(dat,aes(as.factor(Week),fill=Status))+geom_bar()+
geom_text(data=text.df,aes(x=as.factor(Week),y=ypos,label=count),colour=ifelse(text.df$Status=="Not-Shipped","#F8766D","#00BFC4"))

Resources