Plot standard deviation - r

I want to plot the standard deviation for 1 line (1 flow serie, the plot will have 2) in a plot with lines or smoth areas. I've seen and applied some code from sd representation and other examples... but it's not working for me.
My original data has several flow values for the same day, of which I've calculated the daily mean and sd. I'm stuck here, don't know if it is possible to represent the daily sd with lines from the column created "called sd" or should I use the original data.
The bellow code is a general example of what I'll apply to my data. The flow, flow1 and sd, are examples of the result calculation of daily mean and sd of the original data.
library(gridExtra)
library(ggplot2)
library(grid)
x <- data.frame(
date = seq(as.Date("2012-01-01"),as.Date("2012-12-31"), by="week"),
rain = sample(0:20,53,replace=T),
flow1 = sample(50:150,53,replace=T),
flow = sample(50:200,53,replace=T),
sd = sample (0:10,53, replace=T))
g.top <- ggplot(x, aes(x = date, y = rain, ymin=0, ymax=rain)) +
geom_linerange() +
scale_y_continuous(limits=c(22,0),expand=c(0,0), trans="reverse")+
theme_classic() +
theme(plot.margin = unit(c(5,5,-32,6),units="points"),
axis.title.y = element_text(vjust = 0.3))+
labs(y = "Rain (mm)")
g.bottom <- ggplot(x, aes(x = date)) +
geom_line(aes(y = flow, colour = "flow")) +
geom_line(aes(y = flow1, colour = "flow1")) +
stat_summary(geom="ribbon", fun.ymin="min", fun.ymax="max", aes(fill=sd), alpha=0.3) +
theme_classic() +
theme(plot.margin = unit(c(0,5,1,1),units="points"),legend.position="bottom") +
labs(x = "Date", y = "River flow (m/s)")
grid.arrange(g.top, g.bottom , heights = c(1/5, 4/5))
The above code gives Error: stat_summary requires the following missing aesthetics: y
Other option is geom_smooth, but as far as I could understand it requires some line equation (I can be wrong, I'm new in R).

Something like this maybe?
g.bottom <- x %>%
select(date, flow1, flow, sd) %>%
gather(key, value, c(flow, flow1)) %>%
mutate(min = value - sd, max = value + sd) %>%
ggplot(aes(x = date)) +
geom_ribbon(aes(ymin = min, ymax = max, fill = key)) +
geom_line(aes(y = value, colour = key)) +
scale_fill_manual(values = c("grey", "grey")) +
theme_classic() +
theme(plot.margin = unit(c(0,5,1,1),units="points"),legend.position="bottom") +
labs(x = "Date", y = "River flow (m/s)")

Related

How to smooth out a time-series geom_area with fill in ggplot?

I have the following graph and code:
Graph
ggplot(long2, aes(x = DATA, y = value, fill = variable)) + geom_area(position="fill", alpha=0.75) +
scale_y_continuous(labels = scales::comma,n.breaks = 5,breaks = waiver()) +
scale_fill_viridis_d() +
scale_x_date(date_labels = "%b/%Y",date_breaks = "6 months") +
ggtitle("Proporcions de les visites, només 9T i 9C") +
xlab("Data") + ylab("% visites") +
theme_minimal() + theme(legend.position="bottom") + guides(fill=guide_legend(title=NULL)) +
annotate("rect", fill = "white", alpha = 0.3,
xmin = as.Date.character("2020-03-16"), xmax = as.Date.character("2020-06-22"),
ymin = 0, ymax = 1)
But it has some sawtooth, how am I supposed to smooth it out?
I believe your situation is roughly analogous to the following, wherein we have missing x-positions for one group, but not the other at the same position. This causes spikes if you set position = "fill".
library(ggplot2)
x <- seq_len(100)
df <- data.frame(
x = c(x[-c(25, 75)], x[-50]),
y = c(cos(x[-c(25, 75)]), sin(x[-50])) + 5,
group = rep(c("A", "B"), c(98, 99))
)
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "fill")
To smooth out these spikes, it has been suggested to linearly interpolate the data at the missing positions.
# Find all used x-positions
ux <- unique(df$x)
# Split data by group, interpolate data groupwise
df <- lapply(split(df, df$group), function(xy) {
approxed <- approx(xy$x, xy$y, xout = ux)
data.frame(x = ux, y = approxed$y, group = xy$group[1])
})
# Recombine data
df <- do.call(rbind, df)
# Now without spikes :)
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "fill")
Created on 2022-06-17 by the reprex package (v2.0.1)
P.S. I would also have expected a red spike at x=50, but for some reason this didn't happen.

sankey/alluvial diagram with percentage and partial fill in R

I would like modify an existing sankey plot using ggplot2 and ggalluvial to make it more appealing
my example is from https://corybrunson.github.io/ggalluvial/articles/ggalluvial.html
library(ggplot2)
library(ggalluvial)
data(vaccinations)
levels(vaccinations$response) <- rev(levels(vaccinations$response))
ggplot(vaccinations,
aes(x = survey, stratum = response, alluvium = subject,
y = freq,
fill = response, label = response)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(stat = "stratum", size = 3) +
theme(legend.position = "none") +
ggtitle("vaccination survey responses at three points in time")
Created on 2020-10-01 by the reprex package (v0.3.0)
Now, I would like to change this plot that it looks similar to a plot from https://sciolisticramblings.wordpress.com/2018/11/23/sankey-charts-the-new-pie-chart/, i.e. 1. change absolute to relative values (percentage) 2. add percentage labels and 3. apply partial fill (e.g. "missing" and "never")
My approach:
I think I could change the axis to percentage with something like: scale_y_continuous(label = scales::percent_format(scale = 100))
However, I am not sure about step 2. and 3.
This could be achieved like so:
Changing to percentages could be achieved by adding a new column to your df with the percentage shares by survey, which can then be mapped on y instead of freq.
To get nice percentage labels you can make use of scale_y_continuous(label = scales::percent_format())
For the partial filling you can map e.g. response %in% c("Missing", "Never") on fill (which gives TRUE for "Missing" and "Never") and set the fill colors via scale_fill_manual
The percentages of each stratum can be added to the label via label = paste0(..stratum.., "\n", scales::percent(..count.., accuracy = .1)) in geom_text where I make use of the variables ..stratum.. and ..count.. computed by stat_stratum.
library(ggplot2)
library(ggalluvial)
library(dplyr)
data(vaccinations)
levels(vaccinations$response) <- rev(levels(vaccinations$response))
vaccinations <- vaccinations %>%
group_by(survey) %>%
mutate(pct = freq / sum(freq))
ggplot(vaccinations,
aes(x = survey, stratum = response, alluvium = subject,
y = pct,
fill = response %in% c("Missing", "Never"),
label = response)) +
scale_x_discrete(expand = c(.1, .1)) +
scale_y_continuous(label = scales::percent_format()) +
scale_fill_manual(values = c(`TRUE` = "cadetblue1", `FALSE` = "grey50")) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(aes(label = paste0(..stratum.., "\n", scales::percent(..count.., accuracy = .1))), stat = "stratum", size = 3) +
theme(legend.position = "none") +
ggtitle("vaccination survey responses at three points in time")

How to add value labels on the flows item of a Alluvial/Sankey plot (on R ggalluvial)?

I'm looking to label the "flow" portion of Alluvial / Sankey chart on R.
The stratums (columns) can easily be labelled, but not the flows connecting them. All my attempts on reading the documentations and experimenting were to no avail.
In the sample below, "freq" is expected to be labelled on the flow connection part.
library(ggplot2)
library(ggalluvial)
data(vaccinations)
levels(vaccinations$response) <- rev(levels(vaccinations$response))
ggplot(vaccinations,
aes(x = survey, stratum = response, alluvium = subject,
y = freq,
fill = response, label = freq)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(stat = "stratum", size = 3) +
theme(legend.position = "bottom") +
ggtitle("vaccination survey responses at three points in time")
There is an option to take the raw numbers and use these as labels for the flow part:
ggplot(vaccinations,
aes(x = survey, stratum = response, alluvium = subject,
y = freq,
fill = response, label = freq)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(stat = "stratum", size = 3) +
geom_text(stat = "flow", nudge_x = 0.2) +
theme(legend.position = "bottom") +
ggtitle("vaccination survey responses at three points in time")
If you want more control over how to label these points, you can extract the layer data and do computations on that. For example we can compute the fractions for only the starting positions as follows:
# Assume 'g' is the previous plot object saved under a variable
newdat <- layer_data(g)
newdat <- newdat[newdat$side == "start", ]
split <- split(newdat, interaction(newdat$stratum, newdat$x))
split <- lapply(split, function(dat) {
dat$label <- dat$label / sum(dat$label)
dat
})
newdat <- do.call(rbind, split)
ggplot(vaccinations,
aes(x = survey, stratum = response, alluvium = subject,
y = freq,
fill = response, label = freq)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(stat = "stratum", size = 3) +
geom_text(data = newdat, aes(x = xmin + 0.4, y = y, label = format(label, digits = 1)),
inherit.aes = FALSE) +
theme(legend.position = "bottom") +
ggtitle("vaccination survey responses at three points in time")
It still is kind of a judgement call about where exactly you want to place the labels. Doing it at the start is the easy way, but if you want these labels to be approximately in the middle and dodging oneanother it would require some processing.

ggplot2 and facet_grid : add highest value for each plot

I am using facet_grid() to plot multiple plot divided per groups of data. For each plot, I want to add in the corner the highest value of the Y axis. I've tried several hacks but it never gives me the expected results. This answer partially helps me but the value I want to add will constantly be changing, therefore I don't see how I can apply it.
Here is a minimal example, I'd like to add the red numbers on the graph below:
library(ggplot2)
data <- data.frame('group'=rep(c('A','B'),each=4),'hour'=rep(c(1,2,3,4),2),'value'=c(5,4,2,3,6,7,4,5))
ggplot(data,aes(x = hour, y = value)) +
geom_line() +
geom_point() +
theme(aspect.ratio=1) +
scale_x_continuous(name ="hours", limits=c(1,4)) +
scale_y_continuous(limits=c(1,10),breaks = seq(1, 10, by = 2))+
facet_grid( ~ group)
Thanks for your help!
library(dplyr)
data2 <- data %>% group_by(group) %>% summarise(Max = max(value))
ggplot(data,aes(x = hour, y = value)) +
geom_line() +
geom_point() +
geom_text(aes(label = Max), x = Inf, y = Inf, data2,
hjust = 2, vjust = 2, col = 'red') +
theme(aspect.ratio=1) +
scale_x_continuous(name ="hours", limits=c(1,4)) +
scale_y_continuous(limits=c(1,10),breaks = seq(1, 10, by = 2))+
facet_grid( ~ group)
This does the trick. If you always have fixed ranges you can position the text manually.
library(ggplot2)
data <- data.frame('group'=rep(c('A','B'),each=4),'hour'=rep(c(1,2,3,4),2),'value'=c(5,4,2,3,6,7,4,5))
ggplot(data,aes(x = hour, y = value)) +
geom_line() +
geom_point() +
geom_text(
aes(x, y, label=lab),
data = data.frame(
x=Inf,
y=Inf,
lab=tapply(data$value, data$group, max),
group=unique(data$group)
),
vjust="inward",
hjust = "inward"
) +
theme(aspect.ratio=1) +
scale_x_continuous(name ="hours", limits=c(1,4)) +
scale_y_continuous(limits=c(1,10),breaks = seq(1, 10, by = 2))+
facet_grid( ~ group)

Wrong location of errorbars

I am trying to make a plot with multiple lines and with stat_summary defining the mean values. When I apply geom_errorbar(), some of them are placed with a some distance to mean indications, which means, that some of them are 'flying'. What is happening?
Thanks!
My code:
#First I add another data set with SE, SD and mean.
cdata <- ddply(data2, c("OGTT","Treatment"), summarise,
N = sum(!is.na(Glucose)),
mean = mean(Glucose, na.rm=TRUE),
sd = sd(Glucose, na.rm=TRUE),
se = sd / sqrt(N))
#Then I merge it with my original data
totalglu<-merge(data2,cdata)
#Then I make the ggplot
p<-ggplot(data=totalglu, aes(x = factor(OGTT), y = Glucose, group = StudyID, color=StudyID)) +
geom_line() +
facet_grid(End.start ~Treatment)+
stat_summary(aes(group = Treatment), geom = "point", fun.y = mean, shape = 16, size = 2) +
theme(legend.position="none") +
labs(x = "OGTT time points (min)",y= "Glucose (mmol/l)")+
geom_errorbar(aes(ymin=mean-se,ymax=mean+se), width=.1, colour="black")
p
It appears that you are not using End.start when calculating the bars, but it is being used by stat_summary because of the faceting.
Try:
cdata <- ddply(data2, c("OGTT","Treatment","End.start"), summarise,
N = sum(!is.na(Glucose)),
mean = mean(Glucose, na.rm=TRUE),
sd = sd(Glucose, na.rm=TRUE),
se = sd / sqrt(N))
#Then I merge it with my original data
totalglu<-merge(data2,cdata)
#Then I make the ggplot
p<-ggplot(data=totalglu, aes(x = factor(OGTT), y = Glucose, group = StudyID, color=StudyID)) +
geom_line() +
facet_grid(End.start ~Treatment)+
stat_summary(aes(group = Treatment), geom = "point", fun.y = mean, shape = 16, size = 2) +
theme(legend.position="none") +
labs(x = "OGTT time points (min)",y= "Glucose (mmol/l)")+
geom_errorbar(aes(ymin=mean-se,ymax=mean+se), width=.1, colour="black")
p
Though, without the actual starting data, I am not quite sure what data2 looks like, or how ddply is affecting things. Instead, I might suggest skipping making cdata altogether, and just using:
ggplot(data=totalglu, aes(x = factor(OGTT), y = Glucose, group = StudyID, color=StudyID)) +
geom_line() +
facet_grid(End.start ~Treatment)+
stat_summary(aes(group = Treatment), fun.data = mean_cl_normal) +
theme(legend.position="none") +
labs(x = "OGTT time points (min)",y= "Glucose (mmol/l)")

Resources