How to stack partially matched time periods with geom_area (ggplot2)? - r

With the following example, I get a plot where the areas are not stacked. I would like to stack them. This should be a partial stack, intensity starting at 0.5, then reaching 0.8 where stacked, then reaching 0.3 at the end.
I assume that the position argument does not work as the start and end date are not the same.
Am I missing an argument that could solve this issue? Or maybe another geom?
Do I have to subset the data into days, to get the desired output. If so, how can I acheive that?
Thanks in advance,
# Library
library(tidyverse)
library(lubridate)
# Data
df <- tibble(date_debut = as_date(c("2022-09-28", "2022-10-05")),
intensity = c(0.5, 0.3),
duration = days(c(14, 10)),
type = (c("a", "b")))
# Adjustment
df <- df %>%
mutate(date_fin = date_debut + duration) %>%
pivot_longer(cols = c(date_debut, date_fin),
names_to = "date_type",
values_to = "date")
# Plot
df %>%
ggplot(aes(x = date, y = intensity, fill = type))+
geom_area(position = "stack")

This is a tough data wrangling problem. The area plots only stack where the points in the two series have the same x values. The following will achieve that, though it's quite a profligate approach.
df %>%
mutate(interval = interval(date_debut, date_debut + duration)) %>%
group_by(type) %>%
summarize(time = seq(as.POSIXct(min(df$date_debut)),
as.POSIXct(max(df$date_debut + df$duration)), by = 'min'),
intensity = ifelse(time %within% interval, intensity, 0)) %>%
ggplot(aes(x = time, y = intensity, fill = type)) +
geom_area(position = position_stack())

Allan Cameron's answer inspired me to look further into complete.
The proposed answer was solving my question, so I accepted. However, it is indeed more complex than needed.
I solved it this way:
# Adjustment
df <- df %>%
mutate(date_fin = date_debut + duration) %>%
group_by(type) %>%
complete(date_debut = seq(min(date_debut), max(date_fin), by = "1 day")) %>%
fill(intensity) %>%
select(date_debut, intensity, type)
ggplot(df, aes(x = date_debut, y = intensity, fill = type)) +
geom_area()+
scale_x_date(date_labels = "%d",
date_breaks = "1 day")
To avoid the weird empty space, it is fine for me to use geom_col (the question was about geom_area, so no worries).
ggplot(df, aes(x = date_debut, y = intensity, fill = type, colour = type)) +
geom_col(width = 0.95)+
scale_x_date(date_labels = "%d",
date_breaks = "1 day")

Related

Difference between geom_col() and geom_point() for same value

So, I'm trying to plot missing values here over time (longitudinal data).
I would prefer placing them in a geom_col() to fill up with colours of certain treatments afterwards. But for some weird reason, geom_col() gives me weird values, while geom_point() gives me the correct values using the same function. I'm trying to wrap my head around why this is happening. Take a look at the y-axis.
Disclaimer:
I know the missing values dissappear on day 19-20. This is why I'm making the plot.
Sorry about the lay-out of the plot. Not polished yet.
For the geom_point:
gaussian_transformed %>% group_by(factor(time)) %>% mutate(missing = sum(is.na(Rose_width))) %>% ggplot(aes(x = factor(time), y = missing)) + geom_point()
Picture: geom_point
For the geom_col:
gaussian_transformed %>% group_by(factor(time)) %>% mutate(missing = sum(is.na(Rose_width))) %>% ggplot(aes(x = factor(time), y = missing)) + geom_col()
Picture: geom_col
The problem is that you're using mutate and creating several rows for your groups. You cannot see that, but you will have plenty of points overlapping in your geom_point plot.
One way is to either use summarise, or you use distinct
Compare
library(tidyverse)
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
ggplot(aes(x = order, y = missing)) +
geom_point()
The points look ugly because there is a lot of over plotting.
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
distinct(order, .keep_all = TRUE) %>%
ggplot(aes(x = order, y = missing)) +
geom_col()
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
ggplot(aes(x = order, y = missing)) +
geom_col()
Created on 2021-06-02 by the reprex package (v2.0.0)
So after some digging:
What happens was that the geom_col() function sums up all the missing values while geom_point() does not. Hence the large values for y. Why this is happening, I do not know. However doing the following worked fine for me:
gaussian_transformed$time <- as.factor(gaussian_transformed$time)
gaussian_transformed %>% group_by(time) %>% summarise(missing = sum(is.na(Rose_width))) -> gaussian_transformed
gaussian_transformed %>% ggplot(aes(x = time, y = missing)) + geom_col(fill = "blue", alpha = 0.5) + theme_minimal() + labs(title = "Missing values in Gaussian Outcome over the days", x = "Time (in days)", y = "Amount of missing values") + scale_y_continuous(breaks = seq(0, 10, 1))
With the plot: GaussianMissing

show gap for missing date in geom area

I like to plot the time series of my data. However there are some gaps in the date value like in the example below. The following code produces the plot disregarding the missing date. How can I show the missing date i.e. show a gap between 2021-01-02 and 2021-01-04 and similarly 2021-01-06 and 2021-01-08.
library(tidyverse)
fake.data <- data.frame(
varA = c(0.6,0.5,0.2,0.3,0.7),
varB = c(0.1,0.2,0.4,0.6,0.2),
varC = c(0.3,0.3,0.4,0.1,0.1),
start_date = as.Date(c('2021-01-01','2021-01-02','2021-01-04','2021-01-06','2021-01-08')),
stringsAsFactors = FALSE
)
fake.data %>%
gather(variable, value,varA:varC) %>%
ggplot(aes(x = start_date, y = value, fill = variable)) +
geom_area()
I guess the easiest would be to fake the gaps, e.g., with geom_rect.
Consider that "gaps in data" are actually inherent to most use of line / area graphs - some purists might actually be totally against showing lines / areas for non-continuous measurements, because it suggests continuous measurements. Thus, because it is interpolated anyways, you could argue that you might as well not need to show those gaps.
library(tidyverse)
fake.data <- data.frame(
varA = c(0.6,0.5,0.2,0.3,0.7),
varB = c(0.1,0.2,0.4,0.6,0.2),
varC = c(0.3,0.3,0.4,0.1,0.1),
start_date = as.Date(c('2021-01-01','2021-01-02','2021-01-04','2021-01-06','2021-01-08'))
) %>% pivot_longer(cols = matches("^var"), names_to = "variable", values_to = "value" )
ls_data <- setNames(fake.data %>%
complete(start_date = full_seq(start_date, 1)) %>%
split(., is.na(.$variable)), c("vals", "missing"))
ggplot(ls_data$vals, aes(x = start_date, y = value, fill = variable)) +
geom_area() +
geom_rect(data = ls_data$missing, aes(xmin = start_date-.5, xmax = start_date+.5,
ymin = 0, ymax = Inf), fill = "white") +
theme_classic()
Created on 2021-04-21 by the reprex package (v2.0.0)
Considering the above - I'd possibly favour not explicitly showing the gaps, but to show the measurements more explicitly. E.g., with geom_point.
fake.data %>%
ggplot(aes(x = start_date, y = value, fill = variable)) +
geom_area() +
geom_point(position = "stack") +
geom_line(position = "stack")
is this close to what you wish ?
todateseq<-fake.data %>%
select(start_date) %>%
pull
first <- min(todateseq)
last <- max(todateseq)
date_seq <- seq.Date(first,last,by='day')
fake.data2 <- data.frame(start_date=date_seq) %>%
left_join(fake.data,by='start_date')
fake.data2 %>%
gather(variable, value,varA:varC) %>%
mutate(value=ifelse(is.na(value),0,value)) %>%
ggplot(aes(x = start_date, y = value, fill = variable)) +
geom_area(na.rm = F,position = position_stack())

How to put the text for each point of plan in r

I am new to R and I have problem with adding the text for each point in the coordinate xoy: assume that I have dataframe below:
library (dplyr)
library(ggplot2)
dat <- data.frame(
time = factor(c("Breakfast","Breakfast","Breakfast","Lunch","Lunch","Lunch","Dinner","Dinner","Dinner"), levels=c("Breakfast","Lunch","Dinner")),
total_bill_x = c(12.75,14.89,20.5,17.23,30.3,27.8,20.7,32.3,25.4), total_bill_y= c(20.75,15.29,18.52,19.23,27.3,23.6,19.75,27.3,21.48)
)
and here is my code:
dat %>%
group_by(time) %>%
summarise(
x = sum(total_bill_x),
y = sum(total_bill_y)
)%>%
ggplot(.,aes(x,y, col = time)) +
geom_point()
I know that we will use geom_text but i dont know which argument to add into it to know that which point represent breakfast, lunch, dinner.
Any help for this would be much appreciated.
You can use geom_text(aes(label = time), nudge_y = 0.5). nudge_y will vertical adjust the labels. If you want to move horizontally, you must use nudge_x.
dat %>%
group_by(time) %>% # group your data
summarise(
x = sum(total_bill_x),
y = sum(total_bill_y) # compute median YOU ARE NOT COMPUTING MEDIAN HERE
)%>%
ggplot(.,aes(x,y, col = time)) +
geom_point() +
geom_text(aes(label = time), nudge_y = 0.5)

how to accumulated multiple columns of a data.frame in R?

I am trying to find accumulated values for each year of variables A to Z in myData. I have tried a few things but didn't succeed. Once i do that, i would then need to compute maximum,minimum, median, upper and lower quartile average across all those years. Here is my laborious code so far but don't have any idea how to proceed further- in fact, the current code also is not giving me what i am after.
library(tidyverse)
mydate <- as.data.frame(seq(as.Date("2000-01-01"), to= as.Date("2019-12-31"), by="day"))
colnames(mydate) <- "Date"
Data <- data.frame(A = runif(7305,0,10),
J = runif(7305,0,8),
X = runif(7305,0,12),
Z = runif(7305,0,10))
DF <- data.frame(mydate, Data)
myData <- DF %>% separate(Date, into = c("Year","Month","Day")) %>%
sapply(as.numeric) %>%
as.data.frame() %>%
mutate(Date = DF$Date) %>%
filter(Month > 4 & Month < 11) %>%
mutate(DOY = format(Date, "%j")) %>%
group_by(Year) %>%
mutate(cumulativeSum = accumulate(DOY))
I am trying to get a Figure like below for A, J, X, Z. any help would be appreciated.
Update (EDIT)
My question is pretty confusing so i decided to break it down into steps using excel. Here i am using only one variable which in this case is A (note: in my question i have multiple variable). i am accumulated data from May to October each year which is reflected in column cumulative sum. In the second step (Step-2), i re-arrange the data in day of the year (May to October) with their data. in step-3, i am taking the statistics i mentioned earlier across all the years for every day of the year. I try to clarify as much as i could but probably this a bit strange question.
Ultimate Figure
Here is an example Figure that i would like to derive as a result of this exercise.
So, if I'm understand well, you are trying to plot the statistical descriptive of the cumulative values of each variable between May and October of years 2000 to 2019.
So here is a possible solution to calculate first descriptive statistics of each variable (usingdplyr, lubridate, tiydr package) - I encouraged you to break this code in several part in order to understand all steps.
Basically, I isolate month and year of the date, then, pivot the dataframe into a longer format, filter for keeping values only in the period of interest (May to October), calculate the cumulative sum of values grouped by variables and year. Then, I create a fake date (by pasting a consistent year with real month and days) in order to calculate descriptive statistics in function of this date and variable.
Altogether, it gives something like that:
library(lubridate)
library(dplyr)
library(tidyr)
mydata <- DF %>% mutate(Year = year(Date), Month = month(Date)) %>%
pivot_longer(-c(Date,Year,Month), names_to = "variable", values_to = "values") %>%
filter(between(Month,5,10)) %>%
group_by(Year, variable) %>%
mutate(Cumulative = cumsum(values)) %>%
mutate(NewDate = ymd(paste("2020", Month,day(Date), sep = "-"))) %>%
ungroup() %>%
group_by(variable, NewDate) %>%
summarise(Median = median(Cumulative),
Maximum = max(Cumulative),
Minimum = min(Cumulative),
Upper = quantile(Cumulative,0.75),
Lower = quantile(Cumulative, 0.25))
Then, you can get a similar plot to your example by doing:
library(ggplot2)
ggplot(mydata, aes(x = NewDate))+
geom_ribbon(aes(ymin = Lower, ymax = Upper), color = "grey", alpha =0.5)+
geom_line(aes(y = Median), color = "darkblue")+
geom_line(aes(y = Maximum), color = "red", linetype = "dashed", size = 1.5)+
geom_line(aes(y = Minimum), color ="red", linetype = "dashed", size = 1.5)+
facet_wrap(~variable, scales = "free")+
scale_x_date(date_labels = "%b", date_breaks = "month", name = "Month")+
ylab("Daily Cumulative Precipitation (mm)")
Does it look what you are trying to achieve ?
EDIT: Adding Legends
Adding a legend here is not easy as you are using different geom (ribbon, line) with different color, shape, ...
So, one way is to regroup statistics that can be plot with the same geom and do:
mydata %>% pivot_longer(cols = c(Median, Minimum,Maximum), names_to = "Statistic",values_to = "Value") %>%
ggplot(aes(x = NewDate))+
geom_ribbon(aes(ymin = Lower, ymax = Upper, fill = "Upper / Lower"), alpha =0.5)+
geom_line(aes(y = Value, color = Statistic, linetype = Statistic, size = Statistic))+
facet_wrap(~variable, scales = "free")+
scale_x_date(date_labels = "%b", date_breaks = "month", name = "Month")+
ylab("Daily Cumulative Precipitation (mm)")+
scale_size_manual(values = c(1.5,1,1.5))+
scale_linetype_manual(values = c("dashed","solid","dashed"))+
scale_color_manual(values = c("red","darkblue","red"))+
scale_fill_manual(values = "grey", name = "")
So, it looks good but as you can see, it's a litle bit weird as the Upper/Lower is slightly out of the main legends.
Another solution is to add legends as labeling on the last date. For that, you can create a second dataframe by subsetting only the last date of your first dataframe:
mydata_label <- mydata %>% filter(NewDate == max(NewDate)) %>%
pivot_longer(cols = Median:Lower, names_to = "Stat",values_to = "val")
Then, without changing much the plotting part, you can do:
ggplot(mydata, aes(x = NewDate))+
geom_ribbon(aes(ymin = Lower, ymax = Upper), alpha =0.5)+
geom_line(aes(y = Median), color = "darkblue")+
geom_line(aes(y = Maximum), color = "red", linetype = "dashed", size = 1.5)+
geom_line(aes(y = Minimum), color ="red", linetype = "dashed", size = 1.5)+
facet_wrap(~variable, scales = "free")+
scale_x_date(date_labels = "%b", date_breaks = "month", name = "Month", limits = c(min(mydata$NewDate),max(mydata$NewDate)+25))+
ylab("Daily Cumulative Precipitation (mm)")+
geom_text(data = mydata_label,
aes(x = NewDate+5, y = val, label = Stat, color = Stat), size = 2, hjust = 0, show.legend = FALSE)+
scale_color_manual(values = c("Median" = "darkblue","Maximum" = "red","Minimum" = "red","Upper" = "black", "Lower" = "black"))
I reduced on purpose the size of the text labeling due to space issues in order you can see all of them. But based on the figure you attached to your question, you should have plenty of space to make it working.

bar chart of row freq ggplot2

I have the following data:
dataf <- read.table(text = "index,group,taxa1,taxa2,taxa3,total
s1,g1,2,5,3,10
s2,g1,3,4,3,10
s3,g2,1,2,7,10
s4,g2,0,4,6,10", header = T, sep = ",")
I'm trying to make a stacked bar plot of the frequences of the data so that it counts across the row (not down a column) for each index (s1,s2,s3,s4) and then for each group (g1,g2) of each taxa. I'm only able to figure out how to graph the species of one taxa but not all three stacked on each other.
Here are some examples of what I'm trying to make:
These were made on google sheets so they don't look like ggplot but it would be easier to make in r with ggplot2 because the real data set is larger.
You would need to reshape the data.
Here is my solution (broken down by plot)
For first plot
library(tidyverse)
##For first plot
prepare_data_1 <- dataf %>% select(index, taxa1:taxa3) %>%
gather(taxa,value, -index) %>%
mutate(index = str_trim(index)) %>%
group_by(index) %>% mutate(prop = value/sum(value))
##Plot 1
prepare_data_1 %>%
ggplot(aes(x = index, y = prop, fill = fct_rev(taxa))) + geom_col()
For second plot
##For second plot
prepare_data_2 <- dataf %>% select(group, taxa1:taxa3) %>%
gather(taxa,value, -group) %>%
mutate(group = str_trim(group)) %>%
group_by(group) %>% mutate(prop = value/sum(value))
##Plot 2
prepare_data_2 %>%
ggplot(aes(x = group, y = prop, fill = fct_rev(taxa))) + geom_col()
##You need to reshape data before doing that.
dfm = melt(dataf, id.vars=c("index","group"),
measure.vars=c("taxa1","taxa2","taxa3"),
variable.name="variable", value.name="values")
ggplot(dfm, aes(x = index, y = values, group = variable)) +
geom_col(aes(fill=variable)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.25)) +
geom_text(aes(label = values), position = position_stack(vjust = .5), size = 3) + theme_gray()

Resources