how to accumulated multiple columns of a data.frame in R?

how to accumulated multiple columns of a data.frame in R? - r

I am trying to find accumulated values for each year of variables A to Z in myData. I have tried a few things but didn't succeed. Once i do that, i would then need to compute maximum,minimum, median, upper and lower quartile average across all those years. Here is my laborious code so far but don't have any idea how to proceed further- in fact, the current code also is not giving me what i am after.
library(tidyverse)
mydate <- as.data.frame(seq(as.Date("2000-01-01"), to= as.Date("2019-12-31"), by="day"))
colnames(mydate) <- "Date"
Data <- data.frame(A = runif(7305,0,10),
J = runif(7305,0,8),
X = runif(7305,0,12),
Z = runif(7305,0,10))
DF <- data.frame(mydate, Data)
myData <- DF %>% separate(Date, into = c("Year","Month","Day")) %>%
sapply(as.numeric) %>%
as.data.frame() %>%
mutate(Date = DF$Date) %>%
filter(Month > 4 & Month < 11) %>%
mutate(DOY = format(Date, "%j")) %>%
group_by(Year) %>%
mutate(cumulativeSum = accumulate(DOY))
I am trying to get a Figure like below for A, J, X, Z. any help would be appreciated.
Update (EDIT)
My question is pretty confusing so i decided to break it down into steps using excel. Here i am using only one variable which in this case is A (note: in my question i have multiple variable). i am accumulated data from May to October each year which is reflected in column cumulative sum. In the second step (Step-2), i re-arrange the data in day of the year (May to October) with their data. in step-3, i am taking the statistics i mentioned earlier across all the years for every day of the year. I try to clarify as much as i could but probably this a bit strange question.
Ultimate Figure
Here is an example Figure that i would like to derive as a result of this exercise.

So, if I'm understand well, you are trying to plot the statistical descriptive of the cumulative values of each variable between May and October of years 2000 to 2019.
So here is a possible solution to calculate first descriptive statistics of each variable (usingdplyr, lubridate, tiydr package) - I encouraged you to break this code in several part in order to understand all steps.
Basically, I isolate month and year of the date, then, pivot the dataframe into a longer format, filter for keeping values only in the period of interest (May to October), calculate the cumulative sum of values grouped by variables and year. Then, I create a fake date (by pasting a consistent year with real month and days) in order to calculate descriptive statistics in function of this date and variable.
Altogether, it gives something like that:
library(lubridate)
library(dplyr)
library(tidyr)
mydata <- DF %>% mutate(Year = year(Date), Month = month(Date)) %>%
pivot_longer(-c(Date,Year,Month), names_to = "variable", values_to = "values") %>%
filter(between(Month,5,10)) %>%
group_by(Year, variable) %>%
mutate(Cumulative = cumsum(values)) %>%
mutate(NewDate = ymd(paste("2020", Month,day(Date), sep = "-"))) %>%
ungroup() %>%
group_by(variable, NewDate) %>%
summarise(Median = median(Cumulative),
Maximum = max(Cumulative),
Minimum = min(Cumulative),
Upper = quantile(Cumulative,0.75),
Lower = quantile(Cumulative, 0.25))
Then, you can get a similar plot to your example by doing:
library(ggplot2)
ggplot(mydata, aes(x = NewDate))+
geom_ribbon(aes(ymin = Lower, ymax = Upper), color = "grey", alpha =0.5)+
geom_line(aes(y = Median), color = "darkblue")+
geom_line(aes(y = Maximum), color = "red", linetype = "dashed", size = 1.5)+
geom_line(aes(y = Minimum), color ="red", linetype = "dashed", size = 1.5)+
facet_wrap(~variable, scales = "free")+
scale_x_date(date_labels = "%b", date_breaks = "month", name = "Month")+
ylab("Daily Cumulative Precipitation (mm)")
Does it look what you are trying to achieve ?
EDIT: Adding Legends
Adding a legend here is not easy as you are using different geom (ribbon, line) with different color, shape, ...
So, one way is to regroup statistics that can be plot with the same geom and do:
mydata %>% pivot_longer(cols = c(Median, Minimum,Maximum), names_to = "Statistic",values_to = "Value") %>%
ggplot(aes(x = NewDate))+
geom_ribbon(aes(ymin = Lower, ymax = Upper, fill = "Upper / Lower"), alpha =0.5)+
geom_line(aes(y = Value, color = Statistic, linetype = Statistic, size = Statistic))+
facet_wrap(~variable, scales = "free")+
scale_x_date(date_labels = "%b", date_breaks = "month", name = "Month")+
ylab("Daily Cumulative Precipitation (mm)")+
scale_size_manual(values = c(1.5,1,1.5))+
scale_linetype_manual(values = c("dashed","solid","dashed"))+
scale_color_manual(values = c("red","darkblue","red"))+
scale_fill_manual(values = "grey", name = "")
So, it looks good but as you can see, it's a litle bit weird as the Upper/Lower is slightly out of the main legends.
Another solution is to add legends as labeling on the last date. For that, you can create a second dataframe by subsetting only the last date of your first dataframe:
mydata_label <- mydata %>% filter(NewDate == max(NewDate)) %>%
pivot_longer(cols = Median:Lower, names_to = "Stat",values_to = "val")
Then, without changing much the plotting part, you can do:
ggplot(mydata, aes(x = NewDate))+
geom_ribbon(aes(ymin = Lower, ymax = Upper), alpha =0.5)+
geom_line(aes(y = Median), color = "darkblue")+
geom_line(aes(y = Maximum), color = "red", linetype = "dashed", size = 1.5)+
geom_line(aes(y = Minimum), color ="red", linetype = "dashed", size = 1.5)+
facet_wrap(~variable, scales = "free")+
scale_x_date(date_labels = "%b", date_breaks = "month", name = "Month", limits = c(min(mydata$NewDate),max(mydata$NewDate)+25))+
ylab("Daily Cumulative Precipitation (mm)")+
geom_text(data = mydata_label,
aes(x = NewDate+5, y = val, label = Stat, color = Stat), size = 2, hjust = 0, show.legend = FALSE)+
scale_color_manual(values = c("Median" = "darkblue","Maximum" = "red","Minimum" = "red","Upper" = "black", "Lower" = "black"))
I reduced on purpose the size of the text labeling due to space issues in order you can see all of them. But based on the figure you attached to your question, you should have plenty of space to make it working.

Related

How to stack partially matched time periods with geom_area (ggplot2)?

With the following example, I get a plot where the areas are not stacked. I would like to stack them. This should be a partial stack, intensity starting at 0.5, then reaching 0.8 where stacked, then reaching 0.3 at the end.
I assume that the position argument does not work as the start and end date are not the same.
Am I missing an argument that could solve this issue? Or maybe another geom?
Do I have to subset the data into days, to get the desired output. If so, how can I acheive that?
Thanks in advance,
# Library
library(tidyverse)
library(lubridate)
# Data
df <- tibble(date_debut = as_date(c("2022-09-28", "2022-10-05")),
intensity = c(0.5, 0.3),
duration = days(c(14, 10)),
type = (c("a", "b")))
# Adjustment
df <- df %>%
mutate(date_fin = date_debut + duration) %>%
pivot_longer(cols = c(date_debut, date_fin),
names_to = "date_type",
values_to = "date")
# Plot
df %>%
ggplot(aes(x = date, y = intensity, fill = type))+
geom_area(position = "stack")

This is a tough data wrangling problem. The area plots only stack where the points in the two series have the same x values. The following will achieve that, though it's quite a profligate approach.
df %>%
mutate(interval = interval(date_debut, date_debut + duration)) %>%
group_by(type) %>%
summarize(time = seq(as.POSIXct(min(df$date_debut)),
as.POSIXct(max(df$date_debut + df$duration)), by = 'min'),
intensity = ifelse(time %within% interval, intensity, 0)) %>%
ggplot(aes(x = time, y = intensity, fill = type)) +
geom_area(position = position_stack())

Allan Cameron's answer inspired me to look further into complete.
The proposed answer was solving my question, so I accepted. However, it is indeed more complex than needed.
I solved it this way:
# Adjustment
df <- df %>%
mutate(date_fin = date_debut + duration) %>%
group_by(type) %>%
complete(date_debut = seq(min(date_debut), max(date_fin), by = "1 day")) %>%
fill(intensity) %>%
select(date_debut, intensity, type)
ggplot(df, aes(x = date_debut, y = intensity, fill = type)) +
geom_area()+
scale_x_date(date_labels = "%d",
date_breaks = "1 day")
To avoid the weird empty space, it is fine for me to use geom_col (the question was about geom_area, so no worries).
ggplot(df, aes(x = date_debut, y = intensity, fill = type, colour = type)) +
geom_col(width = 0.95)+
scale_x_date(date_labels = "%d",
date_breaks = "1 day")

Is it possible to adjust a second graph to a second y-axis in ggplot?

I am trying to make a several bar plots with their standard errors added to the plot. I tried to add a second y-axis, which was not that hard, however, now I also want my standard errors to fit this new y-axis. I know that I can manipulate the y-axis, but this is not really what I want. I want it such that the standard errors fit to this new y-axis. To illustrate, this is the plot I have now, where I just divided the first y-axis by a 100.
but what I want it something more like this
How it should look like using Excel
to show for all barplots (this was done for the first barplot using Excel). Here is my code
df_bar <- as.data.frame(
rbind(
c('g1', 0.945131015, 1.083188828, 1.040164338,
1.115716593, 0.947886795),
c('g2', 1.393211286, 1.264193745, 1.463434395,
1.298126006, 1.112718796),
c('g3', 1.509976099, 1.450923745, 1.455102201,
1.280102338, 1.462689245),
c('g4', 1.591697668, 1.326292649, 1.767207296,
1.623619341, 2.528108183),
c('g5', 2.625114848, 2.164050167, 2.092843287,
2.301950359, 2.352736806)
)
)
colnames(df_bar)<-c('interval', 'lvl3.Mellem.Høj', 'lvl1.Lav', 'TOM',
',lvl4.Høj', 'lvl2.Lav.Mellem')
df_bar <- melt(df_bar, id.vars = "interval",
variable.name = "name",
value.name = "value")
df_line <- as.data.frame(
rbind(
c('g1', 0.0212972, 0.0164494, 0.0188898, 0.01888982,
0.03035883),
c('g2', 0.0195600, 0.0163811, 0.0188747, 0.01887467,
0.03548092),
c('g3', 0.0192249, 0.0161914, 0.02215852, 0.02267605,
0.03426538),
c('g4', 0.0187961, 0.0180842, 0.01962371, 0.02103450,
0.03902890),
c('g5', 0.0209987, 0.0164596, 0.01838280, 0.02282300,
0.03516818)
)
)
colnames(df_line)<-c('interval', 'lvl3.Mellem.Høj', 'lvl1.Lav', 'TOM',
',lvl4.Høj', 'lvl2.Lav.Mellem')
df_line <- melt(df_line, id.vars = "interval",
variable.name = "name",
value.name = "sd")
df <- inner_join(df_bar,df_line, by=c("interval", "name"))
df %>%
mutate(value = as.numeric(value)) %>%
mutate(sd = as.numeric(sd)) %>%
mutate(interval = as.factor(interval)) %>%
mutate(name = as.factor(name)) %>%
ggplot() +
geom_bar(aes(x = interval, y = value, fill = interval), stat = "identity") +
geom_line(aes(x = interval, y = sd, group = 1),
color = "black", size = .75) +
scale_y_continuous("Value", sec.axis = sec_axis(~ . /100, name = "sd")) +
facet_grid(~name, scales = "free") +
theme_bw() + theme(legend.position = "none") +
xlab("Interval") + ylab("Value") +
labs(caption = "Black line indicates standard deviation.")
Thanks in advance..

As described in this example, you have to also perform a transformation to your values from sd to match the scale of your second axis. In your example you divided by 100, therefore you have to multiply your sd by 100 as shown in the below:
library(tidyverse)
library(data.table)
df %>%
mutate(value = as.numeric(value)) %>%
mutate(sd = as.numeric(sd)) %>%
mutate(interval = as.factor(interval)) %>%
mutate(name = as.factor(name)) %>%
ggplot() +
geom_bar(aes(x = interval, y = value, fill = interval), stat = "identity") +
scale_y_continuous("Value", sec.axis = sec_axis(~ ./100, name = "sd"))+
geom_line(aes(x = interval, y = sd*100, group = 1),
color = "black", size = .75)+
facet_grid(~name, scales = "free")+
theme_bw() + theme(legend.position = "none") +
xlab("Interval") + ylab("Value") +
labs(caption = "Black line indicates standard deviation.")
You can also use a different value to scale your second axis. In this example I used 50 as a scaling factor, which in my opinion looks a bit better:
Created on 2022-08-25 with reprex v2.0.2

Here is what it should look like for the first barplot using Excel.

R | ggplot2 | bar in barplot does not start at the right value

I want to visualize the reference ranges of several liver enzymes (for example GOT and GPT) that I calculated with two programs "kosmic" and "RLE" using ggplot2.
I do not understand why the bars always start at 0, even if the lower range is for example 16.02.
How do I need to change my code so the minimum and maximum values of the bars look like that:
[16.02,45.46] [9.16,60.52] [16.10,68.90] and [9.30,64.40].
Thank you in advance!
#install.packages("ggplot2")
library(ggplot2)
program <- c(rep("kosmic",4),rep("RLE",4))
value <- c(16.02,45.46,9.16,60.52,16.1,48.9,9.3,64.4)
parameter <- c(rep("GOT",2),rep("GPT",2),rep("GOT",2),rep("GPT",2))
table1 <- data.frame(program,value,parameter)
p <- ggplot(table1, aes(parameter,value, fill = program))+
geom_bar(position="dodge", stat="identity")
print(p)
I am looking for something like this:

Are you looking for something like this?
library(dplyr)
table1 %>%
group_by(parameter, program) %>%
summarize(min = min(value),
median = median(value),
max = max(value), .groups = "drop") %>%
ggplot(aes(interaction(parameter,program), fill = program))+
geom_tile(aes(y = median, height = max-min), width = 0.6)
Edit:
Okay this is hacky, but:
table1 %>%
# example of reordering the parameters
mutate(parameter = fct_relevel(parameter, "GPT", after = 0)) %>%
# forcats offers a variety of fct_*** functions to change factors
# (factors are a data type that can separately store labels and ordering)
group_by(parameter, program) %>%
summarize(min = min(value),
median = median(value),
mean = mean(value),
max = max(value), .groups = "drop") %>%
ggplot(aes(parameter, mean, color = program))+
geom_errorbar(aes(ymin = min, ymax = max),
position = position_dodge(width = 0.3), size = 10,
width = 0) +
# control the legend so the key squares aren't gigantic to match the error bar widths
guides(colour = guide_legend(override.aes = list(size=8))) +
# example of assigning different colors.
# a variety of scale_color_* functions are available
scale_color_manual(values = c("kosmic" = "#cc5588", "RLE" = "#779988"))
A downside of this is that the width/spacing of the bars will vary depending on your graphic output aspect ratio, so to use it might take some fiddling to get as you want.

Based on what you want, I'd suggest a box plot instead of a bar plot:
ggplot(table1, aes(x = parameter, y = value, fill = program, color = program)) +
geom_point(position = position_jitterdodge()) +
geom_boxplot(outlier.shape = NA, color = 'black')

How do you make a line graph with multiple lines from multiple variables in R

I have two dataframes and I want to plot a comparison between them. The plot and dataframes look like so
df2019 <- data.frame(Role = c("A","B","C"),Women_percent = c(65,50,70),Men_percent = c(35,50,30), Women_total =
c(130,100,140), Men_total = c(70,100,60))
df2016 <- data.frame(Role= c("A","B","C"),Women_percent = c(70,45,50),Men_percent = c(30,55,50),Women_total =
c(140,90,100), Men_total = c(60,110,100))
all_melted <- reshape2::melt(
rbind(cbind(df2019, year=2019), cbind(df2016, year=2016)),
id=c("year", "Role"))
Theres no reason I need the data in melted from, I just did it because I was plotting bar graphs with it, but now I need a line graph and I dont know how to make line graphs in melted form, and dont know how to keep that 19/16 tag if not in melted frame. When i try to make a line graph I dont know how to specify what "variable" will be used. I want the lines to be the Women,Men percent values, and the label to be the totals. (in this picture the geom_text is the percent values, I want it to use the total values)
Crucially I want the linetype to be dotted in 2016 and for the legend to show that

I think it would be simplest to rbind the two frames after labelling them with their year, then reshape the result so that you have columns for role, year, gender, percent and total.
I would then use a bit of alpha scale trickery to hide the points and labels from 2016:
df2016$year <- 2016
df2019$year <- 2019
rbind(df2016, df2019) %>%
pivot_longer(cols = 2:5, names_sep = "_", names_to = c("Gender", "Type")) %>%
pivot_wider(names_from = Type) %>%
ggplot(aes(Role, percent, color = Gender,
linetype = factor(year),
group = paste(Gender, year))) +
geom_line(size = 1.3) +
geom_point(size = 10, aes(alpha = year)) +
geom_text(aes(label = total, alpha = year), colour = "black") +
scale_colour_manual(values = c("#07aaf6", "#ef786f")) +
scale_alpha(range = c(0, 1), guide = guide_none()) +
scale_linetype_manual(values = c(2, 1)) +
labs(y = "Percent", color = "Gender", linetype = "Year")

ggplot2: Adding fill aesthetic to smooth geom inside stat_summary

I have what I think is a version of remove data points when using stat_summary to generate mean and confidence band or How to set multiple colours in a ggplot2 stat_summary plot? and may also relate to this bug report relating to the SE parameter https://github.com/tidyverse/ggplot2/issues/1546, but I can't seem to figure out what I am doing wrong.
I have weekly data and I am trying to plot current year, previous year, 5 year average, and 5 year range. I can get the plot and all the elements that I want, but I can't get the fill in the range to relate to my scale_fill command.
Here is the code I am using:
library(plyr)
require(dplyr)
require(tidyr)
library(ggplot2)
library(lubridate)
library(zoo)
library(viridis)
ggplot(df1,aes(week,value)) +
geom_point(data=subset(df1,year(date)==year(Sys.Date()) ),size=1.7,aes(colour="1"))+
geom_line(data=subset(df1,year(date)==year(Sys.Date()) ),size=1.7,aes(colour="1"))+
geom_line(data=subset(df1,year(date)==year(Sys.Date())-1 ),size=1.7,aes(colour="2"))+
geom_point(data=subset(df1,year(date)==year(Sys.Date())-1 ),size=1.7,aes(colour="2"))+
#stat_summary(data=subset(df1,year(date)<year(Sys.Date()) &year(date)>year(Sys.Date())-6),geom = 'smooth', alpha = 0.2,size=1.7,
# fun.data = median_hilow,aes(colour=c("1","2","3"),fill="range"))+
stat_summary(data=subset(df1,year(date)<year(Sys.Date()) &year(date)>year(Sys.Date())-6),geom="smooth",fun.y = mean, fun.ymin = min, fun.ymax = max,size=1.7,aes(colour="c",fill="b"))+
#stat_summary(fun.data=mean_cl_normal, geom='smooth', color='black')+
scale_color_viridis("",discrete=TRUE,option="C",labels=c(year(Sys.Date()), year(Sys.Date())-1,paste(year(Sys.Date())-6,"-",year(Sys.Date())-1,"\naverage",sep ="")))+
scale_fill_viridis("",discrete=TRUE,option="C",labels=paste(year(Sys.Date())-6,"-",year(Sys.Date())-1,"\nrange",sep =""))+
#scale_fill_continuous()+
scale_x_continuous(limits=c(min(df1$week),max(df1$week)),expand=c(0,0))+
theme_minimal()+theme(
legend.position = "bottom",
legend.margin=margin(c(0,0,0,0),unit="cm"),
legend.text = element_text(colour="black", size = 12),
plot.caption = element_text(size = 14, face = "italic"),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(size = 14, face = "italic"),
#panel.grid.minor = element_blank(),
text = element_text(size = 14,face = "bold"),
axis.text.y =element_text(size = 14,face = "bold", colour="black"),
axis.text.x=element_text(size = 14,face = "bold", colour="black",angle=90, hjust=1),
)+
labs(y="Crude Oil Imports \n(Weekly, Thousands of Barrels per Day)",x="Week",
title=paste("US Imports of Crude Oil",sep=""),
caption="Source: EIA API, graph by Andrew Leach.")
I have placed an test.Rdata file here with the df1 data frame: https://drive.google.com/file/d/1aMt4WQaOi1vFJcMlgXFY7dzF_kjbgBiU/view?usp=sharing
Ideally, I'd like to have a fill legend item that looks like this, only with the text as I have it in my graph:
Any help would be much appreciated.

The short answer is that you seem to be misunderstanding how ggplot's scale_xx_xx commands are meant to be used (this trips up a lot of people). Whenever possible, the intention is for the aesthetics (the aes() bit inside most geoms) to be mapped to the scale functions. For example, the following code maps year to line color:
plot.simple <- ggplot(data = df1, aes(x = week, y = value, color = as.factor(year(date)))) +
geom_line()
print(plot.simple)
Since we specified that year (converted to a factor) should be used to define line color, ggplot defaults to using scale_color_hue. We could use a different scale:
plot.gray <- ggplot(data = df1, aes(x = week, y = value, color = as.factor(year(date)))) +
geom_line() +
scale_color_grey()
print(plot.gray)
If we don't want to tie aesthetics such as color or fill to values in the data, we can just specify them outside of the call to aes(). Typically you only do this if you don't have multiple values for an aesthetic:
plot.simple <- ggplot(data = df1, aes(x = week, y = value, color = as.factor(year(date)))) +
geom_line(alpha = 0.2)
print(plot.simple)
But you're in the unenviable position of wanting both of these things at once. For your 2017 and 2018 lines, color is meaningful. For the summary ribbon and its associated line, color is just decorative. In such cases, I usually avoid ggplot's built-in summary functions, since they can often "help" in ways that end up confusing or cumbersome.
I would suggest creating two data sets, one containing the 2017 and 2018 years, and the other containing the summary statistics for the ribbon:
df.years <- df1 %>%
mutate(year = year(date)) %>%
filter(year >= year(Sys.Date()) - 1)
df.year.range <- df1 %>%
mutate(year = year(date)) %>%
filter(year >= year(Sys.Date()) - 6 & year <= year(Sys.Date()) - 1) %>%
group_by(week) %>%
summarize(mean = mean(value), min = min(value), max = max(value))
We can then trick ggplot into printing a nice title for the fill on the legend, by setting fill inside aes to the intended string. Because fill is set in aes(), we control its color with scale_fill_manual.
the.plot <- ggplot() +
geom_ribbon(data = df.year.range, aes(x = week, ymin = min, ymax = max, fill = 'Previous 5 Year Range\nof Weekly Exports')) +
geom_line(data = df.year.range, aes(x = week, y = mean), color = 'purple') +
geom_line(data = df.years, aes(x = week, y = value, color = as.factor(year))) +
geom_point(data = filter(df.years, year == year(Sys.Date())), aes(x = week, y = value, color = as.factor(year))) +
scale_fill_manual(values = '#ffccff')
print(the.plot)
This is still rather cumbersome, because you have quite a few different elements tied to various different sources of data (lines for some years, points for others, a ribbon for a summary, etc). But it gets the job done!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to accumulated multiple columns of a data.frame in R? - r

Related

How to stack partially matched time periods with geom_area (ggplot2)?

Is it possible to adjust a second graph to a second y-axis in ggplot?

R | ggplot2 | bar in barplot does not start at the right value

How do you make a line graph with multiple lines from multiple variables in R

ggplot2: Adding fill aesthetic to smooth geom inside stat_summary

Categories

Resources