Visualizing average sentiment by day&year (ggplot) - r

I would like to visualize consumer sentiment by day&year throughout different years. For example, I am interested in comparing consumer sentiment in Dec 18th of 2011, to Dec 18th in 2012.
Currently, I have been able to do so by month&year, but I want to visualize the data at a more granular level.
#Creating a month-year variable
valences_by_post<- valences_by_post %>%
mutate(month_year = zoo::as.yearmon(date))
#2011 & 2012
valence_11_12<-valences_by_post %>%
filter(year == 2011 | year ==2012)%>%
group_by(month_year) %>%
summarize(mean_valence= mean(valence), n=n())
ggplot(valence_11_12, aes(x =factor(month_year), y = mean_valence, group=1)) +
geom_point() +
geom_line()+
geom_smooth()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
Which produces:
However, to compute sentiment by day&year, and visualize across different years, I ran the following:
valences_by_post<- valences_by_post %>%
mutate(year_day = paste(lubridate::year(date), lubridate::yday(date), sep = "-"))
head(valences_by_post$year_day)
valence_day<-valences_by_post %>%
filter(year == 2011| year == 2012)%>%
group_by(year_day) %>%
summarize(mean_valence= mean(valence), n=n())
And then the graph, but I receive an error that, "Error: Discrete value supplied to continuous scale" because the year_day variable is stored as "character", and I was wondering if there is a workaround for this or an equivalent of the "zoo::as.yearmon(date))" function from other packages?
ggplot(valence_day, aes(x =year_day, y = mean_valence)) +
geom_point() +
geom_line()+
scale_x_continuous(breaks=seq(1,365,1)) +
geom_smooth()
Here are data samples:
dput(head(valence_day,5))
structure(list(year_day = c("2011-175", "2011-176", "2011-177",
"2011-182", "2011-189"), mean_valence = c(0, 0.0806100217864924,
0.0714285714285714, 0, 0.5), n = c(1L, 9L, 1L, 1L, 1L)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
And
dput(head(valences_by_post,5))
structure(list(document = c("1", "2", "3", "4", "5"), positive = c(1,
0, 2, 1, 1), negative = c(1, 1, 0, 0, 1), total_words = c(34,
13, 4, 3, 6), valence = c(0, -0.0769230769230769, 0.5, 0.333333333333333,
0), date = structure(c(1308873600, 1308960000, 1308960000, 1308960000,
1308960000), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
year = c(2011, 2011, 2011, 2011, 2011), month = c(6, 6, 6,
6, 6), year_day = c("2011-175", "2011-176", "2011-176", "2011-176",
"2011-176"), month_year = structure(c(2011.41666666667, 2011

IMHO there is no need to add a year_day. Basically this is the same as the date. Hence, you could do your computations by converting your date (which is a datetime object) to a Date . And to show the yearday in the plot this could be achieved via the labels argument of scale_x_date:
library(dplyr)
library(ggplot2)
valence_day <- valences_by_post %>%
filter(year %in% c(2011, 2012)) %>%
group_by(date = as.Date(date)) %>%
summarize(mean_valence = mean(valence), n = n())
ggplot(valence_day, aes(x = date, y = mean_valence)) +
geom_point() +
geom_line() +
scale_x_date(labels = ~ paste(lubridate::year(.x), lubridate::yday(.x), sep = "-")) +
geom_smooth()
DATA
valences_by_post <- structure(list(
document = c("1", "2", "3", "4", "5"), positive = c(
1,
0, 2, 1, 1
), negative = c(1, 1, 0, 0, 1), total_words = c(
34,
13, 4, 3, 6
), valence = c(
0, -0.0769230769230769, 0.5, 0.333333333333333,
0
), date = structure(c(
1308873600, 1308960000, 1308960000, 1308960000,
1308960000
), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
year = c(2011, 2011, 2011, 2011, 2011), month = c(
6, 6, 6,
6, 6
), month_year = structure(c(
2011.41666666667, 2011.41666666667,
2011.41666666667, 2011.41666666667, 2011.41666666667
), class = "yearmon")
), row.names = c(
NA,
5L
), class = "data.frame")

Related

How do I combine data taken separately into a single dataset?

I have a dataset comprised of leaves which I've weighed individually in order of emergence (first emerged through final emergence), and I'd like to combine these masses so that I have the entire mass of all the leaves for each individual plant.
How would I add these up using R programming language, or what would I need to google to get started on figuring this out?
structure(list(Tray = c(1, 1, 1, 1, 1, 1), Plant = c(2, 2, 2,
2, 3, 3), Treatment = structure(c(4L, 4L, 4L, 4L, 4L, 4L), .Label = c("2TLH",
"E2TL", "EH", "WL"), class = "factor"), PreSwitch = c("Soil",
"Soil", "Soil", "Soil", "Soil", "Soil"), PostSwitch = c("Soil",
"Soil", "Soil", "Soil", "Soil", "Soil"), Pellet = c(1, 1, 1,
1, 1, 1), Rep = c(1, 1, 1, 1, 1, 1), Date = structure(c(1618963200,
1618963200, 1618963200, 1618963200, 1618963200, 1618963200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), DAP = c(60, 60, 60, 60, 60, 60), Position = c(2,
1, 3, 4, 4, 3), Whorl = structure(c(1L, 1L, 2L, 2L, 2L, 2L), .Label = c("1",
"2", "3", "4", "5"), class = "factor"), PetioleLength = c(1.229,
1.365, 1.713, 1.02, 0, 1.408), BladeLength = c(1.604, 1.755,
2.466, 2.672, 0.267, 2.662), BladeWidth = c(1.023, 1.185, 1.803,
1.805, 0.077, 1.771), BladeArea = c(1.289, 1.634, 3.492, 3.789,
0.016, 3.704), BladePerimeter = c(6.721, 7.812, 11.61, 12.958,
1.019, 14.863), BladeCircularity = c(0.359, 0.336, 0.326, 0.284,
0.196, 0.211), BPR = c(1.30512611879577, 1.28571428571429, 1.43957968476357,
2.61960784313725, NA, 1.890625), Leaf.Mass = c(9, 11, 31, 33,
32, 33), BladeAR = c(1.56793743890518, 1.48101265822785, 1.36772046589018,
1.4803324099723, 3.46753246753247, 1.50310559006211), Subirrigation = c(0,
0, 0, 0, 0, 0), Genotype = c(1, 1, 1, 1, 1, 1), Location = c(0,
0, 0, 0, 0, 0)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I may be missing something but isn't this a sum by Plant?
One solution below sums it for each plant into a separate table with just the totals and the second summarizes and adds it back to the main data set in a single step.
library(tidyverse)
#summary data set
plant_total <- df %>% group_by(Plant) %>% summarize(plant_weight = sum(Leaf.Mass, na.rm= TRUE))
#add plant_weight column to df data set
plant_total <- df %>% group_by(Plant) %>% mutate(plant_weight = sum(Leaf.Mass, na.rm = TRUE))

Adding p-values to ggplot; ggsignif says it can only handle data with groups that are plotted on the x-axis

I have data as follows, to which I am trying to add p-values:
library(ggplot2)
library(ggsignif)
library(dplyr)
data <- structure(list(treatment = c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1), New_Compare_Truth = c(57,
61, 12, 14, 141, 87, 104, 90, 12, 14), total_Hy = c(135,
168, 9, 15, 103, 83, 238, 251, 9, 15), total = c(285, 305, 60,
70, 705, 435, 520, 450, 60, 70), ratio = c(47.3684210526316,
55.0819672131148, 15, 21.4285714285714, 14.6099290780142, 19.0804597701149,
45.7692307692308, 55.7777777777778, 15, 21.4285714285714), Type = structure(c(2L,
2L, 1L, 1L, 3L, 3L, 5L, 5L, 4L, 4L), .Label = c("A1. Others \nMore \nH",
"A2. Similar \nNorm", "A3. Others \nLess \nH", "B1. Others \nMore \nH",
"B2. Similar \nNorm or \nHigher"), class = "factor"), `Sample Selection` = c("Answers pr",
"Answers pu", "Answers pr", "Answers pu", "Answers pr",
"Answers pu", "Answers pr", "Answers pu", "Answers pr",
"Answers pu"), p_value = c(0.0610371842601616, 0.0610371842601616,
0.346302201593934, 0.346302201593934, 0.0472159407450147, 0.0472159407450147,
0.0018764377521242, 0.0018764377521242, 0.346302201593934, 0.346302201593934
), x = c(2, 2, 1, 1, 3, 3, 5.5, 5.5, 4.5, 4.5)), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
breaks_labels <- structure(list(Type = structure(c(2L, 1L, 3L, 5L, 4L), .Label = c("A1. Others \nMore \nH",
"A2. Similar \nNorm", "A3. Others \nLess \nH", "B1. Others \nMore \nH",
"B2. Similar \nNorm or \nHigher"), class = "factor"), x = c(2,
1, 3, 5.5, 4.5)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
data %>%
ggplot(aes(x = x, y = ratio)) +
geom_col(aes(fill = `Sample Selection`), position = position_dodge(preserve = "single"), na.rm = TRUE) +
geom_text(position = position_dodge(width = .9), # move to center of bars
aes(label=sprintf("%.02f %%", round(ratio, digits = 1)), group = `Sample Selection`),
vjust = -1.5, # nudge above top of bar
size = 4,
na.rm = TRUE) +
# geom_text(position = position_dodge(width = .9), # move to center of bars
# aes(label= paste0("(", ifelse(variable == "Crime = 0", `Observation for Crime = 0`, `Observation for Crime = 1`), ")"), group = `Sample Selection`),
# vjust = -0.6, # nudge above top of bar
# size = 4,
# na.rm = TRUE) +
scale_fill_grey(start = 0.8, end = 0.5) +
scale_y_continuous(expand = expansion(mult = c(0, .1))) +
scale_x_continuous(breaks = breaks_labels$x, labels = breaks_labels$Type) +
theme_bw(base_size = 15) +
xlab("Norm group for corporate Hy") +
ylab("Percentage Compliant Decisions") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_signif(annotation=c("p=0.35", "p=0.06", "p=0.05", "p=0.34", "p=0.00"), y_position = c(30, 40, 55 ,75, 90), xmin=c(0.75,1.75,2.75,3.75,4.75),
xmax=c(1.25,2.25,3.25,4.25,5.25))
For some reason, the last line causes the following error:
Error in f(...) :
Can only handle data with groups that are plotted on the x-axis
Since I am just putting in text and not referring to any variable, I don't really understand why this happens. Can anyone help me out? Without the last line it looks like this:
EDIT: Please note that I would like to keep the space between the third and the fourth column (which is apparently also what caused the problem, see Jared's answer).
Edit
Thanks for clarifying your expected outcome. Here is one way to include geom_signif() annotations without altering the original plot:
library(tidyverse)
library(ggsignif)
data <- structure(list(treatment = c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1), New_Compare_Truth = c(57,
61, 12, 14, 141, 87, 104, 90, 12, 14), total_Hy = c(135,
168, 9, 15, 103, 83, 238, 251, 9, 15), total = c(285, 305, 60,
70, 705, 435, 520, 450, 60, 70), ratio = c(47.3684210526316,
55.0819672131148, 15, 21.4285714285714, 14.6099290780142, 19.0804597701149,
45.7692307692308, 55.7777777777778, 15, 21.4285714285714), Type = structure(c(2L,
2L, 1L, 1L, 3L, 3L, 5L, 5L, 4L, 4L), .Label = c("A1. Others \nMore \nH",
"A2. Similar \nNorm", "A3. Others \nLess \nH", "B1. Others \nMore \nH",
"B2. Similar \nNorm or \nHigher"), class = "factor"), `Sample Selection` = c("Answers pr",
"Answers pu", "Answers pr", "Answers pu", "Answers pr",
"Answers pu", "Answers pr", "Answers pu", "Answers pr",
"Answers pu"), p_value = c(0.0610371842601616, 0.0610371842601616,
0.346302201593934, 0.346302201593934, 0.0472159407450147, 0.0472159407450147,
0.0018764377521242, 0.0018764377521242, 0.346302201593934, 0.346302201593934
), x = c(2, 2, 1, 1, 3, 3, 5.5, 5.5, 4.5, 4.5)), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
breaks_labels <- structure(list(Type = structure(c(2L, 1L, 3L, 5L, 4L), .Label = c("A1. Others \nMore \nH",
"A2. Similar \nNorm", "A3. Others \nLess \nH", "B1. Others \nMore \nH",
"B2. Similar \nNorm or \nHigher"), class = "factor"), x = c(2,
1, 3, 5.5, 4.5)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
annotation_df <- data.frame(signif = c("p=0.35", "p=0.06", "p=0.05", "p=0.34", "p=0.00"),
y_position = c(30, 40, 55 ,75, 90),
xmin = c(0.75,1.75,2.75,4.25,5.25),
xmax = c(1.25,2.25,3.25,4.75,5.75),
group = c(1,2,3,4,5))
data %>%
ggplot(aes(x = x, y = ratio, group = `Sample Selection`)) +
geom_col(aes(fill = `Sample Selection`),
position = position_dodge(preserve = "single"), na.rm = TRUE) +
geom_text(position = position_dodge(width = .9), # move to center of bars
aes(label=sprintf("%.02f %%", round(ratio, digits = 1))),
vjust = -1.5, # nudge above top of bar
size = 4,
na.rm = TRUE) +
scale_fill_grey(start = 0.8, end = 0.5) +
scale_y_continuous(expand = expansion(mult = c(0, .1))) +
scale_x_continuous(breaks = breaks_labels$x, labels = breaks_labels$Type) +
theme_bw(base_size = 15) +
xlab("Norm group for corporate Hy") +
ylab("Percentage Compliant Decisions") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_signif(aes(xmin = xmin,
xmax = xmax,
y_position = y_position,
annotations = signif,
group = group),
data = annotation_df, manual = TRUE)
#> Warning: Ignoring unknown aesthetics: xmin, xmax, y_position, annotations
Created on 2021-07-20 by the reprex package (v2.0.0)
Previous answer
One potential solution to your problem is to plot "Type" on the x axis instead of "x", e.g.
data %>%
ggplot(aes(x = Type, y = ratio)) +
geom_col(aes(fill = `Sample Selection`),
position = position_dodge(preserve = "single"), na.rm = TRUE) +
geom_text(position = position_dodge(width = .9), # move to center of bars
aes(label=sprintf("%.02f %%", round(ratio, digits = 1)),
group = `Sample Selection`),
vjust = -1.5,
size = 4,
na.rm = TRUE) +
scale_fill_grey(start = 0.8, end = 0.5) +
scale_y_continuous(expand = expansion(mult = c(0, .1))) +
theme_bw(base_size = 15) +
xlab("Norm group for corporate Hy") +
ylab("Percentage Compliant Decisions") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_signif(annotation=c("p=0.35", "p=0.06", "p=0.05", "p=0.34", "p=0.00"),
y_position = c(30, 40, 55 ,75, 90),
xmin=c(0.75,1.75,2.75,3.75,4.75),
xmax=c(1.25,2.25,3.25,4.25,5.25))

How to create a bar plot that considers 2 variables years and months on the x axis and that considers 2 variables on the y axis?

I have a data frame called 'm' with 4 columns and about 100 rows.
How can I create a bar plot where I get the right chronological order on the x-axis with months 1-12 repeating for every year?
Update on OP's request (see comments):
I have used make_date from lubridate package to mutate a new column date. This time date is class date -> day is set to first of month:
With pivot_longer we bring the data in correct shape to plot our bars.
Then we use scale_x_date to show every 4 month.
library(lubridate)
library(tidyverse)
library(scales)
m1 <- m %>%
mutate(date = make_date(YEAR, MONTH)) %>%
pivot_longer(
cols = c(AUTO, IMP),
names_to = "names",
values_to = "values"
)
ggplot(m1, aes(x=date, y=values, fill=names)) +
geom_bar(position='dodge', stat='identity') +
scale_x_date(date_breaks = "4 months" , date_labels = "%b-%y")+
theme_bw()
Output:
Original answer:
We create a character column Date. To be a date we would need a day, therefore we leave it as character class.
With sprintf() we put the two columns YEAR and MONTH together:
library(ggplot2)
m$Date <- with(m, sprintf("%d-%02d", YEAR, MONTH))
ggplot(m, aes(fill=AUTO, y=IMP, x=Date)) +
geom_bar(position='dodge', stat='identity')
data:
m <- structure(list(YEAR = c(2009, 2009, 2009, 2009, 2009, 2009, 2009,
2009, 2009, 2009, 2009, 2009, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2011), MONTH = c(1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 1), AUTO = c(0, 0.003130344, 0.01565172, 0.338077161,
0.01565172, 0, 0, 0.003130344, 0, 0, 0, 0, 0.012277821, 0.021486187,
0.009208366, 0.009208366, 0, 0, 0, 0, 0, 0, 0, 0, 0), IMP = c(0,
0, 0.037564129, 0.062606882, 0.006260688, 0, 0, 0, 0, 0, 0, 0,
0, 0.006138911, 0, 0.003069455, 0, 0, 0, 0, 0, 0, 0, 0, 0)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -25L), spec = structure(list(
cols = list(YEAR = structure(list(), class = c("collector_double",
"collector")), MONTH = structure(list(), class = c("collector_double",
"collector")), AUTO = structure(list(), class = c("collector_double",
"collector")), IMP = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
set.seed(1)
m <- data.frame(year = rep(2019:2021,each=12),
month = rep(1:12,3),
AUTO = rnorm(36,30,1),
IMP = rnorm(36,20,4)
)
library(dplyr)
library(tidyr)
library(ggplot2)
m %>%
mutate(date = as.Date(paste0(year,"-",month,"-01"))) %>%
pivot_longer(IMP:AUTO,"group","value") %>%
ggplot(., aes(fill=group, y=value, x=date)) +
geom_bar(position='dodge', stat='identity')

Annotate ggplot based on a second data frame

I have a faceted plot made with ggplot that is already working, it shows data about river altitude against years. I'm trying to add arrows based on a second dataframe which details when floods occurred.
Here's the current plot:
I would like to draw arrows in the top part of each graph based on date information in my second dataframe where each row corresponds to a flood and contains a date.
The link between the two dataframes is the Station_code column, each river has one or more stations which is indicated by this data (in this case only the Var river has two stations).
Here is the dput of the data frame used to create the original plot:
structure(list(River = c("Durance", "Durance", "Durance", "Durance",
"Roya", "Var"), Reach = c("La Brillanne", "Les Mées", "La Brillanne",
"Les Mées", "Basse vallée", "Basse vallée"), Area_km = c(465,
465, 465, 465, 465, 465), Type = c("restored", "target", "restored",
"target", "witness", "restored"), Year = c(2017, 2017, 2012,
2012, 2018, 2011), Restoration_year = c(2013, 2013, 2013, 2013,
NA, 2009), Station_code = c("X1130010", "X1130010", "X1130010",
"X1130010", "Y6624010", "Y6442015"), BRI_adi_moy_sstransect = c(0.00375820736746399,
0.00244752138003355, 0.00446807607783864, 0.0028792618981479,
0.00989200896930529, 0.00357247516596474), SD_sstransect = c(0.00165574247612667,
0.0010044634990875, 0.00220534492332107, 0.00102694633805149,
0.00788573233793128, 0.00308489160008849), min_BRI_sstransect = c(0.00108123849595469,
0.00111493913953216, 0.000555500340370182, 0.00100279590198288,
0, 0), max_BRI_sstransect = c(0.0127781240385231, 0.00700537285706352,
0.0210216858227621, 0.00815151653110584, 0.127734814926934, 0.0223738711013954
), Nb_sstr_unique_m = c(0.00623321576795815, 0.00259754717331206,
0.00117035034437559, 0.00209845092352825, 0.0458628969163946,
3.60620609570031), BRI_adi_moy_transect = c(0.00280232169999531,
0.00173868254527501, 0.00333818552810438, 0.00181398859573415,
0.00903651639185542, 0.00447856455432537), SD_transect = c(0.00128472161839638,
0.000477209421076879, 0.00204050725984513, 0.000472466654940182,
0.00780731734792112, 0.00310039904793707), min_BRI_transect = c(0.00108123849595469,
0.00106445386542223, 0.000901992689363725, 0.000855135344651009,
0.000944414463851629, 0.000162012161197014), max_BRI_transect = c(0.00709151795418251,
0.00434366293208643, 0.011717024999411, 0.0031991369873946, 0.127734814926934,
0.0187952134332499), Nb_tr_unique_m = c(0, 0, 0, 0, 0, 0), Error_reso = c(0.0011,
8e-04, 0.0018, 0.0011, 0.0028, 0.0031), W_BA = c(296.553323029366,
411.056574923547, 263.944186046512, 363.32874617737, 88.6420798065296,
158.66866970576), W_BA_sd = c(84.1498544481585, 65.3909073242282,
100.067554749308, 55.5534084807705, 35.2337070278364, 64.6978349498119
), W_BA_min = c(131, 206, 33, 223, 6, 45), W_BA_max = c(472,
564, 657, 513, 188, 381), W_norm = c(5.73271228619998, 7.9461900926133,
5.10234066090722, 7.02355699765464, 5.09378494746752, 4.81262001531126
), W_norm_sd = c(1.62671218635823, 1.2640804493236, 1.93441939783807,
1.07391043231191, 2.02469218788178, 1.96236658443141), W_norm_min = c(2.53237866910643,
3.98221378500706, 0.637927450996277, 4.31084307794454, 0.344787822572658,
1.36490651299098), W_norm_max = c(9.12429566273463, 10.9027600715727,
12.7005556152895, 9.91687219276031, 10.8033517739433, 11.5562084766569
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
And here is the dput of the date frame containing the flooding date:
structure(list(Station_code = c("Y6042010", "Y6042010", "Y6042010",
"Y6042010", "Y6042010", "Y6042010"), Date = structure(c(12006,
12007, 12016, 12017, 13416, 13488), class = "Date"), Qm3s = c(156,
177, 104, 124, 125, 90.4), Qual = c(5, 5, 5, 5, 5, 5), Year = c(2002,
2002, 2002, 2002, 2006, 2006), Month = c(11, 11, 11, 11, 9, 12
), Station_river = c("Var#Entrevaux", "Var#Entrevaux", "Var#Entrevaux",
"Var#Entrevaux", "Var#Entrevaux", "Var#Entrevaux"), River = c("Var",
"Var", "Var", "Var", "Var", "Var"), Mod_inter = c(13.32, 13.32,
13.32, 13.32, 13.32, 13.32), Qm3s_norm = c(11.7117117117117,
13.2882882882883, 7.80780780780781, 9.30930930930931, 9.38438438438438,
6.78678678678679), File_name = c("Var#Entrevaux.dat", "Var#Entrevaux.dat",
"Var#Entrevaux.dat", "Var#Entrevaux.dat", "Var#Entrevaux.dat",
"Var#Entrevaux.dat"), Station_name = c("#Entrevaux", "#Entrevaux",
"#Entrevaux", "#Entrevaux", "#Entrevaux", "#Entrevaux"), Reach = c("Daluis",
"Daluis", "Daluis", "Daluis", "Daluis", "Daluis"), Restauration_year = c(2009,
2009, 2009, 2009, 2009, 2009), `Area_km[BH]` = c(676, 676, 676,
676, 676, 676), Starting_year = c(1920, 1920, 1920, 1920, 1920,
1920), Ending_year = c("NA", "NA", "NA", "NA", "NA", "NA"), Accuracy = c("good",
"good", "good", "good", "good", "good"), Q2 = c(86, 86, 86, 86,
86, 86), Q5 = c(120, 120, 120, 120, 120, 120), Q10 = c(150, 150,
150, 150, 150, 150), Q20 = c(170, 170, 170, 170, 170, 170), Q50 = c(200,
200, 200, 200, 200, 200), Data_producer = c("DREAL_PACA", "DREAL_PACA",
"DREAL_PACA", "DREAL_PACA", "DREAL_PACA", "DREAL_PACA"), Coord_X_L2e_Z32 = c(959313,
959313, 959313, 959313, 959313, 959313), Coord_Y_L2e_Z32 = c(1893321,
1893321, 1893321, 1893321, 1893321, 1893321), Coord_X_L93 = c(1005748.88,
1005748.88, 1005748.88, 1005748.88, 1005748.88, 1005748.88),
Coord_Y_L93 = c(6324083.97, 6324083.97, 6324083.97, 6324083.97,
6324083.97, 6324083.97), New_FN = c("Var#Entrevaux.csv",
"Var#Entrevaux.csv", "Var#Entrevaux.csv", "Var#Entrevaux.csv",
"Var#Entrevaux.csv", "Var#Entrevaux.csv"), NA_perc = c(14.92,
14.92, 14.92, 14.92, 14.92, 14.92), Q2_norm = c(6.45645645645646,
6.45645645645646, 6.45645645645646, 6.45645645645646, 6.45645645645646,
6.45645645645646), Q5_norm = c(9.00900900900901, 9.00900900900901,
9.00900900900901, 9.00900900900901, 9.00900900900901, 9.00900900900901
), Q10_norm = c(11.2612612612613, 11.2612612612613, 11.2612612612613,
11.2612612612613, 11.2612612612613, 11.2612612612613), Q20_norm = c(12.7627627627628,
12.7627627627628, 12.7627627627628, 12.7627627627628, 12.7627627627628,
12.7627627627628), Q50_norm = c(15.015015015015, 15.015015015015,
15.015015015015, 15.015015015015, 15.015015015015, 15.015015015015
)), row.names = c(NA, -6L), groups = structure(list(Station_code = "Y6042010",
.rows = structure(list(1:6), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = 1L, class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
EDIT
Here is an example of what I would like to do on the plot:
This is the code I use currently to do the plot:
ggplot(data = tst_formule[tst_formule$River != "Roya",], aes(x = Year, y = BRI_adi_moy_transect, shape = River, col = Type)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = BRI_adi_moy_transect - SD_transect, ymax = BRI_adi_moy_transect + SD_transect), size = 0.7, width = 0.3) +
geom_errorbar(aes(ymin = BRI_adi_moy_transect - Error_reso, ymax = BRI_adi_moy_transect + Error_reso, linetype = "Error due to resolution"), size = 0.3, width = 0.3, colour = "black") +
scale_linetype_manual(name = NULL, values = 2) +
scale_shape_manual(values = c(15, 18, 17, 16)) +
scale_colour_manual(values = c("chocolate1", "darkcyan")) +
new_scale("linetype") +
geom_vline(aes(xintercept = Restoration_year, linetype = "Restoration"), colour = "chocolate1") +
scale_linetype_manual(name = NULL, values = 5) +
new_scale("linetype") +
geom_hline(aes(yintercept = 0.004, linetype = "Threshold"), colour= 'black') +
scale_linetype_manual(name = NULL, values = 4) +
scale_y_continuous("BRI*", limits = c(min(tst_formule$BRI_adi_moy_transect - tst_formule$SD_transect, tst_formule$BRI_adi_moy_transect - tst_formule$Error_reso ), max(tst_formule$BRI_adi_moy_transect + tst_formule$SD_transect, tst_formule$BRI_adi_moy_transect + tst_formule$Error_reso))) +
scale_x_continuous(limits = c(min(tst_formule$Year - 1),max(tst_formule$Year + 1)), breaks = scales::breaks_pretty(n = 6)) +
theme_bw() +
facet_wrap(vars(River)) +
theme(legend.spacing.y = unit(-0.01, "cm")) +
guides(shape = guide_legend(order = 1),
colour = guide_legend(order = 2),
line = guide_legend(order = 3))
After tests and more research, I managed to do it by adding the second dataframe in geom_text():
new_scale("linetype") +
geom_segment(data = Flood_plot, aes(x = Date, xend = Date, y = 0.025, yend = 0.020, linetype = "Morphogenic flood"), arrow = arrow(length = unit(0.2, "cm")), inherit.aes = F, guide = guide_legend(order = 6)) +
scale_linetype_manual(name = NULL, values = 1) +
new_scale() creates a new linetype definition after the ones I created before, geom_segment() allows to draw arrows which I wanted but it works with geom_text() and scale_linetype_manual() draws the arrow in the legend without the mention "linetype" above. The second dataframe has the same column (River) as the 1st one to wrap and create the panels.

Create new column with percentages in data frame

I have the following dataframe:
dput(df1)
structure(list(month = c(1, 1, 2, 2, 3, 4), transaction_type = c("AAA",
"BBB", "BBB", "CCC",
"DDD", "AAA"), max_wt_per_month = c(54.9,
51.6833333333333, 52.3333333333333, 49.4666666666667, 49.85,
48.5833333333333), min_wt_per_month = c(0, 0, 0, 0, 0, 0), avg_wt_per_month = c(8.41701333107861,
7.65211141060198, 6.44184012508551, 7.74798927613941, 7.4360566888844,
7.50611319574734), prop = c(Inf, Inf, Inf, Inf, Inf, Inf)), .Names = c("month",
"transaction_type", "max_wt_per_month", "min_wt_per_month", "avg_wt_per_month",
"prop"), row.names = c(NA, -6L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = list(month), drop = TRUE, indices = list(
0:5), group_sizes = 6L, biggest_group_size = 6L, labels = structure(list(
month = 1), row.names = c(NA, -1L), class = "data.frame", vars = list(
month), drop = TRUE, .Names = "month"))
I want to create column prop that would contain the percentage of maximum waiting time with respect to each month. If I run this code, then I get Inf values in most of the rows... (especially it is evident in the real dataset):
my_fun=function(vec){
100*as.numeric(vec[3]) /
sum(with(data_merged_transactions, ifelse(month == vec[1], max_wt_per_month, 0))) }
data_merged_transactions$prop=apply(data_merged_transactions , 1 , my_fun)
I then finally need to create the filled area chart so that each area would be a percentage out of 100%:
ggplot(data_merged_transactions, aes(x=month, y=prop, fill=transaction_type)) +
geom_area(alpha=0.6 , size=1, colour="black")
Why do I get Inf if the sum is not equal to 0?
Moreover, is it possible to create filled area chart with months being factors (Jan, Feb,etc.), not numbers? I tried to substitute month id's by month names, but then I got very thin bars instead of a filled area.
Is this what you were looking for?
library(tidyverse)
df1_tidy <- df1 %>%
group_by(month) %>%
summarise(SUM = sum(max_wt_per_month)) %>%
full_join(df1) %>%
mutate(prop = max_wt_per_month / SUM)
ggplot(data = df1_tidy,
aes(x = month,
y = prop,
fill = transaction_type)) +
geom_area(alpha = 0.6,
size = 1,
colour = "black") +
scale_x_continuous(labels = c("Jan", "Feb", "Mar", "Apr"))

Resources