Converting month_year variable into week_year (dplyr) & (lubridate) - r

I have a dataset structured as follows, where I am tracking collective action mentions by subReddit by month, relative to a policy treatment which is introduced in Feb 17th, 2012. As a result, the period "Feb 2012" appears twice in my dataset where the "pre" period refers to the Feb 2012 days before treatment, and "post" otherwise.
treatment_status month_year collective_action_percentage
pre Dec 2011 5%
pre Jan 2012 8%
pre Feb 2012 10%
post Feb 2012 3%
post March 2012 10%
However, I am not sure how to best visualize this indicator by month, but I made the following graph but I was wondering if presenting this pattern/variable by week&year, rather than month&year basis would be clearer if I am interested in showing how collective action mentions decline after treatment?
ggplot(data = df1, aes(x = as.Date(month_year), fill = collective_action_percentage ,y = collective_action_percentage)) +
geom_bar(stat = "identity", position=position_dodge()) +
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
xlab("Criticism by individuals active before and after treatment") +
theme_classic()+
theme(plot.title = element_text(size = 10, face = "bold"),
axis.text.x = element_text(angle = 90, vjust = 0.5))
output:
I created the month_year variable as follows using the Zoo package
df<- df %>%
mutate(month_year = zoo::as.yearmon(date))
Finally, I tried aggregating the data by weekly-basis as follows, however, given that I have multiple years in my dataset, I want to ideally aggregate data by week&year, and not simply by week
df2 %>% group_by(week = isoweek(time)) %>% summarise(value = mean(values))

Plot a point for each row and connect them with a line so that it is clear what the order is. We also color the pre and post points differently and make treatment status a factor so that we can order the pre level before the post level.
library(ggplot2)
library(zoo)
df2 <- transform(df1, month_year = as.yearmon(month_year, "%b %Y"),
treatment_status = factor(treatment_status, c("pre", "post")))
ggplot(df2, aes(month_year, collective_action_percentage)) +
geom_point(aes(col = treatment_status), cex = 4) +
geom_line()
Note
We assume df1 is as follows. We have already removed % .
df1 <-
structure(list(treatment_status = c("pre", "pre", "pre", "post",
"post"), month_year = c("Dec 2011", "Jan 2012", "Feb 2012", "Feb 2012",
"March 2012"), collective_action_percentage = c(5L, 8L, 10L,
3L, 10L)), class = "data.frame", row.names = c(NA, -5L))

Related

How to create timberplots in R

i am looking for a way to display estimates of a meta-analysis with lots of comparisons in a wide format instead of a forestplot. I came across a timberplot as displayed in this publication in figure 1:
https://www.researchgate.net/publication/283078594_Translational_failure_of_anti-inflammatory_compounds_for_myocardial_infarction_A_meta-Analysis_of_large_animal_models
So far, I was not able to find any r-code to create timberplots. Any hints would be highly appreciated.
As an example, here is a snippet of my current data:
structure(list(Author = c("Zuloaga 2014", "Kelly-Cobbs 2013",
"Kurita 2020", "Li (a) 2010", "Li (b) 2010", "Luo 2017", "Zhang 2016",
"Chen 2011", "Iwata 2015", "Guan 2011", "Mishiro 2014", "Zhang 2016",
"Rewell 2010", "Desilles 2017", "Cai 2018", "Yang 2015", "Augestad 2020",
"Kumas 2016", "Li 2004", "Pintana 2019", "Gao 2010", "Zhu 2016",
"Li 2013", "Chen 2019", "Iwata 2014"), Effect.size = c(35.200386286818,
-83.4784185709104, 36.1567339277335, -67.2836145890038, -66.2782956058588,
50.6942625098245, 2.16606498194945, 34.0909090909091, 34.6207954981455,
-75.7847533632287, 3.79249627522687, 33.8242513500245, 20.4,
53.381981476284, 55.8256496227997, 37.7068384829404, 35.7624831309042,
34.2436848134081, 44.0740740740741, 11.3382899628253, 78.1728075845723,
43.7891335083821, 32.0754716981132, 24.8822975517891, 56.9998933755769
), Standard.error = c(12.4780629739639, 35.8172017746254, 2.51216141038517,
45.4714925944508, 14.9052728665095, 15.9630454594002, 12.7738671567103,
7.27627754260179, 6.95739967875146, 6.46735654871385, 6.32805324709443,
4.51368516355712, 11.6488966431553, 12.4958199880194, 13.0017602415415,
12.1147303263766, 33.7832025707735, 21.5383168322688, 13.0893311456905,
21.8148377078391, 17.226146227274, 2.16584647411636, 6.82104394943358,
17.2913669783741, 4.81056206059614)), row.names = c(NA, 25L), class = "data.frame")
I ran a meta-analysis using the metagen() command from the meta package with the following code:
ma_results <- metagen(
`Effect.size`,
`Standard.error`,
sm = "NMD",
data = df,
studlab = Author,
random = TRUE,
method.tau = "REML",
prediction = TRUE
)
In the following metagen() object, the effect sizes are stored in ma_results$TE and the lower and upper bounds in ma_results$lower and ma_results$upper.
Following the suggestion of Alan Cameron (see below) my current code looks like:
ggplot(within(ma_results[order(ma_results$TE), ], id <- seq(nrow(25))), aes(id, TE)) +
geom_point(size = 0.5) +
geom_linerange(aes(ymin = lower, ymax = upper)) +
geom_hline(yintercept = TE.random, linetype = 2) +
theme_bw()
Here I get an error because of wrong number of dimensions within ma_results[order(ma_results$TE),].
It's fairly easy to create a plot like this using geom_linerange in ggplot. Here's an example with made up data. Whether you will be able to do this with your own data can't be known without a reproducible example:
library(ggplot2)
set.seed(1)
df <- data.frame(mean = runif(200), CI = runif(200))
ggplot(within(df[order(df$mean), ], id <- seq(nrow(df))), aes(id, mean)) +
geom_point(size = 0.5) +
geom_linerange(aes(ymin = mean - CI, ymax = mean + CI)) +
geom_hline(yintercept = mean(df$mean), linetype = 2) +
theme_bw()
EDIT
With the sample data, we can now do the following:
Make the papers a factor variable, with the ordering of the factor being from the lowest to highest effect size
Add upper and lower columns representing one standard error above and one standard error below the effect size. If you want this to be a 95% confidence interval instead, do effect size +/- 1.96 times the standard error.
First, we need to make sure every paper is uniquely identified. At the moment, your sample data contains two different papers with the same name (Zhang 2016), so we need to change one of them to mark it as unique:
df$Author[12] <- "Zhang (b) 2016"
Now let's get the papers arranged by effect size, and add our lower and upper bounds for each paper:
df$Author <- factor(df$Author, df$Author[order(df$Effect.size)])
df$lower <- df$Effect.size - df$Standard.error
df$upper <- df$Effect.size + df$Standard.error
The plot itself is then just:
ggplot(df, aes(Author, Effect.size)) +
geom_point() +
geom_linerange(aes(ymin = lower, ymax = upper)) +
geom_hline(yintercept = mean(df$Effect.size), linetype = 2) +
annotate(geom = 'text', x = 1, y = mean(df$Effect.size), vjust = -0.5,
label = paste('Mean =', round(mean(df$Effect.size), 1)), hjust = 0) +
theme_light() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot: multiple time periods on same plot by month

I am trying to plot multiple time-periods on the same time-series graph by month. This is my data: https://pastebin.com/458t2YLg. I was trying to avoid dput() example but I think it would have caused confusion to reduce the sample and still keep the structure of the original data. Here is basically a glimpse of how it looks like:
date fl_all_cumsum
671 2015-11-02 0.785000
672 2015-11-03 1.046667
673 2015-11-04 1.046667
674 2015-11-05 1.099000
675 2015-11-06 1.099000
676 2015-11-07 1.099000
677 2015-11-08 1.151333
Basically, it is daily data that spans several years. My goal is to compare the cumulative snow gliding (fl_all_cumsum) of several winter seasons (
It is very similar to this: ggplot: Multiple years on same plot by month however, there are some differences, such as: 1) the time periods are not years but winter seasons (1.10.xxxx - 6.30.xxxx+1); 2) Because I care only about the winter periods I would like the x-axis to go only from October to end of June the following year; 3) the data is not consistent (there are a lot of NA gaps during the months).
I managed to produce this:
library(zoo)
library(lubridate)
library(ggplot2)
library(scales)
library(patchwork)
library(dplyr)
library(data.table)
startTime <- as.Date("2016-10-01")
endTime <- as.Date("2017-06-30")
start_end <- c(startTime,endTime)
ggplot(data = master_dataset, aes(x = date, y = fl_all_cumsum))+
geom_line(size = 1, na.rm=TRUE)+
ggtitle("Cumulative Seasonal Gliding Distance")+
labs(color = "")+
xlab("Month")+
ylab("Accumulated Distance [mm]")+
scale_x_date(limits=start_end,breaks=date_breaks("1 month"),labels=date_format("%d %b"))+
theme(axis.text.x = element_text(angle = 50, size = 10 , vjust = 0.5),
axis.text.y = element_text(size = 10, vjust = 0.5),
panel.background = element_rect(fill = "gray100"),
plot.background = element_rect(fill = "gray100"),
panel.grid.major = element_line(colour = "lightblue"),
plot.margin = unit(c(1, 1, 1, 1), "cm"),
plot.title = element_text(hjust = 0.5, size = 22))
This actually works good visually as the x axis goes from October to June as desired; however, I did it by setting limits,
startTime <- as.Date("2016-10-01")
endTime <- as.Date("2017-06-30")
start_end <- c(startTime,endTime)
and then setting breaks of 1 month.
scale_x_date(limits=start_end,breaks=date_breaks("1 month"),labels=date_format("%d %b"))+
It is needless to say that this technique will not work if I would like to include other winter seasons and a legend.
I also tried to assign a season to certain time periods and then use them as a factor:
master_dataset <- master_dataset %>%
mutate(season = case_when(date>=as.Date('2015-11-02')&date<=as.Date('2016-06-30')~"season 2015-16",
date>=as.Date('2016-11-02')&date<=as.Date('2017-06-30')~"season 2016-17",
date>=as.Date('2017-10-13')&date<=as.Date('2018-06-30')~"season 2017-18",
date>=as.Date('2018-10-18')&date<=as.Date('2019-06-30')~"season 2018-19"))
ggplot(master_dataset, aes(month(date, label=TRUE, abbr=TRUE), fl_all_cumsum, group=factor(season),colour=factor(season)))+
geom_line()+
labs(x="Month", colour="Season")+
theme_classic()
As you can see, I managed to include the other seasons in the graph but there are several issues now:
grouped by month it aggregates the daily values and I lose the daily dynamic in the graph (look how it is based on monthly steps)
the x-axis goes in chronological order which messes up my visualization (remember I care for the winter season development so I need the x-axis to go from October-End of June; see the first graph I produced)
Not big of an issue but because the data has NA gaps, the legend also shows a factor "NA"
I am not a programmer so I can't wrap my mind around on how to code for such an issue. In a perfect world, I would like to have something like the first graph I produced but with all winter seasons included and a legend. Does someone have a solution for this? Thanks in advance.
Zorin
This is indeed kind of a pain and rather fiddly. I create "fake dates" that are the same as your date column, but the year is set to 2015/2016 (using 2016 for the dates that will fall in February so leap days are not lost). Then we plot all the data, telling ggplot that it's all 2015-2016 so it gets plotted on the same axis, but we don't label the year. (The season labels are used and are not "fake".)
## Configure some constants:
start_month = 10 # first month on x-axis
end_month = 6 # last month on x-axis
fake_year_start = 2015 # year we'll use for start_month-December
fake_year_end = fake_year_start + 1 # year we'll use for January-end_month
fake_limits = c( # x-axis limits for plot
ymd(paste(fake_year_start, start_month, "01", sep = "-")),
ceiling_date(ymd(paste(fake_year_end, end_month, "01", sep = "-")), unit = "month")
)
df = df %>%
mutate(
## add (real) year and month columns
year = year(date),
month = month(date),
## add the year for the season start and end
season_start = ifelse(month >= start_month, year, year - 1),
season_end = season_start + 1,
## create season label
season = paste(season_start, substr(season_end, 3, 4), sep = "-"),
## add the appropriate fake year
fake_year = ifelse(month >= start_month, fake_year_start, fake_year_end),
## make a fake_date that is the same as the real date
## except set all the years to the fake_year
fake_date = date,
fake_date = "year<-"(fake_date, fake_year)
) %>%
filter(
## drop irrelevant data
month >= start_month | month <= end_month,
!is.na(fl_all_cumsum)
)
ggplot(df, aes(x = fake_date, y = fl_all_cumsum, group = season,colour= season))+
geom_line()+
labs(x="Month", colour = "Season")+
scale_x_date(
limits = fake_limits,
breaks = scales::date_breaks("1 month"),
labels = scales::date_format("%d %b")
) +
theme_classic()

ggplot line chart does not show data correctly

I am trying to be as specific as possible.
The data I am working with looks like:
dates bsheet mro ciss
1 2008 Oct 490509 3.751000 0.8579982
2 2008 Nov 513787 3.434333 0.9153926
3 2008 Dec 570591 2.718742 0.9145012
4 2009 Jan 534985 2.323581 0.8811410
5 2009 Feb 528390 2.001000 0.8551557
6 2009 Mar 551730 1.662290 0.8286146
7 2009 Apr 514041 1.309333 0.7460113
8 2009 May 486151 1.097774 0.5925725
9 2009 Jun 484629 1.001000 0.5412631
10 2009 Jul 454379 1.001000 0.5398128
11 2009 Aug 458111 1.001000 0.3946989
12 2009 Sep 479956 1.001000 0.2232348
13 2009 Oct 448080 1.001000 0.2961637
14 2009 Nov 427756 1.001000 0.3871220
15 2009 Dec 448548 1.001000 0.3209175
and can be produced via
structure(list(dates = c("2008 Oct", "2008 Nov", "2008 Dec",
"2009 Jan", "2009 Feb", "2009 Mar", "2009 Apr", "2009 May", "2009 Jun",
"2009 Jul", "2009 Aug", "2009 Sep", "2009 Oct", "2009 Nov", "2009 Dec"
), bsheet = c(490509, 513787, 570591, 534985, 528390, 551730,
514041, 486151, 484629, 454379, 458111, 479956, 448080, 427756,
448548), mro = c(3.751, 3.43433333333333, 2.71874193548387, 2.32358064516129,
2.001, 1.66229032258065, 1.30933333333333, 1.09777419354839,
1.001, 1.001, 1.001, 1.001, 1.001, 1.001, 1.001), ciss = c(0.857998173913043,
0.9153926, 0.914501173913044, 0.881140954545454, 0.85515565,
0.828614636363636, 0.746011318181818, 0.592572476190476, 0.541263136363636,
0.539812782608696, 0.394698857142857, 0.223234772727273, 0.296163727272727,
0.387122047619048, 0.32091752173913)), row.names = c(NA, 15L), class = "data.frame")
The line chart I created using the following code
ciss_plot = ggplot(data = example) +
geom_line(aes(x = dates, y = ciss, group = 1)) +
labs(x = 'Time', y = 'CISS') +
scale_x_discrete(breaks = dates_breaks, labels = dates_labels) +
scale_y_continuous(limits = c(0, 1), breaks = c(seq(0, 0.8, by = 0.2)), expand = c(0, 0)) +
theme_bw() +
theme(axis.text.x = element_text(hjust = c(rep(0.5, 11), 0.8, 0.2)))
ciss_plot
for ggplot2 looks like:
whereas if plot the same data using the standard built in plot() function of R using
plot(example$ciss, type = 'l')
results in
which obviously is NOT identical!
Could someone please help me out? These plots take me forever already and I am not figuring out where the problem is. I suspect something is wring either with "group = 1" or the data type of the example$dates column!
I am thankful for any constructive input!!
Thank you all in advance!
Manuel
Your date column is in character format. This means that ggplot will by default convert it to a factor and arrange it in alphabetical order, which is why the plot appears in a different shape. One way to fix this is to ensure you have the levels in the correct order before plotting, like this:
library(dplyr)
library(ggplot2)
dates_breaks <- as.character(example$dates)
ggplot(data = example %>% mutate(dates = factor(dates, levels = dates))) +
geom_line(aes(x = dates, y = ciss, group = 1)) +
labs(x = 'Time', y = 'CISS') +
scale_x_discrete(breaks = dates_breaks, labels = dates_breaks,
guide = guide_axis(n.dodge = 2)) +
scale_y_continuous(limits = c(0, 1), breaks = c(seq(0, 0.8, by = 0.2)),
expand = c(0, 0)) +
theme_bw()
A smarter way would be to convert the date column to actual date times, which allows greater freedom of plotting and prevents you having to use a grouping variable at all:
example <- example %>%
mutate(dates = as.POSIXct(strptime(paste(dates, "01"), "%Y %b %d")))
ggplot(example) +
geom_line(aes(x = dates, y = ciss, group = 1)) +
labs(x = 'Time', y = 'CISS') +
scale_y_continuous(limits = c(0, 1), breaks = c(seq(0, 0.8, by = 0.2)),
expand = c(0, 0)) +
scale_x_datetime(breaks = seq(min(example$dates), max(example$dates), "year"),
labels = function(x) strftime(x, "%Y\n%b")) +
theme_bw() +
theme(panel.grid.minor.x = element_blank())

Changing Date Labels From Odd to Even Years

I want to make a seemingly trivial adjustment to the chart pictured below:
I would like the labels along the x-axis to be even years, rather than odd years. So instead of going from 2009 -> 2011 -> 2013, they should go from 2008 -> 2010 -> 2012, and so forth...
How do I go about doing this?
Here is the code:
germany_yields <- read.csv(file = "Germany 10-Year Yield Weekly (2007-2020).csv", stringsAsFactors = F)
italy_yields <- read.csv(file = "Italy 10-Year Yield Weekly (2007-2020).csv", stringsAsFactors = F)
germany_yields <- germany_yields[, -(3:6)]
italy_yields <- italy_yields[, -(3:6)]
colnames(germany_yields)[1] <- "Date"
colnames(germany_yields)[2] <- "Germany.Yield"
colnames(italy_yields)[1] <- "Date"
colnames(italy_yields)[2] <- "Italy.Yield"
combined <- join(germany_yields, italy_yields, by = "Date")
combined <- na.omit(combined)
combined$Date <- as.Date(combined$Date,format = "%B %d, %Y")
combined["Spread"] <- combined$Italy.Yield - combined$Germany.Yield
fl_dates <- c(tail(combined$Date, n=1), head(combined$Date, n=1))
ggplot(data=combined, aes(x = Date, y = Spread)) + geom_line() +
scale_x_date(limits = fl_dates,
expand = c(0, 0),
date_breaks = "2 years",
date_labels = "%Y")
A -- not very elegant -- way would be to put these arguments in your scale_x_date() :
scale_x_date(date_labels = "%Y",
breaks = ymd(unique(year(combined$fl_dates)[year(combined$fl_dates)%%2 == 0]), truncated = 2L)
(we define breaks manually, by subsetting the whole range of dates and keeping the even years)
That's actually fairly simple. Just set the lower limit to an even number, and set the upper limit to NA. As you haven't provided a reproducible example, here on some fake data.
library(tidyverse)
mydates <- seq(as.Date("2007/1/1"), by = "3 months", length.out =100)
df <- tibble(
myvalue = rnorm(length(mydates))
)
# without limits argument
ggplot(df ) +
aes(x = mydates, y = myvalue) +
geom_line(size = 1L, colour = "#0c4c8a") +
scale_x_date(date_breaks = "2 years",
date_labels = "%Y")
# with limits argument
ggplot(df ) +
aes(x = mydates, y = myvalue) +
geom_line(size = 1L, colour = "#0c4c8a") +
scale_x_date(date_breaks = "2 years",
date_labels = "%Y",
limits = c(as.Date("2006/1/1"), NA))
Created on 2020-04-29 by the reprex package (v0.3.0)

Time series with ggplot2: Using days and hours from different columns

I am trying to plot a time series using ggplot, having the day and time stored in different data frame columns. How can I tell ggplot to take into account both the date and time in the plot?
My data looks like this
Date Hour_min Tair Tflower Tbud
Day1 8:35 24,73 29,79 31,41
Day1 8:36 24,29 29,99 31,82
... .. .. ... ...
Day2 00:00 23,62 30,37 32,59
One can load a small sample of the dataset with this:
#Tagua <- read.table(file = "TIMESERIE_OTO32.txt", header = TRUE,dec = ",")
Tagua <- structure(
list(
Date = structure(c(1L, 1L, 2L, 2L), .Label = c("Day1", "Day2"), class = "factor"),
Hour_min = structure(c(1L, 2L, 1L, 2L), .Label = c("8:35", "8:36"), class = "factor"),
Tair = c(24.73, 24.29, 23.62, 24.29),
Tflower = c(29.79, 29.99, 30.37, 29.99),
Tbud = c(31.41, 31.82, 32.59, 31.82)
),
.Names = c("Date", "Hour_min", "Tair", "Tflower", "Tbud"),
class = "data.frame",
row.names = c(NA, -4L))
Days, hours, and 3 temperature from different parts of the flower.
I have 1400 minutes for 2 days.
I wrote this script:
library(ggplot2)
ggplot(aes(x = (Hour_min), group=1), data = Tagua) +
geom_line(aes(y = Tair, colour = "var1")) +
geom_line(aes(y = Tbud, colour = "var2")) +
geom_line(aes(y = Tflower, colour = "var3"))
The problem is that R plots from 00:00 to 23 (of course), without considering the days.
How can I solve this problem?
If possible, I would like to set the x-axis tick just corresponding to the hour (eg. 2:00, 3:00,...).
This may not be the shortest solution, but you can run it step by step and see how it works.
library(lubridate)
library(dplyr)
library(tidyr)
library(ggplot2)
Tagua <- read.table(file = "TIMESERIE_OTO32.txt", header = TRUE, dec = ",")
Tagua_clean <- Tagua %>%
# Separate hours and minutes:
separate(Hour_min, into = c("Hour", "Minute"), sep = ":") %>%
# Convert Day1 -> 0
# Day2 -> 1
mutate(Day = as.numeric(gsub("Day", "", Date)) - 1) %>%
# Create a Period:
mutate(time_period = period(days = Day, hours = Hour, minutes = Minute)) %>%
# Create a Date, using the beginning of the experiment (if you know it):
mutate(Date = as.POSIXct("2017-01-01") + time_period) %>%
# Option 2: Convert the time period to hours:
mutate(Hours = as.numeric(time_period)/3600) %>%
select(Date, Hours, Tair, Tflower, Tbud)
# Option 1: With real dates:
ggplot(aes(x = Date), data = Tagua_clean) +
geom_line(aes(y = Tair, colour = "var1")) +
geom_line(aes(y = Tbud, colour = "var2"))+
geom_line(aes(y = Tflower, colour = "var3"))
# Option 2: With hours:
ggplot(aes(x = Hours), data = Tagua_clean) +
geom_line(aes(y = Tair, colour = "var1")) +
geom_line(aes(y = Tbud, colour = "var2"))+
geom_line(aes(y = Tflower, colour = "var3"))
Update: Restart the hours to 0 every day. Here we use Dates but we customize how they are shown.
scale_x_datetime has the argument date_labels that can be set to "%H" to show the hour of the day or can be set to "Day %d \n Hour: %H" for a combination of day and hour. See ?strptime for more format options. Another argument that can be used is date_breaks to specify "1 hour" if you want a label every hour.
ggplot(aes(x = Date), data = Tagua_clean) +
geom_line(aes(y = Tair, colour = "var1")) +
geom_line(aes(y = Tbud, colour = "var2"))+
geom_line(aes(y = Tflower, colour = "var3")) +
scale_x_datetime(date_labels = "Day %d \n Hour: %H")

Resources