Plotting histogram for data with start and end date - r

I have a data set that is something like this:
start_date end_date outcome
1 2014-07-18 2014-08-20 TRUE
2 2014-08-04 2014-09-23 TRUE
3 2014-08-01 2014-09-03 TRUE
4 2014-08-01 2014-09-03 TRUE
5 2014-12-10 2014-12-10 TRUE
6 2014-10-11 2014-11-07 TRUE
7 2015-04-27 2015-05-20 TRUE
8 2014-11-22 2014-12-25 TRUE
9 2015-03-24 2015-04-26 TRUE
10 2015-03-12 2015-04-10 FALSE
11 2014-05-29 2014-06-28 FALSE
12 2015-03-19 2015-04-20 TRUE
13 2015-03-25 2015-04-26 TRUE
14 2015-03-25 2015-04-26 TRUE
15 2014-07-09 2014-08-10 TRUE
16 2015-03-26 2015-04-26 TRUE
17 2014-07-09 2014-08-10 TRUE
18 2015-03-30 2015-04-28 TRUE
19 2014-03-13 2014-04-13 TRUE
20 2015-04-01 2015-04-29 TRUE
I want to plot a histogram where each bar corresponds to a month and it contains the proportion of FALSE / ALL = (FALSE + TRUE) in that month.
What is the easiest way to do this in R preferably using ggplot?

Here is one way. There will be better ways to do this. But I will leave what I tried. The main job was to create a new data frame for the graphic. Using your data above, I first converted factors to date objects. If yo have date objects in your data, you do not need this. Then, I summarised your data for start_date and end_date using count(). I bound the two data frames and further did the calculation to get the proportion of FALSE for each month.
library(zoo)
library(dplyr)
library(ggplot2)
library(lubridate)
mutate_each(mydf, funs(as.POSIXct(., format = "%Y-%m-%d")), -outcome) %>%
mutate_each(funs(paste(year(.),"-",month(.), sep = "")), vars = -outcome) -> foo1;
count(foo1, start_date, outcome) %>% rename(date = start_date) -> foo2;
count(foo1, end_date, outcome) %>%
rename(date = end_date) %>%
bind_rows(foo2) %>%
group_by(date, outcome) %>%
summarize(total = sum(n)) %>%
summarize(prop = length(which(outcome == FALSE)) / sum(total)) %>%
mutate(date = as.Date(as.yearmon(date))) -> foo3
ggplot(data = foo3, aes(x = date, y = prop)) +
geom_bar(stat = "identity") +
scale_x_date(labels = date_format("%Y-%m"), breaks = date_breaks("month")) +
theme(axis.text.x = element_text(angle = 90, vjust = 1))

Related

Determine if the day of a month is in a date range, independent from its year

Given I have time ranges with a start and an end date, I can easily determine if a specific date falls in this time range. How can we determine if a specific month/day combination lies in a time range, independent from its year.
Example
Given I would like to know whether any first of July (07-01) lies in a time range.
2020-01-30 - 2020-06-15 --> NO
2020-06-16 - 2021-03-20 --> YES
2013-04-26 - 2019-02-13 --> YES (multiple)
R Code Example
# set seed for sampling
set.seed(1)
# number of time ranges
cases <- 10
# time gaps in days
gaps <- sort(sample(x = 1:5000, size = cases, replace = TRUE))
# data frame with time ranges
df <- data.frame(dates_start = rev(Sys.Date() - gaps[2:cases] + 1),
dates_end = rev(Sys.Date() - gaps[1:(cases-1)]))
df
#> dates_start dates_end
#> 1 2009-06-26 2010-01-19
#> 2 2010-01-20 2011-06-05
#> 3 2011-06-06 2011-06-20
#> 4 2011-06-21 2013-04-21
#> 5 2013-04-22 2016-02-17
#> 6 2016-02-18 2016-08-05
#> 7 2016-08-06 2018-05-11
#> 8 2018-05-12 2019-10-09
#> 9 2019-10-10 2021-10-25
# Is specific date in date range
df$date_in_range <- df$dates_start <= lubridate::ymd("2019-07-01") &
lubridate::ymd("2019-07-01") < df$dates_end
# specific day of a month in date range
# pseudo code
data.table::between(x = month_day("07-01"),
lower = dates_start,
upper = dates_end)
#> Error in month_day("07-01"): could not find function "month_day"
# expected output
df$monthday_in_range <- c(T, T, F, T, T, T, T, T, T)
df
#> dates_start dates_end date_in_range monthday_in_range
#> 1 2009-06-26 2010-01-19 FALSE TRUE
#> 2 2010-01-20 2011-06-05 FALSE TRUE
#> 3 2011-06-06 2011-06-20 FALSE FALSE
#> 4 2011-06-21 2013-04-21 FALSE TRUE
#> 5 2013-04-22 2016-02-17 FALSE TRUE
#> 6 2016-02-18 2016-08-05 FALSE TRUE
#> 7 2016-08-06 2018-05-11 FALSE TRUE
#> 8 2018-05-12 2019-10-09 TRUE TRUE
#> 9 2019-10-10 2021-10-25 FALSE TRUE
Update 2
dplyr/data.table independent function
md_in_interval <- function(md, start, end) {
# does the interval cover more than a full year?
# Then any date will fall in this interval and hence the result is TRUE
helper <- (lubridate::year(end) - lubridate::year(start)) > 1
# lubridate time interval
interval <- lubridate::interval(dates_start, dates_end)
# helper dates with month/day combination and start year
my_date1 <- lubridate::mdy(paste0(md, lubridate::year(start)))
# helper dates with month/day combination and end year
my_date2 <- lubridate::mdy(paste0(md, lubridate::year(end)))
# check if month/day combination falls within the interval
out <- my_date1 %within% interval |
my_date2 %within% interval |
helper
return(out)
}
Usage with data.table
library(data.table)
dt <- data.table::as.data.table(df)
dt[, isin := md_in_interval("06-05", dates_start, dates_end)][]
Update
To overcome the issue with when there are more than one year span we could use a helper column:
df %>%
mutate(across(, ymd),
helper = ifelse(year(dates_end) - year(dates_start) > 1, 1, 0),
interval = interval(dates_start, dates_end)) %>%
mutate(my_date1 = mdy(paste0("07-01-",year(dates_start))),
my_date2 = mdy(paste0("07-01-",year(dates_end)))) %>%
mutate(check = my_date1 %within% interval | my_date2 %within% interval | helper == 1) %>%
select(1,2,7)
dates_start dates_end check
1 2009-06-26 2010-01-19 TRUE
2 2010-01-20 2011-06-05 TRUE
3 2011-06-06 2011-06-20 FALSE
4 2011-06-21 2013-04-21 TRUE
5 2013-04-22 2016-02-17 TRUE
6 2016-02-18 2016-08-05 TRUE
7 2016-08-06 2018-05-11 TRUE
8 2018-05-12 2019-10-09 TRUE
9 2019-10-10 2021-10-25 TRUE
First answer:
We could use lubridate for this.
We create an interval with interval then we
we check with %within% wether the day is in interval or not.
Before we have to create a month-day-year of 07-01 element. We do this with mdy(paste0("07-01-",year(dates_start)))
library(dplyr)
library(lubridate)
df %>%
mutate(across(, ymd),
interval = interval(dates_start, dates_end)) %>%
mutate(my_date = mdy(paste0("07-01-",year(dates_start)))) %>%
mutate(check = my_date %within% interval)
dates_start dates_end interval my_date check
1 2009-06-26 2010-01-19 2009-06-26 UTC--2010-01-19 UTC 2009-07-01 TRUE
2 2010-01-20 2011-06-05 2010-01-20 UTC--2011-06-05 UTC 2010-07-01 TRUE
3 2011-06-06 2011-06-20 2011-06-06 UTC--2011-06-20 UTC 2011-07-01 FALSE
4 2011-06-21 2013-04-21 2011-06-21 UTC--2013-04-21 UTC 2011-07-01 TRUE
5 2013-04-22 2016-02-17 2013-04-22 UTC--2016-02-17 UTC 2013-07-01 TRUE
6 2016-02-18 2016-08-05 2016-02-18 UTC--2016-08-05 UTC 2016-07-01 TRUE
7 2016-08-06 2018-05-11 2016-08-06 UTC--2018-05-11 UTC 2016-07-01 FALSE
8 2018-05-12 2019-10-09 2018-05-12 UTC--2019-10-09 UTC 2018-07-01 TRUE
9 2019-10-10 2021-10-25 2019-10-10 UTC--2021-10-25 UTC 2019-07-01 FALSE
You may try
library(lubridate)
library(dplyr)
df %>%
rowwise %>%
mutate(monthday_in_range = 7 %in% month(seq(floor_date(dates_start, "month"), dates_end, by = "month")))
dates_start dates_end monthday_in_range
<date> <date> <lgl>
1 2009-06-26 2010-01-19 TRUE
2 2010-01-20 2011-06-05 TRUE
3 2011-06-06 2011-06-20 FALSE
4 2011-06-21 2013-04-21 TRUE
5 2013-04-22 2016-02-17 TRUE
6 2016-02-18 2016-08-05 TRUE
7 2016-08-06 2018-05-11 TRUE
8 2018-05-12 2019-10-09 TRUE
9 2019-10-10 2021-10-25 TRUE
add
df %>%
rowwise %>%
mutate(monthday_in_range = 7 %in% month(seq(ymd(paste0(substr(dates_start, 1, 8), "13")), dates_end, by = "month")))

Missing data in R - How to skip grouping of days with missing information?

I have hourly values of temperature measurements and I wish to calculate the average per day only for complete (i.e. with 24 measurements) days. Incomplete days would then be summarized as "NA".
I have grouped the values together per year, month and day and call summarize().
I have three month of data missing which appears as a gap in my ggplot function and which is what I want to achieve with the rest. The problem is that when I call summarize() to calculate the mean of my values, days with only 1 or 2 measurements also get called. Only those with all missing values (24) appear as "NA".
Date TempUrb TempRur UHI
1 2011-03-21 22:00:00 10.1 11.67000 -1.570000
2 2011-03-21 23:00:00 9.9 11.67000 -1.770000
3 2011-03-22 00:00:00 10.9 11.11000 -0.210000
4 2011-03-22 01:00:00 10.7 10.56000 0.140000
5 2011-03-22 02:00:00 9.7 10.00000 -0.300000
6 2011-03-22 03:00:00 9.5 10.00000 -0.500000
7 2011-03-22 04:00:00 9.4 8.89000 0.510000
8 2011-03-22 05:00:00 8.4 8.33500 0.065000
9 2011-03-22 06:00:00 8.2 7.50000 0.700000
AvgUHI <- UHI %>% group_by(year(Date), add = TRUE) %>%
group_by(month(Date), add = TRUE) %>%
group_by(day(Date), add = TRUE, .drop = TRUE) %>%
summarize(AvgUHI = mean(UHI, na.rm = TRUE))
# A tibble: 2,844 x 4
# Groups: year(Date), month(Date) [95]
`year(Date)` `month(Date)` `day(Date)` AvgUHI
<int> <int> <int> <dbl>
1476 2015 4 4 0.96625000
1477 2015 4 5 -0.11909722
1478 2015 4 6 -0.60416667
1479 2015 4 7 -0.92916667
1480 2015 4 8 NA
1481 2015 4 9 NA
AvgUHI<- AvgUHI %>% group_by(`year(Date)`, add = TRUE) %>%
group_by(`month(Date)`, add = TRUE) %>%
summarize(AvgUHI= mean(AvgUHI, na.rm = TRUE))
# A tibble: 95 x 3
# Groups: year(Date) [9]
`year(Date)` `month(Date)` AvgUHI
<int> <int> <dbl>
50 2015 4 0.580887346
51 2015 5 0.453815051
52 2015 6 0.008479618
As you can see above on the final table, I have an average for 04-2015, while I am missing data on that month (08 - 09/04/2015 on this example represented on the second table).
The same happens when I calculate AvgUHI and I'm missing hourly data.
I simply would like to see on the last table the AvgUHI for 04-2015 be NA.
E.g: of my graph1
The following will give a dataframe aggregated by day, where only the complete days, with 4 observations, are not NA. Then you can group by month to have the final dataframe.
UHI %>%
mutate(Day = as.Date(Date)) %>%
group_by(Day) %>%
mutate(n = n(), tmpUHI = if_else(n == 24, UHI, NA_real_)) %>%
summarize(AvgUHI = mean(tmpUHI)) %>%
full_join(data.frame(Day = seq(min(.$Day), max(.$Day), by = "day"))) %>%
arrange(Day) -> AvgUHI
For hours look at Rui Barradas' answer. For months the following code worked:
AvgUHI %>%
group_by(year(Day), add = TRUE) %>%
group_by(month(Day), add = TRUE) %>%
mutate(sum = sum(is.na(AvgUHI)), tmpUHI = if_else(sum <= 10, AvgUHI, NA_real_)) %>%
summarise(AvgUHI = mean(tmpUHI, na.rm = TRUE)) -> AvgUHI

plot activity of different departments

I have a dataset over some departments (dep. number), and in which timeframe a certain survey was made in that department. It looks like this
dep type inDate outDate
1 14 AA 2015-01-16 2015-04-25
2 10 AB 2014-05-01 2017-01-01
3 14 BA 2013-01-04 2015-04-06
4 11 CA 2016-09-10 2017-12-01
5 10 DD 2013-01-01 2013-12-01
...
Also i have a startYear = 2013
and an endYear = 2017
for when the surveys started and ended globally.
I want a plot for each of the departments. These plots should show how many surveys were active in the period between the startDate and endDate. So for department 14, the plot should look like this
Can someone just point me in the right direction, i don't even know where to start?
df = read.table(text = "
dep type inDate outDate
1 14 AA 2015-01-16 2015-04-25
2 10 AB 2014-05-01 2017-01-01
3 14 BA 2013-01-04 2015-04-06
4 11 CA 2016-09-10 2017-12-01
5 10 DD 2013-01-01 2013-12-01
", header=T, stringsAsFactors=F)
library(tidyverse)
library(lubridate)
df %>%
mutate_at(vars(inDate, outDate), ymd) %>% # update date columns to date format (if needed)
mutate(dep = factor(dep)) %>% # update dep to factor (if it is not)
group_by(dep, id = row_number()) %>% # for every row
nest() %>% # nest data
mutate(dates = map(data, ~seq(.x$inDate, .x$outDate, "1 day"))) %>% # create a sequence of dates
unnest(dates) %>% # add that sequence of dates as column
count(dep, dates) %>% # count live projects each day
complete(dep, dates, fill = list(n = 0L)) %>% # add zeros to days that surveys weren't live
ggplot(aes(dates, n, group=dep, col=dep))+ # plot
geom_line()+ # add line
facet_wrap(~dep) # one plot for each department
You can remove +facet_wrap(~dep) if you want all departments in the same plot.

Validate time series index

I am using a dataset which is grouped by group_by function of dplyr package.
Each Group has it's own time index which i.e. supposedly consist of 12 months sequences.
This means that it can start from January and end up in December or in other cases it can start from June of the year before and end up in May next year.
Here is the dataset example:
ID DATE
8 2017-01-31
8 2017-02-28
8 2017-03-31
8 2017-04-30
8 2017-05-31
8 2017-06-30
8 2017-07-31
8 2017-08-31
8 2017-09-30
8 2017-10-31
8 2017-11-30
8 2017-12-31
32 2017-01-31
32 2017-02-28
32 2017-03-31
32 2017-04-30
32 2017-05-31
32 2017-06-30
32 2017-07-31
32 2017-08-31
32 2017-09-30
32 2017-10-31
32 2017-11-30
32 2017-12-31
45 2016-09-30
45 2016-10-31
45 2016-11-30
45 2016-12-31
45 2017-01-31
45 2017-02-28
45 2017-03-31
45 2017-04-30
45 2017-05-31
45 2017-06-30
45 2017-07-31
45 2017-08-31
The Problem is that I can't confirm or validate visualy because of dataset dimensions if there are so called "jumps", in other words if dates are consistent. Is there any simple way in r to do that, perhaps some modification/combination of functions from tibbletime package.
Any help will by appreciated.
Thank you in advance.
Here's how I would typically approach this problem using data.table -- the cut.Date() and seq.Date() functions from base are the meat of the logic, so you use the same approach with dplyr if desired.
library(data.table)
## Convert to data.table
setDT(df)
## Convert DATE to a date in case it wasn't already
df[,DATE := as.Date(DATE)]
## Order by ID and Date
setkey(df,ID,DATE)
## Create a column with the month of each date
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
## Generate a sequence of Dates by month for the number of observations
## in each group -- .N
df[,ExpectedMonth := seq.Date(from = min(Month),
by = "months",
length.out = .N), by = .(ID)]
## Create a summary table to test whether an ID had 12 observations where
## the actual month was equal to the expected month
Test <- df[Month == ExpectedMonth, .(Valid = ifelse(.N == 12L,TRUE,FALSE)), by = .(ID)]
print(Test)
# ID Valid
# 1: 8 TRUE
# 2: 32 TRUE
# 3: 45 TRUE
## Do a no-copy join of Test to df based on ID
## and create a column in df based on the 'Valid' column in Test
df[Test, Valid := i.Valid, on = "ID"]
## The final output:
head(df)
# ID DATE Month ExpectedMonth Valid
# 1: 8 2017-01-31 2017-01-01 2017-01-01 TRUE
# 2: 8 2017-02-28 2017-02-01 2017-02-01 TRUE
# 3: 8 2017-03-31 2017-03-01 2017-03-01 TRUE
# 4: 8 2017-04-30 2017-04-01 2017-04-01 TRUE
# 5: 8 2017-05-31 2017-05-01 2017-05-01 TRUE
# 6: 8 2017-06-30 2017-06-01 2017-06-01 TRUE
You could also do things a little more compactly if you really wanted to using a self-join and skip creating Test
setDT(df)
df[,DATE := as.Date(DATE)]
setkey(df,ID,DATE)
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
df[,ExpectedMonth := seq.Date(from = min(Month), by = "months", length.out = .N), keyby = .(ID)]
df[df[Month == ExpectedMonth,.(Valid = ifelse(.N == 12L,TRUE,FALSE)),keyby = .(ID)], Valid := i.Valid]
You can use the summarise function from dplyr to return a logical value of whether there are any day differences greater than 31 within each ID. You do this by first constructing a temporary date using only the year and month and attaching "-01" as the fake day:
library(dplyr)
library(lubridate)
df %>%
group_by(ID) %>%
mutate(DATE2 = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
DATE_diff = c(0, diff(DATE2))) %>%
summarise(Valid = !any(DATE_diff > 31))
Result:
# A tibble: 3 x 2
ID Valid
<int> <lgl>
1 8 TRUE
2 32 TRUE
3 45 TRUE
You can also visually check if there are any gaps by plotting your dates for each ID:
library(ggplot2)
df %>%
mutate(DATE = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
ID = as.factor(ID)) %>%
ggplot(aes(x = DATE, y = ID, group = ID)) +
geom_point(aes(color = ID)) +
scale_x_date(date_breaks = "1 month",
date_labels = "%b-%Y") +
labs(title = "Time Line by ID")

Issue to have correct scale with date ggplot

I have two dataframes tur_e and tur_w. Below you can see the data frame:
tur_e:
Time_f turbidity_E
1 2014-12-12 00:00:00 87
2 2014-12-12 00:15:00 87
3 2014-12-12 00:30:00 91
4 2014-12-12 00:45:00 84
5 2014-12-12 01:00:00 92
6 2014-12-12 01:15:00 89
tur_w:
Time_f turbidity_w
47 2015-06-04 11:45:00 8.4
48 2015-06-04 12:00:00 10.5
49 2015-06-04 12:15:00 9.2
50 2015-06-04 12:30:00 9.1
51 2015-06-04 12:45:00 8.7
52 2015-06-04 13:00:00 8.4
I then create a unique dataframe combining turbidity_E and turbidity_w. I match with the date (time_f) and use melt to reshape data:
dplr <- left_join(tur_e, tur_w, by=c("Time_f"))
dt.df <- melt(dplr, measure.vars = c("turbidity_E", "turbidity_w"))
I plotted series of box plot over time. The code is below:
dt.df%>% mutate(Time_f = ymd_hms(Time_f)) %>%
ggplot(aes(x = cut(Time_f, breaks="month"), y = value)) +
geom_boxplot(outlier.size = 0.3) + facet_wrap(~variable, ncol=1)+labs(x = "time")
I obtain the following graph:
I would like to reduce the number of dates that appear in my x-axis. I add this line of code:
scale_x_date(breaks = date_breaks("6 months"),labels = date_format("%b"))
I got this following error:
Error: Invalid input: date_trans works with objects of class Date
only
I tried a lot of different solutions but no one work. Any help would be appreciate! Thanks!
Two things. First, you need to use scale_x_datetime (you don't have only dates, but also time!). Secondly, when you cut x, it actually just becomes a factor, losing any sense of time altogether. If you want a boxplot of each month, you can group by that cut instead:
dt.df %>% mutate(Time_f = lubridate::ymd_hms(Time_f)) %>%
ggplot(aes(x = Time_f, y = value, group = cut(Time_f, breaks="month"))) +
geom_boxplot(outlier.size = 0.3) +
facet_wrap(~variable, ncol = 1) +
labs(x = "time") +
scale_x_datetime(date_breaks = '1 month')

Resources