Split a rows into two when a date range spans a change in calendar year - r

I am trying to figure out how to add a row when a date range spans a calendar year. Below is a minimal reprex:
I have a date frame like this:
have <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-12-20'), as.Date('2019-05-13')),
to = c(as.Date('2019-06-20'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
have
#> from to
#> 1 2018-12-15 2019-06-20
#> 2 2019-12-20 2020-01-25
#> 3 2019-05-13 2019-09-10
I want a data.frame that splits into two rows when to and from span a calendar year.
want <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-01-01'), as.Date('2019-12-20'), as.Date('2020-01-01'), as.Date('2019-05-13')),
to = c(as.Date('2018-12-31'), as.Date('2019-06-20'), as.Date('2019-12-31'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
want
#> from to
#> 1 2018-12-15 2018-12-31
#> 2 2019-01-01 2019-06-20
#> 3 2019-12-20 2019-12-31
#> 4 2020-01-01 2020-01-25
#> 5 2019-05-13 2019-09-10
I am wanting to do this because for a particular row, I want to know how many days are in each year.
want$time_diff_by_year <- difftime(want$to, want$from)
Created on 2020-05-15 by the reprex package (v0.3.0)
Any base R, tidyverse solutions would be much appreciated.

You can determine the additional years needed for your date intervals with map2, then unnest to create additional rows for each year.
Then, you can identify date intervals of intersections between partial years and a full calendar year. This will keep the partial years starting Jan 1 or ending Dec 31 for a given year.
library(tidyverse)
library(lubridate)
have %>%
mutate(date_int = interval(from, to),
year = map2(year(from), year(to), seq)) %>%
unnest(year) %>%
mutate(year_int = interval(as.Date(paste0(year, '-01-01')), as.Date(paste0(year, '-12-31'))),
year_sect = intersect(date_int, year_int),
from_new = as.Date(int_start(year_sect)),
to_new = as.Date(int_end(year_sect))) %>%
select(from_new, to_new)
Output
# A tibble: 5 x 2
from_new to_new
<date> <date>
1 2018-12-15 2018-12-31
2 2019-01-01 2019-06-20
3 2019-12-20 2019-12-31
4 2020-01-01 2020-01-25
5 2019-05-13 2019-09-10

Related

Is there a way to group data according to time in R?

I'm working with trip ticket data and it includes a column with dates and times. I'm want to group trips according to Morning(05:00 - 10:59), Lunch(11:00-12:59), Afternoon(13:00-17:59), Evening(18:00-23:59), and Dawn/Graveyard(00:00-04:59) and then count the number of trips (by means of counting the unique values in the trip_id column) for each of those categories.
Only I don't know how to group/summarize according to time values. Is this possible in R?
trip_id start_time end_time day_of_week
1 CFA86D4455AA1030 2021-03-16 08:32:30 2021-03-16 08:36:34 Tuesday
2 30D9DC61227D1AF3 2021-03-28 01:26:28 2021-03-28 01:36:55 Sunday
3 846D87A15682A284 2021-03-11 21:17:29 2021-03-11 21:33:53 Thursday
4 994D05AA75A168F2 2021-03-11 13:26:42 2021-03-11 13:55:41 Thursday
5 DF7464FBE92D8308 2021-03-21 09:09:37 2021-03-21 09:27:33 Sunday
Here's a solution with hour() and case_when().
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
trip <- tibble(start_time = mdy_hm("1/1/2022 1:00") + minutes(seq(0, 700, 15)))
trip <- trip %>%
mutate(
hr = hour(start_time),
time_of_day = case_when(
hr >= 5 & hr < 11 ~ "morning",
hr >= 11 & hr < 13 ~ "afternoon",
TRUE ~ "fill in the rest yourself :)"
)
)
print(trip)
#> # A tibble: 47 x 3
#> start_time hr time_of_day
#> <dttm> <int> <chr>
#> 1 2022-01-01 01:00:00 1 fill in the rest yourself :)
#> 2 2022-01-01 01:15:00 1 fill in the rest yourself :)
#> 3 2022-01-01 01:30:00 1 fill in the rest yourself :)
#> 4 2022-01-01 01:45:00 1 fill in the rest yourself :)
#> 5 2022-01-01 02:00:00 2 fill in the rest yourself :)
#> 6 2022-01-01 02:15:00 2 fill in the rest yourself :)
#> 7 2022-01-01 02:30:00 2 fill in the rest yourself :)
#> 8 2022-01-01 02:45:00 2 fill in the rest yourself :)
#> 9 2022-01-01 03:00:00 3 fill in the rest yourself :)
#> 10 2022-01-01 03:15:00 3 fill in the rest yourself :)
#> # ... with 37 more rows
trips <- trip %>%
count(time_of_day)
print(trips)
#> # A tibble: 3 x 2
#> time_of_day n
#> <chr> <int>
#> 1 afternoon 7
#> 2 fill in the rest yourself :) 16
#> 3 morning 24
Created on 2022-03-21 by the reprex package (v2.0.1)

Sum time across different continuous time events across date and time combinations in R

I am having trouble figuring out how to account for and sum continuous time observations across multiple dates and time events in my dataset. A similar question is found here, but it only accounts for one instance of a continuous time event. I have a dataset with multiple date and time combinations. Here is an example from that dataset, which I am manipulating in R:
date.1 <- c("2021-07-21", "2021-07-21", "2021-07-21", "2021-07-29", "2021-07-29", "2021-07-30", "2021-08-01","2021-08-01","2021-08-01")
time.1 <- c("15:57:59", "15:58:00", "15:58:01", "15:46:10", "15:46:13", "18:12:10", "18:12:10","18:12:11","18:12:13")
df <- data.frame(date.1, time.1)
df
date.1 time.1
1 2021-07-21 15:57:59
2 2021-07-21 15:58:00
3 2021-07-21 15:58:01
4 2021-07-29 15:46:10
5 2021-07-29 15:46:13
6 2021-07-30 18:12:10
7 2021-08-01 18:12:10
8 2021-08-01 18:12:11
9 2021-08-01 18:12:13
I tried following the following script from the link I present:
df$missingflag <- c(1, diff(as.POSIXct(df$time.1, format="%H:%M:%S", tz="UTC"))) > 1
df
date.1 time.1 missingflag
1 2021-07-21 15:57:59 FALSE
2 2021-07-21 15:58:00 TRUE
3 2021-07-21 15:58:01 FALSE
4 2021-07-29 15:46:10 FALSE
5 2021-07-29 15:46:13 TRUE
6 2021-07-30 18:12:10 TRUE
7 2021-08-01 18:12:10 FALSE
8 2021-08-01 18:12:11 FALSE
9 2021-08-01 18:12:13 TRUE
But it did not working as anticipated and did not get closer to my answer. It would have been an intermediate goal and probably wouldn't answer my questions.
The GOAL of my dilemma would be account to for all the continuous time observations and put into a new table like this:
date.1 time.1 secs
1 2021-07-21 15:57:59 3
4 2021-07-29 15:46:10 1
5 2021-07-29 15:46:13 1
6 2021-07-30 18:12:10 1
7 2021-08-01 18:12:10 2
9 2021-08-01 18:12:13 1
You will see that the start time of each of the continuous time observations are recorded and the total number of seconds (secs) observed since the start of the continuous observation are being recorded. The script would account for date.1 as there are multiple dates in the dataset.
Thank you in advance.
You can create a datetime object combining date and time columns, get the difference of consecutive values and create groups where all the time 1s apart are part of the same group. For each group count the number of rows and their first datetime value.
library(dplyr)
library(tidyr)
df %>%
unite(datetime, date.1, time.1, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime)) %>%
group_by(grp = cumsum(difftime(datetime,
lag(datetime, default = first(datetime)), units = 'secs') > 1)) %>%
summarise(datetime = first(datetime),
secs = n(), .groups = 'drop') %>%
select(-grp)
# datetime secs
# <dttm> <int>
#1 2021-07-21 15:57:59 3
#2 2021-07-29 15:46:10 1
#3 2021-07-29 15:46:13 1
#4 2021-07-30 18:12:10 1
#5 2021-08-01 18:12:10 2
#6 2021-08-01 18:12:13 1
I have kept datetime as single combined column here but if needed you can separate them again as two different columns using
%>% separate(datetime, c('date', 'time'), sep = ' ')

Is there a quick way of extracting data over several years corresponding to today's date

I have a dataframe that has a date column that spans from 2014-01-01 to today 2021-04-29, and other columns of associated data.
What I would like to do is filter for data for the current day and month across all years. So data for 04-29 is brought back for all years 2014 to 2021 on that particular day.
What would be the most efficient / tidiest way of doing this?
Do something like this
set.seed(2)
df <- data.frame(d = as.Date('2015-01-01') + sample(1:3000, 1000),
o = runif(1000))
head(df)
#> d o
#> 1 2017-09-02 0.1481156
#> 2 2016-12-11 0.3957168
#> 3 2022-09-23 0.2654405
#> 4 2016-02-21 0.3482240
#> 5 2016-01-28 0.6943241
#> 6 2015-10-01 0.5069469
library(lubridate)
#>
#> date, intersect, setdiff, union
df[month(df$d) == month(Sys.Date()) & day(df$d) == day(Sys.Date()),]
#> d o
#> 24 2016-04-29 0.2431883
#> 131 2017-04-29 0.9359659
#> 383 2022-04-29 0.2703415
Created on 2021-04-29 by the reprex package (v2.0.0)
dplyr method is similar
df %>% filter(month(d) == month(Sys.Date()) & day(d) == day(Sys.Date()))

Grouping data over time interval

I have a large dataset of sales information from multiple stores over a period of a few weeks. I need to calculate revenues and average prices over minute intervals, and I can't figure out a smart way to do this. So for example for the data below, I'd want to calculate the revenues and average prices over 10-minute periods, for example the 10-minute period on 2019-02-11 from 09:10:00 to 09:20:00 would result in 2 * 14 + 5 * 9. I've considered labeling each interval with a number and adding a column with the labels, but I don't really know how to implement this. Another option I thought of was to create a separate dataframe with the intervals, and then somehow mapping information from the original data to the interval dataframe, but I didn't get far with this either. Any help on this would be much appreciated!
Example data:
Time
Quantity
Price
2019-02-11 09:15:23
2
14
2019-02-11 09:18:01
5
9
2019-02-11 10:15:23
1
12
2019-02-11 09:28:01
5
9
library(tidyverse)
library(lubridate)
df <- read.table(textConnection("time;quantity;unit_price
2019-02-11 09:15:23;2;14
2019-02-11 09:18:01;5;9
2019-02-11 10:15:23;1;12
2019-02-11 09:28:01;5;9"),
sep = ";",
header = TRUE)
df1 <- df %>%
mutate(
time = lubridate::ymd_hms(time),
time_10min = floor_date(time, "hour") + minutes(minute(time) %/% 10 * 10)
)
df1
#> time quantity unit_price time_10min
#> 1 2019-02-11 09:15:23 2 14 2019-02-11 09:10:00
#> 2 2019-02-11 09:18:01 5 9 2019-02-11 09:10:00
#> 3 2019-02-11 10:15:23 1 12 2019-02-11 10:10:00
#> 4 2019-02-11 09:28:01 5 9 2019-02-11 09:20:00
df1 %>%
group_by(time_10min) %>%
summarise(avg_price = mean(unit_price),
revenue = sum(quantity * unit_price))
#> # A tibble: 3 x 3
#> time_10min avg_price revenue
#> <dttm> <dbl> <int>
#> 1 2019-02-11 09:10:00 11.5 73
#> 2 2019-02-11 09:20:00 9 45
#> 3 2019-02-11 10:10:00 12 12

Is x between two dates?

I have another question in the same project scope pandas dataframe groupby datetime month however I fear the data structure might be to complicated so I am trying an alternative approach. I am hoping this achieves the same result.
I am ideally looking to build a matrix of phone numbers as rows and start and end dates as columns and identify the period in which a telephone call was made.
This will be achieved by transforming a dataset of dates and phone numbers to a complete list of dates, identifying an end day match, and then seeing if the date the telephone call was made falls within that period.
The original data looks like:
Date = as.Date(c("2019-03-01", "2019-03-15","2019-03-29", "2019-04-10","2019-03-05","2019-03-20"))
Phone = c("070000001","070000001","070000001","070000001","070000002","070000002")
df<-data.frame(Date,Phone)
df
## Date Phone
## 1 2019-03-01 070000001
## 2 2019-03-15 070000001
## 3 2019-03-29 070000001
## 4 2019-04-10 070000001
## 5 2019-03-05 070000002
## 6 2019-03-20 070000002
Ideally I would want it to look like this:
## Date Phone INT_1 INT_2 INT_3 INT_4 INT_5
## 1 2019-03-01 070000001 X X X X X
## 2 2019-03-15 070000002 X X X
Where INT is a series of dates + 30 and X indicates that the telephone number appeared in that rolling period.
To do this I assume you need two datasets. The one above, of telephone numbers by date called, and a second which is the complete list of days and their = 30 day counter parts.
dates<-as.data.frame(seq(as.Date("2016/7/1"), as.Date("2019/7/1"),"days"),
responseName = c('start'))
dates$end<-dates$start+30
## INT start end
## 1 2016-07-01 2016-07-31
## 2 2016-07-02 2016-08-01
## 3 2016-07-03 2016-08-02
## 4 2016-07-04 2016-08-03
But how do I get the two to evaluate together? I am assuming some kind of merge and expand of the telephone data into the date list then spread the dates by the row index/ INT?
I think that to match the two dataframes you could use a fuzzyjoin. For example, if I define a dataframe of phone numbers and usage dates as:
library(dplyr)
library(fuzzyjoin)
fake_phone_data <- tibble(
date = as.Date(c("2019-01-03", "2019-01-27", "2019-02-12", "2019-02-25", "2019-02-26")),
phone = c("1", "1", "2", "2", "2")
)
and a dataframe of starting/ending dates (plus an ID column) as:
id_dates <- tibble(
ID = c("1", "2", "3", "4"),
starting_date = as.Date(c("2019-01-01", "2019-01-16", "2019-02-01", "2019-02-16")),
ending_date = as.Date(c("2019-01-15", "2019-01-31", "2019-02-15", "2019-02-27"))
)
then I can join the two dataframes using a fuzzyjoin, i.e. two rows are matched if the date of the phone call happens between the starting date and the end date of the corresponding period:
fuzzy_left_join(
fake_phone_data,
id_dates,
by = c(
"date" = "starting_date",
"date" = "ending_date"
),
match_fun = list(`>=`, `<`)
)
#> # A tibble: 5 x 5
#> date phone ID starting_date ending_date
#> <date> <chr> <chr> <date> <date>
#> 1 2019-01-03 1 1 2019-01-01 2019-01-15
#> 2 2019-01-27 1 2 2019-01-16 2019-01-31
#> 3 2019-02-12 2 3 2019-02-01 2019-02-15
#> 4 2019-02-25 2 4 2019-02-16 2019-02-27
#> 5 2019-02-26 2 4 2019-02-16 2019-02-27
Created on 2019-07-19 by the reprex package (v0.3.0)
Does it solve your problem?
This approach is very similar to this question.

Resources