Calculate number of pending tasks at given time points (ideally with dplyr) - r

I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!

You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)

Related

How to select the earliest date in a month from a Date series in R?

I have a database containing the value of different indices with different frequency (weekly, monthly, daily)of data. I hope to calculate monthly returns by abstracting beginning of month value from the time series.
I have tried to use a loop to partition the time series month by month then use min() to get the earliest date in the month. However, I am wondering whether there is a more efficient way to speed up the calculation.
library(data.table)
df<-fread("statistic_date index_value funds_number
2013-1-1 1000.000 0
2013-1-4 996.096 21
2013-1-11 1011.141 21
2013-1-18 1057.344 21
2013-1-25 1073.376 21
2013-2-1 1150.479 22
2013-2-8 1150.288 19
2013-2-22 1112.993 18
2013-3-1 1148.826 20
2013-3-8 1093.515 18
2013-3-15 1092.352 17
2013-3-22 1138.346 18
2013-3-29 1107.440 17
2013-4-3 1101.897 17
2013-4-12 1093.344 17")
I expect to filter to get the rows of the earliest date of each month, such as:
2013-1-1 1000.000 0
2013-2-1 1150.479 22
2013-3-1 1148.826 20
2013-4-3 1101.897 17
Your help will be much appreciated!
Using the tidyverse and lubridate packages,
library(lubridate)
library(tidyverse)
df %>% mutate(statistic_date = ymd(statistic_date), # convert statistic_date to date format
month = month(statistic_date), #create month and year columns
year= year(statistic_date)) %>%
group_by(month,year) %>% # group by month and year
arrange(statistic_date) %>% # make sure the df is sorted by date
filter(row_number()==1) # select first row within each group
# A tibble: 4 x 5
# Groups: month, year [4]
# statistic_date index_value funds_number month year
# <date> <dbl> <int> <dbl> <dbl>
#1 2013-01-01 1000 0 1 2013
#2 2013-02-01 1150. 22 2 2013
#3 2013-03-01 1149. 20 3 2013
#4 2013-04-03 1102. 17 4 2013
First make statistic_date a Date:
df$statistic_date <- as.Date(df$statistic_date)
The you can use nth_day to find the first day of every month in statistic_date.
library("datetimeutils")
dates <- nth_day(df$statistic_date, period = "month", n = "first")
## [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-03"
df[statistic_date %in% dates]
## statistic_date index_value funds_number
## 1: 2013-01-01 1000.000 0
## 2: 2013-02-01 1150.479 22
## 3: 2013-03-01 1148.826 20
## 4: 2013-04-03 1101.897 17

How to assign day of year values starting from an arbitary date and take care of missing values?

I have an R dataframe df_demand with a date column (depdate) and a dependent variable column bookings. The duration is 365 days starting from 2017-11-02 and ending at 2018-11-01, sorted in ascending order.
We have booking data for only 279 days in the year.
dplyr::arrange(df_demand, depdate)
depdate bookings
1 2017-11-02 43
2 2017-11-03 27
3 2017-11-05 27
4 2017-11-06 22
5 2017-11-07 39
6 2017-11-08 48
.
.
279 2018-11-01 60
I want to introduce another column day_of_year in the following way:
depdate day_of_year bookings
1 2017-11-02 1 43
2 2017-11-03 2 27
3 2017-11-04 3 NA
4 2017-11-05 4 27
.
.
.
365 2018-11-01 365 60
I am trying to find the best possible way to do this.
In Python, I could use something like :
df_demand['day_of_year'] = df_demand['depdate'].sub(df_demand['depdate'].iat[0]).dt.days + 1
I wanted to know about an R equivalent of the same.
When I run
typeof(df_demand_2$depdate)
the output is
"double"
Am I missing something?
You can create a row for every date using the complete function from the tidyr package.
First, I'm creating a data frame with some sample data:
df <- data.frame(
depdate = as.Date(c('2017-11-02', '2017-11-03', '2017-11-05')),
bookings = c(43, 27, 27)
)
Next, I'm performing two operations. First, using tidyr::complete, I'm specifying all the dates I want in my analysis. I can do that using seq.Date, creating a sequence from the first to the last day.
Once that is done, the day_of_year column is simply equal to the row number.
df_complete <- tidyr::complete(df,
depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)
)
df_complete$day_of_year <- 1:nrow(df_complete)
> df_complete
#> # A tibble: 4 x 3
#> depdate bookings day_of_year
#> <date> <dbl> <int>
#> 1 2017-11-02 43 1
#> 2 2017-11-03 27 2
#> 3 2017-11-04 NA 3
#> 4 2017-11-05 27 4
An equivalent solution with the pipe operator from dplyr:
df %>%
complete(depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)) %>%
mutate(days_of_year = row_number())

R - How to add the sum of a specific occurrence in one column to another column

Using R.
This is a small subset of my dataset, simplified to only show relevant columns. The data is taken from Capital Bikeshare. The Start.Date column below has exact rental times for a bike.
Start.date Member.type
2018-11-01 00:00:45 Member
2018-11-01 00:00:52 Casual
2018-11-01 00:01:46 Member
2018-11-01 01:00:02 Casual
2018-11-01 01:03:36 Member
What I'm trying to do is group all of the data by date, hour of day, number of each member type, and total number of member types (casual+member) for any given hour of any given day. So, in the end, I'll just have "Day - Hour - Number of Rentals per member type" so I can predict trends for hour of the day,
Here is my relevant code
library(dplyr)
bikeData <- read.csv("2011data.csv")
bikeData <- bikeData %>%
mutate(Hour = format(strptime(
bikeData$Start.date, "%Y-%m-%d %H:%M:%S"), "%m-%d %H")) %>%
mutate(day = wday(Start.date, label=TRUE))
groupData <- bikeData %>%
mutate(Start.date = ymd_hms(Start.date)) %>%
count(date1 = as.Date(Start.date), Hour1 = hour(Start.date),
member=(Member.type)) %>%
group_by(date1, Hour1) %>%
arrange(date1, Hour1) %>%
summarise(total=sum(n))
What this gives me is the following new dataset, groupData
date1 Hour1 total
2018-11-01 0 82
2018-11-01 1 43
2018-11-01 2 17
2018-11-01 3 4
2018-11-02 0 5
2018-11-02 1 24
So I was able to do the total number of Member+Casual for all 24 hours of each day of my dataset, but how do I get another two columns that show the total number of casual and another that shows the total number of member? Thanks!
Desired below:
date1 Hour1 total Casual Member
2018-11-01 0 82 40 42
2018-11-01 1 43 20 23
2018-11-01 2 17 10 7
2018-11-01 3 4 1 3
2018-11-02 0 5 1 4
2018-11-02 1 24 20 4
groupData <- bikeData %>%
mutate(Start.date = ymd_hms(Start.date)) %>%
count(date1 = as.Date(Start.date), Hour1 = hour(Start.date),
member=(Member.type)) %>%
group_by(date1, Hour1) %>%
arrange(date1, Hour1) %>%
summarise(total=sum(n),members=sum(Member.type=="Member"),casuals=sum(Member.type=="Casual"))
You can simply add to your summarize call two variables that count the logical occurrences of Member.type equaling each of the options.

Count number of rows for each row that meet a logical condition

So I have some data with a time stamp, and for each row, I want to count the number of rows that fall within a certain time window. For example, if I have the data below with a time stamp in h:mm (column ts), I want to count the number of rows that occur from that time stamp to five minutes in the past (column count). The first n rows that are less than five minutes from the first data point should be NAs.
ts data count
1:01 123 NA
1:02 123 NA
1:03 123 NA
1:04 123 NA
1:06 123 5
1:07 123 5
1:10 123 3
1:11 123 4
1:12 123 4
This is straightforward to do with a for loop, but I've been trying to implement with the apply() family and have not yet found any success. Any suggestions?
EDIT: modified to account for the potential for multiple readings per minute, raised in comment.
Data with new mid-minute reading:
library(dplyr)
df %>%
# Take the text above and convert to datetime
mutate(ts = lubridate::ymd_hms(paste(Sys.Date(), ts))) %>%
# Count how many observations per minute
group_by(ts_min = lubridate::floor_date(ts, "1 minute")) %>%
summarize(obs_per_min = sum(!is.na(data))) %>%
# Add rows for any missing minutes, count as zero observations
padr::pad(interval = "1 min") %>%
replace_na(list(obs_per_min = 0)) %>%
# Count cumulative observations, and calc how many in window that
# begins 5 minutes ago and ends at end of current minute
mutate(cuml_count = cumsum(obs_per_min),
prior_cuml = lag(cuml_count) %>% tidyr::replace_na(0),
in_window = cuml_count - lag(prior_cuml, 5)) %>%
# Exclude unneeded columns and rows
select(-cuml_count, -prior_cuml) %>%
filter(obs_per_min > 0)
Output (now reflects add'l reading at 1:06:30)
# A tibble: 12 x 3
ts_min obs_per_min in_window
<dttm> <dbl> <dbl>
1 2018-09-26 01:01:00 1 NA
2 2018-09-26 01:02:00 1 NA
3 2018-09-26 01:03:00 1 NA
4 2018-09-26 01:04:00 1 NA
5 2018-09-26 01:06:00 2 6
6 2018-09-26 01:07:00 1 6
7 2018-09-26 01:10:00 1 4
8 2018-09-26 01:11:00 1 5
9 2018-09-26 01:12:00 1 4

"split" dataframe per month based on columns Start/End

I need to "split" a 15 million line df of the following form:
library(lubridate)
dateStart <- c(lubridate::ymd("2010-01-01"))
dateEnd <- c(lubridate::ymd("2010-03-06"))
length <- c(65)
Amt <- c(348.80)
df1 <- data.frame(dateStart, dateEnd, length, Amt)
df1
# dateStart dateEnd length Amt
# 1 2010-01-01 2010-03-06 65 348.8
into something like:
dateStart dateEnd length Amt
1 2010-01-01 2010-01-31 31 166.35
2 2010-02-01 2010-02-28 28 150.55
3 2010-03-01 2010-03-06 6 32.19
Where length is the number of days and Amt is the pro-rata amount for the number of days. Does anybody know how to do this? Someone mentioned the padr package to me but I do not know how to use it for this specific purpose.
Thank you in advance
I'm going to assume you have an some sort of unique id field in your data set so you have a unique record. Otherwise this is not going to work. I also added 1 extra record so we can see everything works on multiple records.
Data:
library(lubridate)
id <- c(1:2) # added id field needed for unique record and needed for grouping
dateStart <- c(lubridate::ymd("2010-01-01", "2011-01-09"))
dateEnd <- c(lubridate::ymd("2010-03-06", "2011-04-09"))
length <- c(65, 91)
Amt <- c(348.80, 468.70)
df1 <- data.frame(id , dateStart, dateEnd, length, Amt)
First create a data.frame which has the id and missing months. We need dplyr, tidyr and padr. Create groups per unique id, gather the dates so we have start and end date in 1 column. For padr to extend months we first need to thicken the data.frame. Get rid of not needed columns and fill in the missing months.
library(dplyr)
library(tidyr)
library(padr)
#create last_day function for later use
last_day <- function(date) {
ceiling_date(date, "month") - days(1)
}
dates <- df1 %>%
select(id, dateStart, dateEnd) %>%
group_by(id) %>%
gather(names, dates, -id) %>%
arrange(id, dates) %>%
thicken(interval = "month") %>% # need to thicken first for month interval
select(-c(names, dates)) %>%
pad(interval = "month")
dates
# A tibble: 7 x 2
# Groups: id [2]
id dates_month
<int> <date>
1 1 2010-01-01
2 1 2010-02-01
3 1 2010-03-01
4 2 2011-01-01
5 2 2011-02-01
6 2 2011-03-01
7 2 2011-04-01
Next join back the data to the original data.frame
df_extended <- inner_join(dates, df1, by = "id")
df_extended
# A tibble: 7 x 6
# Groups: id [2]
id dates_month dateStart dateEnd length Amt
<int> <date> <date> <date> <dbl> <dbl>
1 1 2010-01-01 2010-01-01 2010-03-06 65 349.
2 1 2010-02-01 2010-01-01 2010-03-06 65 349.
3 1 2010-03-01 2010-01-01 2010-03-06 65 349.
4 2 2011-01-01 2011-01-09 2011-04-09 91 469.
5 2 2011-02-01 2011-01-09 2011-04-09 91 469.
6 2 2011-03-01 2011-01-09 2011-04-09 91 469.
7 2 2011-04-01 2011-01-09 2011-04-09 91 469.
Now to get to the end result. need to use case_when, ifelse doesn't return the data in date format for some reason. The case_when replace set the correct start and end dates (I assume you need the exact start date, not the first of the month, otherwise adjust code to use dates_month instead.) I create an amount per day (amt_pd) variable to be able to multiply this with the number of days in the month to get the pro-rata amount for the number of days in the month.
df_end <- df_extended %>%
mutate(dateEnd = case_when(last_day(dates_month) <= dateEnd ~ last_day(dates_month),
TRUE ~ dateEnd),
dateStart = case_when(dates_month <= dateStart ~ dateStart,
TRUE ~ dates_month),
amt_pd = Amt / length,
length = dateEnd - dateStart + 1,
Amt = amt_pd * length) %>%
select(-c(dates_month, amt_pd)) # get rid of not needed columns
df_end
# A tibble: 7 x 5
# Groups: id [2]
id dateStart dateEnd length Amt
<int> <date> <date> <time> <time>
1 1 2010-01-01 2010-01-31 31 166.350769230769
2 1 2010-02-01 2010-02-28 28 150.252307692308
3 1 2010-03-01 2010-03-06 6 32.1969230769231
4 2 2011-01-09 2011-01-31 23 118.462637362637
5 2 2011-02-01 2011-02-28 28 144.215384615385
6 2 2011-03-01 2011-03-31 31 159.667032967033
7 2 2011-04-01 2011-04-09 9 46.354945054945
All of this could be done in one go. But if you have 15 million rows it might be better to see if the intermediate steps work. Also note that pad has a break_above option.
This is a numeric value that indicates the number of rows in millions
above which the function will break. Safety net for situations where
the interval is different than expected and padding yields a very
large dataframe, possibly overflowing memory.

Resources