Calculating business days between two dataframe columns - r

I have a data frame that contains two POSIXct columns. How can I go about calculating the number of weekdays between these two columns?
df <- data.frame(StartDate=as.POSIXct(c("2017-05-17 12:53:00","2017-08-31 21:16:00","2017-08-25 13:54:00","2017-09-06 15:47:00","2017-10-15 05:11:00"), format = "%Y-%m-%d %H:%M:%S"),
EndDate=as.POSIXct(c("2017-06-09 11:57:00","2017-11-29 16:51:00","2017-09-06 15:13:00","2018-01-03 16:22:00","2017-11-17 11:51:00"), format = "%Y-%m-%d %H:%M:%S"))

Using dplyr:
df %>%
dplyr::rowwise() %>%
dplyr::mutate(wdays = sum(!weekdays(seq(StartDate, EndDate, by="day")) %in% c("Saturday", "Sunday")))
Source: local data frame [5 x 3]
Groups: <by row>
# A tibble: 5 x 3
StartDate EndDate wdays
<dttm> <dttm> <int>
1 2017-05-17 12:53:00 2017-06-09 11:57:00 17
2 2017-08-31 21:16:00 2017-11-29 16:51:00 64
3 2017-08-25 13:54:00 2017-09-06 15:13:00 9
4 2017-09-06 15:47:00 2018-01-03 16:22:00 86
5 2017-10-15 05:11:00 2017-11-17 11:51:00 25
This makes use of the fact that dates can easily be sequenced, and that because TRUE is equal to one, we can just sum up all of the non-weekend days.

Try the bizdays package:
library(bizdays) # Load the package
## Make a calendar that excludes Saturdays and Sundays
create.calendar("Workdays",weekdays = c("saturday", "sunday"))
## Calculate difference in days using the new Workdays calendar
df$bizdays <- bizdays(df$StartDate,df$EndDate,"Workdays")
df$bizdays
[1] 17 63 8 85 24
That returned 17, 63, 8, 85, and 24 business days between the start and end dates you provided. This looks right when I checked the 8 business days between 8/25/2017 and 9/6/2017.

Related

Time interval between several dates (days and hours)

I know a lot of questions have been asked on the same subject but I have not found an answer to this particular question, despite trying to adapt other codes to my problem.
My data frame "v1" has more than 300 thousand lines with the variable "Date" in the following format:
Date
2015-07-27 17:35:00
2015-07-27 17:40:00
2015-07-27 17:45:00
1st I want to know if all the "Date" intervals are in the 5 to 5 minutes interval. If not I would like to track where different intervals are.
2nd I pretend to create a new column where it can be seen the time stamp of the different intervals. For example, "time_int" where it would be seen "00:05:00", "00:05:00"...
Any help will be appreciated. Thank you in advance.
Here is an option to calculate the difference using lag. If you'd like, you could create another column showing hours with units = "hours".
library(tidyverse)
library(lubridate)
df <- data.frame(date = ymd_hms(c("2015-07-27 17:35:00",
"2015-07-27 17:40:00", "2015-07-27 17:49:00", "2015-07-27 19:49:00")))
df %>%
mutate(diff = date - lag(date),
diff_minutes = as.numeric(diff, units = "mins"),
time_int = format(.POSIXct(diff_minutes*60, "UTC"), "%H:%M:%S")) %>%
select(date, diff_minutes, time_int) %>%
# Filter the data for a range of minutes
filter(diff_minutes >= 5 & diff_minutes < 10)
# OUTPUT:
#> date diff_minutes time_int
#> 1 2015-07-27 17:40:00 5 00:05:00
#> 2 2015-07-27 17:49:00 9 00:09:00
Created on 2021-03-09 by the reprex package (v0.3.0)
Original Data
date
<S3: POSIXct>
2015-07-27 17:35:00
2015-07-27 17:40:00
2015-07-27 17:49:00
2015-07-27 19:49:00
You can use rollapplyr to find the time difference between two consecutive rows. And then you can use which to find the rows that the time difference is not 5 minutes.
dt=read.table(text=text, header=TRUE)
library(lubridate)
library(dplyr)
library(zoo)
dt=mutate(dt, Date=ymd_hms(Date)) %>%
mutate(dt, Dif=rollapplyr(Date, 2, function(x) {
return(difftime(x[2], x[1]))
}, fill=NA))
dt
Date Dif
1 2015-07-27 17:35:00 NA
2 2015-07-27 17:40:00 5
3 2015-07-27 17:45:00 5
4 2015-07-27 17:49:00 4
dt[which(dt$Dif != as.difftime(5, units="mins")),]
Date Dif
4 2015-07-27 17:49:00 4
Lastly, to format the times in your desired format:
dt %>% mutate(DifString=format(.POSIXct(Dif*60, tz="GMT"), "%H:%M:%S"))
Date Dif DifString
1 2015-07-27 17:35:00 NA <NA>
2 2015-07-27 17:40:00 5 00:05:00
3 2015-07-27 17:45:00 5 00:05:00
4 2015-07-27 17:49:00 4 00:04:00
Data
text="Date
'2015-07-27 17:35:00'
'2015-07-27 17:40:00'
'2015-07-27 17:45:00'
'2015-07-27 17:49:00'"
dt=read.table(text=text, header=TRUE)

Calculate number of pending tasks at given time points (ideally with dplyr)

I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)

How to select the earliest date in a month from a Date series in R?

I have a database containing the value of different indices with different frequency (weekly, monthly, daily)of data. I hope to calculate monthly returns by abstracting beginning of month value from the time series.
I have tried to use a loop to partition the time series month by month then use min() to get the earliest date in the month. However, I am wondering whether there is a more efficient way to speed up the calculation.
library(data.table)
df<-fread("statistic_date index_value funds_number
2013-1-1 1000.000 0
2013-1-4 996.096 21
2013-1-11 1011.141 21
2013-1-18 1057.344 21
2013-1-25 1073.376 21
2013-2-1 1150.479 22
2013-2-8 1150.288 19
2013-2-22 1112.993 18
2013-3-1 1148.826 20
2013-3-8 1093.515 18
2013-3-15 1092.352 17
2013-3-22 1138.346 18
2013-3-29 1107.440 17
2013-4-3 1101.897 17
2013-4-12 1093.344 17")
I expect to filter to get the rows of the earliest date of each month, such as:
2013-1-1 1000.000 0
2013-2-1 1150.479 22
2013-3-1 1148.826 20
2013-4-3 1101.897 17
Your help will be much appreciated!
Using the tidyverse and lubridate packages,
library(lubridate)
library(tidyverse)
df %>% mutate(statistic_date = ymd(statistic_date), # convert statistic_date to date format
month = month(statistic_date), #create month and year columns
year= year(statistic_date)) %>%
group_by(month,year) %>% # group by month and year
arrange(statistic_date) %>% # make sure the df is sorted by date
filter(row_number()==1) # select first row within each group
# A tibble: 4 x 5
# Groups: month, year [4]
# statistic_date index_value funds_number month year
# <date> <dbl> <int> <dbl> <dbl>
#1 2013-01-01 1000 0 1 2013
#2 2013-02-01 1150. 22 2 2013
#3 2013-03-01 1149. 20 3 2013
#4 2013-04-03 1102. 17 4 2013
First make statistic_date a Date:
df$statistic_date <- as.Date(df$statistic_date)
The you can use nth_day to find the first day of every month in statistic_date.
library("datetimeutils")
dates <- nth_day(df$statistic_date, period = "month", n = "first")
## [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-03"
df[statistic_date %in% dates]
## statistic_date index_value funds_number
## 1: 2013-01-01 1000.000 0
## 2: 2013-02-01 1150.479 22
## 3: 2013-03-01 1148.826 20
## 4: 2013-04-03 1101.897 17

Finding each time of daily max variable in climate data

I have a large dataset over many years which has several variables, but the one I am interested in is wind speed and dateTime. I want to find the time of the max wind speed for every day in the data set. I have hourly data in Posixct format, with WS as a numeric with occasional NAs. Below is a short data set that should hopefully illustrate my point, however my dateTime wasn't working out to be hourly data, but it provides enough for a sample.
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1798,rep=TRUE)
WD <- sample(0:390,1798,rep=TRUE)
Temp <- sample(0:40,1798,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I have previously tried creating a new column with just a posix date (minus time) to allow for day isolation, however all the things I have tried have only returned a shortened data frame with date and WS (aggregate, splitting, xts). Aggregate was only one that didn't do this, however, it gave me 23:00:00 as a constant time which isn't correct.
I have looked at How to calculate daily means, medians, from weather variables data collected hourly in R?, https://stats.stackexchange.com/questions/7268/how-to-aggregate-by-minute-data-for-a-week-into-hourly-means and others but none have answered this question, or the solutions have not returned an ideal result.
I need to compare the results of this analysis with another data frame, so hence the reason I need the actual time when the max wind speed occurred for each day in the dataset. I have a feeling there is a simple solution, however, this has me frustrated.
A dplyr solution may be:
library(dplyr)
df %>%
mutate(date = as.Date(dateTime)) %>%
left_join(
df %>%
mutate(date = as.Date(dateTime)) %>%
group_by(date) %>%
summarise(max_ws = max(WS, na.rm = TRUE)) %>%
ungroup(),
by = "date"
) %>%
select(-date)
# dateTime WS WD Temp max_ws
# 1 2011-01-01 00:00:00 NA 313 2 15
# 2 2011-01-01 00:24:00 7 376 1 15
# 3 2011-01-01 00:48:00 3 28 28 15
# 4 2011-01-01 01:12:00 15 262 24 15
# 5 2011-01-01 01:36:00 1 149 34 15
# 6 2011-01-01 02:00:00 4 319 33 15
# 7 2011-01-01 02:24:00 15 280 22 15
# 8 2011-01-01 02:48:00 NA 110 23 15
# 9 2011-01-01 03:12:00 12 93 15 15
# 10 2011-01-01 03:36:00 3 5 0 15
Dee asked for: "I want to find the time of the max wind speed for every day in the data set." Other answers have calculated the max(WS) for every day, but not at which hour that occured.
So I propose the following solution with dyplr:
library(dplyr)
set.seed(12345)
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1738,rep=TRUE)
WD <- sample(0:390,1738,rep=TRUE)
Temp <- sample(0:40,1738,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
df %>%
group_by(Date = as.Date(dateTime)) %>%
mutate(Hour = hour(dateTime),
Hour_with_max_ws = Hour[which.max(WS)])
I want to highlight out, that if there are several hours with the same maximal windspeed (in the example below: 15), only the first hour with max(WS) will be shown as result, though the windspeed 15 was reached on that date at the hours 0, 3, 4, 21 and 22! So you might need a more specific logic.
For the sake of completeness (and because I like the concise code) here is a "one-liner" using data.table:
library(data.table)
setDT(df)[, max.ws := max(WS, na.rm = TRUE), by = as.IDate(dateTime)][]
dateTime WS WD Temp max.ws
1: 2011-01-01 00:00:00 NA 293 22 15
2: 2011-01-01 00:24:00 15 55 14 15
3: 2011-01-01 00:48:00 NA 186 24 15
4: 2011-01-01 01:12:00 4 300 22 15
5: 2011-01-01 01:36:00 0 120 36 15
---
1734: 2011-01-29 21:12:00 12 249 5 15
1735: 2011-01-29 21:36:00 9 282 21 15
1736: 2011-01-29 22:00:00 12 238 6 15
1737: 2011-01-29 22:24:00 10 127 21 15
1738: 2011-01-29 22:48:00 13 297 0 15

convert day-number within the year to month/day format

I am trying to convert a day-number within the year to month/day format.
With this df:
set.seed(123)
df1 <- data.frame(Year = rep(15,100), DayNum = seq(78,177,1), Hour = sample(0:23,100,replace = T))
df2 <- data.frame(Year = rep(16,100), DayNum = seq(78,177,1), Hour = sample(0:23,100,replace = T))
df <- rbind(df1, df2)
> head(df)
Year DayNum Hour
1 15 78 6
2 15 79 18
3 15 80 9
4 15 81 21
5 15 82 22
6 15 83 1
> tail(df)
Year DayNum Hour
195 16 172 22
196 16 173 11
197 16 174 9
198 16 175 15
199 16 176 3
200 16 177 13
which has 100 records for 2015 and 2016, how can I make a POSIXct date/time column?
While there are a number of related posts with a Julian date from a beginning origin (usually 1970-01-01), I could not find any posts with a day-number within a year and with a variable year (i.e. 2015 and 2016).
The as.POSIXct function has an option to specify the origin date when converting from a "Julian" date to the date/time object:
#calculate the origin date based on the year column
df$origin<-as.Date(paste0("20", df$Year,"-01-01"))
#convert the Julian day to a date/time object
as.POSIXct(df$JulianDay, origin=df$origin)
One may need to consider adding the timezone option for completeness:
as.POSIXct(df$JulianDay, origin=df$origin, tz="GMT")
You might need something like this, use %j to specify the day of the year:
strptime(with(df, paste(Year, DayNum, Hour)), "%y %j %H")
# [1] "2015-03-19 06:00:00 EDT"
# [2] "2015-03-20 18:00:00 EDT"
# [3] "2015-03-21 09:00:00 EDT"
# [4] "2015-03-22 21:00:00 EDT"
# [5] "2015-03-23 22:00:00 EDT"

Resources