How to count the number of days between each day in R - r

Given Date in date frame:
Date: Note it's in year/month/day format
2020-01-01
2020-02-01
2020-03-03
2020-04-04
How do I get the aggregate count total of number of days between each date.
Count:
0
30
58
87

Just convert the character strings to a Date object.
dates <- as.Date(c("2020-01-01", "2020-02-01", "2020-03-03", "2020-04-04"))
dates - dates[1]
# Time differences in days
# [1] 0 31 62 94

you can convert your character strings to date format using as.Date and then use the lag function:
df <- data.frame(date = c("2020-01-01", "2020-02-02", "2020-03-03", "2020-04-04"))
df$ndays <- as.numeric(as.Date(df$date) - dplyr::lag(as.Date(df$date), n = 1, default = as.Date(df$date)[1]))
> df
date ndays
1 2020-01-01 0
2 2020-02-02 32
3 2020-03-03 30
4 2020-04-04 32

Related

R: date format with just year and month

I have a dataframe with monthly data, one column containing the year and one column containing the month. I'd like to combine them into one column with Date format, going from this:
Year Month Data
2020 1 54
2020 2 58
2020 3 78
2020 4 59
To this:
Date Data
2020-01 54
2020-02 58
2020-03 78
2020-04 59
I think you can't represent a Date format in R without showing the day. If you want a character column, like in your example, you can do:
> x <- data.frame(Year = c(2020,2020,2020,2020), Month = c(1,2,3,4), Data = c(54,58,78,59))
> x$Month <- ifelse(nchar(x$Month == 1), paste0(0, x$Month), x$Month) # add 0 behind.
> x$Date <- paste(x$Year, x$Month, sep = '-')
> x
Year Month Data Date
1 2020 01 54 2020-01
2 2020 02 58 2020-02
3 2020 03 78 2020-03
4 2020 04 59 2020-04
> class(x$Date)
[1] "character"
If you want a Date type column you will have to add:
x$Date <- paste0(x$Date, '-01')
x$Date <- as.Date(x$Date, format = '%Y-%m-%d')
x
class(x$Date)
Maybe the simplest way would be to arbitrarily set a day (e.g. 01) to all your dates ? Therefore date intervals would be preserved.
data<-data.frame(Year=c(2020,2020,2020,2020), Month=c(1,2,3,4), Data=c(54,58,78,59))
data$Date<-gsub(" ","",paste(data$Year,"-",data$Month,"-","01"))
data$Date<-as.Date(data$Date,format="%Y-%m-%d")
You can use sprintf -
sprintf('%d-%02d', data$Year, data$Month)
#[1] "2020-01" "2020-02" "2020-03" "2020-04"

How to select the earliest date in a month from a Date series in R?

I have a database containing the value of different indices with different frequency (weekly, monthly, daily)of data. I hope to calculate monthly returns by abstracting beginning of month value from the time series.
I have tried to use a loop to partition the time series month by month then use min() to get the earliest date in the month. However, I am wondering whether there is a more efficient way to speed up the calculation.
library(data.table)
df<-fread("statistic_date index_value funds_number
2013-1-1 1000.000 0
2013-1-4 996.096 21
2013-1-11 1011.141 21
2013-1-18 1057.344 21
2013-1-25 1073.376 21
2013-2-1 1150.479 22
2013-2-8 1150.288 19
2013-2-22 1112.993 18
2013-3-1 1148.826 20
2013-3-8 1093.515 18
2013-3-15 1092.352 17
2013-3-22 1138.346 18
2013-3-29 1107.440 17
2013-4-3 1101.897 17
2013-4-12 1093.344 17")
I expect to filter to get the rows of the earliest date of each month, such as:
2013-1-1 1000.000 0
2013-2-1 1150.479 22
2013-3-1 1148.826 20
2013-4-3 1101.897 17
Your help will be much appreciated!
Using the tidyverse and lubridate packages,
library(lubridate)
library(tidyverse)
df %>% mutate(statistic_date = ymd(statistic_date), # convert statistic_date to date format
month = month(statistic_date), #create month and year columns
year= year(statistic_date)) %>%
group_by(month,year) %>% # group by month and year
arrange(statistic_date) %>% # make sure the df is sorted by date
filter(row_number()==1) # select first row within each group
# A tibble: 4 x 5
# Groups: month, year [4]
# statistic_date index_value funds_number month year
# <date> <dbl> <int> <dbl> <dbl>
#1 2013-01-01 1000 0 1 2013
#2 2013-02-01 1150. 22 2 2013
#3 2013-03-01 1149. 20 3 2013
#4 2013-04-03 1102. 17 4 2013
First make statistic_date a Date:
df$statistic_date <- as.Date(df$statistic_date)
The you can use nth_day to find the first day of every month in statistic_date.
library("datetimeutils")
dates <- nth_day(df$statistic_date, period = "month", n = "first")
## [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-03"
df[statistic_date %in% dates]
## statistic_date index_value funds_number
## 1: 2013-01-01 1000.000 0
## 2: 2013-02-01 1150.479 22
## 3: 2013-03-01 1148.826 20
## 4: 2013-04-03 1101.897 17

Calculating business days between two dataframe columns

I have a data frame that contains two POSIXct columns. How can I go about calculating the number of weekdays between these two columns?
df <- data.frame(StartDate=as.POSIXct(c("2017-05-17 12:53:00","2017-08-31 21:16:00","2017-08-25 13:54:00","2017-09-06 15:47:00","2017-10-15 05:11:00"), format = "%Y-%m-%d %H:%M:%S"),
EndDate=as.POSIXct(c("2017-06-09 11:57:00","2017-11-29 16:51:00","2017-09-06 15:13:00","2018-01-03 16:22:00","2017-11-17 11:51:00"), format = "%Y-%m-%d %H:%M:%S"))
Using dplyr:
df %>%
dplyr::rowwise() %>%
dplyr::mutate(wdays = sum(!weekdays(seq(StartDate, EndDate, by="day")) %in% c("Saturday", "Sunday")))
Source: local data frame [5 x 3]
Groups: <by row>
# A tibble: 5 x 3
StartDate EndDate wdays
<dttm> <dttm> <int>
1 2017-05-17 12:53:00 2017-06-09 11:57:00 17
2 2017-08-31 21:16:00 2017-11-29 16:51:00 64
3 2017-08-25 13:54:00 2017-09-06 15:13:00 9
4 2017-09-06 15:47:00 2018-01-03 16:22:00 86
5 2017-10-15 05:11:00 2017-11-17 11:51:00 25
This makes use of the fact that dates can easily be sequenced, and that because TRUE is equal to one, we can just sum up all of the non-weekend days.
Try the bizdays package:
library(bizdays) # Load the package
## Make a calendar that excludes Saturdays and Sundays
create.calendar("Workdays",weekdays = c("saturday", "sunday"))
## Calculate difference in days using the new Workdays calendar
df$bizdays <- bizdays(df$StartDate,df$EndDate,"Workdays")
df$bizdays
[1] 17 63 8 85 24
That returned 17, 63, 8, 85, and 24 business days between the start and end dates you provided. This looks right when I checked the 8 business days between 8/25/2017 and 9/6/2017.

R: extract hour from variable format timestamp

My dataframe has timestamp with and without seconds, and a random use of 0 in front of months and hours, i.e. 01 or 1
library(tidyverse)
df <- data_frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 01:03', '12/30/2015 23:04:25'))
cust timestamp
A 5/31/2016 1:03:12
A 05/25/2016 01:06
B 6/16/2016 01:03
B 12/30/2015 23:04:25
How to extract hours into a separate column? The desired output:
cust timestamp hours
A 5/31/2016 1:03:12 1
A 05/25/2016 01:06 1
B 6/16/2016 9:03 9
B 12/30/2015 23:04:25 23
I prefer the answer with tidyverse and mutate, but my attempt fails to extract hours correctly:
df %>% mutate(hours=strptime(timestamp, '%H') %>% as.character() )
# A tibble: 4 × 3
cust timestamp hours
<chr> <chr> <chr>
1 A 5/31/2016 1:03:12 2016-10-31 05:00:00
2 A 05/25/2016 01:06 2016-10-31 05:00:00
3 B 6/16/2016 01:03 2016-10-31 06:00:00
4 B 12/30/2015 23:04:25 2016-10-31 12:00:00
Try this:
library(lubridate)
df <- data.frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 09:03', '12/30/2015 23:04:25'))
df %>% mutate(hours=hour(strptime(timestamp, '%m/%d/%Y %H:%M')) %>% as.character() )
cust timestamp hours
1 A 5/31/2016 1:03:12 1
2 A 05/25/2016 01:06 1
3 B 6/16/2016 09:03 9
4 B 12/30/2015 23:04:25 23
Here is a solution that appends 00 for the seconds when they are missing, then converts to a date using lubridate and extracts the hours using format. Note, if you don't want the 00:00 at the end of the hours, you can just eliminate them from the output format in format:
df %>%
mutate(
cleanTime = ifelse(grepl(":[0-9][0-9]:", timestamp)
, timestamp
, paste0(timestamp, ":00")) %>% mdy_hms
, hour = format(cleanTime, "%H:00:00")
)
returns:
cust timestamp cleanTime hour
<chr> <chr> <dttm> <chr>
1 A 5/31/2016 1:03:12 2016-05-31 01:03:12 01:00:00
2 A 05/25/2016 01:06 2016-05-25 01:06:00 01:00:00
3 B 6/16/2016 01:03 2016-06-16 01:03:00 01:00:00
4 B 12/30/2015 23:04:25 2015-12-30 23:04:25 23:00:00
Your timestamp is a character string (), you need to format is as a date (with as.Date for example) before you can start using functions like strptime.
You are going to have to go through some string manipulations to have properly formatted data before you can convert it to dates. Prepend a zero to months with a single digit and append :00 to hours with missing seconds. Use strsplit() and other regex functions. Afterwards do as.Date(df$timestamp,format = '%m/%d/%Y %H:%M:%S'), then you will be able to use strptime to extract the hours.

convert day-number within the year to month/day format

I am trying to convert a day-number within the year to month/day format.
With this df:
set.seed(123)
df1 <- data.frame(Year = rep(15,100), DayNum = seq(78,177,1), Hour = sample(0:23,100,replace = T))
df2 <- data.frame(Year = rep(16,100), DayNum = seq(78,177,1), Hour = sample(0:23,100,replace = T))
df <- rbind(df1, df2)
> head(df)
Year DayNum Hour
1 15 78 6
2 15 79 18
3 15 80 9
4 15 81 21
5 15 82 22
6 15 83 1
> tail(df)
Year DayNum Hour
195 16 172 22
196 16 173 11
197 16 174 9
198 16 175 15
199 16 176 3
200 16 177 13
which has 100 records for 2015 and 2016, how can I make a POSIXct date/time column?
While there are a number of related posts with a Julian date from a beginning origin (usually 1970-01-01), I could not find any posts with a day-number within a year and with a variable year (i.e. 2015 and 2016).
The as.POSIXct function has an option to specify the origin date when converting from a "Julian" date to the date/time object:
#calculate the origin date based on the year column
df$origin<-as.Date(paste0("20", df$Year,"-01-01"))
#convert the Julian day to a date/time object
as.POSIXct(df$JulianDay, origin=df$origin)
One may need to consider adding the timezone option for completeness:
as.POSIXct(df$JulianDay, origin=df$origin, tz="GMT")
You might need something like this, use %j to specify the day of the year:
strptime(with(df, paste(Year, DayNum, Hour)), "%y %j %H")
# [1] "2015-03-19 06:00:00 EDT"
# [2] "2015-03-20 18:00:00 EDT"
# [3] "2015-03-21 09:00:00 EDT"
# [4] "2015-03-22 21:00:00 EDT"
# [5] "2015-03-23 22:00:00 EDT"

Resources