Fast manipulation of Dates in R

Fast manipulation of Dates in R - r

I have around 34000 vectors of dates that I have to change the day and move the month. I have tried this with a loop and using the mapply function but it is extremely slow.
This is an example of what I have:
library(lubridate)
list_dates = replicate(34000,seq(as.Date("2019-03-14"),length.out = 208,by = "months"),simplify = F)
new_day = round(runif(34000,1,30))
new_day[sample(1:34000,10000)] = NA
new_dates = mapply(FUN = function(dates,day_change){
day(dates) = ifelse(is.na(rep(day_change,length(dates))),day(dates),rep(day_change,length(dates)))
dates = as.Date(ifelse(is.na(rep(day_change,length(dates))),dates,dates%m-%months(1)),origin = "1970-01-01")
return(dates)
},dates = list_dates,day_change = as.list(new_day),SIMPLIFY = F)
The variable new_dates should contain a list of the original dates move accordingly to the variable new_day. The function in side works like this:
if new_day is different from NA it will change the day of the dates to the new one
if new_day is different from NA it will move the months of the dates one behind.
I'm open to any solution that will increase the speed regardless of the packages use (if they are in CRAN).
EDIT
So based on the comments I reduce the example for a list of 2 vector of dates each containing 2 dates and created a manual vector of new days:
list_dates = replicate(2,seq(as.Date("2019-03-14"),length.out = 2,by = "months"),simplify = F)
new_day = c(9,NA)
This is the original input (variable list_dates):
[[1]]
[1] "2019-03-14" "2019-04-14"
[[2]]
[1] "2019-03-14" "2019-04-14"
and the expected output of the mapply function is:
[[1]]
[1] "2019-02-09" "2019-03-09"
[[2]]
[1] "2019-03-14" "2019-04-14"
As you can see the first vector of dates was change to the day 9 and each date was lag one month. The second vector of dates did not change because new_dates is NA for that value.

Here is a lubridate solution
library(lubridate)
mapply(
function(x, y) { if (!is.na(y)) {
day(x) <- y;
month(x) <- month(x) - 1
}
return(x) },
list_dates, new_day, SIMPLIFY = F)
#[[1]]
#[1] "2019-02-09" "2019-03-09"
#
#[[2]]
#[1] "2019-03-14" "2019-04-14"
Or using purrr
library(purrr)
library(lubridate)
map2(list_dates, new_day, function(x, y) {
if (!is.na(y)) {
day(x) <- y
month(x) <- month(x) - 1
}
x })

In addition to Maurits' solution, if you want to further increase the speed of computation, you may want to consider using multiple cores with doParallel
library(data.table)
library(doParallel)
registerDoParallel(3)
df <- data.table(new_day,list_dates)
mlply(df,
function(new_day,list_dates){
list_dates <- list_dates[[1]]
if(!is.na(new_day)){
day(list_dates) <- new_day
list_dates <- list_dates %m-% months(1)
}
return(list_dates)
}, .parallel = T, .paropts = list(.packages='lubridate')
)

Related

R lubridate find non overlapping periods between a continuous time frame and a set of intervals

I've got the following time frame:
A <- c('2016-01-01', '2019-01-05')
B <- c('2017-05-05','2019-06-05')
X_Period <- interval("2015-01-01", "2019-12-31")
Y_Periods <- interval(A, B)
I'd like to find the non overlapping periods between X_Period and Y_Periods so that the result would be:
[1]'2015-01-01'--'2015-12-31'
[2]'2017-05-06'--'2019-01-04'
[3]'2019-06-06'--'2019-31-12'
I'm trying to use setdiff but it does not work
setdiff(X_Period, Y_Periods)

Here is an option:
library(lubridate)
seq_X <- as.Date(seq(int_start(X_Period), int_end(X_Period), by = "1 day"))
seq_Y <- as.Date(do.call("c", sapply(Y_Periods, function(x)
seq(int_start(x), int_end(x), by = "1 day"))))
unique_dates_X <- seq_X[!seq_X %in% seq_Y]
lst <- aggregate(
unique_dates_X,
by = list(cumsum(c(0, diff.Date(unique_dates_X) != 1))),
FUN = function(x) c(min(x), max(x)),
simplify = F)$x
lapply(lst, function(x) interval(x[1], x[2]))
#[[1]]
#[1] 2015-01-01 UTC--2015-12-31 UTC
#
#[[2]]
#[1] 2017-05-06 UTC--2019-01-04 UTC
#
#[[3]]
#[1] 2019-06-06 UTC--2019-12-31 UTC
The strategy is to convert the intervals to by-day sequences (one for X_Period and one for Y_Period); then we find all days that are only part of X_Period (and not part of Y_Periods). We then aggregate to determine the first and last date in all sub-sequences of consecutive dates. The resulting lst is a list with those start/end dates. To convert to interval, we simply loop through the list and convert the start/end dates to an interval.

Selecting multiple columns using Regular Expressions

I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.
The best code that I could write to get this done is,
data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))
Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!

Here's a regex that gets all the columns at once:
data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))
The vertical bar represents an 'or'.
As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:
data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))

Suppose that in the code below x represents your names(data). Then the following will do what you want.
# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
EDIT.
You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.
var_names <- function(x, from = 1, to = Inf){
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.integer(y[sapply(y, `!=`, "")])
x[from <= y & y <= to]
}
var_names(x, 5)
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"

Remove the non-digits, scan the remainder in and check whether each is in 5:12 :
DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data
DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
Using magrittr it could also be written like this:
library(magrittr)
DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6

Why does lubridate's parse_date_time work with lapply, but fail with sapply?

Given: the following 4x2 dataframe
df <- as.data.frame(
stringsAsFactors = FALSE,
matrix(
c("2014-01-13 12:08:02", "2014-01-13 12:19:46",
"2014-01-14 09:59:09", "2014-01-14 10:05:09",
"6-18-2016 17:43:42", "6-18-2016 18:06:59",
"6-27-2016 12:16:47", "6-27-2016 12:29:05"),
nrow = 4, ncol = 2, byrow = TRUE
)
)
colnames(df) <- c("starttime", "stoptime")
Goal: the same dataframe but with all the values replaced by the return value of the following lubridate function call:
f <- function(column) {
parse_date_time(column, orders = c ("ymd_hms", "mdy_hms"), tz = "ETZ")
}
Here's the sapply call, whose result contains strange integers:
df2 <- sapply(df, FUN = f) # has values like `1467030545`
And here's the lapply call, that works as expected:
df2 <- lapply(df, FUN = f) # has values like `2016-06-27 12:29:05`
I understand sapply returns the simplest data structure it can while lapply returns a list. I was prepared to follow up the sapply call with df2 <- data.frame(df2) to end up with a data frame as desired. My question is:
Why does the parse_date_time function behave as expected in the lapply but not in the sapply?

The reason is that sapply have by default simplify = TRUE and when the length or dimension of the list elements are same, it simplifies to a vector or matrix. Internally, Date time classes are stored as numeric,
typeof(parse_date_time(df$starttime, orders = c("ymd_hms", "mdy_hms"), tz = "ETZ"))
#[1] "double"
while the class is 'POSIXct`
class(parse_date_time(df$starttime, orders = c("ymd_hms", "mdy_hms"), tz = "ETZ"))
#[1] "POSIXct" "POSIXt"
so it coerces to that while doing the matrix conversion, while in the list it preserves the class format.
If we are interested in a data.frame, then we create a copy of 'df' and use [] to get the same structure as 'df'
df2 <- df
df2[] <- lapply(df, FUN = function(column) {
parse_date_time(column, orders = c("ymd_hms", "mdy_hms"), tz = "ETZ")
})
df2
# starttime stoptime
#1 2014-01-13 12:08:02 2014-01-13 12:19:46
#2 2014-01-14 09:59:09 2014-01-14 10:05:09
#3 2016-06-18 17:43:42 2016-06-18 18:06:59
#4 2016-06-27 12:16:47 2016-06-27 12:29:05

Calculate time difference in R

I'm currently struggling with R and calculating the time difference in days.
I have data.frame with around 60 000 rows. In this data frame there are two columns called "start" and "end". Both columns contain data in UNIX time format WITH milliseconds - as you can see by the last three digits.
Start <- c("1470581434000", "1470784954000", "1470811368000", "1470764345000")
End <- c("1470560601000", "1470581549000", "1470785452000", "1470764722000")
d <- data.frame(Start, End)
My desired output should be a extra column called timediff where the time difference is outline in days.
I tried it with timediff and strptime which I found here. But nothing worked out.
Maybe one of you worked with calculation of time differences in the past.
Thanks a lot

There is a very small and fast solution:
Start_POSIX <- as.POSIXct(as.numeric(Start)/1000, origin="1970-01-01")
End_POSIX <- as.POSIXct(as.numeric(End)/1000, origin="1970-01-01")
difftime(Start_POSIX, End_POSIX)
Time differences in mins
[1] 347.216667 3390.083333 431.933333 -6.283333
or if you want another unit:
difftime(Start_POSIX, End_POSIX, unit = "sec")
Time differences in secs
[1] 20833 203405 25916 -377

You have a few steps you'll need to take:
# 1. Separate the milliseconds.
# To do this, insert a period in front of the last three digits
Start <-
sub(pattern = "(\\d{3}$)", # get the pattern of three digits at the end of the string
replacement = ".\\1", # replace with a . and then the pattern
x = Start)
# 2. Convert to numeric
Start <- as.numeric(Start)
# 3. Convert to POSIXct
Start <- as.POSIXct(Start,
origin = "1970-01-01")
For convenience, it would be good to put these all into a function
# Bundle all three steps into one function
unixtime_to_posixct <- function(x)
{
x <- sub(pattern = "(\\d{3}$)",
replacement = ".\\1",
x = x)
x <- as.numeric(x)
as.POSIXct(x,
origin = "1970-01-01")
}
And with that, you can get your differences in days
#* Put it all together.
library(dplyr)
library(magrittr)
Start <- c("1470581434000", "1470784954000", "1470811368000", "1470764345000")
End <- c("1470560601000", "1470581549000", "1470785452000", "1470764722000")
d <- data.frame(Start,
End,
stringsAsFactors = FALSE)
lapply(
X = d,
FUN = unixtime_to_posixct
) %>%
as.data.frame() %>%
mutate(diff = difftime(Start, End, units = "days"))

R: efficient ways to add months to dates?

I have a data.table of millions of rows and one of the columns is date column. I would like to add 12 months to all the dates in that column and create a new column. So I use the dplyr and lubridate packages E.g.
library(dplyr)
library(lubridate)
new_data <- data %>% mutate(date12m = date %m+% months(12))
This works, however it is very slow for large datasets. Am I missing something? How can this be sped up? I generally don't expect R to run for more than 10 minutes for such a simple task
Edit:
I note that my solution is already more efficient than using as.yearmon. Thanks to Colonel Beauvel for the solution
a <- data.frame(date = rep(today(),1000000))
func = function(u) {
d = as.Date(as.yearmon(u)+1, frac=1)
if(day(u)>day(d)) return(d)
day(d) = day(u)
d
}
pt <- proc.time()
a <- a %>% mutate(date12m = func(date))
data.table::timetaken(pt)
pt <- proc.time()
a <- a %>% mutate(date12m = date %m+% 12)
data.table::timetaken(pt)

Just add 1 with month:
x=seq.Date(from=as.Date("2007-01-01"), to=as.Date("2014-12-12"), by="day")
month(x) = month(x) + 1
#> head(x)
#[1] "2007-02-01" "2007-02-02" "2007-02-03" "2007-02-04" "2007-02-05" "2007-02-06"
Edit : as per #akrun comment here is the solution, using as.yearmon from zoo package. The trick is to do quick check when taking the day of the last date of the next month:
library(zoo)
func = function(u)
{
d = as.Date(as.yearmon(u)+1/12, frac=1)
if(day(u)>day(d)) return(d)
day(d) = day(u)
d
}
x=as.Date(c("2014-01-31","2015-02-28","2013-03-02"))
#> as.Date(sapply(x, func))
#[1] "2014-02-28" "2015-03-28" "2013-04-02"

I am also working with big data frames in R, you can use the package DescTools, it has a function named AddMonths(date,NoOfMonths).
It works quite well for me.
> a <- ymd("2011-09-9")
> b <- AddMonths(a,1)
> b
[1] "2011-10-09"