R: efficient ways to add months to dates? - r

I have a data.table of millions of rows and one of the columns is date column. I would like to add 12 months to all the dates in that column and create a new column. So I use the dplyr and lubridate packages E.g.
library(dplyr)
library(lubridate)
new_data <- data %>% mutate(date12m = date %m+% months(12))
This works, however it is very slow for large datasets. Am I missing something? How can this be sped up? I generally don't expect R to run for more than 10 minutes for such a simple task
Edit:
I note that my solution is already more efficient than using as.yearmon. Thanks to Colonel Beauvel for the solution
a <- data.frame(date = rep(today(),1000000))
func = function(u) {
d = as.Date(as.yearmon(u)+1, frac=1)
if(day(u)>day(d)) return(d)
day(d) = day(u)
d
}
pt <- proc.time()
a <- a %>% mutate(date12m = func(date))
data.table::timetaken(pt)
pt <- proc.time()
a <- a %>% mutate(date12m = date %m+% 12)
data.table::timetaken(pt)

Just add 1 with month:
x=seq.Date(from=as.Date("2007-01-01"), to=as.Date("2014-12-12"), by="day")
month(x) = month(x) + 1
#> head(x)
#[1] "2007-02-01" "2007-02-02" "2007-02-03" "2007-02-04" "2007-02-05" "2007-02-06"
Edit : as per #akrun comment here is the solution, using as.yearmon from zoo package. The trick is to do quick check when taking the day of the last date of the next month:
library(zoo)
func = function(u)
{
d = as.Date(as.yearmon(u)+1/12, frac=1)
if(day(u)>day(d)) return(d)
day(d) = day(u)
d
}
x=as.Date(c("2014-01-31","2015-02-28","2013-03-02"))
#> as.Date(sapply(x, func))
#[1] "2014-02-28" "2015-03-28" "2013-04-02"

I am also working with big data frames in R, you can use the package DescTools, it has a function named AddMonths(date,NoOfMonths).
It works quite well for me.
> a <- ymd("2011-09-9")
> b <- AddMonths(a,1)
> b
[1] "2011-10-09"

Related

Fast manipulation of Dates in R

I have around 34000 vectors of dates that I have to change the day and move the month. I have tried this with a loop and using the mapply function but it is extremely slow.
This is an example of what I have:
library(lubridate)
list_dates = replicate(34000,seq(as.Date("2019-03-14"),length.out = 208,by = "months"),simplify = F)
new_day = round(runif(34000,1,30))
new_day[sample(1:34000,10000)] = NA
new_dates = mapply(FUN = function(dates,day_change){
day(dates) = ifelse(is.na(rep(day_change,length(dates))),day(dates),rep(day_change,length(dates)))
dates = as.Date(ifelse(is.na(rep(day_change,length(dates))),dates,dates%m-%months(1)),origin = "1970-01-01")
return(dates)
},dates = list_dates,day_change = as.list(new_day),SIMPLIFY = F)
The variable new_dates should contain a list of the original dates move accordingly to the variable new_day. The function in side works like this:
if new_day is different from NA it will change the day of the dates to the new one
if new_day is different from NA it will move the months of the dates one behind.
I'm open to any solution that will increase the speed regardless of the packages use (if they are in CRAN).
EDIT
So based on the comments I reduce the example for a list of 2 vector of dates each containing 2 dates and created a manual vector of new days:
list_dates = replicate(2,seq(as.Date("2019-03-14"),length.out = 2,by = "months"),simplify = F)
new_day = c(9,NA)
This is the original input (variable list_dates):
[[1]]
[1] "2019-03-14" "2019-04-14"
[[2]]
[1] "2019-03-14" "2019-04-14"
and the expected output of the mapply function is:
[[1]]
[1] "2019-02-09" "2019-03-09"
[[2]]
[1] "2019-03-14" "2019-04-14"
As you can see the first vector of dates was change to the day 9 and each date was lag one month. The second vector of dates did not change because new_dates is NA for that value.
Here is a lubridate solution
library(lubridate)
mapply(
function(x, y) { if (!is.na(y)) {
day(x) <- y;
month(x) <- month(x) - 1
}
return(x) },
list_dates, new_day, SIMPLIFY = F)
#[[1]]
#[1] "2019-02-09" "2019-03-09"
#
#[[2]]
#[1] "2019-03-14" "2019-04-14"
Or using purrr
library(purrr)
library(lubridate)
map2(list_dates, new_day, function(x, y) {
if (!is.na(y)) {
day(x) <- y
month(x) <- month(x) - 1
}
x })
In addition to Maurits' solution, if you want to further increase the speed of computation, you may want to consider using multiple cores with doParallel
library(data.table)
library(doParallel)
registerDoParallel(3)
df <- data.table(new_day,list_dates)
mlply(df,
function(new_day,list_dates){
list_dates <- list_dates[[1]]
if(!is.na(new_day)){
day(list_dates) <- new_day
list_dates <- list_dates %m-% months(1)
}
return(list_dates)
}, .parallel = T, .paropts = list(.packages='lubridate')
)

Given only vector of dates, expand data in between (unequal) date points

Other questions have centered around having a start and end date. (see the following for examples
Given start date and end date, reshape/expand data for each day between (each day on a row)
Expand rows by date range using start and end date
My question is different in that I only have one date column and I would like to convert the unequal date ranges to daily counts. This specific example created deals with number of workers on a job site at one time. Different crews of people come on different dates
A brief data frame provided is as follows:
dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))
I would like the data to look like this:
dw <- data.frame(date=c(seq(as.Date("1999-03-22"),as.Date("1999-04-10"),by= "day")),
work=c(rep(43,7),rep(95,10),rep(92,3)),
cumwork=c(rep(43,7),rep(138,10),rep(230,3)))
I have been stuck on this for some time. Any help would be appreciated!
UPDATE (7/5/2017): As pointed out by #Scarabee the dates in the dataframe 'dd' should be in date format. Have updated the code to reflect this
A possible way:
First, create the sequence of dates you're interested in as a one-column dataframe:
v <- data.frame(date = seq(min(dd$date), as.Date("1999-04-10"), by="day"))
Next, join with your original dataframe and fill the missing values, for instance using dplyr and zoo:
library(dplyr)
library(zoo)
v %>%
left_join(dd, by = "date") %>%
na.locf
NB: I suppose that your dataframe dd actually contains dates (and not factors).
dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))
A solution similar, with base R (and zoo package):
dd$date <- as.Date(as.character(dd$date))
my.seq <- data.frame(date=seq.Date(from=range(dd$date)[1], to=range(dd$date)[2], by="day"))
output <- merge(my.seq, dd, all.x=TRUE)
output <- zoo::na.locf(output)
You first have to transform your date into a Date format. Then separately create a vector of complete dates and merge it with the original data. Eventually, run a "last observation carried forward" algorithm.
Here is a really fast pure base R solution:
ExpandDates <- function(df, lastColRepeat) {
myDiff <- diff(df$date)
dfOut <- data.frame(df$date[1] + 0:(sum(myDiff) + lastColRepeat - 1L),
stringsAsFactors=FALSE)
myDiff <- c(myDiff, lastColRepeat)
for (i in 2:3) {dfOut[,i] <- rep(df[ ,i], times = myDiff)}
names(dfOut) <- names(df)
dfOut
}
The last argument is to determine the number of times the last value should be repeated. As it stands, there is nothing in the original data.frame that would give this value. I'm also assuming that the "date" field is actually a date as pointed out by #Scarabee.
Here is some test data:
set.seed(123)
workVec <- sample(5000, 3000)
testDF <- data.frame(date = as.Date(sort(sample(12000, 3000)),
origin = "1970-01-01"), work = workVec,
cumwork = cumsum(workVec))
DplyrTest <- function(dd) { ## from #Scarabee
v <- data.frame(date = seq(min(dd$date), max(dd$date), by="day"))
v %>%
left_join(dd, by = "date") %>%
na.locf
}
a <- ExpandDates(testDF, 1)
b <- DplyrTest(testDF)
Test for equality:
identical(a$cumwork, as.integer(b$cumwork))
[1] TRUE
identical(a$work, as.integer(b$work))
[1] TRUE
identical(a$date, as.Date(b$date))
[1] TRUE
Benchmarks:
library(microbenchmark)
microbenchmark(DplyrTest(testDF), ExpandDates(testDF,1))
Unit: milliseconds
expr min lq mean median uq max neval cld
DplyrTest(testDF) 80.909303 84.337006 91.315057 86.320883 88.818739 173.69395 100 b
ExpandDates(testDF, 1) 1.122384 1.208184 2.521693 1.355564 1.486317 72.23444 100 a

Vectorizing for-loop in R for creating strings with different length

I have created a sample R script to show my question:
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df[,'start_date'] <- as.Date(test.df[,'start_date'])
test.df[,'end_date'] <- as.Date(test.df[,'end_date'])
for (loop in (1:nrow(test.df))) {
test.df[loop,'output'] <- paste(seq(test.df[loop,'start_date'],test.df[loop,'end_date'],by = 1),collapse=';')
}
I need to create strings of date with different length, I can only think of using for-loop for my problem, but I have about 70K cases that need to process the string, is there any way of speeding it up?
Update 01
Thanks #akrun for the answer, I have further modified my question as below:
library(dplyr)
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df[,'start_date'] <- as.Date(test.df[,'start_date'])
test.df[,'end_date'] <- as.Date(test.df[,'end_date'])
# Part A
for (loop in (1:nrow(test.df))) {
test.df[loop,'output'] <- paste(seq(test.df[loop,'start_date'],test.df[loop,'end_date'],by = 1),collapse=';')
}
# Part B
test.mod <- group_by(test.df,uid) %>%
do({df <- data.frame(.)
output.df <- data.frame(uid=df[1,'uid'],
date=unlist(strsplit(df[,'output'],';')))
data.frame(output.df)
})
Now Part A is fixed, but is there anyway to speed up Part B? Or should I combine Part A and Part B together? Please enlighten me as data.table is new to me.
We could convert the 'test.df' to 'data.table' (setDT(test.df)), grouped by 'uid', we get the seq of 'start_date', 'end_date' and the paste the elements together.
library(data.table)
setDT(test.df)[,paste(seq(start_date, end_date, by = '1 day'), collapse=';') , uid]
Update
For the Part B, if we dont paste, it is a two column dataset
setDT(test.df)[,seq(start_date, end_date, by = '1 day') , uid]
Here is how you can do it with apply
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df$output <- apply(test.df, 1, function(x) paste(seq(as.Date(x[2]), as.Date(x[3]), by = 1), collapse=';'))

Convert factor into time in R

I have a data frame df like:
ID time
a 121:24:30
b 130:30:00
The time column is of factor after importing data.
I want convert the values of time column into minutes. At first, I have tried:
df$time <- times(df$time)
But I got warning message:
"out of day time entry"
I notice the value in the hour position is more than 24 in my dataset.
So how am I supposed to do now?
Thanks in advance!
You could use the lubridate package for this.
library(lubridate)
x <- hms(df$time)
(hour(x) * 60) + minute(x) + (second(x) / 60)
# [1] 7284.5 7830.0
Assuming your data is saved as dat use the following
#convert to character
dat$time <- as.character(dat$time)
#split by ":"
times <- strsplit(dat$time, ":")
# get minutes
dat$time <- sapply(times, function(x){
x = as.numeric(x)
x[1]*60+x[2]+x[3]/60
})
Another option (just for fun) is to play around with the gsubfn package
s <- factor(c("121:24:30", "130:30:00"))
library(gsubfn)
as.numeric(gsubfn("(\\d+):(\\d+):(\\d+)",
~ as.numeric(x)*60 + as.numeric(y) + as.numeric(z)/60,
as.character(s)))
## [1] 7284.5 7830.0

Calculate time difference between two events (given date and time) in R

I'm currently struggling with a beginner's issue regarding the calculation of a time difference between two events.
I want to take a column consisting of date and time (both values in one column) into consideration and calculate a time difference with the value of the previous/next row with the same ID (A or B in this example).
ID = c("A", "A", "B", "B")
time = c("08.09.2014 10:34","12.09.2014 09:33","13.08.2014 15:52","11.09.2014 02:30")
d = data.frame(ID,time)
My desired output is in the format Hours:Minutes
time difference = c("94:59","94:59","682:38","682:38")
The format Days:Hours:Minutes or anything similar would also work, as long as it could be conveniently implemented. I am flexible regarding the format of the output, the above is just an idea that crossed my mind.
For each single ID, I always have two rows (in the example 2xA and 2xB). I don't have a convincing idea how to avoid the repition of the difference.
I've tried some examples before, which I found on stackoverflow. Most of them used POSIXt and strptime. However, I didn't manage to apply those ideas to my data set.
Here's my attempt using dplyr
library(dplyr)
d %>%
mutate(time = as.POSIXct(time, format = "%d.%m.%Y %H:%M")) %>%
group_by(ID) %>%
mutate(diff = paste0(gsub("[.].*", "", diff(time)*24), ":",
round(as.numeric(gsub(".*[.]", ".", diff(time)*24))*60)))
# Source: local data frame [4 x 3]
# Groups: ID
#
# ID time diff
# 1 A 2014-09-08 10:34:00 94:59
# 2 A 2014-09-12 09:33:00 94:59
# 3 B 2014-08-13 15:52:00 682:38
# 4 B 2014-09-11 02:30:00 682:38
A very (to me) hack-ish base solution:
ID <- c("A", "A", "B", "B")
time <- c("08.09.2014 10:34", "12.09.2014 09:33", "13.08.2014 15:52","11.09.2014 02:30")
d <- data.frame(ID, time)
d$time <- as.POSIXct(d$time, format="%d.%m.%Y %H:%M")
unlist(unname(lapply(split(d, d$ID), function(d) {
sapply(abs(diff(c(d$time[2], d$time))), function(x) {
sprintf("%s:%s", round(((x*24)%/%1)), round(((x*24)%%1 *60)))
})
})))
## [1] "94:59" "94:59" "682:38" "682:38"
I have to believe this function exists somewhere already, tho.
similar to the attempts of David and hrmbrmstr, I found that this solution using difftime works
I use a rowshift script I found on stackoverflow
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
d$time.c <- as.POSIXct(d$time, format = "%d.%m.%Y %H:%M")
d$time.prev <- rowShift(d$time.c,-1)
d$diff <- difftime(d$time.c,d$time.prev, units="hours")
Every other row of d$diff has positive/negative values in the results. I do remove all the rows with negative values and have the difference between the first and the last time for every ID.

Resources