Issues in writing and reading datetime in R - r

Below is an example, where I calculate travel time between to points. I create a tibble with start time, duration, end_time and time_diff = end_time - start_time. I wrote the tibble using write_csv() and read it again with read_csv()
library(sf)
library(tidygeocoder)
library(osrm)
library(lubridate)
# 1. One World Trade Center, NYC
# 2. Madison Square Park, NYC
adresses <- c("285 Fulton St, New York, NY 10007",
"11 Madison Ave, New York, NY 10010")
# geocode the two addresses & transform to {sf} data structure
data <- tidygeocoder::geo(adresses, method = "osm") %>%
st_as_sf(coords = c("long", "lat"), crs = 4326)
rownames(data) <- c("One World Trafe Center", "Madison Square Park")
# calculate travel time from "One World Trade Center" to "Madison Square Park"
osroute <- osrmTable(src = data["One World Trade Center", ],
dst = data["Madison Square Park", ])
tbl_out <- tibble(start_trip = ymd_hms("2018-01-01 08:00:00", tz = "America/New_York"),
duration = osroute$durations,
end_trip = start_trip + 60 * osroute$durations,
time_diff = difftime(end_trip, start_trip, units = "mins")
write.csv(tbl_out,
"sample.csv")
tbl_in <- read_csv("sample.csv")
Here is the screenshot of tbl_out (used to write the data)
Here is the screenshot of tbl_in (read using read_csv)
Can someone help me fix some issues -
Datetime changed from EST to UTC after writing into csv and then reading from csv file
Time_diff and duration have same values. But time_diff is better readable and how can I add unit mins to duration. I don't want to convert it character and paste min.
Is there a way to convert time duration and time_diff to HH:MM and still perform basic operation like addition
Follow up on 3. In case we convert time to HH:MM. Can I write 74.5 mins as 74:30.
time_diff in tbl_in after reading from csv file does not have min

The time's timezone is not written to file when you write.csv, so when reading it in there is no clue what TZ it should be. I suggest you always write in one timezone (e.g., "UTC"), and then explicitly cast it when you read it in. You can do this with attr<-:
now <- Sys.time()
now
# [1] "2021-08-27 14:09:59 EDT"
attr(now, "tzone") <- "UTC"
now
# [1] "2021-08-27 18:09:59 UTC"
What you're seeing is a "difftime"-class object. You can convert any numeric vector into this; all that that class does is change the print method used for it.
times <- c(10.4, 19)
times + 1
# [1] 11.4 20.0
times <- structure(times, class="difftime", units="mins")
times
# Time differences in mins
# [1] 10.4 19.0
times + 1
# Time differences in mins
# [1] 11.4 20.0
One implementation could be to use data.table::as.ITime:
times <- c(10.41, 19.01)
as.ITime(60*times) # ITime assumes seconds for all numbers
# [1] "00:10:24" "00:19:00"
as.ITime(60*times) + 60
# [1] "00:11:24" "00:20:00"
Frankly, I think that may be the best way-forward, even if it gives you HH:MM:SS instead of just HH:MM (perhaps there's a way to customize that ... I don't know offhand). One benefit of this method is that it writes to CSVs as the HH:MM:SS format (which may or may not be perfect):
write.csv(data.frame(x = as.ITime(60*times)), "foo.csv")
readLines("foo.csv")
# [1] "\"\",\"V1\"" "\"1\",00:10:24" "\"2\",00:19:00"
However, if that doesn't fit your needs, then below is a hack that might be sufficient for your needs. (Note that this, when saved to a CSV, is not saved in that format, it is saved as a number.)
format.my_difftime <- function(x, ..., digits = getOption("digits.my_difftime", 0)) {
if (is.null(digits)) digits <- getOption("digits.my_difftime", 0) # idk why this is needed
units <- attr(x, "units")
if (!is.null(units)) {
mult <- switch(units,
sec=, secs=1/60,
min=, mins=1,
hour=, hours=60,
day=, days=86400,
NA)
if (is.na(mult)) {
warning("Unrecognized units, ignoring: ", sQuote(units, q = FALSE))
mult <- 1
}
x <- x * mult
} # else assume 'mins'
fmt <- paste0("%02i:%0", digits+2+(digits>0), ".0", digits, "f")
sprintf(fmt, as.integer(x), 60 * (x %% 1))
}
print.my_difftime <- function(x, ...) cat(format(x), "\n")
as.data.frame.my_difftime <- as.data.frame.difftime
units.my_difftime <- function(x) attr(x, "units")
`units<-.my_difftime` <- function(x, value) {
attr(x, "units") <- value
x
}
Demonstration:
times <- c(10.41, 19.01)
structure(times, class = "my_difftime", units = "sec")
# 00:10 00:19
structure(times, class = "my_difftime", units = "min")
# 10:25 19:01
structure(times, class = "my_difftime", units = "hour")
# 624:36 1140:36
options(digits.my_difftime = 3)
times <- structure(times, class = "my_difftime", units = "min")
times
# 10:24.600 19:00.600
data.frame(x = times)
# x
# 1 10:24.600
# 2 19:00.600
options(digits.my_difftime = 0)
data.frame(x = times)
# x
# 1 10:25
# 2 19:01
dput(data.frame(x = times))
# structure(list(x = structure(c(10.41, 19.01), class = "my_difftime", units = "min")), class = "data.frame", row.names = c(NA, -2L))
(Resolved in 3.)
Similar to #1, write.csv does not include units when it writes to a file; in fact, if it did, then read.csv (and most other CSV-reading functions) might presume that the column is character instead.
write.csv(data.frame(x = times), "foo.csv") # using my_difftime, not ITime
readLines("foo.csv")
# [1] "\"\",\"x\"" "\"1\",10.41" "\"2\",19.01"
The numbers are preserved.
In the case of this naïve implementation for my_difftime, though, the numbers never change, so the "units" attribute is merely for presentation. That is, when you calculate times, make sure that its units are always "minutes" (in my assumption of the OP/question) or something known.
From there, for TZ and for my_difftime, once somebody read.csv's the file, they are responsible for properly classing the column.

Related

Generating sequence of all days of a given year [duplicate]

Is there a simple R idiom for getting a sequence of all days in a given year? I can do the following which does ok, except for leap years:
dtt <- as.Date( paste( as.character(year), "-1-1", sep="") ) + seq( 0,364 )
I could, obviously, add a line to filter out any values in (year + 1) but I'm guessing there's a much shorter way to do this.
What about this:
R> length(seq( as.Date("2004-01-01"), as.Date("2004-12-31"), by="+1 day"))
[1] 366
R> length(seq( as.Date("2005-01-01"), as.Date("2005-12-31"), by="+1 day"))
[1] 365
R>
This uses nuttin' but base R to compute correctly on dates to give you your vector. If you want higher-level operators, look e.g. at lubridate or even my more rudimentary RcppBDT which wraps parts of the Boost Time_Date library.
Using Dirk's guidance I've settled on this:
getDays <- function(year){
seq(as.Date(paste(year, "-01-01", sep="")), as.Date(paste(year, "-12-31", sep="")), by="+1 day")
}
I'd be interested to know if it would be faster to invert the sequencing and the casting as.Date:
# My function getDays
getDays_1 <- function(year) {
d1 <- as.Date(paste(year, '-01-01', sep = ''));
d2 <- as.Date(paste(year, '-12-31', sep = ''));
as.Date(d1:d2, origin = '1970-01-01');
};
# other getDays
getDays_2 <- function(year) {
seq(as.Date(paste(year, '-01-01', sep='')),
as.Date(paste(year, '-12-31', sep='')),
by = '+1 day');
};
test_getDays_1 <- function(n = 10000) {
for(i in 1:n) {
getDays_1(2000);
};
};
test_getDays_2 <- function(n = 10000) {
for(i in 1:n) {
getDays_2(2000);
};
};
system.time(test_getDays_1());
# user system elapsed
# 4.80 0.00 4.81
system.time(test_getDays_2());
# user system elapsed
# 4.52 0.00 4.53
I guess not . . . it appears that sequencing Date objects is slightly faster than convert a vector of integers to Dates
I needed something similar, however for a range of dates I want to know the number of days in that year. I came up with the following function, which returns a vector with the same length as the dates in the input.
days_in_year <- function(dates) {
years <- year(dates)
days <- table(year(seq(as.Date(paste0(min(years), '-01-01')),
as.Date(paste0(max(years), '-12-31')),
by = '+1 day')))
as.vector(days[as.character(years)])
}
It works similar to Dirk's solution, however it uses the lubridate::year function to get the year part of all dates twice. Using table does the same as length, however for all unique years. It might use some more memory than strictly necessary if the dates are not in consecutive years.

Converting date to numeric but limiting to number of days in year

I want to create date object between 2008-01-01 and 2010-12-31 around 10K of them. I wrote the code for that but I actually want to keep days 1-366 in 2008 because of 2008-02-29 (leap year) I want them to restart after 366 then become 1 on 2009-01-01. I can do this as create only for 2008 then 2009 then 2010 but it won't be convenient. I was reading about lubridate but could not figure it out. I can also filter 1 to 366 then 367-731 but that's not gonna be efficient as well. Anyone knows a better way to do it?
set.seed(123)
tim1=sample(365*3+1,10000,replace = TRUE) ### that plus 1 from feb 29 in 2008
dat1=as.Date(tim1,origin="2007-12-31") # then 1 will be 2008-01-01
You can create a vector of all the target dates and sample from it. To create the vector, there is seq.Date, the seq method for objects of class "Date".
start <- as.Date("2008-01-01")
end <- as.Date("2010-12-31")
s <- seq(start, end, by = "days")
The vector s includes all days between start and end. Now sample from it.
set.seed(123)
dat1 <- sample(s, 10000, TRUE)
Transform the sample into day-of-the-year. See help("strptime")
as.numeric(format(dat1, format = "%j"))
In the end, remove s, it's no longer needed.
rm(s) # tidy up
Edit.
The following two functions do what the question asks for but with two different methods.
f1 is the code above wrapped in a function, f2 uses ave/seq_along/match and is a bit more complicated. The tests show function f2 to be twice as fast than f1
f1 <- function(start_date, end_date, n){
start <- as.Date(start_date)
end <- as.Date(end_date)
s <- seq(start, end, by = "days")
y <- sample(s, n, replace = TRUE)
as.numeric(format(y, format = "%j"))
}
f2 <- function(start_date, end_date, n){
start <- as.Date(start_date)
end <- as.Date(end_date)
s <- seq(start, end, by = "days")
y <- sample(s, n, replace = TRUE)
z <- ave(as.integer(s), lubridate::year(s), FUN = seq_along)
z[match(y, s)]
}
set.seed(123)
x1 <- f1("2008-01-01", "2010-12-31", 100)
set.seed(123)
x2 <- f2("2008-01-01", "2010-12-31", 100)
all.equal(x1, x2)
#[1] TRUE
Now the tests.
library(microbenchmark)
mb <- microbenchmark(
f1 = f1("2008-01-01", "2010-12-31", 1e4),
f2 = f2("2008-01-01", "2010-12-31", 1e4),
times = 50
)
print(mb, order = "median")
ggplot2::autoplot(mb)

Daily Arithmetic and daily Geometric Averages for each year (AAPL)

I am trying to obtain the daily arithmetic and daily geometric averages for each year, for the APPL stock data using R. My implementation on this will be the periodReturn function in the last few lines, but it doesn't seem to work, and an error: '...' used in an incorrect context is given.
How can I modify my code such that I can get the desired output? Some help will be deeply appreciated.
# Get historical price data (daily)
getSymbols('AAPL', from = "2005-01-01")
AAPLdaily <- as.data.frame(AAPL)
head(AAPLdaily)
?to.period
# AAPLweekly <- to.weekly(to.weekly(AAPL, indexAt = 'endof'))
# AAPLweekly <- as.data.frame(AAPLweekly)
# Better to do it in one step like this:
AAPLweekly <- as.data.frame( to.weekly(AAPL, indexAt = 'endof') )
head(AAPLweekly)
AAPLmonthly <- as.data.frame( to.monthly(AAPL, indexAt = 'endof') )
head(AAPLmonthly)
AAPLyearly <- as.data.frame( to.yearly(AAPL, indexAt = 'endof') )
AAPLyearly
# Another way to do this
AAPLweekly1 <- as.data.frame(to.period(AAPL, period = 'weeks', indexAt = 'endof'))
head(AAPLweekly1)
AAPLmonthly1 <- as.data.frame(to.period(AAPL, period = 'months', indexAt = 'endof'))
head(AAPLmonthly1)
AAPLyearly1 <- as.data.frame(to.period(AAPL, period = 'years', indexAt = 'endof'))
head(AAPLyearly1)
########## Another possible method #########
# Change to data.frames
AAPL = as.data.frame(AAPL)
head(AAPL)
# Get Dates
dates <- as.Date(row.names(AAPL))
head(dates)
# Create a cloumn in APPL data frame with the dates
AAPL$dates <- as.Date(row.names(AAPL))
?aggregate
?substr
# Last Day of month
lastDayofMonth <- aggregate(AAPL["dates"], list(month = substr(AAPL$dates, 1, 7)), max)
head(lastDayofMonth)
AAPLmonth <- AAPL[dates %in% lastDayofMonth$dates, ]
head(AAPLmonth)
# Last day of year
lastDayofYear <- aggregate(AAPL["dates"], list(month = substr(AAPL$dates, 1, 4)), max)
head(lastDayofYear)
AAPLyear <- AAPL[dates %in% lastDayofYear$dates, ]
AAPLmonth
AAPLdaily <- as.data.frame( to.daily(AAPL, indexAt = 'endof') )
AAPLdaily
dailyReturn(AAPLdaily)
periodReturn(AAPL,
period='daily',
subset=NULL,
type='arithmetic',
leading=TRUE,
...
)
If what you are asking for is the yearly, monthly, weekly arithmetic/geometric return all you have to do is:
getSymbols('AAPL',from= '2010-01-01')
ROC(AAPL[endpoints(AAPL,on = 'years'),"AAPL.Adjusted"],type='discrete’)
2012-12-31 0.32566879
2013-12-31 0.08069493
2014-12-31 0.40622488
2015-12-31 -0.03013708
2016-12-30 0.12480425
2017-09-20 0.36428706
for the geometric (log) return change the ROC argument to ‘continuous’:
ROC(AAPL[endpoints(AAPL,on = 'years'),"AAPL.Adjusted"],type='continuous’)
2012-12-31 0.28191708
2013-12-31 0.07760429
2014-12-31 0.34090873
2015-12-31 -0.03060053
2016-12-30 0.11760902
2017-09-20 0.31063199
For other periods change the endpoints argument to months or weeks.

Determine season from Date using lubridate in R

I have a very big dataset with a DateTime Column containing POSIXct-Values. I need to determine the season (Winter - Summer) based on the DateTime column. I've created a function which works fine on a small dataset, but crashes when I use it on the large one. Can anybody see my mistake?
I've created 4 functions:
3 subfunctions so that I can do logical comparisons and selection
using *apply functions
1 function to determine the season
Here are thefunctions:
require(lubridate)
# function for logical comparison (to be used in *apply)
greaterOrEqual <- function(x,y){
ifelse(x >= y,T,F)
}
# function for logical comparison (to be used in *apply)
less <- function(x,y){
ifelse(x < y,T,F)
}
# function for logical comparison (to be used in *apply)
selFromLogic <- function(VecLogic,VecValue){
VecValue[VecLogic]
}
# Main Function to determine the season
getTwoSeasons <- function(input.date) {
Winter1Start <- as.POSIXct("2000-01-01 00:00:00", tz = "UTC")
Winter1End <- as.POSIXct("2000-04-15 23:59:59", tz = "UTC")
SummerStart <- Winter1End + 1
SummerEnd <- as.POSIXct("2000-10-15 23:59:59", tz = "UTC")
Winter2Start <- SummerEnd + 1
Winter2End <- as.POSIXct("2000-12-31 00:00:00", tz = "UTC")
year(input.date) <- year(Winter1Start)
attr(input.date, "tzone") <- attr(Winter1Start, "tzone")
SeasonStart <- c(Winter1Start,SummerStart,Winter2Start)
SeasonsEnd <- c(Winter1End,SummerEnd,Winter2End)
Season_names <- as.factor(c("WinterHalfYear","SummerHalfYear","WinterHalfYear"))
Season_select <- sapply(SeasonStart, greaterOrEqual, x = input.date) & sapply(SeasonsEnd, less, x = input.date)
Season_return <- apply(Season_select,MARGIN = 1,selFromLogic,VecValue = Season_names)
return(Season_return)
}
And here's a way to test the function:
dates <- Sys.time() + seq(0,10000,10)
getTwoSeasons(dates)
I would be thankful for any help, this is driving me crazy!
And if you're interested in getting back four seasons, here's code to do that:
library(lubridate)
getSeason <- function(input.date){
numeric.date <- 100*month(input.date)+day(input.date)
## input Seasons upper limits in the form MMDD in the "break =" option:
cuts <- base::cut(numeric.date, breaks = c(0,319,0620,0921,1220,1231))
# rename the resulting groups (could've been done within cut(...levels=) if "Winter" wasn't double
levels(cuts) <- c("Winter","Spring","Summer","Fall","Winter")
return(cuts)
}
Unit Test:
getSeason(as.POSIXct("2016-01-01 12:00:00")+(0:365)*(60*60*24))
For completeness, worth noting that lubridate now has a quarter (and a semester) function. quarter splits the year into fourths and semester into halves:
library(lubridate)
quarter(x, with_year = FALSE, fiscal_start = 1)
semester(x, with_year = FALSE)
For more, see: https://www.rdocumentation.org/packages/lubridate/versions/1.7.4/topics/quarter
I packaged #Lars Arne Jordanger's much more elegant approach into a function:
getTwoSeasons <- function(input.date){
numeric.date <- 100*month(input.date)+day(input.date)
## input Seasons upper limits in the form MMDD in the "break =" option:
cuts <- base::cut(numeric.date, breaks = c(0,415,1015,1231))
# rename the resulting groups (could've been done within cut(...levels=) if "Winter" wasn't double
levels(cuts) <- c("Winter", "Summer","Winter")
return(cuts)
}
Testing it on some sample data seems to work fine:
getTwoSeasons(as.POSIXct("2016-01-01 12:00:00")+(0:365)*(60*60*24))
After several hours of debugging I've found my mistake, and it's quite absurd really:
If a season for a DateTimeValue was not found, apply returned list-object instead of a vector (this was the case when the DateTime value equalled 2000-12-31 00:00:00). Returning a list created an an overproportional increase in computation time and the described crashes. Here's a the fixed code:
# input date and return 2 season
getTwoSeasons <- function(input.date) {
Winter1Start <- as.POSIXct("2000-01-01 00:00:00", tz = "UTC")
Winter1End <- as.POSIXct("2000-04-15 23:59:59", tz = "UTC")
SummerStart <- Winter1End + 1
SummerEnd <- as.POSIXct("2000-10-15 23:59:59", tz = "UTC")
Winter2Start <- SummerEnd + 1
Winter2End <- as.POSIXct("2001-01-01 00:00:01", tz = "UTC")
SeasonStart <- c(Winter1Start,SummerStart,Winter2Start)
SeasonsEnd <- c(Winter1End,SummerEnd,Winter2End)
Season_names <- factor(c("WinterHalf","SummerHalf","WinterHalf"))
year(input.date) <- year(Winter1Start)
attr(input.date, "tzone") <- attr(Winter1Start, "tzone")
Season_selectStart <- vapply(X = SeasonStart,function(x,y){x <= input.date},FUN.VALUE = logical(length(input.date)),y = input.date)
Season_selectEnd <- vapply(X = SeasonsEnd,function(x,y){x > input.date},FUN.VALUE = logical(length(input.date)),y = input.date)
Season_selectBoth <- Season_selectStart & Season_selectEnd
Season_return <- apply(Season_selectBoth,MARGIN = 1,function(x,y){y[x]}, y = Season_names)
return(Season_return)
}
The "sub"-functions are now integrated in the main function and two sapply functions replaced with vapply.
PS: There is still an issue with the timezone, since c() strips the timezone away. I'll update the code when I fix it.
The following strategy can also be used: The basic observation is that
substr can extract the month and day information we need in order to
decide if it's summer or winter. The idea is then to convert this to
numbers of the form month.date, and the test for being summer then
boils down to having a number larger than 4.15 but smaller than 10.16.
The example below shows how this can be done when a vector of dates
first are transformed into the alternative presentation described
above, and then a vector that tells if it is summer "TRUE" or winter
"FALSE" will be created based on this.
DateTime <- as.POSIXct(x = "2000-01-01 00:00:00",
tz = "UTC") +
(0:1000)*(60*60*24)
DateTime_2 <- as.numeric(paste(
substr(x = DateTime,
start = 6,
stop = 7),
substr(x = DateTime,
start = 9,
stop = 10),
sep = "."))
.season <- (DateTime_2 > 4.15) & (DateTime_2 < 10.16)
Use the POSXlt instead of POSXct.
I made my own function depending on the definition of seasons that I am using. I created vectors named normal for a non-leap year and leap for leap year with each season name repeated the no. of times it appears starting from Jan 1. And created the following function.
SEASON <- function(datee){
datee <- as.POSIXlt(datee)
season <- vector()
normal <- rep(c("Winter","Spring","Summer","Monsoon","Autumn","Winter"), c(46,44,91,77,76,31))
leap <- rep(c("Winter","Spring","Summer","Monsoon","Autumn","Winter"), c(46,45,91,77,76,31))
if(leap_year(year(datee)) == FALSE){
season <- normal[datee$yday+1]
} else {
season <- leap[datee$yday+1]
}
return(season)
}
Let's put it to test for some dataset.
Dates <- seq(as.POSIXct("2000-01-01"), as.POSIXct("2010-01-01"), by= "day")
sapply(Dates, SEASON)
It works.

Create a vector of all dates in a given year

Is there a simple R idiom for getting a sequence of all days in a given year? I can do the following which does ok, except for leap years:
dtt <- as.Date( paste( as.character(year), "-1-1", sep="") ) + seq( 0,364 )
I could, obviously, add a line to filter out any values in (year + 1) but I'm guessing there's a much shorter way to do this.
What about this:
R> length(seq( as.Date("2004-01-01"), as.Date("2004-12-31"), by="+1 day"))
[1] 366
R> length(seq( as.Date("2005-01-01"), as.Date("2005-12-31"), by="+1 day"))
[1] 365
R>
This uses nuttin' but base R to compute correctly on dates to give you your vector. If you want higher-level operators, look e.g. at lubridate or even my more rudimentary RcppBDT which wraps parts of the Boost Time_Date library.
Using Dirk's guidance I've settled on this:
getDays <- function(year){
seq(as.Date(paste(year, "-01-01", sep="")), as.Date(paste(year, "-12-31", sep="")), by="+1 day")
}
I'd be interested to know if it would be faster to invert the sequencing and the casting as.Date:
# My function getDays
getDays_1 <- function(year) {
d1 <- as.Date(paste(year, '-01-01', sep = ''));
d2 <- as.Date(paste(year, '-12-31', sep = ''));
as.Date(d1:d2, origin = '1970-01-01');
};
# other getDays
getDays_2 <- function(year) {
seq(as.Date(paste(year, '-01-01', sep='')),
as.Date(paste(year, '-12-31', sep='')),
by = '+1 day');
};
test_getDays_1 <- function(n = 10000) {
for(i in 1:n) {
getDays_1(2000);
};
};
test_getDays_2 <- function(n = 10000) {
for(i in 1:n) {
getDays_2(2000);
};
};
system.time(test_getDays_1());
# user system elapsed
# 4.80 0.00 4.81
system.time(test_getDays_2());
# user system elapsed
# 4.52 0.00 4.53
I guess not . . . it appears that sequencing Date objects is slightly faster than convert a vector of integers to Dates
I needed something similar, however for a range of dates I want to know the number of days in that year. I came up with the following function, which returns a vector with the same length as the dates in the input.
days_in_year <- function(dates) {
years <- year(dates)
days <- table(year(seq(as.Date(paste0(min(years), '-01-01')),
as.Date(paste0(max(years), '-12-31')),
by = '+1 day')))
as.vector(days[as.character(years)])
}
It works similar to Dirk's solution, however it uses the lubridate::year function to get the year part of all dates twice. Using table does the same as length, however for all unique years. It might use some more memory than strictly necessary if the dates are not in consecutive years.

Resources