Determine season from Date using lubridate in R - r

I have a very big dataset with a DateTime Column containing POSIXct-Values. I need to determine the season (Winter - Summer) based on the DateTime column. I've created a function which works fine on a small dataset, but crashes when I use it on the large one. Can anybody see my mistake?
I've created 4 functions:
3 subfunctions so that I can do logical comparisons and selection
using *apply functions
1 function to determine the season
Here are thefunctions:
require(lubridate)
# function for logical comparison (to be used in *apply)
greaterOrEqual <- function(x,y){
ifelse(x >= y,T,F)
}
# function for logical comparison (to be used in *apply)
less <- function(x,y){
ifelse(x < y,T,F)
}
# function for logical comparison (to be used in *apply)
selFromLogic <- function(VecLogic,VecValue){
VecValue[VecLogic]
}
# Main Function to determine the season
getTwoSeasons <- function(input.date) {
Winter1Start <- as.POSIXct("2000-01-01 00:00:00", tz = "UTC")
Winter1End <- as.POSIXct("2000-04-15 23:59:59", tz = "UTC")
SummerStart <- Winter1End + 1
SummerEnd <- as.POSIXct("2000-10-15 23:59:59", tz = "UTC")
Winter2Start <- SummerEnd + 1
Winter2End <- as.POSIXct("2000-12-31 00:00:00", tz = "UTC")
year(input.date) <- year(Winter1Start)
attr(input.date, "tzone") <- attr(Winter1Start, "tzone")
SeasonStart <- c(Winter1Start,SummerStart,Winter2Start)
SeasonsEnd <- c(Winter1End,SummerEnd,Winter2End)
Season_names <- as.factor(c("WinterHalfYear","SummerHalfYear","WinterHalfYear"))
Season_select <- sapply(SeasonStart, greaterOrEqual, x = input.date) & sapply(SeasonsEnd, less, x = input.date)
Season_return <- apply(Season_select,MARGIN = 1,selFromLogic,VecValue = Season_names)
return(Season_return)
}
And here's a way to test the function:
dates <- Sys.time() + seq(0,10000,10)
getTwoSeasons(dates)
I would be thankful for any help, this is driving me crazy!

And if you're interested in getting back four seasons, here's code to do that:
library(lubridate)
getSeason <- function(input.date){
numeric.date <- 100*month(input.date)+day(input.date)
## input Seasons upper limits in the form MMDD in the "break =" option:
cuts <- base::cut(numeric.date, breaks = c(0,319,0620,0921,1220,1231))
# rename the resulting groups (could've been done within cut(...levels=) if "Winter" wasn't double
levels(cuts) <- c("Winter","Spring","Summer","Fall","Winter")
return(cuts)
}
Unit Test:
getSeason(as.POSIXct("2016-01-01 12:00:00")+(0:365)*(60*60*24))

For completeness, worth noting that lubridate now has a quarter (and a semester) function. quarter splits the year into fourths and semester into halves:
library(lubridate)
quarter(x, with_year = FALSE, fiscal_start = 1)
semester(x, with_year = FALSE)
For more, see: https://www.rdocumentation.org/packages/lubridate/versions/1.7.4/topics/quarter

I packaged #Lars Arne Jordanger's much more elegant approach into a function:
getTwoSeasons <- function(input.date){
numeric.date <- 100*month(input.date)+day(input.date)
## input Seasons upper limits in the form MMDD in the "break =" option:
cuts <- base::cut(numeric.date, breaks = c(0,415,1015,1231))
# rename the resulting groups (could've been done within cut(...levels=) if "Winter" wasn't double
levels(cuts) <- c("Winter", "Summer","Winter")
return(cuts)
}
Testing it on some sample data seems to work fine:
getTwoSeasons(as.POSIXct("2016-01-01 12:00:00")+(0:365)*(60*60*24))

After several hours of debugging I've found my mistake, and it's quite absurd really:
If a season for a DateTimeValue was not found, apply returned list-object instead of a vector (this was the case when the DateTime value equalled 2000-12-31 00:00:00). Returning a list created an an overproportional increase in computation time and the described crashes. Here's a the fixed code:
# input date and return 2 season
getTwoSeasons <- function(input.date) {
Winter1Start <- as.POSIXct("2000-01-01 00:00:00", tz = "UTC")
Winter1End <- as.POSIXct("2000-04-15 23:59:59", tz = "UTC")
SummerStart <- Winter1End + 1
SummerEnd <- as.POSIXct("2000-10-15 23:59:59", tz = "UTC")
Winter2Start <- SummerEnd + 1
Winter2End <- as.POSIXct("2001-01-01 00:00:01", tz = "UTC")
SeasonStart <- c(Winter1Start,SummerStart,Winter2Start)
SeasonsEnd <- c(Winter1End,SummerEnd,Winter2End)
Season_names <- factor(c("WinterHalf","SummerHalf","WinterHalf"))
year(input.date) <- year(Winter1Start)
attr(input.date, "tzone") <- attr(Winter1Start, "tzone")
Season_selectStart <- vapply(X = SeasonStart,function(x,y){x <= input.date},FUN.VALUE = logical(length(input.date)),y = input.date)
Season_selectEnd <- vapply(X = SeasonsEnd,function(x,y){x > input.date},FUN.VALUE = logical(length(input.date)),y = input.date)
Season_selectBoth <- Season_selectStart & Season_selectEnd
Season_return <- apply(Season_selectBoth,MARGIN = 1,function(x,y){y[x]}, y = Season_names)
return(Season_return)
}
The "sub"-functions are now integrated in the main function and two sapply functions replaced with vapply.
PS: There is still an issue with the timezone, since c() strips the timezone away. I'll update the code when I fix it.

The following strategy can also be used: The basic observation is that
substr can extract the month and day information we need in order to
decide if it's summer or winter. The idea is then to convert this to
numbers of the form month.date, and the test for being summer then
boils down to having a number larger than 4.15 but smaller than 10.16.
The example below shows how this can be done when a vector of dates
first are transformed into the alternative presentation described
above, and then a vector that tells if it is summer "TRUE" or winter
"FALSE" will be created based on this.
DateTime <- as.POSIXct(x = "2000-01-01 00:00:00",
tz = "UTC") +
(0:1000)*(60*60*24)
DateTime_2 <- as.numeric(paste(
substr(x = DateTime,
start = 6,
stop = 7),
substr(x = DateTime,
start = 9,
stop = 10),
sep = "."))
.season <- (DateTime_2 > 4.15) & (DateTime_2 < 10.16)

Use the POSXlt instead of POSXct.
I made my own function depending on the definition of seasons that I am using. I created vectors named normal for a non-leap year and leap for leap year with each season name repeated the no. of times it appears starting from Jan 1. And created the following function.
SEASON <- function(datee){
datee <- as.POSIXlt(datee)
season <- vector()
normal <- rep(c("Winter","Spring","Summer","Monsoon","Autumn","Winter"), c(46,44,91,77,76,31))
leap <- rep(c("Winter","Spring","Summer","Monsoon","Autumn","Winter"), c(46,45,91,77,76,31))
if(leap_year(year(datee)) == FALSE){
season <- normal[datee$yday+1]
} else {
season <- leap[datee$yday+1]
}
return(season)
}
Let's put it to test for some dataset.
Dates <- seq(as.POSIXct("2000-01-01"), as.POSIXct("2010-01-01"), by= "day")
sapply(Dates, SEASON)
It works.

Related

Generating sequence of all days of a given year [duplicate]

Is there a simple R idiom for getting a sequence of all days in a given year? I can do the following which does ok, except for leap years:
dtt <- as.Date( paste( as.character(year), "-1-1", sep="") ) + seq( 0,364 )
I could, obviously, add a line to filter out any values in (year + 1) but I'm guessing there's a much shorter way to do this.
What about this:
R> length(seq( as.Date("2004-01-01"), as.Date("2004-12-31"), by="+1 day"))
[1] 366
R> length(seq( as.Date("2005-01-01"), as.Date("2005-12-31"), by="+1 day"))
[1] 365
R>
This uses nuttin' but base R to compute correctly on dates to give you your vector. If you want higher-level operators, look e.g. at lubridate or even my more rudimentary RcppBDT which wraps parts of the Boost Time_Date library.
Using Dirk's guidance I've settled on this:
getDays <- function(year){
seq(as.Date(paste(year, "-01-01", sep="")), as.Date(paste(year, "-12-31", sep="")), by="+1 day")
}
I'd be interested to know if it would be faster to invert the sequencing and the casting as.Date:
# My function getDays
getDays_1 <- function(year) {
d1 <- as.Date(paste(year, '-01-01', sep = ''));
d2 <- as.Date(paste(year, '-12-31', sep = ''));
as.Date(d1:d2, origin = '1970-01-01');
};
# other getDays
getDays_2 <- function(year) {
seq(as.Date(paste(year, '-01-01', sep='')),
as.Date(paste(year, '-12-31', sep='')),
by = '+1 day');
};
test_getDays_1 <- function(n = 10000) {
for(i in 1:n) {
getDays_1(2000);
};
};
test_getDays_2 <- function(n = 10000) {
for(i in 1:n) {
getDays_2(2000);
};
};
system.time(test_getDays_1());
# user system elapsed
# 4.80 0.00 4.81
system.time(test_getDays_2());
# user system elapsed
# 4.52 0.00 4.53
I guess not . . . it appears that sequencing Date objects is slightly faster than convert a vector of integers to Dates
I needed something similar, however for a range of dates I want to know the number of days in that year. I came up with the following function, which returns a vector with the same length as the dates in the input.
days_in_year <- function(dates) {
years <- year(dates)
days <- table(year(seq(as.Date(paste0(min(years), '-01-01')),
as.Date(paste0(max(years), '-12-31')),
by = '+1 day')))
as.vector(days[as.character(years)])
}
It works similar to Dirk's solution, however it uses the lubridate::year function to get the year part of all dates twice. Using table does the same as length, however for all unique years. It might use some more memory than strictly necessary if the dates are not in consecutive years.

Issues in writing and reading datetime in R

Below is an example, where I calculate travel time between to points. I create a tibble with start time, duration, end_time and time_diff = end_time - start_time. I wrote the tibble using write_csv() and read it again with read_csv()
library(sf)
library(tidygeocoder)
library(osrm)
library(lubridate)
# 1. One World Trade Center, NYC
# 2. Madison Square Park, NYC
adresses <- c("285 Fulton St, New York, NY 10007",
"11 Madison Ave, New York, NY 10010")
# geocode the two addresses & transform to {sf} data structure
data <- tidygeocoder::geo(adresses, method = "osm") %>%
st_as_sf(coords = c("long", "lat"), crs = 4326)
rownames(data) <- c("One World Trafe Center", "Madison Square Park")
# calculate travel time from "One World Trade Center" to "Madison Square Park"
osroute <- osrmTable(src = data["One World Trade Center", ],
dst = data["Madison Square Park", ])
tbl_out <- tibble(start_trip = ymd_hms("2018-01-01 08:00:00", tz = "America/New_York"),
duration = osroute$durations,
end_trip = start_trip + 60 * osroute$durations,
time_diff = difftime(end_trip, start_trip, units = "mins")
write.csv(tbl_out,
"sample.csv")
tbl_in <- read_csv("sample.csv")
Here is the screenshot of tbl_out (used to write the data)
Here is the screenshot of tbl_in (read using read_csv)
Can someone help me fix some issues -
Datetime changed from EST to UTC after writing into csv and then reading from csv file
Time_diff and duration have same values. But time_diff is better readable and how can I add unit mins to duration. I don't want to convert it character and paste min.
Is there a way to convert time duration and time_diff to HH:MM and still perform basic operation like addition
Follow up on 3. In case we convert time to HH:MM. Can I write 74.5 mins as 74:30.
time_diff in tbl_in after reading from csv file does not have min
The time's timezone is not written to file when you write.csv, so when reading it in there is no clue what TZ it should be. I suggest you always write in one timezone (e.g., "UTC"), and then explicitly cast it when you read it in. You can do this with attr<-:
now <- Sys.time()
now
# [1] "2021-08-27 14:09:59 EDT"
attr(now, "tzone") <- "UTC"
now
# [1] "2021-08-27 18:09:59 UTC"
What you're seeing is a "difftime"-class object. You can convert any numeric vector into this; all that that class does is change the print method used for it.
times <- c(10.4, 19)
times + 1
# [1] 11.4 20.0
times <- structure(times, class="difftime", units="mins")
times
# Time differences in mins
# [1] 10.4 19.0
times + 1
# Time differences in mins
# [1] 11.4 20.0
One implementation could be to use data.table::as.ITime:
times <- c(10.41, 19.01)
as.ITime(60*times) # ITime assumes seconds for all numbers
# [1] "00:10:24" "00:19:00"
as.ITime(60*times) + 60
# [1] "00:11:24" "00:20:00"
Frankly, I think that may be the best way-forward, even if it gives you HH:MM:SS instead of just HH:MM (perhaps there's a way to customize that ... I don't know offhand). One benefit of this method is that it writes to CSVs as the HH:MM:SS format (which may or may not be perfect):
write.csv(data.frame(x = as.ITime(60*times)), "foo.csv")
readLines("foo.csv")
# [1] "\"\",\"V1\"" "\"1\",00:10:24" "\"2\",00:19:00"
However, if that doesn't fit your needs, then below is a hack that might be sufficient for your needs. (Note that this, when saved to a CSV, is not saved in that format, it is saved as a number.)
format.my_difftime <- function(x, ..., digits = getOption("digits.my_difftime", 0)) {
if (is.null(digits)) digits <- getOption("digits.my_difftime", 0) # idk why this is needed
units <- attr(x, "units")
if (!is.null(units)) {
mult <- switch(units,
sec=, secs=1/60,
min=, mins=1,
hour=, hours=60,
day=, days=86400,
NA)
if (is.na(mult)) {
warning("Unrecognized units, ignoring: ", sQuote(units, q = FALSE))
mult <- 1
}
x <- x * mult
} # else assume 'mins'
fmt <- paste0("%02i:%0", digits+2+(digits>0), ".0", digits, "f")
sprintf(fmt, as.integer(x), 60 * (x %% 1))
}
print.my_difftime <- function(x, ...) cat(format(x), "\n")
as.data.frame.my_difftime <- as.data.frame.difftime
units.my_difftime <- function(x) attr(x, "units")
`units<-.my_difftime` <- function(x, value) {
attr(x, "units") <- value
x
}
Demonstration:
times <- c(10.41, 19.01)
structure(times, class = "my_difftime", units = "sec")
# 00:10 00:19
structure(times, class = "my_difftime", units = "min")
# 10:25 19:01
structure(times, class = "my_difftime", units = "hour")
# 624:36 1140:36
options(digits.my_difftime = 3)
times <- structure(times, class = "my_difftime", units = "min")
times
# 10:24.600 19:00.600
data.frame(x = times)
# x
# 1 10:24.600
# 2 19:00.600
options(digits.my_difftime = 0)
data.frame(x = times)
# x
# 1 10:25
# 2 19:01
dput(data.frame(x = times))
# structure(list(x = structure(c(10.41, 19.01), class = "my_difftime", units = "min")), class = "data.frame", row.names = c(NA, -2L))
(Resolved in 3.)
Similar to #1, write.csv does not include units when it writes to a file; in fact, if it did, then read.csv (and most other CSV-reading functions) might presume that the column is character instead.
write.csv(data.frame(x = times), "foo.csv") # using my_difftime, not ITime
readLines("foo.csv")
# [1] "\"\",\"x\"" "\"1\",10.41" "\"2\",19.01"
The numbers are preserved.
In the case of this naïve implementation for my_difftime, though, the numbers never change, so the "units" attribute is merely for presentation. That is, when you calculate times, make sure that its units are always "minutes" (in my assumption of the OP/question) or something known.
From there, for TZ and for my_difftime, once somebody read.csv's the file, they are responsible for properly classing the column.

Subsetting a spacetime::SDFDF by time

How can I subset a spacetime::SDFDF (spatio-temporal data with full space-time grid) by time?
Sofar, I tried:
library("maps")
library("maptools")
library("spacetime")
library("plm")
states.m <- map("state", plot = FALSE, fill = TRUE)
IDs <- sapply(strsplit(states.m$names, ":"), function(x) x[1])
states <- map2SpatialPolygons(states.m, IDs = IDs)
yrs <- 1970:1986
time <- as.POSIXct(paste(yrs, "-01-01", sep = ""), tz = "GMT")
data("Produc")
Produc.st <- STFDF(states[-8], time, Produc[order(Produc[2], Produc[1]),])
Produc.st#time[c(1,5,17)]
Produc.st[Produc.st#time[c(1,5,17)]]
But that gives me the error: ncol(i) == 2 is not TRUE.
Any ideas?
Please try
Produc.st[,index(Produc.st#time[c(1,5,17)])]
i.e., time selection is done after the ,, and don't select with an xts object as Produc.st#time[c(1,5,17)]) is, but with a time (POSIXct) vector.

Aggregate data by week or few days

I'm trying to aggregate a data frame as to obtain a table with weekly averages of a variable. I found the following package provides a nice solution, and I've been using it for aggregating data yearly and monthly. However, the function to aggregate data weekly simply is not working as described. Does anyone has an idea how I can fix this up?
For instance, following the manual:
require(TSAgg)
#Load the data:
data(foo)
##Format the data using the timeSeries function.
foo.ts<-timeSeries(foo[,1], "%d/%m/%Y %H:%M",foo[,3])
##Aggregate the data into 6 days blocks using max
(mean.month <- monthsAgg(foo.ts,mean,6))
#Aggregate the data into weeks, using 7 days and mean:
(foo.week<-daysAgg(foo.ts,mean,7) )
The last command doesn't work. The function is the following:
daysAgg <-
function (data, process, multiple = NULL, na.rm = FALSE)
{
if (is.null(multiple)) {
multiple = 1
}
if (multiple == 1) {
day <- aggregate(data[, 8:length(data)], list(day = data$day,
month = data$month, year = data$year), process, na.rm = na.rm)
days <- ymd(paste(day$year, day$month, day$day))
data2 <- data.frame(date = days, data = day[, 4:length(day)])
names(data2) <- c("Date", names(data[8:length(data)]))
return(data2)
}
temp <- data
day <- aggregate(list(data[, 8:length(data)], count = 1),
list(day = data$day, month = data$month, year = data$year),
process, na.rm = na.rm)
days <- ymd(paste(day$year, day$month, day$day))
data <- data.frame(date = days, day[, 5:length(day) - 1],
count = day[length(day)])
days = paste(multiple, "days")
all.dates <- seq.Date(as.Date(data$date[1]), as.Date(data$date[length(data[,
1])]), by = "day")
dates <- data.frame(date = all.dates)
aggreGated <- merge(dates, data, by = "date", all.x = TRUE)
aggreGated$date <- rep(seq.Date(as.Date(data$date[1]), as.Date(data$date[length(data[,
1])]), by = days), each = multiple, length = length(all.dates))
results <- aggregate(list(aggreGated[2:length(aggreGated)]),
list(date = aggreGated$date), process, na.rm = TRUE)
results <- subset(results, results$count != 0)
results <- results[, -length(results)]
names(results) <- c("Date", names(temp[8:length(temp)]))
return(results)
}
The problem in the code stems from its usage of the function ymd, which attaches " UTC" to the end of all dates it outputs. It is possible to overload the function by defining ymd again using
ymd <- function(x) {
as.Date(x, "%Y %m %d")
}
before you call daysAgg.

Create a vector of all dates in a given year

Is there a simple R idiom for getting a sequence of all days in a given year? I can do the following which does ok, except for leap years:
dtt <- as.Date( paste( as.character(year), "-1-1", sep="") ) + seq( 0,364 )
I could, obviously, add a line to filter out any values in (year + 1) but I'm guessing there's a much shorter way to do this.
What about this:
R> length(seq( as.Date("2004-01-01"), as.Date("2004-12-31"), by="+1 day"))
[1] 366
R> length(seq( as.Date("2005-01-01"), as.Date("2005-12-31"), by="+1 day"))
[1] 365
R>
This uses nuttin' but base R to compute correctly on dates to give you your vector. If you want higher-level operators, look e.g. at lubridate or even my more rudimentary RcppBDT which wraps parts of the Boost Time_Date library.
Using Dirk's guidance I've settled on this:
getDays <- function(year){
seq(as.Date(paste(year, "-01-01", sep="")), as.Date(paste(year, "-12-31", sep="")), by="+1 day")
}
I'd be interested to know if it would be faster to invert the sequencing and the casting as.Date:
# My function getDays
getDays_1 <- function(year) {
d1 <- as.Date(paste(year, '-01-01', sep = ''));
d2 <- as.Date(paste(year, '-12-31', sep = ''));
as.Date(d1:d2, origin = '1970-01-01');
};
# other getDays
getDays_2 <- function(year) {
seq(as.Date(paste(year, '-01-01', sep='')),
as.Date(paste(year, '-12-31', sep='')),
by = '+1 day');
};
test_getDays_1 <- function(n = 10000) {
for(i in 1:n) {
getDays_1(2000);
};
};
test_getDays_2 <- function(n = 10000) {
for(i in 1:n) {
getDays_2(2000);
};
};
system.time(test_getDays_1());
# user system elapsed
# 4.80 0.00 4.81
system.time(test_getDays_2());
# user system elapsed
# 4.52 0.00 4.53
I guess not . . . it appears that sequencing Date objects is slightly faster than convert a vector of integers to Dates
I needed something similar, however for a range of dates I want to know the number of days in that year. I came up with the following function, which returns a vector with the same length as the dates in the input.
days_in_year <- function(dates) {
years <- year(dates)
days <- table(year(seq(as.Date(paste0(min(years), '-01-01')),
as.Date(paste0(max(years), '-12-31')),
by = '+1 day')))
as.vector(days[as.character(years)])
}
It works similar to Dirk's solution, however it uses the lubridate::year function to get the year part of all dates twice. Using table does the same as length, however for all unique years. It might use some more memory than strictly necessary if the dates are not in consecutive years.

Resources