Create a vector of all dates in a given year - r

Is there a simple R idiom for getting a sequence of all days in a given year? I can do the following which does ok, except for leap years:
dtt <- as.Date( paste( as.character(year), "-1-1", sep="") ) + seq( 0,364 )
I could, obviously, add a line to filter out any values in (year + 1) but I'm guessing there's a much shorter way to do this.

What about this:
R> length(seq( as.Date("2004-01-01"), as.Date("2004-12-31"), by="+1 day"))
[1] 366
R> length(seq( as.Date("2005-01-01"), as.Date("2005-12-31"), by="+1 day"))
[1] 365
R>
This uses nuttin' but base R to compute correctly on dates to give you your vector. If you want higher-level operators, look e.g. at lubridate or even my more rudimentary RcppBDT which wraps parts of the Boost Time_Date library.

Using Dirk's guidance I've settled on this:
getDays <- function(year){
seq(as.Date(paste(year, "-01-01", sep="")), as.Date(paste(year, "-12-31", sep="")), by="+1 day")
}

I'd be interested to know if it would be faster to invert the sequencing and the casting as.Date:
# My function getDays
getDays_1 <- function(year) {
d1 <- as.Date(paste(year, '-01-01', sep = ''));
d2 <- as.Date(paste(year, '-12-31', sep = ''));
as.Date(d1:d2, origin = '1970-01-01');
};
# other getDays
getDays_2 <- function(year) {
seq(as.Date(paste(year, '-01-01', sep='')),
as.Date(paste(year, '-12-31', sep='')),
by = '+1 day');
};
test_getDays_1 <- function(n = 10000) {
for(i in 1:n) {
getDays_1(2000);
};
};
test_getDays_2 <- function(n = 10000) {
for(i in 1:n) {
getDays_2(2000);
};
};
system.time(test_getDays_1());
# user system elapsed
# 4.80 0.00 4.81
system.time(test_getDays_2());
# user system elapsed
# 4.52 0.00 4.53
I guess not . . . it appears that sequencing Date objects is slightly faster than convert a vector of integers to Dates

I needed something similar, however for a range of dates I want to know the number of days in that year. I came up with the following function, which returns a vector with the same length as the dates in the input.
days_in_year <- function(dates) {
years <- year(dates)
days <- table(year(seq(as.Date(paste0(min(years), '-01-01')),
as.Date(paste0(max(years), '-12-31')),
by = '+1 day')))
as.vector(days[as.character(years)])
}
It works similar to Dirk's solution, however it uses the lubridate::year function to get the year part of all dates twice. Using table does the same as length, however for all unique years. It might use some more memory than strictly necessary if the dates are not in consecutive years.

Related

Generating sequence of all days of a given year [duplicate]

Is there a simple R idiom for getting a sequence of all days in a given year? I can do the following which does ok, except for leap years:
dtt <- as.Date( paste( as.character(year), "-1-1", sep="") ) + seq( 0,364 )
I could, obviously, add a line to filter out any values in (year + 1) but I'm guessing there's a much shorter way to do this.
What about this:
R> length(seq( as.Date("2004-01-01"), as.Date("2004-12-31"), by="+1 day"))
[1] 366
R> length(seq( as.Date("2005-01-01"), as.Date("2005-12-31"), by="+1 day"))
[1] 365
R>
This uses nuttin' but base R to compute correctly on dates to give you your vector. If you want higher-level operators, look e.g. at lubridate or even my more rudimentary RcppBDT which wraps parts of the Boost Time_Date library.
Using Dirk's guidance I've settled on this:
getDays <- function(year){
seq(as.Date(paste(year, "-01-01", sep="")), as.Date(paste(year, "-12-31", sep="")), by="+1 day")
}
I'd be interested to know if it would be faster to invert the sequencing and the casting as.Date:
# My function getDays
getDays_1 <- function(year) {
d1 <- as.Date(paste(year, '-01-01', sep = ''));
d2 <- as.Date(paste(year, '-12-31', sep = ''));
as.Date(d1:d2, origin = '1970-01-01');
};
# other getDays
getDays_2 <- function(year) {
seq(as.Date(paste(year, '-01-01', sep='')),
as.Date(paste(year, '-12-31', sep='')),
by = '+1 day');
};
test_getDays_1 <- function(n = 10000) {
for(i in 1:n) {
getDays_1(2000);
};
};
test_getDays_2 <- function(n = 10000) {
for(i in 1:n) {
getDays_2(2000);
};
};
system.time(test_getDays_1());
# user system elapsed
# 4.80 0.00 4.81
system.time(test_getDays_2());
# user system elapsed
# 4.52 0.00 4.53
I guess not . . . it appears that sequencing Date objects is slightly faster than convert a vector of integers to Dates
I needed something similar, however for a range of dates I want to know the number of days in that year. I came up with the following function, which returns a vector with the same length as the dates in the input.
days_in_year <- function(dates) {
years <- year(dates)
days <- table(year(seq(as.Date(paste0(min(years), '-01-01')),
as.Date(paste0(max(years), '-12-31')),
by = '+1 day')))
as.vector(days[as.character(years)])
}
It works similar to Dirk's solution, however it uses the lubridate::year function to get the year part of all dates twice. Using table does the same as length, however for all unique years. It might use some more memory than strictly necessary if the dates are not in consecutive years.

R - How to create a vector with propagated dates

I am trying to create a vector (dateVec) which contains the dates in the column Date propagated by the number of days in column Days. I cannot understand why the code that I created is not working. Dates are in Date format.
> for ( i in mydata[,1] ) {
> dateVec = mydata [,1] + 0 : mydata [,2] }
The data has much more rows, here is a sample as an example:
Date (mydata[,1]) -- Days (mydata[,2])
10/05/2017 ---------- 3
05/05/2017 ---------- 2
The result that I would expect for dateVec would be:
(10/05/2017, 11/05/2017, 12/05/2017, 13/05/2017, 05/05/2017, 06/05/2017, 07/05/2017, ...)
There are a few issues here why your code isn't working.
For loop: Here, your i needs a series of integers to iterate through. As you
have it now, you are trying to loop from i = 1 to "10/05/2017" and "05/05/2017".
A more useful way is to use seq_along to generate a sequence of integers from
1 to the length of the object passed through to seq_along.
dateVec is not indexed, so that you are overwriting dateVec for each
iteration of your loop
Variable length of days. For the first date, you are generating a sequence 3
days long and for the second date, 2 days. You will need a data structure that can handle variable length element such as a list.
To modify your existing code:
mydata <- data.frame(Date = as.Date(c("10/05/2017", "05/05/2017"),
format = "%d/%m/%Y"), Days = c(3, 2))
dateVec <- list()
for (i in seq_along(mydata[, 1])) {
dateVec[[i]] = mydata [i, 1] + 0 : mydata [i, 2]
}
res <- do.call("c", dateVec)
A more r idiomatic approach is to pass the starting date and length of time in parallel using mapply to return a list, which is then concatenated to a vector of dates
res <- do.call("c", mapply(function(x, y) seq(from = x, length.out = y,
by = "1 day"), x = mydata[["Date"]], y = mydata[["Days"]]))
Here's a clunky solution:
library("lubridate")
mydata = data.frame(Date = dmy(c("10/05/2017", "05/05/2017")),
Days = c(3,2))
dateVec = dmy(character())
for(i in 1:length(mydata$Date)){
dateVec = c(dateVec,mydata$Date[i])
for(j in 1:mydata$Days[i]){
dateVec = c(dateVec, mydata$Date[i]+j)
}
}
Note that this uses the lubridate package and doesn't format the dates quite how you did. I also found it interesting that I had to initialize dateVec as a date object. Initially I tried dateVec = c() but R tried to coerce to numeric.

Determine season from Date using lubridate in R

I have a very big dataset with a DateTime Column containing POSIXct-Values. I need to determine the season (Winter - Summer) based on the DateTime column. I've created a function which works fine on a small dataset, but crashes when I use it on the large one. Can anybody see my mistake?
I've created 4 functions:
3 subfunctions so that I can do logical comparisons and selection
using *apply functions
1 function to determine the season
Here are thefunctions:
require(lubridate)
# function for logical comparison (to be used in *apply)
greaterOrEqual <- function(x,y){
ifelse(x >= y,T,F)
}
# function for logical comparison (to be used in *apply)
less <- function(x,y){
ifelse(x < y,T,F)
}
# function for logical comparison (to be used in *apply)
selFromLogic <- function(VecLogic,VecValue){
VecValue[VecLogic]
}
# Main Function to determine the season
getTwoSeasons <- function(input.date) {
Winter1Start <- as.POSIXct("2000-01-01 00:00:00", tz = "UTC")
Winter1End <- as.POSIXct("2000-04-15 23:59:59", tz = "UTC")
SummerStart <- Winter1End + 1
SummerEnd <- as.POSIXct("2000-10-15 23:59:59", tz = "UTC")
Winter2Start <- SummerEnd + 1
Winter2End <- as.POSIXct("2000-12-31 00:00:00", tz = "UTC")
year(input.date) <- year(Winter1Start)
attr(input.date, "tzone") <- attr(Winter1Start, "tzone")
SeasonStart <- c(Winter1Start,SummerStart,Winter2Start)
SeasonsEnd <- c(Winter1End,SummerEnd,Winter2End)
Season_names <- as.factor(c("WinterHalfYear","SummerHalfYear","WinterHalfYear"))
Season_select <- sapply(SeasonStart, greaterOrEqual, x = input.date) & sapply(SeasonsEnd, less, x = input.date)
Season_return <- apply(Season_select,MARGIN = 1,selFromLogic,VecValue = Season_names)
return(Season_return)
}
And here's a way to test the function:
dates <- Sys.time() + seq(0,10000,10)
getTwoSeasons(dates)
I would be thankful for any help, this is driving me crazy!
And if you're interested in getting back four seasons, here's code to do that:
library(lubridate)
getSeason <- function(input.date){
numeric.date <- 100*month(input.date)+day(input.date)
## input Seasons upper limits in the form MMDD in the "break =" option:
cuts <- base::cut(numeric.date, breaks = c(0,319,0620,0921,1220,1231))
# rename the resulting groups (could've been done within cut(...levels=) if "Winter" wasn't double
levels(cuts) <- c("Winter","Spring","Summer","Fall","Winter")
return(cuts)
}
Unit Test:
getSeason(as.POSIXct("2016-01-01 12:00:00")+(0:365)*(60*60*24))
For completeness, worth noting that lubridate now has a quarter (and a semester) function. quarter splits the year into fourths and semester into halves:
library(lubridate)
quarter(x, with_year = FALSE, fiscal_start = 1)
semester(x, with_year = FALSE)
For more, see: https://www.rdocumentation.org/packages/lubridate/versions/1.7.4/topics/quarter
I packaged #Lars Arne Jordanger's much more elegant approach into a function:
getTwoSeasons <- function(input.date){
numeric.date <- 100*month(input.date)+day(input.date)
## input Seasons upper limits in the form MMDD in the "break =" option:
cuts <- base::cut(numeric.date, breaks = c(0,415,1015,1231))
# rename the resulting groups (could've been done within cut(...levels=) if "Winter" wasn't double
levels(cuts) <- c("Winter", "Summer","Winter")
return(cuts)
}
Testing it on some sample data seems to work fine:
getTwoSeasons(as.POSIXct("2016-01-01 12:00:00")+(0:365)*(60*60*24))
After several hours of debugging I've found my mistake, and it's quite absurd really:
If a season for a DateTimeValue was not found, apply returned list-object instead of a vector (this was the case when the DateTime value equalled 2000-12-31 00:00:00). Returning a list created an an overproportional increase in computation time and the described crashes. Here's a the fixed code:
# input date and return 2 season
getTwoSeasons <- function(input.date) {
Winter1Start <- as.POSIXct("2000-01-01 00:00:00", tz = "UTC")
Winter1End <- as.POSIXct("2000-04-15 23:59:59", tz = "UTC")
SummerStart <- Winter1End + 1
SummerEnd <- as.POSIXct("2000-10-15 23:59:59", tz = "UTC")
Winter2Start <- SummerEnd + 1
Winter2End <- as.POSIXct("2001-01-01 00:00:01", tz = "UTC")
SeasonStart <- c(Winter1Start,SummerStart,Winter2Start)
SeasonsEnd <- c(Winter1End,SummerEnd,Winter2End)
Season_names <- factor(c("WinterHalf","SummerHalf","WinterHalf"))
year(input.date) <- year(Winter1Start)
attr(input.date, "tzone") <- attr(Winter1Start, "tzone")
Season_selectStart <- vapply(X = SeasonStart,function(x,y){x <= input.date},FUN.VALUE = logical(length(input.date)),y = input.date)
Season_selectEnd <- vapply(X = SeasonsEnd,function(x,y){x > input.date},FUN.VALUE = logical(length(input.date)),y = input.date)
Season_selectBoth <- Season_selectStart & Season_selectEnd
Season_return <- apply(Season_selectBoth,MARGIN = 1,function(x,y){y[x]}, y = Season_names)
return(Season_return)
}
The "sub"-functions are now integrated in the main function and two sapply functions replaced with vapply.
PS: There is still an issue with the timezone, since c() strips the timezone away. I'll update the code when I fix it.
The following strategy can also be used: The basic observation is that
substr can extract the month and day information we need in order to
decide if it's summer or winter. The idea is then to convert this to
numbers of the form month.date, and the test for being summer then
boils down to having a number larger than 4.15 but smaller than 10.16.
The example below shows how this can be done when a vector of dates
first are transformed into the alternative presentation described
above, and then a vector that tells if it is summer "TRUE" or winter
"FALSE" will be created based on this.
DateTime <- as.POSIXct(x = "2000-01-01 00:00:00",
tz = "UTC") +
(0:1000)*(60*60*24)
DateTime_2 <- as.numeric(paste(
substr(x = DateTime,
start = 6,
stop = 7),
substr(x = DateTime,
start = 9,
stop = 10),
sep = "."))
.season <- (DateTime_2 > 4.15) & (DateTime_2 < 10.16)
Use the POSXlt instead of POSXct.
I made my own function depending on the definition of seasons that I am using. I created vectors named normal for a non-leap year and leap for leap year with each season name repeated the no. of times it appears starting from Jan 1. And created the following function.
SEASON <- function(datee){
datee <- as.POSIXlt(datee)
season <- vector()
normal <- rep(c("Winter","Spring","Summer","Monsoon","Autumn","Winter"), c(46,44,91,77,76,31))
leap <- rep(c("Winter","Spring","Summer","Monsoon","Autumn","Winter"), c(46,45,91,77,76,31))
if(leap_year(year(datee)) == FALSE){
season <- normal[datee$yday+1]
} else {
season <- leap[datee$yday+1]
}
return(season)
}
Let's put it to test for some dataset.
Dates <- seq(as.POSIXct("2000-01-01"), as.POSIXct("2010-01-01"), by= "day")
sapply(Dates, SEASON)
It works.

R - plot overlapping time intervals

I have a list of people and their working start and end times during a day. I want to plot a curve showing the total of people working at any given minute in the day. What I could do is just add 1440 additional conditional boolean variables for each minute of the day and sum them up, but that seems very inelegant. I'm wondering if there a better way to do it (integrals?).
Here's the code to generate a df with my sample data:
sample_wt <- function() {
require(lubridate)
set.seed(10)
worktime <- data.frame(
ID = c(1:100),
start = now()+abs(rnorm(100,4800,2400))
)
worktime$end <- worktime$start + abs(rnorm(100,20000,10000))
worktime$length <- difftime(worktime$end, worktime$start, units="mins")
worktime
}
To create a sample data , you can do something like:
DF <- sample_wt()
Here one option using IRanges package from Bioconductor.
library(IRanges)
## generate sample
DF <- sample_wt()
## create the range from the sample data
rangesA <- IRanges(as.numeric(DF$start), as.numeric(DF$end))
## create one minute range
xx = seq(min(DF$start),max(DF$end),60)
rangesB <- IRanges(as.numeric(xx),as.numeric(xx+60))
## count the overlaps
ov <- countOverlaps(rangesB, rangesA, type="within")
## plot the result
plot(xx,ov,type='l')
Surely it can be improved, but this seems to do it:
time_range <- seq(min(DF$start), max(DF$end), 60)
result <- integer(length(time_range))
for (t in seq_along(time_range)) {
result[t] <- sum(DF$start <= time_range[t] & DF$end >= time_range[t])
}
I don't have lubridate installed, so I produced the data.frame through Sys.time instead of now (guess they should be similar). This could make the trick:
minutes<-seq(as.POSIXct(paste(sep="",Sys.Date()," 00:00:00")),by="min",length.out=24*60)
rowSums(outer(minutes,worktime$start,">") & outer(minutes,worktime$end,"<"))

Calculate the last n weekdays

I have a function in R that, given n days, returns a list of the last n weekdays. My solution works fine, but it feels inelegant, and I was wondering if there were any easy ways to improve it.
WeekdayList <- function(n) {
Today <- as.Date(Sys.time())
days <- c(Today)
i <- 1
while (length(days) < n) {
NewDay <- as.Date(Today-i)
if (!weekdays(NewDay) %in% c("Saturday", "Sunday")) {
days <- c(days,NewDay)
}
i <- i+1
}
days
}
WeekdayList(30)
WeekdayList(2)
Exclusion of holidays would be a nice feature too.
Vectorizing code is essential in R. Here is the example:
WeekdayList2 <- function(n) {
Today <- as.Date(Sys.time())
dayz <- seq(Today, Today - 2 * n, "-1 days")
dayz <- dayz[!(weekdays(dayz) %in% c("Saturday", "Sunday"))]
dayz <- dayz[seq_len(n)]
return(dayz)
}
identical(WeekdayList2(1000), WeekdayList(1000))
system.time(WeekdayList2(10000))
system.time(WeekdayList(10000))
[1] TRUE
user system elapsed
0 0 0
user system elapsed
4.90 0.00 4.91
As you can see, even though my function creates a vector twice almost twice the size it needs to be (and then deletes the weekends), it is much faster than using a "for" loop. My computer cannot even run your function with n = 100000 (not that that you'd care about that many days back anyway), but WeekdayList2 runs it almost instantly.
Since Holidays are relative to where you are, you'll probably need to manually upload a list of dates, and add another criteria to filter out those dates from the data.
I added a holiday calculation to Rguy's code.
WeekdayList3 <- function(n) {
library(timeDate)
Today <- as.Date(Sys.time())
dayz <- rev(seq(Today - 2 * n, Today, "days"))
years <- as.numeric(unique(format(dayz,'%Y')))
holidays <- as.Date(holidayNYSE(years))
dayz <- dayz[!(weekdays(dayz) %in% c("Saturday", "Sunday"))]
dayz <- dayz[!(dayz %in% holidays)]
dayz <- dayz[1 : n]
return(dayz)
}
WeekdayList3(100)

Resources