Selecting and grouping similar dates from vectors of dates - r

I have three vectors of dates in POSIX format that correspond with data collection times from three large datasets. Each of these vectors is of a different length and have similar (but not identical) dates.
I would like to:
group these dates into specified time ranges, e.g. group dates from each vector that fall in a 30-day window and
reduce the number of date groupings to reflect the dataset with the smallest number of collection times, e.g. if "dataset A" has three sampling dates and "dataset B" has five sampling dates then there would only be three groupings of dates (unless the two extra dates in "dataset B" fall within 30 days of the dates in "dataset A").
An example with three vectors of dates in POSIX format (I want to group similar dates between the vectors, allowing a time window of 30 days):
A.dates = as.POSIXlt(c("1998-07-24 08:00","1999-07-24 08:00","2000-07-24 08:00"),
tz = "America/Los_Angeles")
B.dates = as.POSIXlt(c("1998-07-25 08:00","1999-07-25 08:00","2000-07-25 08:00"),
tz = "America/Los_Angeles")
C.dates = as.POSIXlt(c("1998-07-26 08:00","1999-07-26 08:00","2000-07-26 08:00","2000-08-29"),
tz = "America/Los_Angeles")
Specifying a time window of 30 days, there would be three date groupings (the sampling dates from July 1998, 1999, and 2000). The C.dates vector has a fourth collection date of August 29th, 2000 which would be excluded from the groupings because:
it is not within 30 days of the July dates in the other vectors and
there are no dates in the other two vectors that fall within 30 days of August 29th, 2000.

You could loop over each element of each vector and create sequences ± 15 days
L <- list(A.dates, B.dates, C.dates)
tmp <- lapply(L, function(x) lapply(x, function(x)
do.call(seq, c(as.list(as.Date(x) + c(-15, 15)), "day"))))
and unionite them in the list.
tmp <- lapply(tmp, function(x) as.Date(Reduce(union, x), origin="1970-01-01"))
Then simply find the intersect
i <- Reduce(function(...) as.Date(intersect(...), origin="1970-01-01"), tmp)
and select the dates accordingly.
tmp <- lapply(L, function(x) x[as.Date(x) %in% i])
tmp
# [[1]]
# [1] "1998-07-24 08:00:00 PDT" "1999-07-24 08:00:00 PDT"
# [3] "2000-07-24 08:00:00 PDT"
#
# [[2]]
# [1] "1998-07-25 08:00:00 PDT" "1999-07-25 08:00:00 PDT"
# [3] "2000-07-25 08:00:00 PDT"
#
# [[3]]
# [1] "1998-07-26 PDT" "1999-07-26 PDT" "2000-07-26 PDT"
To sort them by year according to your comment, we first unlist them. Unfortunately this converts the dates into numerics (i.e. seconds since January, 1 1970), so we need to convert them back.
tmp <- as.POSIXlt(unlist(lapply(tmp, as.POSIXct)), origin="1970-01-01",
tz="America/Los_Angeles")
Finally we split the list by the first four substrings which is the year (we also could do split(tmp, strftime(tmp, "%Y")) though).
res <- split(tmp, substr(tmp, 1, 4))
res
# $`1998`
# [1] "1998-07-24 08:00:00 PDT" "1998-07-25 08:00:00 PDT"
# [3] "1998-07-26 00:00:00 PDT"
#
# $`1999`
# [1] "1999-07-24 08:00:00 PDT" "1999-07-25 08:00:00 PDT"
# [3] "1999-07-26 00:00:00 PDT"
#
# $`2000`
# [1] "2000-07-24 08:00:00 PDT" "2000-07-25 08:00:00 PDT"
# [3] "2000-07-26 00:00:00 PDT"

Related

How to convert time series dates into data frame dates

I have a time series of weekly data, beginning at Jan. 1, 2016. I've tried using the method in this question, but getting dates from 1970.
This is what I'm doing below:
# Creating this df of dates used later on
index.date <- data.frame(start.date=seq(from=as.Date("01/01/2016",format="%m/%d/%Y"),
to=as.Date("10/30/2021",format="%m/%d/%Y"),
by='week'))
# Create a ts, specifying start date and frequency=52 for weekly
weekly.ts <- ts(rnorm(305,0,1),start=min(index.date$start.date), frequency = 52)
# Look at the min and max dates in the ts
as.Date(as.numeric(time(min(weekly.ts))))
[1] "1970-01-02"
as.Date(as.numeric(time(max(weekly.ts))))
[1] "1970-01-02"
I plan to place the ts into a df with dates shown in a date format with the following:
# Place ts dates and values into a df
output.df <-data.frame(date=as.Date(as.numeric(time(weekly.ts))),
y=as.matrix(weekly.ts))
Is this a matter of me specifying the dates incorrectly in the ts, or am I converting them incorrectly with as.Date(as.numeric(timeweekly.ts))))? I would expect the min date to be Jam. 1, 2016, and the maximum Oct. 29, 2021 (as it is for index.date).
ts series do not understand Date class but you can encode the dates into numbers and then decode them back. Assuming that you want a series with frequency 52 the first week in 2016 will be represented by 2016, the second by 2016+1/52, ..., the last by 2016+51/52.
For example,
tt <- ts(rnorm(305), start = 2016, freq = 52)
Now decode the dates.
toDate <- function(tt) {
yr <- as.integer(time(tt))
week <- as.integer(cycle(tt)) # first week of year is 1, etc.
as.Date(ISOdate(yr, 1, 1)) + 7 * (week - 1)
}
data.frame(dates = toDate(tt), series = c(tt))
We can also convert from Date class to year/week number
# input is a Date class object
to_yw <- function(date) {
yr <- as.numeric(format(date, "%Y"))
yday <- as.POSIXlt(date)$yday # jan 1st is 0
week <- pmin(floor(yday / 7), 51) + 1 # 1st week of yr is 1
yw <- yr + (week - 1) / 52
list(yw = yw, year = yr, yday = yday, week = week)
}
Try this
weekly.ts <- ts(rnorm(305,0,1),
start=min(index.date$start.date),
end=max(index.date$start.date), frequency=2)
# look at plot to see if it works
plot(stl(weekly.ts, s.window=2))
# get time
head(as.POSIXlt.Date(time(weekly.ts)))
[1] "2016-01-01 UTC" "2016-01-01 UTC" "2016-01-02 UTC" "2016-01-02 UTC"
[5] "2016-01-03 UTC" "2016-01-03 UTC"
tail(as.POSIXlt.Date(time(weekly.ts)))
[1] "2021-10-26 UTC" "2021-10-27 UTC" "2021-10-27 UTC" "2021-10-28 UTC"
[5] "2021-10-28 UTC" "2021-10-29 UTC"
You get 2 dates because of freqency=2, which is required by decompose or stl for meaningful data.

How to calculate time difference in milliseconds using R when formats are different?

I have a problem in R that is killing me! Can you help me?
I found a question in StackOverflow that gave me a very good explanation.
Here is the link: How to parse milliseconds?
I was able to implement the following code that works very well.
z2 <- strptime("10/2/20 11:16:17.682", "%d/%m/%y %H:%M:%OS")
z1 <- strptime("10/2/20 11:16:16.683", "%d/%m/%y %H:%M:%OS")
When I calculate z2-z1, I get
Time difference of 0.9989998 secs
Similarly, when I use
z3 <- strptime("130 11:16:16.683", "%j %H:%M:%OS")
z4 <- strptime("130 11:16:18.682", "%j %H:%M:%OS")
When I calculate z4-z3, I get
Time difference of 1.999 secs
What is my problem?
The first column has the format 130 18:25:50.408, with millions of rows!!!
The second column has the format 2020 130 18:25:51.357 that is like the first column but has the year 2020.
The first column is also from 2020, but as the year is not there R uses the current year.
First question,
How can I substract both columns? I know how to substract columns.
What I do not know is to subtract these two times.
For example, second time is 2020 130 18:25:51.357
and first time is 130 18:25:50.408
I guess that I can do it programmatically converting it to a string, and eliminating the 2020. However, I am hoping that a quicker solution is available using base R or the lubridate package.
Second question,
"%j %H:%M:%OS" is the format for 130 11:16:16.683
What is the format for 2020 130 18:25:51.357?
As explained before this is working very well:
z3 <- strptime("130 11:16:16.683", "%j %H:%M:%OS")
But, this is NOT working.
z7 <- strptime("2020 130 11:16:16.683", "%y %j %H:%M:%OS")
UPDATE 1
I solved the second question!
However, I have not figured out yet the first question.
For the second question, the mistake in the format was that instead of %y, I need to write %Y with upper case.
Here is one example:
later <- strptime("2020 130 11:16:17.683", "%Y %j %H:%M:%OS")
earlier <- strptime("2020 130 11:16:16.684", "%Y %j %H:%M:%OS")
difftime(later,earlier,units="secs")
The R results is:
Time difference of 0.9990001 secs
UPDATE 2
At this point, what is pending is the following:
I need to substract two times that were made the same day on 2020.
The second time does have the year, the first time does not.
later <- strptime("2020 130 11:16:17.683", "%Y %j %H:%M:%OS")
earlier <- strptime("130 11:16:16.684", "%j %H:%M:%OS")
difftime(later,earlier,units="secs")
R produces the following result:
Time difference of -31622399 secs
Why? As we are on 2021, R formats the vector earlier as the current year, 2021 because the year is not there.
My columns has millions of rows.
At this point, my guess is that I would need to add 2020 with a concatenation or something like that. Is there any other method?
Thank you for your help!
Your object z2 is a POSIX list object. What this means is that it is a list of the time elements of your time.
print.default(z2)
# $sec
# [1] 17.682
#
# $min
# [1] 16
#
# $hour
# [1] 11
#
# $mday
# [1] 10
#
# $mon
# [1] 1
#
# $year
# [1] 120
#
# $wday
# [1] 1
#
# $yday
# [1] 40
#
# $isdst
# [1] 0
#
# $zone
# [1] "GMT"
#
# $gmtoff
# [1] NA
#
# attr(,"class")
# [1] "POSIXlt" "POSIXt"
When you do a subtraction, z2 - z1 R dispatches this operation to a function called -.POSIXt, which itself calls difftime. This function converts z2 to a POSIX count object. What this means is that it gets converted to a count of seconds since the beginning of the epoch, by default "1970-01-01".
options("digits" = 16)
print.default(as.POSIXct(z2))
# [1] 1581333377.682
# attr(,"class")
# [1] "POSIXct" "POSIXt"
# attr(,"tzone")
# [1] ""
difftime(z2, z1)
# Time difference of 0.9989998340606689 secs
R, like most software, works with double precision numerics. This means that arithmetic is imprecise, although approximately true. Most software will try to hide this imprecision by reducing the number of digits shown. That said, different numbers will give you different imprecision, so you might prefer referring directly to the list element of z2.
print.default(z2$sec - z1$sec)
# [1] 0.9989999999999988
You could therefore apply the time difference using your favourite data.frame tools.
options("digits" = 6)
# character columns
df1 <- data.frame(
col1 = c("10/2/20 11:16:17.682", "10/2/20 11:16:16.683"),
col2 = c("130 11:16:16.683", "130 11:16:18.682"),
stringsAsFactors = FALSE)
library(dplyr)
# convert columns to POSIXlt
df2 <- mutate(df1,
col1 = strptime(col1, "%d/%m/%y %H:%M:%OS"),
col2 = strptime(stringr::str_c("2020 ", col2), "%Y %j %H:%M:%OS"),
diff_days = unclass(difftime(col2, col1, units = "days")))
df2
# col1 col2 diff_days
# 1 2020-02-10 11:16:17 2020-05-09 11:16:16 88.9583
# 2 2020-02-10 11:16:16 2020-05-09 11:16:18 88.9584

Convert date to day-of-week in R

I have a date in this format in my data frame:
"02-July-2015"
And I need to convert it to the day of the week (i.e. 183). Something like:
df$day_of_week <- weekdays(as.Date(df$date_column))
But this doesn't understand the format of the dates.
You could use lubridate to convert to day of week or day of year.
library(lubridate)
# "02-July-2015" is Thursday
date_string <- "02-July-2015"
dt <- dmy(date_string)
dt
## [1] "2015-07-02 UTC"
### Day of week : (1-7, Sunday is 1)
wday(dt)
## [1] 5
### Day of year (1-366; for 2015, only 365)
yday(dt)
## [1] 183
### Or a little shorter to do the same thing for Day of year
yday(dmy("02-July-2015"))
## [1] 183
day = as.POSIXct("02-July-2015",format="%d-%b-%Y")
# see ?strptime for more information on date-time conversions
# Day of year as decimal number (001–366).
format(day,format="%j")
[1] "183"
#Weekday as a decimal number (1–7, Monday is 1).
format(day,format="%u")
[1] "4"
This is what anotherFishGuy supposed, plus converting the values to as.numeric so they fit through classifier.
# day <- Sys.time()
as.num.format <- function(day, ...){
as.numeric(format(day, ...))
}
doy <- as.num.format(day,format="%j")
doy <- as.num.format(day,format="%u")
hour <- as.num.format(day, "%H")

Parsing ambiguous timestamps

I have been provided a dataset with an ambiguous date format, e.g:
d_raw <- c("1102001 23:00", "1112001 0:00")
I would like to try to parse this date into a POSIXlt object in R. The source of the file assures me that the file is in chronological order, that the date format is month, then day, then year, and that there are no gaps in the time series.
Is there any way to parse this date format, using the ordering to resolve ambiguities? E.g. the first element above should parse to c("2001-01-10 23:00:00", "2001-01-11 00:00:00") rather than c("2001-01-10 23:00:00", "2001-11-01 00:00:00").
How about this (using regular expressions)
d_raw <- c("192001 16:00", "1102001 23:00", "1112001 0:00")
re <- "^(.+?)([1-9]|[1-3][0-9])(\\d{4}) (\\d{1,2}):(\\d{2})$"
m <- regexec(re, d_raw)
parts <- regmatches(d_raw, m)
lapply(parts, function(x) {
x<-as.numeric(x[-1])
ISOdate(x[3], x[1], x[2], x[4], x[5])
})
# [[1]]
# [1] "2001-01-09 16:00:00 GMT"
#
# [[2]]
# [1] "2001-01-10 23:00:00 GMT"
#
# [[3]]
# [1] "2001-01-11 GMT"
If you had more test cases that would be helpful just to make sure the regular expression correctly works.
I pity you for your horrible data vendor, so I decided to try and fix this for you.
# make up some horrid data
d_bad <- as.POSIXlt(seq(as.Date("2014-01-01"), as.Date("2014-12-31"), by=1))
d_raw <- paste0(d_bad$mon+1, d_bad$mday, d_bad$year+1900)
d_new <- d_raw
# not ambiguous when nchar is 6
d_new <- ifelse(nchar(d_new)==6,
paste0("0", substr(d_new,1,1), "0", substr(d_new,2,nchar(d_new))), d_new)
# now not ambiguous when nchar is 7 and it doesn't begin with a "1"
d_new <- ifelse(nchar(d_new)==7 & substr(d_new,1,1) != "1",
paste0("0",d_new), d_new)
# now guess a leading zero and parse
d_new <- ifelse(nchar(d_new)==7, paste0("0",d_new), d_new)
d_try <- as.Date(d_new, "%m%d%Y")
# now only days in October, November, and December might be wrong
bad <- cumsum(c(1L,as.integer(diff(d_try)))-1L) < 0L
# put the leading zero in the day, but remember "bad" rows have an
# extra leading zero, so make sure to skip it
d_try2 <- ifelse(bad,
paste0(substr(d_new,2,3),"0", substr(d_new,4,nchar(d_new))), d_new)
# convert to Date, POSIXlt, whatever and do a happy dance
d_YAY <- as.Date(d_try2, "%m%d%Y")
data.frame(d_raw, d_new, d_try, bad, d_try2, d_YAY)
# d_raw d_new d_try bad d_try2 d_YAY
# 1 112014 01012014 2014-01-01 FALSE 01012014 2014-01-01
# 2 122014 01022014 2014-01-02 FALSE 01022014 2014-01-02
# 3 132014 01032014 2014-01-03 FALSE 01032014 2014-01-03
# 4 142014 01042014 2014-01-04 FALSE 01042014 2014-01-04
# 5 152014 01052014 2014-01-05 FALSE 01052014 2014-01-05
# 6 162014 01062014 2014-01-06 FALSE 01062014 2014-01-06
I only did this with Dates in order to keep the example data set small. Doing this for POSIXlt would be very similar, except you would need to change the as.Date calls to as.POSIxlt and adjust the format accordingly.

R some time stamps changing to NA

My Code is reading in a CSV file and converting the time stamp column to the R time format
DF <- read.csv("DF.CSV",head=TRUE,sep=",")
DF[51082,1]
[1] 03/01/2012 19:29
DF[1,1]
[1] 02/24/12 00:29
It reads it in properly and the above 2 rows are displayed as expected
DF$START <- as.POSIXct(strptime(paste(DF$START),format="%m/%d/%y %H:%M"))
DF[1,1]
[1] "2012-02-24 00:29:00 GMT"
DF[51082,1]
[1] NA
After converting them to the R time format using strptime and then displaying them again some of the values have NA and there was no error message displayed or reason for it that I can figure out
You have (at least) two different date formats,
one in %Y (4-digit years), one in %y (2-digit years).
Unless 12 really means 12AD, you need to try both.
DF <- data.frame(
START = c(
"03/01/2012 19:29",
"02/24/12 00:29"
),
stringsAsFactors = FALSE
)
coalesce <- function (x, ...) {
z <- class(x)
for (y in list(...)) {
x <- ifelse(is.na(x), y, x)
}
class(x) <- z
x
}
DF$START <- coalesce(
as.POSIXct(strptime(DF$START, format="%m/%d/%y %H:%M")),
as.POSIXct(strptime(DF$START, format="%m/%d/%Y %H:%M"))
)
# START
# 1 2012-03-01 19:29:00
# 2 2012-02-24 00:29:00
Try to use this:
> DF$START <- as.POSIXct(strptime(paste(DF$START),format="%m/%d/%Y %H:%M"))
This adds year with century.

Resources