Extract POSIXct information from large vector - r

I have a large POSIXct vector v2 with 438000 elements, created as follows:
t.start <- as.POSIXct("2016-08-16 15:00:00 CEST")
v1 <- seq(from = t.start, length.out = 2920, by = "3 hours")
v2 <- rep(v1, each = 150)
From v2, I would like to extract the 12 elements that - for the first time they appear - contain the first day of each month. Specifically, I look for:
The numeric position in v2 these 12 elements have
The actual date of these elements in %d %b format, e.g. "01 Sep"
These two things should be extracted separately, i.e. stored in two different vectors afterwards. I think v1 and v2 contain daylight saving POSIXct elements but that should not affect the general operation. Any hint on how I can bypass the daylight savings would be a nice little add-on!
Any idea on how to do that?

We can start by extracting the day number from each element with format(v2, "%d). Then, to determine where the first days of the month are we can equate that to "01". Then we can take the diff() of that logical vector, remembering to concatenate 0L out front to account for the missing first element. Wrap that in which(), and you have the indices of the first element of each first day.
w <- which(c(0L, diff(format(v2, "%d") == "01")) == 1L)
w
# [1] 18451 54451 91651 127801 165001 202201 235801 272851
# [9] 308851 346051 382051 419251
Now w holds the locations of the 12 elements we need. Let's take a look at those elements of v2, just to confirm we've got it right.
v2[w]
# [1] "2016-09-01 00:00:00 PDT" "2016-10-01 00:00:00 PDT"
# [3] "2016-11-01 00:00:00 PDT" "2016-12-01 02:00:00 PST"
# [5] "2017-01-01 02:00:00 PST" "2017-02-01 02:00:00 PST"
# [7] "2017-03-01 02:00:00 PST" "2017-04-01 00:00:00 PDT"
# [9] "2017-05-01 00:00:00 PDT" "2017-06-01 00:00:00 PDT"
# [11] "2017-07-01 00:00:00 PDT" "2017-08-01 00:00:00 PDT"
Looks good. Note that we've got some 2am entries there, which is fine because it's Daylight Savings Time. Now let's get to your desired format ...
format(v2[w], "%d %b")
# [1] "01 Sep" "01 Oct" "01 Nov" "01 Dec" "01 Jan" "01 Feb"
# [7] "01 Mar" "01 Apr" "01 May" "01 Jun" "01 Jul" "01 Aug"

Related

Working with difficult AM/PM formats and REGEX with lubridate in R

Hello guys I hope everyone is having a good one, I am trying to work with some AM/PM formats in lubridate on R but I cant seem to come up with a proper solution I hope you guys can correct meand help me out please!
I have a HUGE dataset that has date_time formats in a very rare way the format goes as follow:
First a number that represents the day, second an abbreviation of the month OR even the month fully spelled out a 12H time format and the strings " a. m." OR "p. m." or even a combination of more spaces between or missing "dots" then such as "a. m" to set an example please take a look at this vector:
dates<-c("02 dec 05:47 a. m",
"7 November 09:47 p. m.",
"3 jul 12:28 a.m.",
"23 sept 08:53 a m.",
"7 may 09:05 PM")
These make up for more than 95% of the rare formats of datetime in the data set I have been trying to use lubridate on R I am trying to use the function
ydm_hm(paste(2021,dates))
this is because all dates are form 2021 but I get always:
[1] NA NA NA
[4] NA "2021-05-07 21:05:00 UTC"
Warning message:
4 failed to parse.
the 4 that fail to parse give me NAS and the only one that parses is correct I do notice that this one has PM or AM as uppercase letters without dots but most of the time my formats will be like this:
ydm_hm("7 may 09:05 p.m.")
and this gives me NAS...
So I feel as though the only way to get this dates to workout is to change the structure and using REGEX so convert all "a. m." combinations into "AM" and "PM" only after analyzing the data I realized all "p.m" or "a. m." strings come after ONE or TWO spaces after the 12H time format that always have a length of 5 characters and so what should be considered to come up with the patter of the REGEX is the following
the string will begins with one or two numbers then spaces and then letters (for the month abbreviated or fully spelled out after that will have spaces and then 5 characters (that's the 12H time format) and then will have letters spaces and dots for all possible a.m and p.m formats but I have tried with no luck to convert the structure of the date.. if you guys could help me I will be so freaking thankful I dont know if there is a way or another package in R that will even resolve this issue without using regex so thank you everyone for your help !
my desired output will be:
"2021-12-02 05;47:00 UTC"
"2021-11-07 09:47:00 UTC"
"2021-07-03 12:28:00 UTC"
"2021-09-23 08:53:00 UTC"
"2021-05-07 21:05:00 UTC"
In this case, parse_date from parsedate works
library(parsedate)
parse_date(paste(2021, dates))
-output
[1] "2021-12-02 05:47:00 UTC"
[2] "2021-11-07 09:47:00 UTC"
[3] "2021-07-03 12:28:00 UTC"
[4] "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"
Or if the second value should be PM, use str_remove to remove the space
library(stringr)
parse_date(paste(2021, str_remove_all(dates,
"(?<=[A-Za-z])[. ]+(?=[A-Za-z])")))
[1] "2021-12-02 05:47:00 UTC"
[2] "2021-11-07 21:47:00 UTC"
[3] "2021-07-03 00:28:00 UTC"
[4] "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"
With ydm_hm, the issue is that one of the am/pm format showed spaces without the . and this may not get parsed. We could change the format by removing the spaces
library(lubridate)
library(stringr)
ydm_hm(paste(2021, str_remove_all(dates,
"(?<=[A-Za-z])[. ]+(?=[A-Za-z])")))
[1] "2021-12-02 05:47:00 UTC"
[2] "2021-11-07 21:47:00 UTC"
[3] "2021-07-03 00:28:00 UTC"
[4] "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"
Since you raised the issue of regular expression, I thought I might try one way to do that
library(stringr)
# get boolean for pm dates
pm = str_detect(dates,"(?<=\\d\\d:\\d\\d\\s{1,2})[pP]",)
# convert dates to dates without am/pm
dates = str_extract(dates,"^.*:\\d\\d")
# add pm back to pm dates and am to am dates
dates[pm] <- paste(dates[pm], "PM")
dates[!pm] <- paste(dates[!pm], "AM")
# now your orignal approach works
ydm_hm(paste(2021,dates))
Output
[1] "2021-12-02 05:47:00 UTC" "2021-11-07 21:47:00 UTC" "2021-07-03 00:28:00 UTC" "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"

Can I apply a function over a vector using base tryCatch?

I'm trying to parse dates (using lubridate functions) from a vector which has mixed date formats.
departureDate <- c("Aug 17, 2020 12:00:00 AM", "Nov 19, 2019 12:00:00 AM", "Dec 21, 2020 12:00:00 AM",
"Dec 24, 2020 12:00:00 AM", "Dec 24, 2020 12:00:00 AM", "Apr 19, 2020 12:00:00 AM", "28/06/2019",
"16/08/2019", "04/02/2019", "10/04/2019", "28/07/2019", "26/07/2019",
"Jun 22, 2020 12:00:00 AM", "Apr 5, 2020 12:00:00 AM", "May 1, 2021 12:00:00 AM")
As I didn't notice at first, I tried to parse with lubridate::mdy_hms(departureDate) which resulted in NA values for dates whose format differs from that of the parser.
As the format may change on random positions of the vector I tried to use the following sentence:
departureDate <- tryCatch(mdy_hms(departureDate),
warning = function(w){return(dmy(departureDate))})
Which brought even more NA's as it only applied the warning function call. Is there a way to solve this by using my approach?
Thanks in advance
We can use lubridate::parse_date_time which can take multiple formats.
lubridate::parse_date_time(departureDate, c('%b %d, %Y %I:%M:%S %p', '%d/%m/%Y'))
#[1] "2020-08-17 UTC" "2019-11-19 UTC" "2020-12-21 UTC" "2020-12-24 UTC"
#[5] "2020-12-24 UTC" "2020-04-19 UTC" "2019-06-28 UTC" "2019-08-16 UTC"
#[9] "2019-02-04 UTC" "2019-04-10 UTC" "2019-07-28 UTC" "2019-07-26 UTC"
#[13] "2020-06-22 UTC" "2020-04-05 UTC" "2021-05-01 UTC"
Since in departureDate month name is in English, you need the locale to be English as well.
Refer How to change the locale of R? if you have non-English locale.
The ideal situation is that the code should be able to deal with every format on its own, without letting it fall to an exception.
Another issue to take into account is that the myd_hms() function returns dates in the POSIXct data type, whereas dmy() returns the Date type, so they wouldn't mix well together.
The code below applies mdy_hms(), then converts it to Date. It then tests for NA's and applies the second function dmy() on the missing values. More rules can be added in the pipeline at will if more formats are to be recognized.
library(dplyr)
dates.converted <-
mdy_hms(departureDate, tz = ) %>%
as.Date() %>%
ifelse(!is.na(.), ., dmy(departureDate)) %>%
structure(class = "Date")
print(dates.converted)
Output
[1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28" "2019-08-16"
[9] "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05" "2021-05-01"
One method would be to iterate through a list of candidate formats and apply it only to dates not previously parsed correctly.
fmts <- c("%b %d, %Y %H:%M:%S %p", "%d/%m/%Y")
dates <- rep(Sys.time()[NA], length(departureDate))
for (fmt in fmts) {
isna <- is.na(dates)
if (!any(isna)) break
dates[isna] <- as.POSIXct(departureDate[isna], format = fmt)
}
dates
# [1] "2020-08-17 12:00:00 PDT" "2019-11-19 12:00:00 PST" "2020-12-21 12:00:00 PST"
# [4] "2020-12-24 12:00:00 PST" "2020-12-24 12:00:00 PST" "2020-04-19 12:00:00 PDT"
# [7] "2019-06-28 00:00:00 PDT" "2019-08-16 00:00:00 PDT" "2019-02-04 00:00:00 PST"
# [10] "2019-04-10 00:00:00 PDT" "2019-07-28 00:00:00 PDT" "2019-07-26 00:00:00 PDT"
# [13] "2020-06-22 12:00:00 PDT" "2020-04-05 12:00:00 PDT" "2021-05-01 12:00:00 PDT"
as.Date(dates)
# [1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28"
# [8] "2019-08-16" "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05"
# [15] "2021-05-01"
I encourage you to put the most-likely formats first in the fmts vector.
The way this is set up, as soon as all elements are correctly found, no further formats are attempted (i.e., break).
Edit: if there is a difference in LOCALE where AM/PM are not locally recognized, then one method would be to first remove them from the strings:
departureDate <- gsub("\\s[AP]M$", "", departureDate)
departureDate
# [1] "Aug 17, 2020 12:00:00" "Nov 19, 2019 12:00:00" "Dec 21, 2020 12:00:00"
# [4] "Dec 24, 2020 12:00:00" "Dec 24, 2020 12:00:00" "Apr 19, 2020 12:00:00"
# [7] "28/06/2019" "16/08/2019" "04/02/2019"
# [10] "10/04/2019" "28/07/2019" "26/07/2019"
# [13] "Jun 22, 2020 12:00:00" "Apr 5, 2020 12:00:00" "May 1, 2021 12:00:00"
and then use a simpler format:
fmts <- c("%b %d, %Y %H:%M:%S", "%d/%m/%Y")

Floor datetime with custom start time (lubridate)

Is there a way to floor dates using a custom start time instead of the earliest possible time?
For example, flooring hours in a day into 2 12-hour intervals starting at 8am and 8pm rather than 12am and 12pm.
Example:
x <- ymd_hms("2009-08-03 21:00:00")
y <- ymd_hms("2009-08-03 09:00:00")
floor_date(x, '12 hours')
floor_date(y, '12 hours')
# default lubridate output:
[1] "2009-08-03 12:00:00 UTC"
[1] "2009-08-03 UTC"
# what i would like to have:
[1] "2009-08-03 20:00:00 UTC"
[1] "2009-08-03 08:00:00 UTC"
You could program a small switch (without lubridate, though).
FUN <- function(x) {
s <- switch(which.min(abs(mapply(`-`, c(8, 20), as.numeric(substr(x, 12, 13))))),
"08:00:00", "20:00:00")
as.POSIXct(paste(as.Date(x), s))
}
FUN("2009-08-03 21:00:00")
# [1] "2009-08-03 20:00:00 CEST"
FUN("2009-08-03 09:00:00")
# [1] "2009-08-03 08:00:00 CEST"

Best way to deal with differing date data [duplicate]

I am trying to do some simple operation in R, after loading a table i encountered a date column which has many formats combined.
**Date**
1/28/14 6:43 PM
1/29/14 4:10 PM
1/30/14 12:09 PM
1/30/14 12:12 PM
02-03-14 19:49
02-03-14 20:03
02-05-14 14:33
I need to convert this to format like 28-01-2014 18:43 i.e. %d-%m-%y %h:%m
I tried this
tablename$Date <- as.Date(as.character(tablename$Date), "%d-%m-%y %h:%m")
but doing this its filling NA in the entire column. Please help me to get this right!
The lubridate package makes quick work of this:
library(lubridate)
d <- parse_date_time(dates, names(guess_formats(dates, c("mdy HM", "mdy IMp"))))
d
## [1] "2014-01-28 18:43:00 UTC" "2014-01-29 16:10:00 UTC"
## [3] "2014-01-30 12:09:00 UTC" "2014-01-30 12:12:00 UTC"
## [5] "2014-02-03 19:49:00 UTC" "2014-02-03 20:03:00 UTC"
## [7] "2014-02-05 14:33:00 UTC"
# put in desired format
format(d, "%m-%d-%Y %H:%M:%S")
## [1] "01-28-2014 18:43:00" "01-29-2014 16:10:00" "01-30-2014 12:09:00"
## [4] "01-30-2014 12:12:00" "02-03-2014 19:49:00" "02-03-2014 20:03:00"
## [7] "02-05-2014 14:33:00"
You'll need to adjust the vector in guess_formats if you come across other format variations.

Changing dates in different time zones by adding to POSIXlt

I am running into an error when I try to localize times for "date" (a variable of class=POSIXlt) in my dataset. Example code is as follows:
# All dates are coded by survey software in EST(not local time)
date <- c("2011-07-26 07:23", "2011-07-29 07:34", "2011-07-29 07:40")
region <-c("USA-EST", "UK", "Singapore")
#Change the times based on time-zone differences
start_time<-strptime(date,"%Y-%m-%d %h:%m")
localtime=as.POSIXlt(start_time)
localtime<-ifelse(region=="UK",start_time+6,start_time)
localtime<-ifelse(region=="Singapore",start_time+12,start_time)
#Then, I need to extract the hour and weekday
weekday<-weekdays(localtime)
hour<-factor(localtime)
There must be something wrong with my "ifelse" statement, because I get the error: number of items to replace is not a multiple of replacement length. Please help!
How about using R's native time code? The trick is that you can't have more than one time-zone in a POSIX vector, so use a list instead:
region <- c("EST","Europe/London","Asia/Singapore")
(localtime <- lapply(seq(date),function(x) as.POSIXlt(date[x],tz=region[x])))
[[1]]
[1] "2011-07-26 07:23:00 EST"
[[2]]
[1] "2011-07-29 07:34:00 Europe/London"
[[3]]
[1] "2011-07-29 07:40:00 Asia/Singapore"
And to convert to a vector in a single timezone:
Reduce("c",localtime)
[1] "2011-07-26 13:23:00 BST" "2011-07-29 07:34:00 BST"
[3] "2011-07-29 00:40:00 BST"
Note that my system timezone is BST, but if yours is EST it will convert to that.
You can use the timezone handling built in in POSIXct:
> start_time <- as.POSIXct(date,"%Y-%m-%d %H:%M", tz = "America/New_York")
> start_time
[1] "2011-07-26 07:23:00 EDT" "2011-07-29 07:34:00 EDT" "2011-07-29 07:40:00 EDT"
> format(start_time, tz="Europe/London", usetz=TRUE)
[1] "2011-07-26 12:23:00 BST" "2011-07-29 12:34:00 BST" "2011-07-29 12:40:00 BST"
> format(start_time, tz="Asia/Singapore", usetz=TRUE)
[1] "2011-07-26 19:23:00 SGT" "2011-07-29 19:34:00 SGT" "2011-07-29 19:40:00 SGT"

Resources