parse multiple date formats by index in R - r

I'm trying to parse multiple date formats based on their position in a vector of dates. At some the data switched the format it used from y/m/d to y/d/m. This is annoying for dates like 2010/07/03 where specifying the order in lubridate .
This is an example of dates
datevec <- c("2011/07/01", "2011/07/02", "2011/07/03", "2011/02/07" )
The dates are set up so before a certain row the dates are one format and after another row the dates are another format, so I'm trying to provide an index to the function
when I tried to parse them using this plus lubridate it only returned 3 dates.
lapply(datevec, function(x, i) ifelse( x[i] <4, parse_date_time(x, "%Y-%m-%d"), parse_date_time(x,"%Y-%d-%m" )) )

1) If we changed the ifelse in the question to a plain if then the basic idea in the question works with appropriate modifications. Note that it gives a list L so assuming we really want a vector we add the last line of code.
f <- function(x, i) if (i < 4)
parse_date_time(x, "ymd") else parse_date_time(x, "ydm")
L <- Map(f, datevec, seq_along(datevec), USE.NAMES = FALSE)
do.call("c", L)
## [1] "2011-07-01 UTC" "2011-07-02 UTC" "2011-07-03 UTC" "2011-02-07 UTC"
2) Use the ifelse on the format part rather than on the date part and use as.Date instead of parse_date_time:
ix <- seq_along(datevec)
as.Date(datevec, ifelse(ix < 4, "%Y/%m/%d", "%Y/%d/%m"))
## [1] "2011-07-01" "2011-07-02" "2011-07-03" "2011-07-02"
3) Convert the first 3 using ymd and the rest using ydm and then concatenate.
c(ymd(head(datevec, 3)), ydm(tail(datevec, -3)))
## [1] "2011-07-01" "2011-07-02" "2011-07-03" "2011-07-02"
4) or with only base R:
c(as.Date(head(datevec, 3)), as.Date(tail(datevec, -3), "%Y/%d/%m"))
## [1] "2011-07-01" "2011-07-02" "2011-07-03" "2011-07-02"
5) Another approach is to convert the later dates using string manipulation so that all the dates are in the same format and then use as.Date or ymd:
ix <- seq_along(datevec)
swap <- sub("(..)/(..)$", "\\2/\\1", datevec)
as.Date(ifelse(ix < 4, datevec, swap))
## [1] "2011-07-01" "2011-07-02" "2011-07-03" "2011-07-02"
6) The above codes return Date class, which is more appropriate for dates without times but if for some reason you really need POSIXct use as.POSIXct on the above or else use parse_date_time like this:
c(parse_date_time(head(datevec, 3), "ymd"), parse_date_time(tail(datevec, -3), "ydm"))
## [1] "2011-07-01 UTC" "2011-07-02 UTC" "2011-07-03 UTC" "2011-07-02 UTC"

Related

R reading in short date instead of long date [duplicate]

If a date vector has two-digit years, mdy() turns years between 00 and 68 into 21st Century years and years between 69 and 99 into 20th Century years. For example:
library(lubridate)
mdy(c("1/2/54","1/2/68","1/2/69","1/2/99","1/2/04"))
gives the following output:
Multiple format matches with 5 successes: %m/%d/%y, %m/%d/%Y.
Using date format %m/%d/%y.
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC" "2004-01-02 UTC"
I can fix this after the fact by subtracting 100 from the incorrect dates to turn 2054 and 2068 into 1954 and 1968. But is there a more elegant and less error-prone method of parsing two-digit dates so that they get handled correctly in the parsing process itself?
Update: After #JoshuaUlrich pointed me to strptime I found this question, which deals with an issue similar to mine, but using base R.
It seems like a nice addition to date handling in R would be some way to handle century selection cutoffs for two-digit dates within the date parsing functions.
Here is a function that allows you to do this:
library(lubridate)
x <- mdy(c("1/2/54","1/2/68","1/2/69","1/2/99","1/2/04"))
foo <- function(x, year=1968){
m <- year(x) %% 100
year(x) <- ifelse(m > year %% 100, 1900+m, 2000+m)
x
}
Try it out:
x
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"
foo(x)
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"
foo(x, 1950)
[1] "1954-01-02 UTC" "1968-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"
The bit of magic here is to use the modulus operator %% to return the fraction part of a division. So 1968 %% 100 yields 68.
I just experienced this exact same bug / feature.
I ended up writing the following two quick functions to help convert from excel-type dates (which is where i get this most) to something R can use.
There's nothing wrong with the accepted answer -- it's just that i prefer not to load up on packages too much.
First, a helper to split and replace the years ...
year1900 <- function(dd_y, yrFlip = 50)
{
dd_y <- as.numeric(dd_y)
dd_y[dd_y > yrFlip] <- dd_y[dd_y > yrFlip] + 1900
dd_y[dd_y < yrFlip] <- dd_y[dd_y < yrFlip] + 2000
return(dd_y)
}
which is used by a function that 'fixes' your excel dates, depending on type:
XLdate <- function(Xd, type = 'b-Y')
{
switch(type,
'b-Y' = as.Date(paste0(substr(Xd, 5, 9), "-", substr(Xd, 1, 3), "-01"), format = "%Y-%b-%d"),
'b-y' = as.Date(paste0(year1900(substr(Xd, 5, 6)), "-", substr(Xd, 1, 3), "-01"),
format = "%Y-%b-%d"),
'Y-b' = as.Date(paste0(substr(Xd, 1, 3), "-", substr(Xd, 5, 9), "-01"), format = "%Y-%b-%d")
)
}
Hope this helps.
Another option would be:
xxx <- c("01-Jan-54","01-Feb-68","01-Aug-69","01-May-99","01-Jun-04", "
31-Dec-68","01-Jan-69", "31-Dec-99")
.
dmy(paste0(sub("\\d\\d$","",xxx) , ifelse( (tt <-
sub("\\d\\d-\\D\\D\\D-","",xxx) ) > 20 ,paste0("19",tt),paste0("20",tt))))
Though no solution is elegant nor short.
I think it would be better if lubridate just added an option to specify the cutoff date.

Extract two combined dates and times from a string

Could you please let me know how I can extract date and time from ("2015-08-11 03:14:00 UTC--2015-08-11 04:14:00 UTC"). Note that this string contains a time interval with two dates and two times. I would like to break it down into 4 individual strings such as Date 1, Time 1, Date 2, Time 2 and then store them in 4 separate vectors.
Thanks.
Try the following.
x <- "2015-08-11 03:14:00 UTC--2015-08-11 04:14:00 UTC"
y <- strsplit(x, "--")[[1]]
dates <- as.Date(y)
times <- strftime(y, format = "%H:%M:%S")
You never mentioned whether you need functional dates and times from your input string. If you need to simply parse each portion of your timestamp then using gsub is one option.
x <- "2015-08-11 03:14:00 UTC--2015-08-11 04:14:00 UTC"
y <- unlist(strsplit(x, "--"))
dates <- sapply(y, function(x) gsub("(\\d{4}-\\d{2}-\\d{2}).*", "\\1", x))
times <- sapply(y, function(x) gsub(".*(\\d{2}:\\d{2}:\\d{2}.*)", "\\1", x))
dates
[1] "2015-08-11" "2015-08-11"
times
[1] "03:14:00 UTC" "04:14:00 UTC"
Demo here:
Rextester

Most efficient way to have R recognize " 41520092010" as "04/15/2009 20:10"

I need to find the duration of a large number of events by using the start and end time variables in a dataset, but both the variables encode the time in the annoying format "mmddyyyyhhmm," with the cherry on top being that the first nine months are encoded as single digits (January is " 1" rather than "01"). At least the time uses a twenty-four clock (assuming the people filling out each event did it right).
I know there has to be a fairly simple way to do this, but I can't think of one and suspect one of you fine folks have it memorized and can write it out in a couple of seconds.
One possibility consists in using the stringr library in combination with the lubridate library:
DatesAndTimes <- c("41520092010","121520092010")
library(stringr)
library(lubridate)
mdy_hm(str_pad(DatesAndTimes, 12, side="left", pad="0"))
#[1] "2009-04-15 20:10:00 UTC" "2009-12-15 20:10:00 UTC"
If you have a vector x with character values for conversion ...
x <- c("41520092010", "11520092010", "121520092010")
... you can check this vector for 11 characters (or whatever). If an element has 11 characters, we paste a zero on the front, then convert the whole vector to POSIXt.
as.POSIXct(
ifelse(nchar(x) == 11, paste0("0", x), x),
format = "%m%d%Y%H%M",
tz = "UTC"
)
# [1] "2009-04-15 20:10:00 UTC" "2009-01-15 20:10:00 UTC"
# [3] "2009-12-15 20:10:00 UTC"
If you don't like ifelse(), you can use replace().
replace(x, nchar(x) == 11, paste0("0", x[nchar(x) == 11]))
or formatC()
formatC(as.numeric(x), digits = 12, width = 12, flag = "0")
The most efficient of these is likely formatC().

Parse "next day" time

Is there a function (built-in or packaged) that would allow parsing a time like "25:15:00" as "1:15 on the next day"? Unfortunately, as.POSIXct doesn't like it with the %X specification (equivalent to %H:%M:%S),
> as.POSIXct('25:15:00', format='%X')
[1] NA
> as.POSIXct('15:15:00', format='%X')
[1] "2013-05-24 15:15:00 CEST"
and I can't find a suitable conversion specification in the strptime docs.
Not thoroughly tested but you can try this function
parse_time <- function(x, format = "%X") {
hour <- as.numeric(substr(x, 1, 2))
delta <- ifelse(hour >= 24, 24 * 3600, 0)
hour <- hour %% 24
date <- paste0(hour, substr(x, 3, nchar(x)))
strptime(date, format = format) + delta
}
parse_time(c('25:15:00', "23:10:00"))
##[1] "2013-05-25 01:15:00 GMT" "2013-05-24 23:10:00 GMT"
Now there is:
library(devtools)
install_github('kimisc', 'krlmlr')
library(kimisc)
hms.to.seconds('25:15:00')
It uses a slightly different approach than dickoa's code: The argument is filtered by gsub using a suitable regular expression, and the actual conversion doesn't involve strptime at all. See the code.

Is there a more elegant way to convert two-digit years to four-digit years with lubridate?

If a date vector has two-digit years, mdy() turns years between 00 and 68 into 21st Century years and years between 69 and 99 into 20th Century years. For example:
library(lubridate)
mdy(c("1/2/54","1/2/68","1/2/69","1/2/99","1/2/04"))
gives the following output:
Multiple format matches with 5 successes: %m/%d/%y, %m/%d/%Y.
Using date format %m/%d/%y.
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC" "2004-01-02 UTC"
I can fix this after the fact by subtracting 100 from the incorrect dates to turn 2054 and 2068 into 1954 and 1968. But is there a more elegant and less error-prone method of parsing two-digit dates so that they get handled correctly in the parsing process itself?
Update: After #JoshuaUlrich pointed me to strptime I found this question, which deals with an issue similar to mine, but using base R.
It seems like a nice addition to date handling in R would be some way to handle century selection cutoffs for two-digit dates within the date parsing functions.
Here is a function that allows you to do this:
library(lubridate)
x <- mdy(c("1/2/54","1/2/68","1/2/69","1/2/99","1/2/04"))
foo <- function(x, year=1968){
m <- year(x) %% 100
year(x) <- ifelse(m > year %% 100, 1900+m, 2000+m)
x
}
Try it out:
x
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"
foo(x)
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"
foo(x, 1950)
[1] "1954-01-02 UTC" "1968-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"
The bit of magic here is to use the modulus operator %% to return the fraction part of a division. So 1968 %% 100 yields 68.
I just experienced this exact same bug / feature.
I ended up writing the following two quick functions to help convert from excel-type dates (which is where i get this most) to something R can use.
There's nothing wrong with the accepted answer -- it's just that i prefer not to load up on packages too much.
First, a helper to split and replace the years ...
year1900 <- function(dd_y, yrFlip = 50)
{
dd_y <- as.numeric(dd_y)
dd_y[dd_y > yrFlip] <- dd_y[dd_y > yrFlip] + 1900
dd_y[dd_y < yrFlip] <- dd_y[dd_y < yrFlip] + 2000
return(dd_y)
}
which is used by a function that 'fixes' your excel dates, depending on type:
XLdate <- function(Xd, type = 'b-Y')
{
switch(type,
'b-Y' = as.Date(paste0(substr(Xd, 5, 9), "-", substr(Xd, 1, 3), "-01"), format = "%Y-%b-%d"),
'b-y' = as.Date(paste0(year1900(substr(Xd, 5, 6)), "-", substr(Xd, 1, 3), "-01"),
format = "%Y-%b-%d"),
'Y-b' = as.Date(paste0(substr(Xd, 1, 3), "-", substr(Xd, 5, 9), "-01"), format = "%Y-%b-%d")
)
}
Hope this helps.
Another option would be:
xxx <- c("01-Jan-54","01-Feb-68","01-Aug-69","01-May-99","01-Jun-04", "
31-Dec-68","01-Jan-69", "31-Dec-99")
.
dmy(paste0(sub("\\d\\d$","",xxx) , ifelse( (tt <-
sub("\\d\\d-\\D\\D\\D-","",xxx) ) > 20 ,paste0("19",tt),paste0("20",tt))))
Though no solution is elegant nor short.
I think it would be better if lubridate just added an option to specify the cutoff date.

Resources