I have the following character vector that contains date/time. I want to convert them to a date format and I tried the below methods : as.Date() and as.POSIXct()
time <- c("Oct 01,2015 15:38:31 ", "Oct 05,2015 11:07:14", "Oct 11,2015 14:15:51 ", "Oct 11,2015 14:19:53 ", "Oct 12,2015 11:23:28", "Oct 19,2015 16:32:51 ")
#as.Date() is skipping the time part
time_1<-as.Date(time,"%b %d,%Y %H:%M:%S")
time_1
[1] "2015-10-01" "2015-10-05" "2015-10-11" "2015-10-11" "2015-10-12" "2015-10-19"
#POSIXct is showing an error
time_2<-as.POSIXlt(time,"%b %d,%Y %H:%M:%S")
The as.Date() function is skipping the time part and POSIX is throwing an error (which is obvious).
How do I convert the above string as a proper date+time format?
We can specify the format in both as.POSIXlt and as.POSIXct
as.POSIXlt(time,format="%b %d,%Y %H:%M:%S")
#[1] "2015-10-01 15:38:31 IST" "2015-10-05 11:07:14 IST"
#[3] "2015-10-11 14:15:51 IST" "2015-10-11 14:19:53 IST"
#[5] "2015-10-12 11:23:28 IST" "2015-10-19 16:32:51 IST"
Without specifying the format, the function guesses the "%b %d,%Y %H:%M:%S" as the timezone (tz). The second argument is tz as per the default formula in the ?as.POSIXct
as.POSIXct(x, tz = "", ...)
Related
I'm trying to parse dates (using lubridate functions) from a vector which has mixed date formats.
departureDate <- c("Aug 17, 2020 12:00:00 AM", "Nov 19, 2019 12:00:00 AM", "Dec 21, 2020 12:00:00 AM",
"Dec 24, 2020 12:00:00 AM", "Dec 24, 2020 12:00:00 AM", "Apr 19, 2020 12:00:00 AM", "28/06/2019",
"16/08/2019", "04/02/2019", "10/04/2019", "28/07/2019", "26/07/2019",
"Jun 22, 2020 12:00:00 AM", "Apr 5, 2020 12:00:00 AM", "May 1, 2021 12:00:00 AM")
As I didn't notice at first, I tried to parse with lubridate::mdy_hms(departureDate) which resulted in NA values for dates whose format differs from that of the parser.
As the format may change on random positions of the vector I tried to use the following sentence:
departureDate <- tryCatch(mdy_hms(departureDate),
warning = function(w){return(dmy(departureDate))})
Which brought even more NA's as it only applied the warning function call. Is there a way to solve this by using my approach?
Thanks in advance
We can use lubridate::parse_date_time which can take multiple formats.
lubridate::parse_date_time(departureDate, c('%b %d, %Y %I:%M:%S %p', '%d/%m/%Y'))
#[1] "2020-08-17 UTC" "2019-11-19 UTC" "2020-12-21 UTC" "2020-12-24 UTC"
#[5] "2020-12-24 UTC" "2020-04-19 UTC" "2019-06-28 UTC" "2019-08-16 UTC"
#[9] "2019-02-04 UTC" "2019-04-10 UTC" "2019-07-28 UTC" "2019-07-26 UTC"
#[13] "2020-06-22 UTC" "2020-04-05 UTC" "2021-05-01 UTC"
Since in departureDate month name is in English, you need the locale to be English as well.
Refer How to change the locale of R? if you have non-English locale.
The ideal situation is that the code should be able to deal with every format on its own, without letting it fall to an exception.
Another issue to take into account is that the myd_hms() function returns dates in the POSIXct data type, whereas dmy() returns the Date type, so they wouldn't mix well together.
The code below applies mdy_hms(), then converts it to Date. It then tests for NA's and applies the second function dmy() on the missing values. More rules can be added in the pipeline at will if more formats are to be recognized.
library(dplyr)
dates.converted <-
mdy_hms(departureDate, tz = ) %>%
as.Date() %>%
ifelse(!is.na(.), ., dmy(departureDate)) %>%
structure(class = "Date")
print(dates.converted)
Output
[1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28" "2019-08-16"
[9] "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05" "2021-05-01"
One method would be to iterate through a list of candidate formats and apply it only to dates not previously parsed correctly.
fmts <- c("%b %d, %Y %H:%M:%S %p", "%d/%m/%Y")
dates <- rep(Sys.time()[NA], length(departureDate))
for (fmt in fmts) {
isna <- is.na(dates)
if (!any(isna)) break
dates[isna] <- as.POSIXct(departureDate[isna], format = fmt)
}
dates
# [1] "2020-08-17 12:00:00 PDT" "2019-11-19 12:00:00 PST" "2020-12-21 12:00:00 PST"
# [4] "2020-12-24 12:00:00 PST" "2020-12-24 12:00:00 PST" "2020-04-19 12:00:00 PDT"
# [7] "2019-06-28 00:00:00 PDT" "2019-08-16 00:00:00 PDT" "2019-02-04 00:00:00 PST"
# [10] "2019-04-10 00:00:00 PDT" "2019-07-28 00:00:00 PDT" "2019-07-26 00:00:00 PDT"
# [13] "2020-06-22 12:00:00 PDT" "2020-04-05 12:00:00 PDT" "2021-05-01 12:00:00 PDT"
as.Date(dates)
# [1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28"
# [8] "2019-08-16" "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05"
# [15] "2021-05-01"
I encourage you to put the most-likely formats first in the fmts vector.
The way this is set up, as soon as all elements are correctly found, no further formats are attempted (i.e., break).
Edit: if there is a difference in LOCALE where AM/PM are not locally recognized, then one method would be to first remove them from the strings:
departureDate <- gsub("\\s[AP]M$", "", departureDate)
departureDate
# [1] "Aug 17, 2020 12:00:00" "Nov 19, 2019 12:00:00" "Dec 21, 2020 12:00:00"
# [4] "Dec 24, 2020 12:00:00" "Dec 24, 2020 12:00:00" "Apr 19, 2020 12:00:00"
# [7] "28/06/2019" "16/08/2019" "04/02/2019"
# [10] "10/04/2019" "28/07/2019" "26/07/2019"
# [13] "Jun 22, 2020 12:00:00" "Apr 5, 2020 12:00:00" "May 1, 2021 12:00:00"
and then use a simpler format:
fmts <- c("%b %d, %Y %H:%M:%S", "%d/%m/%Y")
I have a timestamp vector like
time_stamp <- c("7/1/2013", "7/1/2013 12:00:30 AM", "7/1/2013 12:01:00 AM", "7/1/2013 12:01:30 AM", "8/1/2013","8/1/2013 11:02:30 PM")
I want to format this to date class. I tried
strptime(time_stamp, format = "%d/%m/%Y %H:%M:%S", tz = "GMT")
but since two timestamps have missing times it results in NAs, which should be substituted by default: 12:00:00.
I can run a loop such as:
for (i in 1:length(time_stamp))
{
if(nchar(time_stamp[i])<11)
{
time_stamp[i] <- paste(time_stamp[i], " 12:00:00 AM")
}
}
time_stamp <- format(strptime(time_stamp, format = "%d/%m/%Y %I:%M:%S %p", tz = "GMT"), "%d/%m/%Y %H:%M:%S", tz = "GMT")
Is there a faster and cleaner way to accomplish this? The vector is a part of large dataset so I don't want to loop over it.
lubridate::parse_date_time can take multiple token orders, with or without the %:
lubridate::parse_date_time(time_stamp, orders = c("dmy IMS p", "dmy"))
## [1] "2013-01-07 00:00:00 UTC" "2013-01-07 00:00:30 UTC" "2013-01-07 00:01:00 UTC"
## [4] "2013-01-07 00:01:30 UTC" "2013-01-08 00:00:00 UTC" "2013-01-08 23:02:30 UTC"
Or use its truncated parameter:
lubridate::parse_date_time(time_stamp, orders = 'dmy IMS p', truncated = 4)
which returns the same thing.
Or use a bit of regex replacement and then process as normal:
as.POSIXct(sub("(\\d{4}$)", "\\1 00:00:00", time_stamp),
format = "%d/%m/%Y %H:%M:%S", tz = "GMT")
#[1] "2013-01-07 00:00:00 GMT" "2013-01-07 12:00:30 GMT" "2013-01-07 12:01:00 GMT"
#[4] "2013-01-07 12:01:30 GMT" "2013-01-08 00:00:00 GMT" "2013-01-08 11:02:30 GMT"
I cannot get R to format POSIXlt objects in the desired timezone. POSIXct works as expected. Is this a bug or am I missing something?
date.str = "2015-12-09 13:30"
from = "Europe/London"
to = "America/Los_Angeles"
lt = as.POSIXlt(date.str, tz=from)
format(lt, tz=to, usetz=TRUE)
#[1] "2015-12-09 13:30:00 GMT"
ct = as.POSIXct(date.str, tz=from)
format(ct, tz=to, usetz=TRUE)
#[1] "2015-12-09 05:30:00 PST"
The tzone attributes are the same:
attributes(ct)$tzone
#[1] "Europe/London"
attributes(lt)$tzone
#[1] "Europe/London"
Solution
As pointed out by #nicola, format.POSIXlt has no tz parameter. To print a POSIXlt date in another timezone one can use lubridate package to convert a POSIXlt object to the desired timezone first:
require(lubridate)
lt.changed = with_tz(lt, tz=to)
format(lt.changed, usetz=TRUE)
#[1] "2015-12-09 05:30:00 PST"
I have a vector of character strings looking like this. I want to convert them to dates. The characters for time-zone is posing trouble.
> a
[1] "07/17/2014 5:01:22 PM EDT" "7/17/2014 2:01:05 PM PDT" "07/17/2014 4:00:48 PM CDT" "07/17/2014 3:05:16 PM MDT"
If I use: strptime(a, "%d/%m/%Y %I:%M:%S %p %Z") I get [1] NA
If i omit the "%Z" for time-zone, and use this:
strptime(a, "%m/%d/%Y %I:%M:%S %p", tz = "EST5EDT") I get
[1] "2014-07-17 17:01:22 EDT"
Since my strings contain various time zones - PDT, CDT, EDT, MDT , I can't default all time zones to EST5EDT. One way to overcome is split the vector into different vectors for each time-zone, remove the letters PDT / EDT etc. and apply the right timezone with strptime - "EST5EDT" , "CST6CDT" etc. Is there any other way to solve this?
If the date is always the first part of the elements of the character vector and it is always followed by the time, splitting the elements by the whitespaces is a possibility. If only the date is needed:
dates <- sapply(a, function(x) strsplit(x, split = " ")[[1]][1])
dates <- as.Date(as.character(dates), format = "%m/%d/%Y")
[1] "2014-07-17" "2014-07-17" "2014-07-17" "2014-07-17"
If also the time is needed:
datetime <- sapply(a, function(x) paste(strsplit(x, split = " ")[[1]][1:3],
collapse = " "))
datetime <- strptime(as.character(datetime), format = "%m/%d/%Y %I:%M:%S %p")
[1] "2014-07-17 17:01:22 CEST" "2014-07-17 14:01:05 CEST"
You can set a different timezone using the tz argument here.
Actual question
How can I temporarily change/specify the locale settings to be used for certain function calls (e.g. strptime())?
Background
I just ran the following rvest demo:
demo("tripadvisor", package = "rvest")
When it comes to the part where the dates are to be scraped, I run into some problems that most likely are caused by my locale settings: the dates are in an US american format while I'm on a German locale:
require("rvest")
url <- "http://www.tripadvisor.com/Hotel_Review-g37209-d1762915-Reviews-JW_Marriott_Indianapolis-Indianapolis_Indiana.html"
reviews <- url %>%
html() %>%
html_nodes("#REVIEWS .innerBubble")
date <- reviews %>%
html_node(".rating .ratingDate") %>%
html_attr("title")
> date
[1] "December 9, 2014" "December 9, 2014" "December 8, 2014" "December 8, 2014"
[5] "December 6, 2014" "December 5, 2014" "December 5, 2014" "December 3, 2014"
[9] "December 3, 2014" "December 3, 2014"
Based on this output, I would use the following format: %B %e, %Y (or %B%e, %Y depending on what "with a leading space for a single-digit number" actually means WRT to the leading space; see ?strptime).
Yet, both fails:
strptime(date, "%B %e, %Y")
strptime(date, "%B%e, %Y")
I suppose it's due to the fact that %B expects the month names to be in German instead of English:
Full month name in the current locale. (Also matches abbreviated name on input.)
EDIT
Sys.setlocale() let's you change your locale settings. But it seems that it's not possible to do so after a function relying on locale settings has been called. I.e., you need to start with a fresh R session in order for the locale change to take effect. This makes temporary changes a bit cumbersome. Any ideas how to work around this?
This is my locale:
> Sys.getlocale(category = "LC_ALL")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
When I change it before running strptime() for the first time, everything works just fine:
Sys.setlocale(category = "LC_ALL", locale = "us")
> strptime(date, "%B %e, %Y")
[1] "2014-12-09 CET" "2014-12-09 CET" "2014-12-08 CET" "2014-12-08 CET" "2014-12-06 CET"
[6] "2014-12-05 CET" "2014-12-05 CET" "2014-12-03 CET" "2014-12-03 CET" "2014-12-03 CET"
However, if I change it after having run stptime(), the change does not seem to be recognized
> Sys.setlocale(category = "LC_ALL", locale = "German")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
> strptime(date, "%B %e, %Y")
[1] "2014-12-09 CET" "2014-12-09 CET" "2014-12-08 CET" "2014-12-08 CET" "2014-12-06 CET"
[6] "2014-12-05 CET" "2014-12-05 CET" "2014-12-03 CET" "2014-12-03 CET" "2014-12-03 CET"
This should actually result in a vector of NAs if the change back to a German locale had been carried out.
parse_date_time() from the lubridate package is what you are looking for. It has an explicit locale option for parsing strings according to a specific locale.
parse_date_time(date, orders = "B d, Y", locale = "us")
gives you:
[1] "2016-02-26 UTC" "2016-02-26 UTC" "2016-02-26 UTC" "2016-02-24 UTC" "2016-02-23 UTC" "2016-02-21 UTC"
[7] "2016-02-21 UTC" "2016-02-21 UTC" "2016-02-20 UTC" "2016-02-20 UTC"
Note that you give the parsing format without leading %as you would in strptime().
You can also use readr::locale("en") inside readr::parse_date()
readr::parse_date(date, format = "%B %e, %Y",
# vector of strings to be interpreted as missing values:
na = c("", "NA"),
locale = readr::locale("en"),
# to trim leading and trailing whitespaces:
trim_ws = TRUE)
From the docs: "The locale controls defaults that vary from place to place. The default locale is US-centric (like R), but you can use locale() to create your own locale that controls things like the default time zone, encoding, decimal mark, big mark, and day/month names."