Actual question
How can I temporarily change/specify the locale settings to be used for certain function calls (e.g. strptime())?
Background
I just ran the following rvest demo:
demo("tripadvisor", package = "rvest")
When it comes to the part where the dates are to be scraped, I run into some problems that most likely are caused by my locale settings: the dates are in an US american format while I'm on a German locale:
require("rvest")
url <- "http://www.tripadvisor.com/Hotel_Review-g37209-d1762915-Reviews-JW_Marriott_Indianapolis-Indianapolis_Indiana.html"
reviews <- url %>%
html() %>%
html_nodes("#REVIEWS .innerBubble")
date <- reviews %>%
html_node(".rating .ratingDate") %>%
html_attr("title")
> date
[1] "December 9, 2014" "December 9, 2014" "December 8, 2014" "December 8, 2014"
[5] "December 6, 2014" "December 5, 2014" "December 5, 2014" "December 3, 2014"
[9] "December 3, 2014" "December 3, 2014"
Based on this output, I would use the following format: %B %e, %Y (or %B%e, %Y depending on what "with a leading space for a single-digit number" actually means WRT to the leading space; see ?strptime).
Yet, both fails:
strptime(date, "%B %e, %Y")
strptime(date, "%B%e, %Y")
I suppose it's due to the fact that %B expects the month names to be in German instead of English:
Full month name in the current locale. (Also matches abbreviated name on input.)
EDIT
Sys.setlocale() let's you change your locale settings. But it seems that it's not possible to do so after a function relying on locale settings has been called. I.e., you need to start with a fresh R session in order for the locale change to take effect. This makes temporary changes a bit cumbersome. Any ideas how to work around this?
This is my locale:
> Sys.getlocale(category = "LC_ALL")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
When I change it before running strptime() for the first time, everything works just fine:
Sys.setlocale(category = "LC_ALL", locale = "us")
> strptime(date, "%B %e, %Y")
[1] "2014-12-09 CET" "2014-12-09 CET" "2014-12-08 CET" "2014-12-08 CET" "2014-12-06 CET"
[6] "2014-12-05 CET" "2014-12-05 CET" "2014-12-03 CET" "2014-12-03 CET" "2014-12-03 CET"
However, if I change it after having run stptime(), the change does not seem to be recognized
> Sys.setlocale(category = "LC_ALL", locale = "German")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
> strptime(date, "%B %e, %Y")
[1] "2014-12-09 CET" "2014-12-09 CET" "2014-12-08 CET" "2014-12-08 CET" "2014-12-06 CET"
[6] "2014-12-05 CET" "2014-12-05 CET" "2014-12-03 CET" "2014-12-03 CET" "2014-12-03 CET"
This should actually result in a vector of NAs if the change back to a German locale had been carried out.
parse_date_time() from the lubridate package is what you are looking for. It has an explicit locale option for parsing strings according to a specific locale.
parse_date_time(date, orders = "B d, Y", locale = "us")
gives you:
[1] "2016-02-26 UTC" "2016-02-26 UTC" "2016-02-26 UTC" "2016-02-24 UTC" "2016-02-23 UTC" "2016-02-21 UTC"
[7] "2016-02-21 UTC" "2016-02-21 UTC" "2016-02-20 UTC" "2016-02-20 UTC"
Note that you give the parsing format without leading %as you would in strptime().
You can also use readr::locale("en") inside readr::parse_date()
readr::parse_date(date, format = "%B %e, %Y",
# vector of strings to be interpreted as missing values:
na = c("", "NA"),
locale = readr::locale("en"),
# to trim leading and trailing whitespaces:
trim_ws = TRUE)
From the docs: "The locale controls defaults that vary from place to place. The default locale is US-centric (like R), but you can use locale() to create your own locale that controls things like the default time zone, encoding, decimal mark, big mark, and day/month names."
Related
I'm trying to parse dates (using lubridate functions) from a vector which has mixed date formats.
departureDate <- c("Aug 17, 2020 12:00:00 AM", "Nov 19, 2019 12:00:00 AM", "Dec 21, 2020 12:00:00 AM",
"Dec 24, 2020 12:00:00 AM", "Dec 24, 2020 12:00:00 AM", "Apr 19, 2020 12:00:00 AM", "28/06/2019",
"16/08/2019", "04/02/2019", "10/04/2019", "28/07/2019", "26/07/2019",
"Jun 22, 2020 12:00:00 AM", "Apr 5, 2020 12:00:00 AM", "May 1, 2021 12:00:00 AM")
As I didn't notice at first, I tried to parse with lubridate::mdy_hms(departureDate) which resulted in NA values for dates whose format differs from that of the parser.
As the format may change on random positions of the vector I tried to use the following sentence:
departureDate <- tryCatch(mdy_hms(departureDate),
warning = function(w){return(dmy(departureDate))})
Which brought even more NA's as it only applied the warning function call. Is there a way to solve this by using my approach?
Thanks in advance
We can use lubridate::parse_date_time which can take multiple formats.
lubridate::parse_date_time(departureDate, c('%b %d, %Y %I:%M:%S %p', '%d/%m/%Y'))
#[1] "2020-08-17 UTC" "2019-11-19 UTC" "2020-12-21 UTC" "2020-12-24 UTC"
#[5] "2020-12-24 UTC" "2020-04-19 UTC" "2019-06-28 UTC" "2019-08-16 UTC"
#[9] "2019-02-04 UTC" "2019-04-10 UTC" "2019-07-28 UTC" "2019-07-26 UTC"
#[13] "2020-06-22 UTC" "2020-04-05 UTC" "2021-05-01 UTC"
Since in departureDate month name is in English, you need the locale to be English as well.
Refer How to change the locale of R? if you have non-English locale.
The ideal situation is that the code should be able to deal with every format on its own, without letting it fall to an exception.
Another issue to take into account is that the myd_hms() function returns dates in the POSIXct data type, whereas dmy() returns the Date type, so they wouldn't mix well together.
The code below applies mdy_hms(), then converts it to Date. It then tests for NA's and applies the second function dmy() on the missing values. More rules can be added in the pipeline at will if more formats are to be recognized.
library(dplyr)
dates.converted <-
mdy_hms(departureDate, tz = ) %>%
as.Date() %>%
ifelse(!is.na(.), ., dmy(departureDate)) %>%
structure(class = "Date")
print(dates.converted)
Output
[1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28" "2019-08-16"
[9] "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05" "2021-05-01"
One method would be to iterate through a list of candidate formats and apply it only to dates not previously parsed correctly.
fmts <- c("%b %d, %Y %H:%M:%S %p", "%d/%m/%Y")
dates <- rep(Sys.time()[NA], length(departureDate))
for (fmt in fmts) {
isna <- is.na(dates)
if (!any(isna)) break
dates[isna] <- as.POSIXct(departureDate[isna], format = fmt)
}
dates
# [1] "2020-08-17 12:00:00 PDT" "2019-11-19 12:00:00 PST" "2020-12-21 12:00:00 PST"
# [4] "2020-12-24 12:00:00 PST" "2020-12-24 12:00:00 PST" "2020-04-19 12:00:00 PDT"
# [7] "2019-06-28 00:00:00 PDT" "2019-08-16 00:00:00 PDT" "2019-02-04 00:00:00 PST"
# [10] "2019-04-10 00:00:00 PDT" "2019-07-28 00:00:00 PDT" "2019-07-26 00:00:00 PDT"
# [13] "2020-06-22 12:00:00 PDT" "2020-04-05 12:00:00 PDT" "2021-05-01 12:00:00 PDT"
as.Date(dates)
# [1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28"
# [8] "2019-08-16" "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05"
# [15] "2021-05-01"
I encourage you to put the most-likely formats first in the fmts vector.
The way this is set up, as soon as all elements are correctly found, no further formats are attempted (i.e., break).
Edit: if there is a difference in LOCALE where AM/PM are not locally recognized, then one method would be to first remove them from the strings:
departureDate <- gsub("\\s[AP]M$", "", departureDate)
departureDate
# [1] "Aug 17, 2020 12:00:00" "Nov 19, 2019 12:00:00" "Dec 21, 2020 12:00:00"
# [4] "Dec 24, 2020 12:00:00" "Dec 24, 2020 12:00:00" "Apr 19, 2020 12:00:00"
# [7] "28/06/2019" "16/08/2019" "04/02/2019"
# [10] "10/04/2019" "28/07/2019" "26/07/2019"
# [13] "Jun 22, 2020 12:00:00" "Apr 5, 2020 12:00:00" "May 1, 2021 12:00:00"
and then use a simpler format:
fmts <- c("%b %d, %Y %H:%M:%S", "%d/%m/%Y")
I have the following character vector that contains date/time. I want to convert them to a date format and I tried the below methods : as.Date() and as.POSIXct()
time <- c("Oct 01,2015 15:38:31 ", "Oct 05,2015 11:07:14", "Oct 11,2015 14:15:51 ", "Oct 11,2015 14:19:53 ", "Oct 12,2015 11:23:28", "Oct 19,2015 16:32:51 ")
#as.Date() is skipping the time part
time_1<-as.Date(time,"%b %d,%Y %H:%M:%S")
time_1
[1] "2015-10-01" "2015-10-05" "2015-10-11" "2015-10-11" "2015-10-12" "2015-10-19"
#POSIXct is showing an error
time_2<-as.POSIXlt(time,"%b %d,%Y %H:%M:%S")
The as.Date() function is skipping the time part and POSIX is throwing an error (which is obvious).
How do I convert the above string as a proper date+time format?
We can specify the format in both as.POSIXlt and as.POSIXct
as.POSIXlt(time,format="%b %d,%Y %H:%M:%S")
#[1] "2015-10-01 15:38:31 IST" "2015-10-05 11:07:14 IST"
#[3] "2015-10-11 14:15:51 IST" "2015-10-11 14:19:53 IST"
#[5] "2015-10-12 11:23:28 IST" "2015-10-19 16:32:51 IST"
Without specifying the format, the function guesses the "%b %d,%Y %H:%M:%S" as the timezone (tz). The second argument is tz as per the default formula in the ?as.POSIXct
as.POSIXct(x, tz = "", ...)
I’m having difficulties with a date time problem. My data frame looks like this and I want to find the duration that each person watches TV.
Start.Time <- c(193221,201231,152324,182243,123432,192245)
End.Time <- c(202013,211232,154521,183422,133121,201513)
cbind(Start.Time,End.Time)
I have tried different methods to convert them in order to be able to make calculation but I didn’t produce any significant results.
as.POSIXct(Start.Time , origin="2015-11-01")
My results are completely wrong
[1] "2015-11-03 05:40:21 GMT" "2015-11-03 07:53:51 GMT"
[3] "2015-11-02 18:18:44 GMT" "2015-11-03 02:37:23 GMT"
[5] "2015-11-02 10:17:12 GMT" "2015-11-03 05:24:05 GMT"
For example I want 193221 to become 19:32:21 HH:MM:SS
Is there a package out there that easily does the conversion? and if its possible i don't want the date displayed, just the time.
You can convert your numbers to actual time stamps (in POSIXct format) like this:
Start.Time <- c(193221,201231,152324,182243,123432,192245)
Start.POSIX <- as.POSIXct(as.character(Start.Time), format = "%H%M%S")
Start.POSIX
## [1] "2015-12-19 19:32:21 CET" "2015-12-19 20:12:31 CET" "2015-12-19 15:23:24 CET"
## [4] "2015-12-19 18:22:43 CET" "2015-12-19 12:34:32 CET" "2015-12-19 19:22:45 CET"
As you can see, as.POSIXct assumes the times to belong to the current date. POSIXct alway denotes a specific moment in time and thus contains not only a time but also a date. You can now easily do calculations with these:
End.Time <- c(202013,211232,154521,183422,133121,201513)
End.POSIX <- as.POSIXct(as.character(End.Time), format = "%H%M%S")
End.POSIX - Start.POSIX
## Time differences in mins
## [1] 47.86667 60.01667 21.95000 11.65000 56.81667 52.46667
When you print the POSIXct objects (as I did above with Start.POSIX) they are acutally converted to characters and these are printed. You can see this, because there are " around the dates. You can control the format that is used when printing and thus, you could print the times only as follows:
format(Start.POSIX, "%H:%M:%S")
## [1] "19:32:21" "20:12:31" "15:23:24" "18:22:43" "12:34:32" "19:22:45"
For output, the specification is %Z (see ?strptime). But for input, how does that work?
To clarify, it'd be great for the time zone abbreviation to be parsed into useful information by as.POSIXct(), but more core to be question is how to get the function to at least ignore the time zone.
Here is my best workaround, but is there a particular format code to pass to as.POSIXct() that will work for all time zones?
times <- c("Fri Jul 03 00:15:00 EDT 2015", "Fri Jul 03 00:15:00 GMT 2015")
as.POSIXct(times, format="%a %b %d %H:%M:%S %Z %Y") # nope! strptime can't handle %Z in input
formats <- paste("%a %b %d %H:%M:%S", gsub(".+ ([A-Z]{3}) [0-9]{4}$", "\\1", times),"%Y")
as.POSIXct(times, format=formats) # works
Edit: Here is the output from the last line, as well as its class (from a separate call); the output is as expected. From the console:
> as.POSIXct(times, format=formats)
[1] "2015-07-03 00:15:00 EDT" "2015-07-03 00:15:00 EDT"
> attributes(as.POSIXct(times, format=formats))
$class
[1] "POSIXct" "POSIXt"
$tzone
[1] ""
The short answer is, "no, you can't." Those are abbreviations and they are not guaranteed to uniquely identify a specific timezone.
For example, is "EST" Eastern Standard Time in the US or Australia? Is "CST" Central Standard Time in the US or Australia, or is it China Standard Time, or is it Cuba Standard Time?
I just noticed that you're not trying to parse the timezone abbreviation, you are simply trying to avoid it. I don't know of a way to tell strptime to ignore arbitrary characters. I do know that it will ignore anything in the character representation of the time after the end of the format string. For example:
R> # The year is not parsed, so the current year is used
R> as.POSIXct(times, format="%a %b %d %H:%M:%S")
[1] "2015-07-03 00:15:00 UTC" "2015-07-03 00:15:00 UTC"
Other than that, a regular expression is the only thing I can think of that solves this problem. Unlike your example, I would use the regex on the input character vector to remove all 3-5 character timezone abbreviations.
R> times_no_tz <- gsub(" [[:upper:]]{3,5} ", " ", times)
R> as.POSIXct(times_no_tz, format="%a %b %d %H:%M:%S %Y")
[1] "2015-07-03 00:15:00 UTC" "2015-07-03 00:15:00 UTC"
I'm currently playing around a lot with dates and times for a package I'm building.
Stumbling across this post reminded me again that it's generally not a bad idea to check out if something can be done with basic R features before turning to contrib packages.
Thus, is it possible to round a date of class POSIXct with base R functionality?
I checked
methods(round)
which "only" gave me
[1] round.Date round.timeDate*
Non-visible functions are asterisked
This is what I'd like to do (Pseudo Code)
x <- as.POSIXct(Sys.time())
[1] "2012-07-04 10:33:55 CEST"
round(x, atom="minute")
[1] "2012-07-04 10:34:00 CEST"
round(x, atom="hour")
[1] "2012-07-04 11:00:00 CEST"
round(x, atom="day")
[1] "2012-07-04 CEST"
I know this can be done with timeDate, lubridate etc., but I'd like to keep package dependencies down. So before going ahead and checking out the source code of the respective packages, I thought I'd ask if someone has already done something like this.
base has round.POSIXt to do this. Not sure why it doesn't come up with methods.
x <- as.POSIXct(Sys.time())
x
[1] "2012-07-04 10:01:08 BST"
round(x,"mins")
[1] "2012-07-04 10:01:00 BST"
round(x,"hours")
[1] "2012-07-04 10:00:00 BST"
round(x,"days")
[1] "2012-07-04"
On this theme with lubridate, also look into the ceiling_date() and floor_date() functions:
x <- as.POSIXct("2009-08-03 12:01:59.23")
ceiling_date(x, "second")
# "2009-08-03 12:02:00 CDT"
ceiling_date(x, "hour")
# "2009-08-03 13:00:00 CDT"
ceiling_date(x, "day")
# "2009-08-04 CDT"
ceiling_date(x, "week")
# "2009-08-09 CDT"
ceiling_date(x, "month")
# "2009-09-01 CDT"
If you don't want to call external libraries and want to keep POSIXct as I do this is one idea (inspired by this question): use strptime and paste a fake month and day. It should be possible to do it more straight forward, as said in this comment
"For strptime the input string need not specify the date completely:
it is assumed that unspecified seconds, minutes or hours are zero, and
an unspecified year, month or day is the current one."
Thus it seems that you have to use strftime to output a truncated string, paste the missing part and convert again in POSIXct.
This is how an update answer could look:
x <- as.POSIXct(Sys.time())
x
[1] "2018-12-27 10:58:51 CET"
round(x,"mins")
[1] "2018-12-27 10:59:00 CET"
round(x,"hours")
[1] "2018-12-27 11:00:00 CET"
round(x,"days")
[1] "2018-12-27 CET"
as.POSIXct(paste0(strftime(x,format="%Y-%m"),"-01")) #trunc by month
[1] "2018-12-01 CET"
as.POSIXct(paste0(strftime(x,format="%Y"),"-01-01")) #trunc by year
[1] "2018-01-01 CET"