converting multiple date formats into one in r - r

I am working with messy excel file with multiple date formats
2016-10-17T12:38:41Z
Mon Oct 17 08:03:08 GMT 2016
10-Sep-15
13-Oct-09
18-Oct-2016 05:42:26 UTC
I want to convert all of the above in yyyy-mm-dd format. I am using following code for the conversion but lot of values are coming NA.
as.Date(parse_date_time(df$date,c('mdy', 'ymd_hms','a b d HMS y','d b y HMS')))
How can I do it all of them together. I have read other threads on similar case,but nothing seems to work for my case.
Please help

If I add 'dmy' to the list then at least all of the cases in your example are succesfully parsed:
z <- c("2016-10-17T12:38:41Z", "Mon Oct 17 08:03:08 GMT 2016",
"10-Sep-15", "13-Oct-09", "18-Oct-2016 05:42:26 UTC")
library(lubridate)
parse_date_time(z,c('mdy', 'dmy', 'ymd_HMS','a b d HMS y','d b y HMS'))
## [1] "2016-10-17 12:38:41 UTC" "2016-10-17 08:03:08 UTC"
## [3] "2015-09-10 00:00:00 UTC" "2009-10-13 00:00:00 UTC"
## [5] "2016-10-18 05:42:26 UTC"
Your big problem will be the third and fourth elements: are these actually meant to be 'ymd' and 'dmy' respectively? I'm not sure how any logic will let you auto-detect these differences ... out of context, "15 Sep 2010" and "10 September 2015" both seem perfectly reasonable possibilities ...
For what it's worth I also tried the new anytime package - it only handled the first and last element.

Removing the times first makes it possible to specify only three alternatives in orders to parse the sample data in the question. This interprets 10-Sep-15 and 13-Oct-09 as dmy but if you want them interpreted as ymd then uncomment the commented out line:
orders <- c("dmy", "mdy", "ymd")
# orders <- c("ymd", "dmy", "mdy")
as.Date(parse_date_time(gsub("..:..:..", " ", x), orders = orders))
giving:
[1] "2016-10-17" "2016-10-17" "2015-09-10" "2009-10-13" "2016-10-18"
or if the commented out line is uncommented then:
[1] "2016-10-17" "2016-10-17" "2010-09-15" "2013-10-09" "2016-10-18"
Note: The input is:
x <- c("2016-10-17T12:38:41Z ", "Mon Oct 17 08:03:08 GMT 2016", "10-Sep-15",
"13-Oct-09", "18-Oct-2016 05:42:26 UTC")

Related

How to convert a string into a date-time object

The data is character and I want it to be date-time. I have the cheat sheet with me but there isn't any format that I can use that satisfies the weird date format. Any suggestions?
x <- 'Fri Dec 11 12:10:51 PST 2020'
You can use the anytime package
> library(anytime)
> anytime("Fri Dec 11 12:10:51 PST 2020")
[1] "2020-12-11 12:10:51 CST"
>
> class(anytime("Fri Dec 11 12:10:51 PST 2020"))
[1] "POSIXct" "POSIXt"
>
It has three key advantages:
it can guess the format (as here)
it converts all sorts of input format (incl character, factor, ...)
it is pretty fast (as the parser is C++ from Boost)
It is pretty standard for most methods to ignore the timezone attribute. So the PST became my local time, i.e. Central.
In base R, you could do :
x <- 'Fri Dec 11 12:10:51 PST 2020'
as.POSIXct(x, format = '%a %b %d %T PST %Y')
See ?strptime for detailed format specifications.

Change character to date with different format (Win and Mac)

I have several date variables in a data.frame.
They look for example like this:
[1] "10/14/18 17:55:28" "10/15/18 19:27:56"
[3] "11/04/18 15:47:46" "Thu Feb 7 14:51:55 2019"
[5] "Thu Feb 7 17:14:15 2019" "Thu Feb 7 15:46:09 2019"
[7] "Thu Feb 7 11:42:27 2019" "Thu Feb 7 13:24:16 2019"
[9] "Thu Feb 7 18:02:29 2019" "Mon Oct 15 08:48:43 2018"
[11] "10/17/18 17:08:38" "12/08/18 08:08:11"
[13] "10/11/18 21:25:30" "10/14/18 19:15:30"
[15] "10/16/18 11:18:01" "10/16/18 18:19:27"
[17] "Tue Oct 16 19:49:24 2018" "Wed Oct 17 21:36:32 2018"
[19] "Sat Oct 13 11:22:35 2018" "Fri Dec 7 17:12:33 2018"
At the moment this is a character variable. I want to change it with as.Date to substract the variables from each other.
I already found this:
as.Date( DATE$Sess1, format = "%m/%d/%y")
I would prefer to keep not only the date but also the time.
The real problem is that they include Apple and Windows format which makes it even more complicated.
I would prefer dplyr solutions ;)
You can use lubridates parse_date_time and include all the formats that it could take.
x <- c("10/14/18 17:55:28" , "10/15/18 19:27:56" ,
"11/04/18 15:47:46" , "Thu Feb 7 14:51:55 2019",
"Thu Feb 7 17:14:15 2019", "Thu Feb 7 15:46:09 2019")
lubridate::parse_date_time(x,c('mdyT', 'amdTY'))
#[1] "2018-10-14 17:55:28 UTC" "2018-10-15 19:27:56 UTC" "2018-11-04 15:47:46 UTC"
#[4] "2019-02-07 14:51:55 UTC" "2019-02-07 17:14:15 UTC" "2019-02-07 15:46:09 UTC"
Read ?parse_date_time to know different format details.
To get the dates, you can wrap as.Date around it.
as.Date(lubridate::parse_date_time(x,c('mdyT', 'amdTY')))
#[1] "2018-10-14" "2018-10-15" "2018-11-04" "2019-02-07" "2019-02-07" "2019-02-07"
For keeping the time, it's best to use a different date format, e.g. POSIXlt or POSIXct. You can also extend the format string to include the time (e.g. format = "%m/%d/%y %H:%M:%S") - see https://astrostatistics.psu.edu/su07/R/html/base/html/strptime.html for more details on these codes.
as.POSIXlt(DATE$Sess1, format = "%m/%d/%y %H:%M:%S")
As for handling different formats, because the ones you have aren't unambiguous on their own, I suggest having a vector of possible formats, then trying each in turn until one works.
If you're using the tidyverse, use {lubridate} to reformat. There are two different date/time formats in your example, so you'll need to format them twice.
lubridate::as_datetime(DATE$Sess1, format = "%a %b %e %H:%M:%S %Y")
and then for all the NA results...
lubridate::as_datetime(DATE$Sess1, format = "%m/%d/%y %H:%M:%S")

How to locate and convert the all date format for a file.txt?

Suppose I have a diary.txt file. I import it as a string. All the dates in this string appear to be YYYY.MM.DD, and I want to locate and convert it into DDMMYYYY. What should I do?
For example, here is a diary.txt,
2018.01.01
It's a nice day.
2018.01.02
Today is a rainy day.
It should be converted into
Jan 01 2018
It's a nice day.
Jan 02 2018
Today is a rainy day.
You first need to coerce your dates to proper date objects (as.Date) and then replace them with a newly formatted date. See ?strptime for the syntax on how to specify the new format.
# import data
diary <- tempfile(fileext = ".txt")
cat("2018.01.01
It's a nice day.
2018.01.02
Today is a rainy day.", file = diary)
xy <- readLines(con = diary)
# coerce to proper date format
dates <- as.Date(xy, format = "%Y.%m.%d")
# replace valid dates with new dates formatted using format()
# date should be all non-NAs
xy[!is.na(dates)] <- format(dates[!is.na(dates)], format = "%b %d %Y") # %b will depend on your locale, see ?strptime
# write to file
writeLines(xy, con = "result.txt")
# contents of result.txt
jan. 01 2018
It's a nice day.
jan. 02 2018
Today is a rainy day.
Notice that it doesn't say Jan, but jan. This is due to my local which doesn't match to what you may be used to.
> Sys.getlocale()
[1] "LC_COLLATE=Slovenian_Slovenia.1250;LC_CTYPE=Slovenian_Slovenia.1250;LC_MONETARY=Slovenian_Slovenia.1250;LC_NUMERIC=C;LC_TIME=Slovenian_Slovenia.1250"
If I set time locale to something else (may only work on windows)
> Sys.setlocale(category = "LC_TIME", locale = "English_United States.1252")
the result is
> xy
[1] "Jan 01 2018" "It's a nice day." "" "Jan 02 2018"
[5] "Today is a rainy day."
Try this out:
# Loading data
data <- readLines("diary.txt")
# Identifying lines with dates
date_lines <- grep("^[[:digit:]]", data)
# Creating dates
data[date_lines] <- format(as.POSIXct(data[date_lines], format = "%Y.%m.%d"), "%b %d %Y")
# Writing to new file
fileConn<-file("diary_fixed.txt")
writeLines(data, fileConn)
close(fileConn)

How do I specify POSIX (time) format for 3 letter tz in R, in order to ignore it?

For output, the specification is %Z (see ?strptime). But for input, how does that work?
To clarify, it'd be great for the time zone abbreviation to be parsed into useful information by as.POSIXct(), but more core to be question is how to get the function to at least ignore the time zone.
Here is my best workaround, but is there a particular format code to pass to as.POSIXct() that will work for all time zones?
times <- c("Fri Jul 03 00:15:00 EDT 2015", "Fri Jul 03 00:15:00 GMT 2015")
as.POSIXct(times, format="%a %b %d %H:%M:%S %Z %Y") # nope! strptime can't handle %Z in input
formats <- paste("%a %b %d %H:%M:%S", gsub(".+ ([A-Z]{3}) [0-9]{4}$", "\\1", times),"%Y")
as.POSIXct(times, format=formats) # works
Edit: Here is the output from the last line, as well as its class (from a separate call); the output is as expected. From the console:
> as.POSIXct(times, format=formats)
[1] "2015-07-03 00:15:00 EDT" "2015-07-03 00:15:00 EDT"
> attributes(as.POSIXct(times, format=formats))
$class
[1] "POSIXct" "POSIXt"
$tzone
[1] ""
The short answer is, "no, you can't." Those are abbreviations and they are not guaranteed to uniquely identify a specific timezone.
For example, is "EST" Eastern Standard Time in the US or Australia? Is "CST" Central Standard Time in the US or Australia, or is it China Standard Time, or is it Cuba Standard Time?
I just noticed that you're not trying to parse the timezone abbreviation, you are simply trying to avoid it. I don't know of a way to tell strptime to ignore arbitrary characters. I do know that it will ignore anything in the character representation of the time after the end of the format string. For example:
R> # The year is not parsed, so the current year is used
R> as.POSIXct(times, format="%a %b %d %H:%M:%S")
[1] "2015-07-03 00:15:00 UTC" "2015-07-03 00:15:00 UTC"
Other than that, a regular expression is the only thing I can think of that solves this problem. Unlike your example, I would use the regex on the input character vector to remove all 3-5 character timezone abbreviations.
R> times_no_tz <- gsub(" [[:upper:]]{3,5} ", " ", times)
R> as.POSIXct(times_no_tz, format="%a %b %d %H:%M:%S %Y")
[1] "2015-07-03 00:15:00 UTC" "2015-07-03 00:15:00 UTC"

How to convert a character date into POSIX time without loss in precision

I have the following date which I want to convert into POSIX time. I followed this answer but there's a difference between the input and the output date if I convert the date back.
char_date <- "2012-04-27T20:48:14"
unix_date <- as.integer(as.POSIXct(char_date, origin="1970-01-01"))
unix_date
# [1] 1335448800
which translates back to Thu, 26 Apr 2012 14:00:00.
What am I messing up?
No need for sub and you should always define the time zone:
x <- as.POSIXct("2012-04-27T20:48:14", format="%Y-%m-%dT%H:%M:%S", tz="CET")
#[1] "2012-04-27 20:48:14 CEST"
as.numeric(x)
#[1] 1335552494
I think there are 2 issue in play here: The T character is affecting the character parser so it ingores the time part, and I assume your timezone is UTC+10, which is why your translation is at 2pm the previous day.
(as.POSIXct(char_date, origin="1970-01-01"))
[1] "2012-04-27 BST"
(as.POSIXct(sub("T"," ",char_date), origin="1970-01-01"))
[1] "2012-04-27 20:48:14 BST"

Resources