The date in my dataset is like this: 20130501000000 and I'm trying to convert this to a better datetime format in R
data1$date <- as.Date(data1$date, format = "%Y-%m-%s-%h-%m-%s")
However, I get an error for needing an origin. After I put the very first cell under date in as origin, it converts every cell under date to N/A. Is this right or should I try as.POSIXct()?
That is a somewhat degenerate format, but the anytime() and anydate() functions of the anytime package can help you, without requiring any explicit format strings:
R> anytime("20130501000000") ## returns POSIXct
[1] "2013-05-01 CDT"
R> anydate("20130501000000") ## returns Date
[1] "2013-05-01"
R>
Not that we parse from character representation here -- parsing from numeric would be wrong as we use a conflicting heuristic to make sense of dates stored a numeric values.
So here your code would just become
data1$data <- anytime::anydate(data1$date)
provided data1$date is in character, else wrap one as.character() around it.
Lastly, if you actually want Datetime rather than Date (as per your title), don't use anydate() but anytime().
Before I write my answer, I would like to say that the format argument should be the format that your string is in. Therefore, if you have "20130501000000", you have to use (you don't have - between each component of your date in the string format):
as.Date("20130501000000", format = "%Y%m%d%H%M%S")
# [1] "2013-05-01"
which works just fine, does not produce any error, and will return an object of class Date:
as.Date("20130501000000", format = "%Y%m%d%H%M%S") |> class()
# [1] "Date"
Therefore, I think your issue is more of a formatting and not origin of the date.
Now to my detailed answer:
As far as I know and can understand, the as.Date() will convert it to "date", so if you want the time part of the string as well, you have to use as.POSIXct():
as.POSIXct("20130501000000", format = "%Y%m%d%H%M%S")
# [1] "2013-05-01 EEST"
as.POSIXct("20130501000000", format = "%Y%m%d%H%M%S") |> class()
# [1] "POSIXct" "POSIXt"
Note that the timezone is EEST which is my local timezone, if you want to define the timezone, you have to define it. For example to set the timezone to UTC:
as.POSIXct("20130501000000", format = "%Y%m%d%H%M%S", tz = "UTC")
# [1] "2013-05-01 UTC"
using the as.POSIXct() you can do arithmetic with the object:
times <- c("20130501000000",
"20130501035001") # added 03:50:01 to the first element
class(times)
# [1] "character"
times <- as.POSIXct(times, format = "%Y%m%d%H%M%S", tz = "UTC")
class(times)
# [1] "POSIXct" "POSIXt"
times[2] - times[1]
# Time difference of 3.833611 hours
Related
I have a vector a = 40208.64507.
In excel, I can automatically change a to a datetime: 2010/1/30 15:28:54 by click the Date type.
I tried some methods but I cannot get the same result in R, just as in excel.
a = 40208.64507
# in Excel, a can change into: 2010/1/30 15:28:54
as.Date(a, origin = "1899-12-30")
lubridate::as_datetime(a, origin = "1899-12-30")
Is there any way to get the same results in R as in Excel?
Here are several ways. chron class is the closest to Excel in terms of internal representations -- they are the same except for origin -- and the simplest so we list that one first. We also show how to use chron as an intermediate step to get POSIXct.
Base R provides an approach which avoids package dependencies and lubridate might be used if you are already using it.
1) Add the appropriate origin using chron to get a chron datetime or convert that to POSIXct. Like Excel, chron works in days and fractions of a day, but chron uses the UNIX Epoch as origin whereas Excel uses the one shown below.
library(chron)
a <- 40208.64507
# chron date/time
ch <- as.chron("1899-12-30") + a; ch
## [1] (01/30/10 15:28:54)
# POSIXct date/time in local time zone
ct <- as.POSIXct(ch); ct
## [1] "2010-01-30 10:28:54 EST"
# POSIXct date/time in UTC
as.POSIXct(format(ct), tz = "UTC")
## [1] "2010-01-30 10:28:54 UTC"
2) Using only base R convert the number to Date class using the indicated origin and then to POSIXct.
# POSIXct with local time zone
ct <- as.POSIXct(as.Date(a, origin = "1899-12-30")); ct
## [1] "2010-01-30 10:28:54 EST"
# POSIXct with UTC time zone
as.POSIXct(format(ct), tz = "UTC")
## [1] "2010-01-30 15:28:54 UTC"
3) Using lubridate it is similar to base R so we can write
library(lubridate)
# local time zone
as_datetime(as_date(a, origin = "1899-12-30"), tz = "")
[1] "2010-01-30 15:28:54 EST"
# UTC time zone
as_datetime(as_date(a, origin = "1899-12-30"))
[1] "2010-01-30 15:28:54 UTC"
I am working with an external package that's converting columns of a dataframe with the lubridate date type Date into numeric type. (Confirmed by running as.numeric() on the columns).
I'm wondering if there's a way to convert it back?
For example, if I have the date "O1-01-2021" then running as.numeric on it returns -719143. How can I turn that back into "O1-01-2021" ?
Note that Date class is part of base R, not lubridate.
You probably assumed that the data was year/month/day by mistake. Using base R to eliminate lubridate as a problem we can replicate the question's result like this:
as.numeric(as.Date("01-01-2021", "%Y-%m-%d"))
## [1] -719143
Had we used day/month/year we would have gotten:
as.numeric(as.Date("01-01-2021", "%d-%m-%Y"))
## [1] 18628
or using lubridate
library(lubridate)
as.numeric(dmy("01-01-2021"))
## [1] 18628
It would be best if you fix the mistake that resulted in -719143 but if you don't control that and are faced with an input of
-719143 and want to get as.Date("2021-01-01") as the output then:
# input x is numeric; result is Date class
fixup <- function(x) as.Date(format(.Date(x), "%y-%m-%d"), "%d-%m-%y")
fixup(-719143)
## [1] "2020-01-01"
Note that we can't tell from the question whether 01-01-2020 is supposed to represent day-month-year or month-day-year so we assumed the first but if it is to represent the second then it should be obvious at this point how to proceed.
EDIT #2: It looks like the original data is being parsed as Jan 20, year 1, which might happen if the year-month-day columns were jumbled while being parsed:
as.numeric(as.Date("01-01-2021", format = "%Y-%m-%d", origin = "1970-01-01"))
[1] -719143
as.numeric(as.Date("0001-01-20", origin = "1970-01-01"))
[1] -719143
Is there a way to share an example of the raw data as you have it? e.g. dput(MY_DATA[1:10, DATE_COL])
EDIT: -719143 is about 1970 years of days, which can't be a coincidence, given that many date/time formats use 1970 as a baseline. I wonder if 01-01-2021 is being interpreted as the numeric formula equal to -2021 and so we're looking at perhaps -2021 seconds/days/[?] before year zero, which would be about -1970 years before the epoch...
-719143/(365)
[1] -1970.255
For instance, we can get something close with:
as.numeric(as.Date("0000-01-01", origin = "1970-01-01"))
[1] -719528
Original answer:
R treats a string describing a date as text:
x <- "01-01-2021"
class(x)
[1] "character"
We can convert it to a Date data type using these two equivalent commands:
base_dt <- as.Date(x, "%m-%d-%Y") # base R version
lubridt <- lubridate::mdy(x) # convenience lubridate function
identical(base_dt, lubridt)
[1] TRUE
Under the hood, a Date object in R is a numeric value with a flag telling R it's a date:
> typeof(lubridt) # What general type of data is it?
[1] "double" # --> numeric, stored as a double
> as.numeric(lubridt)
[1] 18628
> class(lubridt) # Does it have any special class attributes?
[1] "Date" # --> yes, it's a Date
> dput(lubridt) # How would we construct it from scratch?
structure(18628, class = "Date") # --> by giving 18628 a Date attribute
In R, a Date is encoded as the number of days since 1970 began:
> as.Date("1970-01-1") + as.numeric(lubridt)
[1] "2021-01-01"
We could convert it back to the original text using:
format(base_dt, "%m-%d-%Y")
[1] "01-01-2021"
identical(x, format(base_dt, "%m-%d-%Y"))
[1] TRUE
I am working with an input file where I have different string dates given in different month,day,year formats
example input ->
input <- c("2014-08-31 23:59:38" , "9/1/2014 00:00:25","2014-08-31 13:39:23", "12/1/2014 20:03:28")
How can I use a single function that would convert various formats of dates, in a fast manner, I am processing millions of lines
so far I have written this function:
convert_date <- function(x){
if (is.na(mdy_hms(x))){
return(ymd_hms(x))
}
return(mdy_hms(x))
}
However, it is extremely slow, I am looking for a faster and more convenient method.
Thank you so much for your time.
If you can construct a vector of possible formats that the date could be in, you could use clock. For each date-time string, it stops on the first format that succeeds.
Note that this only works if your formats are unambiguous. i.e. it would probably give you faulty results if you had both %m/%d/%Y and %d/%m/%Y in the same vector, because those are ambiguous.
library(clock)
input <- c(
"2014-08-31 23:59:38" , "9/1/2014 00:00:25",
"2014-08-31 13:39:23", "12/1/2014 20:03:28"
)
format <- c("%Y-%m-%d %H:%M:%S", "%m/%d/%Y %H:%M:%S")
date_time_parse(input, zone = "UTC", format = format)
#> [1] "2014-08-31 23:59:38 UTC" "2014-09-01 00:00:25 UTC"
#> [3] "2014-08-31 13:39:23 UTC" "2014-12-01 20:03:28 UTC"
I have a Time column in my df with value 1.01.2016 0:00:05. I want it without the seconds and therefore used df$Time <- as.POSIXct(df$Time, format = "%d.%m.%Y :%H:%M", tz = "Asia/Kolkata"). But I get NA value. What is the problem here?
I suspect there are two things working here: the storage of a time object (POSIXt), and the representation of that object.
The string you present is (I believe) not a proper POSIXt (whether POSIXct or POSIXlt) object for R, which means it is just a character string. In that case, you can remove it with:
gsub(':[^:]*$', '', '1.01.2016 0:00:05')
# [1] "1.01.2016 0:00"
However, that is still just a string, not a date or time object. If you parse it into a time-object that R knows about:
as.POSIXct("1.01.2016 0:00:05", format = "%d.%m.%Y %H:%M:%S", tz = "Asia/Kolkata")
# [1] "2016-01-01 00:00:05 IST"
then you now have a time object that R knows something about ... and it defaults to representing it (printing it on the console) with seconds-precision. Typically, all that is available to change for the console-printing is the precision of the seconds, as in
options("digits.secs")
# $digits.secs
# NULL
Sys.time()
# [1] "2018-06-26 18:21:06 PDT"
options("digits.secs"=3)
Sys.time()
# [1] "2018-06-26 18:21:10.090 PDT"
then you can get more. But alas, I do know think there is an R-option to say "always print my POSIXt objects in this way". So your only choice is (at the point where you no longer need it to be a time-like object) to change it back into a string with no time-like value:
x <- as.POSIXct("1.01.2016 0:00:05", format = "%d.%m.%Y %H:%M:%S", tz = "Asia/Kolkata")
x
# [1] "2016-01-01 00:00:05 IST"
?strptime
# see that day-of-month can either be "%d" for 01-31 or "%e" for 1-31
format(x, format="%e.%m.%Y %H:%M")
# [1] " 1.01.2016 00:00"
(This works equally well for a vector.)
Part of me suggests convert to POSIXt and back to string as opposed to my gsub example because using as.POSIXct will tell you when the string does not match the date-time-like object you are expecting, whereas gsub will happily do something wrong or nothing.
Try asPOSIXlt:
> test <- "1.01.2016 0:00:05"
> as.POSIXlt(test, "%d.%m.%Y %H:%M:%S", tz="Asia/Kolkata")
[1] "2016-01-01 00:00:05 IST"
In the R package lubridate, I can easily create a date with the following syntax:
> mdy("5/4/2015")
As expected, it produces the following result:
[1] "2015-05-04 UTC"
However, if I try to add that very value to an array, it seems to change from UTC to my local time (EDT):
> c(mdy("5/4/2015"))
[1] "2015-05-03 20:00:00 EDT"
Since I don't care about times this wouldn't affect me much except that this results in the date shifting back by 1, as follows:
> day(mdy("5/4/2015"))
[1] 4
> day(c(mdy("5/4/2015")))
[1] 3
To me, the act of adding something to an array should not change the value of that something. Am I missing something here, and is there a way to resolve this issue?
That's because lubridate::mdy assumes UTC. When you wrap it around c(), it reverts back to your local timezone EDT because c() does not pass on the timezone attribute:
> attr(mdy("5/4/2015", tz = "EDT"), "tzone")
# [1] "EDT"
> attr(c(mdy("5/4/2015", tz = "EDT")), "tzone")
# NULL
You can do:
Sys.setenv(TZ = "UTC")
To set your local timezone to UTC.
Alternatively, you can specity the timezone explicitly in mdy():
mdy("5/4/2015", tz = "UTC")
Apart from Steven's solution, you could also store your dates in a list
list(mdy("5/4/2015"))[[1]]
#[1] "2015-05-04 UTC"
This won't remove the timezone and you don't have to mess around with environment variables.
I agree with you: If you look at c as some form of constructor for a "vector" and you come from a C++ or similar background, the removal of attributes (except for names) certainly seems strange.