Sometimes I am given data sets that has two different date formats but common variables that have to been joined into one dataframe. Over the years, I've tried various solutions to get around this workflow hassle. Now that I've been using lubridate, it seems like many of these problems are easily solved. However, I am encountering some behaviour that seems weird to me though I imagine there is a good explanation that is beyond me. Say I am given a data set with different date formats that I join into one data frame. This dataframe looks like this:
library(ludridate)
library(dplyr)
df<-data.frame(Lab=c("A","B"),DATE=c("12/15/15","12/15/2013")); df
I want to convert this data to a date format with lubridate. However the following does not format consistently:
df %>%
mutate(mdy(DATE))
...but rather creates a 0015 date. If I filter just for Lab "A":
df %>%
filter(Lab=="A") %>%
mutate(mdy(DATE))
... or even group_by Lab:
df %>%
group_by(Lab) %>%
mutate(mdy(DATE))
Then I get the desired year format. Is this the correct behaviour of the lubridate family of date formatting functions? Is there a better way to accomplish what I am doing? I am sure that multiple date formats in one column is a relatively common (and annoying) occurence.
Thanks in advance.
parse_date_time of lubridate package can help format multiple date formats in one go.
Syntax:
df$date = parse_date_time(df$date, c(format1, format2, format3))
You need to specify all the possible format types.
Since lubridate has some difficulty understanding (correctly) some format types, you need to make custom format.
In the help section , you will find the below illustration. You can recreate it to suit your requirement.
## ** how to use `select_formats` argument **
## By default %Y has precedence:
parse_date_time(c("27-09-13", "27-09-2013"), "dmy")
## [1] "13-09-27 UTC" "2013-09-27 UTC"
## to give priority to %y format, define your own select_format function:
my_select <- function(trained){
n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5
names(trained[ which.max(n_fmts) ])
}
parse_date_time(c("27-09-13", "27-09-2013"), "dmy", select_formats = my_select)
## '[1] "2013-09-27 UTC" "2013-09-27 UTC"
From the help on parse_date_time:
## ** how to use select_formats **
## By default %Y has precedence:
parse_date_time(c("27-09-13", "27-09-2013"), "dmy")
## [1] "13-09-27 UTC" "2013-09-27 UTC"
## to give priority to %y format, define your own select_format function:
my_select <- function(trained){
n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5
names(trained[ which.max(n_fmts) ])
}
parse_date_time(c("27-09-13", "27-09-2013"), "dmy", select_formats = my_select)
## '[1] "2013-09-27 UTC" "2013-09-27 UTC"
Related
I have a data frame with a character column of date-times.
When I use as.Date, most of my strings are parsed correctly, except for a few instances. The example below will hopefully show you what is going on.
# my attempt to parse the string to Date -- uses the stringr package
prods.all$Date2 <- as.Date(str_sub(prods.all$Date, 1,
str_locate(prods.all$Date, " ")[1]-1),
"%m/%d/%Y")
# grab two rows to highlight my issue
temp <- prods.all[c(1925:1926), c(1,8)]
temp
# Date Date2
# 1925 10/9/2009 0:00:00 2009-10-09
# 1926 10/15/2009 0:00:00 0200-10-15
As you can see, the year of some of the dates is inaccurate. The pattern seems to occur when the day is double digit.
Any help you can provide will be greatly appreciated.
The easiest way is to use lubridate:
library(lubridate)
prods.all$Date2 <- mdy(prods.all$Date2)
This function automatically returns objects of class POSIXct and will work with either factors or characters.
You may be overcomplicating things, is there any reason you need the stringr package? You can use as.Date and its format argument to specify the input format of your string.
df <- data.frame(Date = c("10/9/2009 0:00:00", "10/15/2009 0:00:00"))
as.Date(df$Date, format = "%m/%d/%Y %H:%M:%S")
# [1] "2009-10-09" "2009-10-15"
Note the Details section of ?as.Date:
Character strings are processed as far as necessary for the format specified: any trailing characters are ignored
Thus, this also works:
as.Date(df$Date, format = "%m/%d/%Y")
# [1] "2009-10-09" "2009-10-15"
All the conversion specifications that can be used to specify the input format are found in the Details section in ?strptime. Make sure that the order of the conversion specification as well as any separators correspond exactly with the format of your input string.
More generally and if you need the time component as well, use as.POSIXct or strptime:
as.POSIXct(df$Date, "%m/%d/%Y %H:%M:%S")
strptime(df$Date, "%m/%d/%Y %H:%M:%S")
I'm guessing at what your actual data might look at from the partial results you give.
If you don't know the format you could use anytime::anydate, which tries to match to common formats:
library(anytime)
date <- c("01/01/2000 0:00:00", "Jan 1, 2000 0:00:00", "2000-Jan-01 0:00:00")
anydate(date)
[1] "2000-01-01" "2000-01-01" "2000-01-01"
library(lubridate)
if your date format is like this '04/24/2017 05:35:00'then change it like below
prods.all$Date2<-gsub("/","-",prods.all$Date2)
then change the date format
parse_date_time(prods.all$Date2, orders="mdy hms")
I have character field in the following format
df
sd
10/12/2017 6:12
10/12/2017 6:14
I want to convert it into date format so that I can extract time from it.
Now having read this, I don't want to use regex as I want to keep it more generic as the formats might change. So I wanted to convert it to a date format and use lubridate to extract required fields.
So i used the following :
d1 <- strptime(df$sd[1], "%m/%d/%Y%H:%M")
and it gives me the following result
d1
[1] "2017-10-12 IST"
whereas i was expecting the hours and mins to be included as well.
Also on trying to use
format(dmy_hms(df$sd), "%H:%M:%S")
I get that "All formats failed to parse. No formats found"
Any suggestions on this?
Here's the lubridate solution:
library(lubridate)
df <- data.frame(sd=c('10/12/2017 6:12','10/12/2017 6:14'),stringsAsFactors = F)
dmy_hm(df$sd)
#[1] "2017-12-10 06:12:00 UTC" "2017-12-10 06:14:00 UTC"
I have a monthly data file where dates are stored in %tm format of Stata like 2000m1. How I can convert it to dates?
I could do something like manipulate the strings into 2000-01-01 but I would like to avoid this if possible.
as.Date('2000m1') (unsurprisingly) returns NA.
1) yearmon Using the zoo package, this converts it to a "yearmon" class object which may make more sense than converting it to a "Date" given that you have no day of the month. Such objects are internally represented as a year + 0 for Jan, year + 1/12 for Feb, etc. so they sort properly.
library(zoo)
as.yearmon('2000m1', '%Ym%m')
## [1] "Jan 2000"
If you really want "Date" class then the following give the start and end of month respectively:
as.Date(as.yearmon('2000m1', '%Ym%m'))
## [1] "2000-01-01"
as.Date(as.yearmon('2000m1', '%Ym%m'), frac = 1)
[1] "2000-01-31"
2) paste This does not use any packages and while it does use paste it's a fairly minimal use of string manipulation:
as.Date(paste("2000m1", 1), "%Ym%m %d")
## [1] "2000-01-01"
Note: Be sure not to use any solution that returns a POSIXct object rather than a "yearmon" or "Date" object since then you have introduced the possibility of future potential errors based on time zones into your code which can be completely avoided by using an appropriate class. See the R Help Desk article in R News 4/1.
This can be done very easily with the amazing lubridate package:
data <- c("2001m1","2010m3","2015m12","2009m8")
library(lubridate)
parse_date_time(data,orders="%Y%m"):
[1] "2001-01-01 UTC" "2010-03-01 UTC" "2015-12-01 UTC" "2009-08-01 UTC"
I'm trying to use as.Date in R.
I'm using the command:
as.Date("65-05-14", "%y-%m-%d")
I get:
"2065-05-14"
Is there any way to get it show 1965 instead? Or do I need to recode everything into long format -- eg add 1900 as a numeric?
Thanks!
I didn't see this simple solution in the linked questions, so I'm adding it here too.
In base R you can simply use as.POSIXlt class which provides year attribute. You can then simply reduce 100 years.
Lets say this is your dates vector
(Date <- c("65-05-14", "15-05-14", "25-05-14", "34-05-14"))
## [1] "65-05-14" "15-05-14" "25-05-14" "34-05-14"
You can simply do
Date <- as.POSIXlt(Date, format = "%y-%m-%d")
Date$year <- Date$year - 100L
Date # Alternatively, you could also do `as.Date(Date)`
## [1] "1965-05-14 IDT" "1915-05-14 IDT" "1925-05-14 IDT" "1934-05-14 IDT"
I have a data frame with a character column of date-times.
When I use as.Date, most of my strings are parsed correctly, except for a few instances. The example below will hopefully show you what is going on.
# my attempt to parse the string to Date -- uses the stringr package
prods.all$Date2 <- as.Date(str_sub(prods.all$Date, 1,
str_locate(prods.all$Date, " ")[1]-1),
"%m/%d/%Y")
# grab two rows to highlight my issue
temp <- prods.all[c(1925:1926), c(1,8)]
temp
# Date Date2
# 1925 10/9/2009 0:00:00 2009-10-09
# 1926 10/15/2009 0:00:00 0200-10-15
As you can see, the year of some of the dates is inaccurate. The pattern seems to occur when the day is double digit.
Any help you can provide will be greatly appreciated.
The easiest way is to use lubridate:
library(lubridate)
prods.all$Date2 <- mdy(prods.all$Date2)
This function automatically returns objects of class POSIXct and will work with either factors or characters.
You may be overcomplicating things, is there any reason you need the stringr package? You can use as.Date and its format argument to specify the input format of your string.
df <- data.frame(Date = c("10/9/2009 0:00:00", "10/15/2009 0:00:00"))
as.Date(df$Date, format = "%m/%d/%Y %H:%M:%S")
# [1] "2009-10-09" "2009-10-15"
Note the Details section of ?as.Date:
Character strings are processed as far as necessary for the format specified: any trailing characters are ignored
Thus, this also works:
as.Date(df$Date, format = "%m/%d/%Y")
# [1] "2009-10-09" "2009-10-15"
All the conversion specifications that can be used to specify the input format are found in the Details section in ?strptime. Make sure that the order of the conversion specification as well as any separators correspond exactly with the format of your input string.
More generally and if you need the time component as well, use as.POSIXct or strptime:
as.POSIXct(df$Date, "%m/%d/%Y %H:%M:%S")
strptime(df$Date, "%m/%d/%Y %H:%M:%S")
I'm guessing at what your actual data might look at from the partial results you give.
If you don't know the format you could use anytime::anydate, which tries to match to common formats:
library(anytime)
date <- c("01/01/2000 0:00:00", "Jan 1, 2000 0:00:00", "2000-Jan-01 0:00:00")
anydate(date)
[1] "2000-01-01" "2000-01-01" "2000-01-01"
library(lubridate)
if your date format is like this '04/24/2017 05:35:00'then change it like below
prods.all$Date2<-gsub("/","-",prods.all$Date2)
then change the date format
parse_date_time(prods.all$Date2, orders="mdy hms")