Lubridate mdy function - r

I'm trying to convert the following and am not successful with one of the dates [1]. "4/2/10" becomes "0010-04-02".
Is there a way to correct this?
thanks,
Vivek
data <- data.frame(initialDiagnose = c("4/2/10","14.01.2009", "9/22/2005",
"4/21/2010", "28.01.2010", "09.01.2009", "3/28/2005",
"04.01.2005", "04.01.2005", "Created on 9/17/2010", "03 01 2010"))
mdy <- mdy(data$initialDiagnose)
dmy <- dmy(data$initialDiagnose)
mdy[is.na(mdy)] <- dmy[is.na(mdy)] # some dates are ambiguous, here we give
data$initialDiagnose <- mdy # mdy precedence over dmy
data
initialDiagnose
1 0010-04-02
2 2009-01-14
3 2005-09-22
4 2010-04-21
5 2010-01-28
6 2009-09-01
7 2005-03-28
8 2005-04-01
9 2005-04-01
10 2010-09-17
11 2010-03-01

I think this is occurring because the mdy() function prefers to match the year with %Y (the actual year) over %y (2 digit abbreviation for the year, defaulting to 19XX or 20XX).
There is a workaround, though. I took a look at the help files for lubridate::parse_date_time (?parse_date_time), and near the bottom of the help file, there is an example for adding an argument that prefers matching with the %y format over the %Y format for the year. The relevant bit of code from the help file:
## ** how to use `select_formats` argument **
## By default %Y has precedence:
parse_date_time(c("27-09-13", "27-09-2013"), "dmy")
## [1] "13-09-27 UTC" "2013-09-27 UTC"
## to give priority to %y format, define your own select_format function:
my_select <- function(trained){
n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5
names(trained[ which.max(n_fmts) ])
}
parse_date_time(c("27-09-13", "27-09-2013"), "dmy", select_formats = my_select)
## '[1] "2013-09-27 UTC" "2013-09-27 UTC"
So, for your example, you can adapt this code and replace the mdy <- mdy(data$initialDiagnose) line with this:
# Define a select function that prefers %y over %Y. This is copied
# directly from the help files
my_select <- function(trained){
n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5
names(trained[ which.max(n_fmts) ])
}
# Parse as mdy dates
mdy <- parse_date_time(data$initialDiagnose, "mdy", select_formats = my_select)
# [1] "2010-04-02 UTC" NA "2005-09-22 UTC" "2010-04-21 UTC" NA
# [6] "2009-09-01 UTC" "2005-03-28 UTC" "2005-04-01 UTC" "2005-04-01 UTC" "2010-09-17 UTC"
#[11] "2010-03-01 UTC"
And running the remaining lines of code from your question, it gives me this data frame as the result:
initialDiagnose
1 2010-04-02
2 2009-01-14
3 2005-09-22
4 2010-04-21
5 2010-01-28
6 2009-09-01
7 2005-03-28
8 2005-04-01
9 2005-04-01
10 2010-09-17
11 2010-03-01

Related

R: Lubridate Failing to Convert Character to Numeric

I'm new to R - and searched old post for an answer but failed to come across anything that resolved my issue.
I pulled in a csv with the time a trip started in the mdy h:mm:ss format, but it is currently recognized as a character. I've tried to use mdy_hms(c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00"))
as well as
as.Date(parse_date_time(dc_biketrips$started_at, c(mdy_hms))) to no avail.
Does anyone have any suggestions for how I could fix this?
UPDATE: I also tried to use date <-mdy_hms(c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00")) str(date)but this also did not work
attempt to use date <-mdy_hms(C("11/1/2020 0:05:00"etc
image of csv
The first of your two options works:
library(lubridate)
date <-mdy_hms(c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00"))
str(date)
# POSIXct[1:3], format: "2020-11-01 00:05:00" "2020-11-01 07:29:00" "2020-11-01 14:04:00"
How were your data "pulled in"?
One option would be to use as.POSIXct:
started_at <- c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00")
as.POSIXct(started_at, format = "%m/%d/%Y %H:%M:%OS")
#> [1] "2020-11-01 00:05:00 CET" "2020-11-01 07:29:00 CET"
#> [3] "2020-11-01 14:04:00 CET"
EDIT
library(lubridate)
library(dplyr)
started_at <- c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00")
Both as.POSIXct and lubridate::mdy_hms return an object of class "POSIXct" "POSIXt"
class(as.POSIXct(started_at, format = "%m/%d/%Y %H:%M:%OS"))
#> [1] "POSIXct" "POSIXt"
class(mdy_hms(started_at))
#> [1] "POSIXct" "POSIXt"
Not sure what you expect. When I run your code everything works fine except that we end up with 0 obs after filtering for week < 15 as all the dates in the example data are from week 44:
dc_biketrips <- data.frame(
started_at
)
dc_biketrips <- dc_biketrips %>%
mutate(started_at = as.POSIXct(started_at, format = "%m/%d/%Y %H:%M:%OS"),
interval60 = floor_date(started_at, unit = "hour"),
interval15 = floor_date(started_at, unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
dc_biketrips
#> started_at interval60 interval15 week dotw
#> 1 2020-11-01 00:05:00 2020-11-01 00:00:00 2020-11-01 00:00:00 44 So
#> 2 2020-11-01 07:29:00 2020-11-01 07:00:00 2020-11-01 07:15:00 44 So
#> 3 2020-11-01 14:04:00 2020-11-01 14:00:00 2020-11-01 14:00:00 44 So
dc_biketrips %>%
filter(week < 15)
#> [1] started_at interval60 interval15 week dotw
#> <0 rows> (or 0-length row.names)

lubridate: inconsistent behavior with timezones

Consider the following example
library(lubridate)
library(tidyverse)
> hour(ymd_hms('2008-01-04 00:00:00'))
[1] 0
Now,
dataframe <- data_frame(time = c(ymd_hms('2008-01-04 00:00:00'),
ymd_hms('2008-01-04 00:01:00'),
ymd_hms('2008-01-04 00:02:00'),
ymd_hms('2008-01-04 00:03:00')),
value = c(1,2,3,4))
mutate(dataframe,hour = strftime(time, format="%H:%M:%S"),
hour2 = hour(time))
# A tibble: 4 × 4
time value hour hour2
<dttm> <dbl> <chr> <int>
1 2008-01-03 19:00:00 1 19:00:00 19
2 2008-01-03 19:01:00 2 19:01:00 19
3 2008-01-03 19:02:00 3 19:02:00 19
4 2008-01-03 19:03:00 4 19:03:00 19
What is going on here? Why are the dates converted into some local time which I dont event know?
This is not an issue with lubridate, but with the way POSIXct values are combined into a vector.
You have
> ymd_hms('2008-01-04 00:01:00')
[1] "2008-01-04 00:01:00 UTC"
But when combining into a vector you get
> c(ymd_hms('2008-01-04 00:01:00'), ymd_hms('2008-01-04 00:01:00'))
[1] "2008-01-03 19:01:00 EST" "2008-01-03 19:01:00 EST"
The reason is that the tzone attribute gets lost when combining POSIXct values (see c.POSIXct).
> attributes(ymd_hms('2008-01-04 00:01:00'))
$tzone
[1] "UTC"
$class
[1] "POSIXct" "POSIXt"
but
> attributes(c(ymd_hms('2008-01-04 00:01:00')))
$class
[1] "POSIXct" "POSIXt"
What you can use instead is
> ymd_hms(c('2008-01-04 00:01:00', '2008-01-04 00:01:00'))
[1] "2008-01-04 00:01:00 UTC" "2008-01-04 00:01:00 UTC"
which will use the default tz = "UTC" for all arguments.
You also need to pass tz = "UTC" into strftime because its default is your current time zone (unlike ymd_hms which defaults to UTC).

How to deal with one column of two formats and single class?

I have one column with two different formats but the same class 'factor'.
D$date
2009-05-12 11:30:00
2009-05-13 11:30:00
2009-05-14 11:30:00
2009-05-15 11:30:00
42115.652
2876
8765
class(D$date)
factor
What I need is to convert the number to date.
D$date <- as.character(D$date)
D$date=ifelse(!is.na(as.numeric(D$date)),
as.POSIXct(as.numeric(D$date) * (60*60*24), origin="1899-12-30", tz="UTC"),
D$date)
Now the number was converted but to a strange number "1429630800".
I tried without ifelse:
as.POSIXct(as.numeric(42115.652) * (60*60*24), origin="1899-12-30", tz="UTC")
[1] "2015-04-21 15:38:52 UTC"
It was converted nicely.
The problem is that you are mixing up classes in the true/false halves of your ifelse. You can fix this by adding as.character like this
D$date = ifelse(!is.na(as.numeric(D$date)),
as.character(as.POSIXct(as.numeric(D$date) * (60*60*24), origin="1899-12-30", tz="UTC")),
D$date)
#D
# date
#1 2009-05-12 11:30:00
#2 2009-05-13 11:30:00
#3 2009-05-14 11:30:00
#4 2009-05-15 11:30:00
#5 2015-04-21 15:38:52
#6 1907-11-15 00:00:00
#7 1923-12-30 00:00:00
You can also create a function which transforms each value in POSIX, then using lapply and do.call.
b <- c("2009-05-12 11:30:00", "2009-05-13 11:30:00", "2009-05-14 11:30:00",
"2009-05-15 11:30:00", "42115.652", "2876", "8765")
foo <- function(x){
if(!is.na(as.numeric(x))){
as.POSIXct(as.numeric(x) * (60*60*24), origin="1899-12-30", tz="UTC")
}else{
as.POSIXct(x, origin="1899-12-30", tz="UTC")
}
}
do.call("c", lapply(b, foo))
[1] "2009-05-12 13:30:00 CEST" "2009-05-13 13:30:00 CEST" "2009-05-14 13:30:00 CEST" "2009-05-15 13:30:00 CEST"
[5] "2015-04-21 17:38:52 CEST" "1907-11-15 01:00:00 CET" "1923-12-30 01:00:00 CET"

subsetting a data frame according factor date

I have a data frame(df) where one of its column is a date column. However that column's type is factor:
> head(df$date)
[1] 2011-01-01 2011-01-01 2011-01-01 2011-01-01 2011-01-01 2011-01-01
1519 Levels: 2010-11-27 2010-11-28 2010-11-29 2010-11-30 2010-12-01 2010-12-02 2010-12-03 2010-12-04 ... 2015-02-07
I want to subset this data frame according to date. For example I want to create a second data frame(df2) where it is a subset of df where dates are smaller than 2014-03-30.
How can I do that using R? I will be very glad for any help. Thanks a lot.
You could begin exploring the lubridate library. It makes working with dates very simple.
df <- data.frame(date = c("2013-01-01", "2014-04-01", "2014-01-01",
"2011-06-01", "2012-03-01", "2014-08-01"))
df
date
1 2013-01-01
2 2014-04-01
3 2014-01-01
4 2011-06-01
5 2012-03-01
6 2014-08-01
library(lubridate)
# ymd - year-month-day
df$date <- ymd(df$date)
with(df, df[date < ymd("2014-03-30"),])
[1] "2013-01-01 UTC" "2014-01-01 UTC" "2011-06-01 UTC" "2012-03-01 UTC"

Text process using R

I am quite new in programming and R Software.
My data-set includes date-time variables as following:
2007/11/0103
2007/11/0104
2007/11/0105
2007/11/0106
I need an operator which count from left up to the character number 10 and then execute a space and copy the last two characters and then add :00 for all columns.
Expected results:
2007/11/01 03:00
2007/11/01 04:00
2007/11/01 05:00
2007/11/01 06:00
If you want to actually turn your data into a "POSIXlt" "POSIXt" class in R (so you could subtract/add days, minutes and etc from/to it) you could do
# Your data
temp <- c("2007/11/0103", "2007/11/0104", "2007/11/0105", "2007/11/0106")
temp2 <- strptime(temp, "%Y/%m/%d%H")
## [1] "2007-11-01 03:00:00 IST" "2007-11-01 04:00:00 IST" "2007-11-01 05:00:00 IST" "2007-11-01 06:00:00 IST"
You could then extract hours for example
temp2$hour
## [1] 3 4 5 6
Add hours
temp2 + 3600
## [1] "2007-11-01 04:00:00 IST" "2007-11-01 05:00:00 IST" "2007-11-01 06:00:00 IST" "2007-11-01 07:00:00 IST"
And so on. If you just want the format you mentioned in your question (which is just a character string), you can also do
format(strptime(temp, "%Y/%m/%d%H"), format = "%Y/%m/%d %H:%M")
#[1] "2007/11/01 03:00" "2007/11/01 04:00" "2007/11/01 05:00" "2007/11/01 06:00"
Try
library(lubridate)
dat <- read.table(text="2007/11/0103
2007/11/0104
2007/11/0105
2007/11/0106",header=F,stringsAsFactors=F)
dat$V1 <- format(ymd_h(dat$V1),"%Y/%m/%d %H:%M")
dat
# V1
# 1 2007/11/01 03:00
# 2 2007/11/01 04:00
# 3 2007/11/01 05:00
# 4 2007/11/01 06:00
Suppose your dates are a vector named dates
library(stringr)
paste0(paste(str_sub(dates, end=10), str_sub(dates, 11)), ":00")
paste and substr are your friends here. Type ? before either to see the documentation
my.parser <- function(a){
paste0(substr(a, 0,10),' ',substr(a,11,12),':00') # paste0 is like paste but does not add whitespace
}
a<- '2007/11/0103'
my.parser(a) # = "2007/11/01 03:00"

Resources