Why are my csv file dates not parsing via mdy (lubridate) - r

I'm hoping someone can shed some light on why lubridate isn't parsing my dates correctly. I'm reading in a fairly large csv file to a data frame, so my issue isn't necessarily reproducible, but I'll show my steps:
require("lubridate")
pricingAddy = "C/DailyData.csv"
pricingData = as.data.frame(read.csv(pricingAddy, header = TRUE, stringsAsFactors = FALSE))
sampleHead = head(pricingData)
sampleHead
Symbol TradeDate PX_OPEN PX_HIGH PX_LOW PX_LAST PX_VOLUME MOV_AVG_20D MOV_AVG_50D MOV_AVG_100D MOV_AVG_200D
1 A 1/2/2014 57.10 57.100 56.15 56.21 1916160 56.0765 53.7096 51.5385 47.7321
2 A 1/3/2014 56.39 57.345 56.26 56.92 1866651 56.2435 53.8276 51.6432 47.8032
3 A 1/6/2014 57.40 57.700 56.56 56.64 1777472 56.4005 53.9474 51.7404 47.8781
4 A 1/7/2014 56.95 57.630 56.93 57.45 1463208 56.5315 54.0740 51.8498 47.9591
5 A 1/8/2014 57.33 58.540 57.17 58.39 2659468 56.6980 54.2044 51.9641 48.0454
6 A 1/9/2014 58.40 58.680 57.87 58.41 1757647 56.8515 54.3428 52.0803 48.1284
mdy(sampleHead["TradeDate"])
[1] NA
Warning message:
All formats failed to parse. No formats found.
dts = c("1/2/2014", "1/3/2014", "1/6/2014", "1/7/2014", "1/8/2014", "1/9/2014")
sampleHead["TradeDate"] == dts
TradeDate
1 TRUE
2 TRUE
3 TRUE
4 TRUE
5 TRUE
6 TRUE
mdy(dts)
[1] "2014-01-02 UTC" "2014-01-03 UTC" "2014-01-06 UTC" "2014-01-07 UTC" "2014-01-08 UTC" "2014-01-09 UTC"
Any takers? I haven't seen this before. Thanks in advance...

Related

Why does dplyr convert POSIXct objects

I have a date-time object of class POSIXct. I need to adjust the values by adding several hours. I understand that I can do this using basic addition. For example, I can add 5 hours to a POSIXct object like so:
x <- as.POSIXct("2009-08-02 18:00:00", format="%Y-%m-%d %H:%M:%S")
x
[1] "2009-08-02 18:00:00 PDT"
x + (5*60*60)
[1] "2009-08-02 23:00:00 PDT"
Now I have a data frame in which some times are ok and some are bad.
> df
set_time duration up_time
1 2009-05-31 14:10:00 3 2009-05-31 11:10:00
2 2009-08-02 18:00:00 4 2009-08-02 23:00:00
3 2009-08-03 01:20:00 5 2009-08-03 06:20:00
4 2009-08-03 06:30:00 2 2009-08-03 11:30:00
Note that the first data frame entry has an 'up_time' less than the 'set_time'. So in this context a 'good' time is one where the set_time < up_time. And a 'bad' time is one in which set_time > up_time. I want to leave the good entries alone and fix the bad entries. The bad entries should be fixed by creating an 'up_time' that is equal to the 'set_time' + duration. I do this with the following dplyr pipe:
df1 <- tbl_df(df) %>% mutate(up_time = ifelse(set_time > up_time, set_time +
(duration*60*60), up_time))
df1
# A tibble: 4 x 3
set_time duration up_time
<dttm> <dbl> <dbl>
1 2009-05-31 14:10:00 3. 1243815000.
2 2009-08-02 18:00:00 4. 1249279200.
3 2009-08-03 01:20:00 5. 1249305600.
4 2009-08-03 06:30:00 2. 1249324200.
Up time has been coerced to numeric:
> str(df1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 3 variables:
$ set_time: POSIXct, format: "2009-05-31 14:10:00" "2009-08-02 18:00:00"
"2009-08-03 01:20:00" "2009-08-03 06:30:00"
$ duration: num 3 4 5 2
$ up_time : num 1.24e+09 1.25e+09 1.25e+09 1.25e+09
I can convert it back to the desired POSIXct format using:
> as.POSIXct(df1$up_time,origin="1970-01-01")
[1] "2009-05-31 17:10:00 PDT" "2009-08-02 23:00:00 PDT" "2009-08-03 06:20:00
PDT" "2009-08-03 11:30:00 PDT"
But I feel like this last step shouldn't be necessary. Can I/How can I avoid having dplyr change my variable formatting?

lubridate: inconsistent behavior with timezones

Consider the following example
library(lubridate)
library(tidyverse)
> hour(ymd_hms('2008-01-04 00:00:00'))
[1] 0
Now,
dataframe <- data_frame(time = c(ymd_hms('2008-01-04 00:00:00'),
ymd_hms('2008-01-04 00:01:00'),
ymd_hms('2008-01-04 00:02:00'),
ymd_hms('2008-01-04 00:03:00')),
value = c(1,2,3,4))
mutate(dataframe,hour = strftime(time, format="%H:%M:%S"),
hour2 = hour(time))
# A tibble: 4 × 4
time value hour hour2
<dttm> <dbl> <chr> <int>
1 2008-01-03 19:00:00 1 19:00:00 19
2 2008-01-03 19:01:00 2 19:01:00 19
3 2008-01-03 19:02:00 3 19:02:00 19
4 2008-01-03 19:03:00 4 19:03:00 19
What is going on here? Why are the dates converted into some local time which I dont event know?
This is not an issue with lubridate, but with the way POSIXct values are combined into a vector.
You have
> ymd_hms('2008-01-04 00:01:00')
[1] "2008-01-04 00:01:00 UTC"
But when combining into a vector you get
> c(ymd_hms('2008-01-04 00:01:00'), ymd_hms('2008-01-04 00:01:00'))
[1] "2008-01-03 19:01:00 EST" "2008-01-03 19:01:00 EST"
The reason is that the tzone attribute gets lost when combining POSIXct values (see c.POSIXct).
> attributes(ymd_hms('2008-01-04 00:01:00'))
$tzone
[1] "UTC"
$class
[1] "POSIXct" "POSIXt"
but
> attributes(c(ymd_hms('2008-01-04 00:01:00')))
$class
[1] "POSIXct" "POSIXt"
What you can use instead is
> ymd_hms(c('2008-01-04 00:01:00', '2008-01-04 00:01:00'))
[1] "2008-01-04 00:01:00 UTC" "2008-01-04 00:01:00 UTC"
which will use the default tz = "UTC" for all arguments.
You also need to pass tz = "UTC" into strftime because its default is your current time zone (unlike ymd_hms which defaults to UTC).

Lubridate mdy function

I'm trying to convert the following and am not successful with one of the dates [1]. "4/2/10" becomes "0010-04-02".
Is there a way to correct this?
thanks,
Vivek
data <- data.frame(initialDiagnose = c("4/2/10","14.01.2009", "9/22/2005",
"4/21/2010", "28.01.2010", "09.01.2009", "3/28/2005",
"04.01.2005", "04.01.2005", "Created on 9/17/2010", "03 01 2010"))
mdy <- mdy(data$initialDiagnose)
dmy <- dmy(data$initialDiagnose)
mdy[is.na(mdy)] <- dmy[is.na(mdy)] # some dates are ambiguous, here we give
data$initialDiagnose <- mdy # mdy precedence over dmy
data
initialDiagnose
1 0010-04-02
2 2009-01-14
3 2005-09-22
4 2010-04-21
5 2010-01-28
6 2009-09-01
7 2005-03-28
8 2005-04-01
9 2005-04-01
10 2010-09-17
11 2010-03-01
I think this is occurring because the mdy() function prefers to match the year with %Y (the actual year) over %y (2 digit abbreviation for the year, defaulting to 19XX or 20XX).
There is a workaround, though. I took a look at the help files for lubridate::parse_date_time (?parse_date_time), and near the bottom of the help file, there is an example for adding an argument that prefers matching with the %y format over the %Y format for the year. The relevant bit of code from the help file:
## ** how to use `select_formats` argument **
## By default %Y has precedence:
parse_date_time(c("27-09-13", "27-09-2013"), "dmy")
## [1] "13-09-27 UTC" "2013-09-27 UTC"
## to give priority to %y format, define your own select_format function:
my_select <- function(trained){
n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5
names(trained[ which.max(n_fmts) ])
}
parse_date_time(c("27-09-13", "27-09-2013"), "dmy", select_formats = my_select)
## '[1] "2013-09-27 UTC" "2013-09-27 UTC"
So, for your example, you can adapt this code and replace the mdy <- mdy(data$initialDiagnose) line with this:
# Define a select function that prefers %y over %Y. This is copied
# directly from the help files
my_select <- function(trained){
n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5
names(trained[ which.max(n_fmts) ])
}
# Parse as mdy dates
mdy <- parse_date_time(data$initialDiagnose, "mdy", select_formats = my_select)
# [1] "2010-04-02 UTC" NA "2005-09-22 UTC" "2010-04-21 UTC" NA
# [6] "2009-09-01 UTC" "2005-03-28 UTC" "2005-04-01 UTC" "2005-04-01 UTC" "2010-09-17 UTC"
#[11] "2010-03-01 UTC"
And running the remaining lines of code from your question, it gives me this data frame as the result:
initialDiagnose
1 2010-04-02
2 2009-01-14
3 2005-09-22
4 2010-04-21
5 2010-01-28
6 2009-09-01
7 2005-03-28
8 2005-04-01
9 2005-04-01
10 2010-09-17
11 2010-03-01

subsetting a data frame according factor date

I have a data frame(df) where one of its column is a date column. However that column's type is factor:
> head(df$date)
[1] 2011-01-01 2011-01-01 2011-01-01 2011-01-01 2011-01-01 2011-01-01
1519 Levels: 2010-11-27 2010-11-28 2010-11-29 2010-11-30 2010-12-01 2010-12-02 2010-12-03 2010-12-04 ... 2015-02-07
I want to subset this data frame according to date. For example I want to create a second data frame(df2) where it is a subset of df where dates are smaller than 2014-03-30.
How can I do that using R? I will be very glad for any help. Thanks a lot.
You could begin exploring the lubridate library. It makes working with dates very simple.
df <- data.frame(date = c("2013-01-01", "2014-04-01", "2014-01-01",
"2011-06-01", "2012-03-01", "2014-08-01"))
df
date
1 2013-01-01
2 2014-04-01
3 2014-01-01
4 2011-06-01
5 2012-03-01
6 2014-08-01
library(lubridate)
# ymd - year-month-day
df$date <- ymd(df$date)
with(df, df[date < ymd("2014-03-30"),])
[1] "2013-01-01 UTC" "2014-01-01 UTC" "2011-06-01 UTC" "2012-03-01 UTC"

Parsing dates in multiple formats in R using lubridate

I have data with dates in MM/DD/YY HH:MM format and others in plain old MM/DD/YY format. I want to parse all of them into the same format as "2010-12-01 12:12 EST." How should I go about doing that? I tried the following ifelse statement and it gave me a bunch of long integers and told me a large number of my data points failed to parse:
df_prime$date <- ifelse(!is.na(mdy_hm(df$date)), mdy_hm(df$date), mdy(df$date))
df_prime is a duplicate of the data frame df that I initially loaded in
IEN date admission_number KEY_PTF_45 admission_from discharge_to
1 12 3/3/07 18:05 1 252186 OTHER DIRECT
2 12 3/9/07 12:10 1 252186 RETURN TO COMMUNITY- INDEPENDENT
3 12 3/10/07 15:08 2 252382 OUTPATIENT TREATMENT
4 12 3/14/07 10:26 2 252382 RETURN TO COMMUNITY-INDEPENDENT
5 12 4/24/07 19:45 3 254343 OTHER DIRECT
6 12 4/28/07 11:45 3 254343 RETURN TO COMMUNITY-INDEPENDENT
...
1046334 23613488506 2/25/14 NA NA
1046335 23613488506 2/25/14 11:27 NA NA
1046336 23613488506 2/28/14 NA NA
1046337 23613488506 3/4/14 NA NA
1046338 23613488506 3/10/14 11:30 NA NA
1046339 23613488506 3/10/14 12:32 NA NA
Sorry if some of the formatting isn't right, but the date column is the most important one.
EDIT: Below is some code for a portion of my data frame via a dput command:
structure(list(IEN = c(23613488506, 23613488506, 23613488506, 23613488506, 23613488506, 23613488506), date = c("2/25/14", "2/25/14 11:27", "2/28/14", "3/4/14", "3/10/14 11:30", "3/10/14 12:32")), .Names = c("IEN", "date"), row.names = 1046334:1046339, class = "data.frame")
Have you tried the function guess_formats() in the lubridate package?
A reproducible example to build a dataframe like yours could be helpful!
The lubridate package's mdy_hm has a truncated parameter that lets you supply dates that might not have all the bits. For your example:
> mdy_hm(d$date,truncated=2)
[1] "2014-02-25 00:00:00 UTC" "2014-02-25 11:27:00 UTC"
[3] "2014-02-28 00:00:00 UTC" "2014-03-04 00:00:00 UTC"
[5] "2014-03-10 11:30:00 UTC" "2014-03-10 12:32:00 UTC"

Resources