date import, incorrect century - r

I have a bunch of dates that I am parsing that are in the form "%m/%d/%y". as.Date(dates, format = "%m/%d/%y") converts a date like "1/01/64" to "2064-01-01" but I need that to be "1964-01-01." I suppose I can find instances where the year is in the future and then subtract a century, but that seems a little ridiculous.

Dates are stored internal as integer days, so there is only such formatting at the time of input or output. As for input without century information I think you are out of luck. Here's what ?strptime says about the %y format spec: "On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 and 2008 POSIX standards, but they do also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’."
as.Date( "01/01/64", "%m/%d/%y", origin="1970-01-01") -100*365.25
#[1] "1964-01-01"
It might be possible to start a bar fight about programmers who allow removal of century information given that Y2K is so recent in the past.
Since the default is to assume year 00-68 is 2000-2068, it is certainly possible to create an as.Dateshift

Another way to fix the dates is to change all years that occur in the future (relative to today's date using Sys.Date()) as starting with 19 instead of 20.
dates=as.Date(c("01/01/64", "12/31/15"))
# [1] "2064-01-01" "2015-12-31" ## contains an incorrect date
## Now correct the dates that havn't yet occurred
as.Date(ifelse(dates > Sys.Date(), format(dates, "19%y-%m-%d"), format(dates)))
#[1] "1964-01-01" "2015-12-31"

Related

as.Date produces unexpected result in a sequence of week-based dates

I am working on the transformation of week based dates to month based dates.
When checking my work, I found the following problem in my data which is the result of a simple call to as.Date()
as.Date("2016-50-4", format = "%Y-%U-%u")
as.Date("2016-50-5", format = "%Y-%U-%u")
as.Date("2016-50-6", format = "%Y-%U-%u")
as.Date("2016-50-7", format = "%Y-%U-%u") # this is the problem
The previous code yields correct date for the first 3 lines:
"2016-12-15"
"2016-12-16"
"2016-12-17"
The last line of code however, goes back 1 week:
"2016-12-11"
Can anybody explain what is happening here?
Working with week of the year can become very tricky. You may try to convert the dates using the ISOweek package:
# create date strings in the format given by the OP
wd <- c("2016-50-4","2016-50-5","2016-50-6","2016-50-7", "2016-51-1", "2016-52-7")
# convert to "normal" dates
ISOweek::ISOweek2date(stringr::str_replace(wd, "-", "-W"))
The result
#[1] "2016-12-15" "2016-12-16" "2016-12-17" "2016-12-18" "2016-12-19" "2017-01-01"
is of class Date.
Note that the ISO week-based date format is yyyy-Www-d with a capital W preceeding the week number. This is required to distinguish it from the standard month-based date format yyyy-mm-dd.
So, in order to convert the date strings provided by the OP using ISOweek2date() it is necessary to insert a W after the first hyphen which is accomplished by replacing the first - by -W in each string.
Also note that ISO weeks start on Monday and the days of the week are numbered 1 to 7. The year which belongs to an ISO week may differ from the calendar year. This can be seen from the sample dates above where the week-based date 2016-W52-7 is converted to 2017-01-01.
About the ISOweek package
Back in 2011, the %G, %g, %u, and %V format specifications weren't available to strptime() in the Windows version of R. This was annoying as I had to prepare weekly reports including week-on-week comparisons. I spent hours to find a solution for dealing with ISO weeks, ISO weekdays, and ISO years. Finally, I ended up creating the ISOweek package and publishing it on CRAN. Today, the package still has its merits as the aforementioned formats are ignored on input (see ?strptime for details).
As #lmo said in the comments, %u stands for the weekdays as a decimal number (1–7, with Monday as 1) and %U stands for the week of the year as decimal number (00–53) using Sunday as the first day. Thus, as.Date("2016-50-7", format = "%Y-%U-%u") will result in "2016-12-11".
However, if that should give "2016-12-18", then you should use a week format that has also Monday as starting day. According to the documentation of ?strptime you would expect that the format "%Y-%V-%u" thus gives the correct output, where %V stands for the week of the year as decimal number (01–53) with monday as the first day.
Unfortunately, it doesn't:
> as.Date("2016-50-7", format = "%Y-%V-%u")
[1] "2016-01-18"
However, at the end of the explanation of %V it sais "Accepted but ignored on input" meaning that it won't work.
You can circumvent this behavior as follows to get the correct dates:
# create a vector of dates
d <- c("2016-50-4","2016-50-5","2016-50-6","2016-50-7", "2016-51-1")
# convert to the correct dates
as.Date(paste0(substr(d,1,8), as.integer(substring(d,9))-1), "%Y-%U-%w") + 1
which gives:
[1] "2016-12-15" "2016-12-16" "2016-12-17" "2016-12-18" "2016-12-19"
The issue is because for %u, 1 is Monday and 7 is Sunday of the week. The problem is further complicated by the fact that %U assumes week begins on Sunday.
For the given input and expected behavior of format = "%Y-%U-%u", the output of line 4 is consistent with the output of previous 3 lines.
That is, if you want to use format = "%Y-%U-%u", you should pre-process your input. In this case, the fourth line would have to be as.Date("2016-51-7", format = "%Y-%U-%u") as revealed by
format(as.Date("2016-12-18"), "%Y-%U-%u")
# "2016-51-7"
Instead, you are currently passing "2016-50-7".
Better way of doing it might be to use the approach suggested in Uwe Block's answer. Since you are happy with "2016-50-4" being transformed to "2016-12-15", I suspect in your raw data, Monday is counted as 1 too. You could also create a custom function that changes the value of %U to count the week number as if week begins on Monday so that the output is as you expected.
#Function to change value of %U so that the week begins on Monday
pre_process = function(x, delim = "-"){
y = unlist(strsplit(x,delim))
# If the last day of the year is 7 (Sunday for %u),
# add 1 to the week to make it the week 00 of the next year
# I think there might be a better solution for this
if (y[2] == "53" & y[3] == "7"){
x = paste(as.integer(y[1])+1,"00",y[3],sep = delim)
} else if (y[3] == "7"){
# If the day is 7 (Sunday for %u), add 1 to the week
x = paste(y[1],as.integer(y[2])+1,y[3],sep = delim)
}
return(x)
}
And usage would be
as.Date(pre_process("2016-50-7"), format = "%Y-%U-%u")
# [1] "2016-12-18"
I'm not quite sure how to handle when the year ends on a Sunday.

How can I change character class date variables to POSIXlt class when there are multiple date formats?

I'm struggling with converting character class dates of many different format types (e.g., yyyy/mm/dd; mm/dd/yyyy; yyyy-mm-dd; mm-dd-yyyy; yy-mm-dd; mm-dd-yy; etc.) to POSIXlt class. Ideally, I would like to convert all birth_dates to POSIXlt class with yyyy/mm/dd format (see sample data below). Is there any simple way to do this in R?:
id birth_date start_date age
102 08/09/1993 2013/09/01 20
103 1995-02-21 2013/09/01 18
104 01-15-94 2013/09/01 19
105 88-12-30 2013/09/01 24
Here is what I have been doing thus far. Unfortunately, this doesn't seem to work (I wind up with more NAs than there should be) given all of the different ways in which the original date is formatted:
library(lubridate)
data$birth_date1<-as.Date(data$birth_date,format="%Y-%m-%d") #Convert character class to date class
data$birth_date2<-ymd(swc3$birth_date1) #Convert date class to POSIXlt class using lubridate pkg
That's horrible. Could be worse though. At least there are delimiters in there, like "-" and "/".
Short Answer
Yes, there's an easy way to parse that in R. Apply parse_date_time() separately to each birth date, giving it a decent orders list to chose from, and carefully set the order of the guesses. You'll need to convert the "integer-time" to a useful time when you're done.
See the Long Answer for details.
Long Answer
This is why the lubridate package has parse_date_time(). But there are problems. Let's see:
require(lubridate)
# WRONG! doesn't work as intended.
as.Date(
parse_date_time(data$birth_date,
orders=c("ymd", "mdy", "mdY", "Ymd")
)
)
[1] "1993-08-09" "1995-02-21" "1994-01-15" "0088-12-30"
That looks great, except for the last one. What's going on?
parse_date_time() is selecting a "best fit" set of orders and formats to use when parsing the dates, and the last element is the odd one out.
To make this work as intended, you'll need to apply parse_date_time() one-by-one to each date, because each date format was apparently selected more-or-less at random. This will be slower, but it will give more useful answers.
# RIGHT. Some conversion of results required.
parsed <- sapply(data[,"birth_date"],
parse_date_time,
orders=c("ymd", "mdy", "mdY", "Ymd") )
parsed
08/09/1993 1995-02-21 01-15-94 88-12-30
744854400 793324800 758592000 599443200
Ok, those look like Unix-time integers, which are the unclass()'d version of what parse_date_time() produces. And none are negative, so they must all have happened after 1970. This is encouraging. Convert:
# Conversion of results
parsed <- as.POSIXct(parsed, origin="1970-01-01", tz = "GMT")
as.Date(parsed)
08/09/1993 1995-02-21 01-15-94 88-12-30
"1993-08-09" "1995-02-21" "1994-01-15" "1988-12-30"
lubridate and parse_date_time() are very good at what they do.
Since you asked for POSIXlt, not Date types:
as.POSIXlt(parsed)
08/09/1993 1995-02-21
"1993-08-09 10:00:00 AEST" "1995-02-21 11:00:00 AEDT"
01-15-94 88-12-30
"1994-01-15 11:00:00 AEDT" "1988-12-30 11:00:00 AEDT"
Though I personally prefer only having dates when the actual time isn't important; these are assumed to be all happening at midnight UTC, and are converted to my time zone (Eastern Australia).

Converting dates from imported CSV file

I'm importing time series data from a CSV file and one of the vectors/columns are dates in the format DD/MM/YYYY. Vector class is characters or factors if I chose the Strings as factors = True. I convert the imported file to a data frame and then run the following:
df$Date <- as.Date(df$Date , "%d/%m/%y")
I get no error message, but the dates are all messed up in the format YYYYMMDD and all the YYYY are the year 2020...
Before:
10/09/2009
11/09/2009
14/09/2009
After:
2020-09-10
2020-09-11
2020-09-14
You are using %y when it should be %Y. See the documentation here.
%y
Year without century (00–99). On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 and 2008 POSIX standards, but they do also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’.
%Y
Year with century. Note that whereas there was no zero in the original Gregorian calendar, ISO 8601:2004 defines it to be valid (interpreted as 1BC): see http://en.wikipedia.org/wiki/0_(year). Note that the standards also say that years before 1582 in its calendar should only be used with agreement of the parties involved.
Try running the code again so that the data frame is not modified by any previous attempt but this time use
df$Date <- as.Date(df$Date , "%d/%m/%Y")
#Heroka is right.
If ever you need it you could also use posixct objects (they contain information of seconds)
Try this:
df$Date.time <- as.POSIXct(df$Date , format="%d/%m/%Y")
If you want the date and time in strings you can try the following:
df$Date.time <- format(as.POSIXct(df$Date , format="%d/%m/%Y"),format="%Y-%m-%d %H:%M")
or
df$Date <- format(as.POSIXct(df$Date , format="%d/%m/%Y"),format="%Y-%m-%d")

R as.Date conversion century error

In my dataset a column contains Date of Births of many employees so many of them lies in the range 1960 to 1980. I am trying to format them using as.Date and in some of them the results are not per my expectation.
Example:
as.Date("7/1/61","%m/%d/%y")
i want it to return "1961-07-01" but it returns "2061-07-01".
Read:
?strptime # where all the formatting details are available
%y
Year without century (00–99). On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behavior specified by the 2004 and 2008 POSIX standards, but they do also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’.
So you need a regex to backdate and it's probably better to do as a string conversion before sending to as.Date:
dvec <- c("7/1/61", "7/1/79")
as.Date( sub("/(..$)", "/19\\1",dvec) , "%m/%d/%Y")
[1] "1961-07-01" "1979-07-01"
If this goes into production it will become an error waiting to happen when the age of your employees starts to creep above the last two digits of the current year.

How to convert specific time format to timestamp in R? [duplicate]

This question already has answers here:
Read csv with dates and numbers
(3 answers)
Closed 9 years ago.
I am working on "Localization Data for Person Activity Data Set" dataset from UCI and in this data set there is a column of date and time(both in one column) with following format:
27.05.2009 14:03:25:777
27.05.2009 14:03:25:183
27.05.2009 14:03:25:210
27.05.2009 14:03:25:237
...
I am wondering if there is anyway to convert this column to timestamp using R.
First of all, we need to substitute the colon separating the milliseconds from the seconds to a dot, otherwise the final step won't work (thanks to Dirk Eddelbuettel for this one). Since in the end R will use the separators it wants, to be quicker, I'll just go ahead and substitute all the colons for dots:
x <- "27.05.2009 14:03:25:777" # this is a simplified version of your data
y <- gsub(":", ".", x) # this is your vector with the aforementioned substitution
By the way, this is how your vector should look after gsub:
> y
[1] "27.05.2009 14.03.25.777"
Now, in order to have it show the milliseconds, you first need to adjust an R option and then use a function called strptime, which will convert your date vector to POSIXlt (an R-friendly) format. Just do the following:
> options(digits.secs = 3) # this tells R you want it to consider 3 digits for seconds.
> strptime(y, "%d.%m.%Y %H:%M:%OS") # this finally formats your vector
[1] "2009-05-27 14:03:25.777"
I've learned this nice trick here. This other answer also says you can skip the options setting and use, for example, strptime(y, "%d.%m.%Y %H:%M:%OS3"), but it doesn't work for me. Henrik noted that the function's help page, ?strptime states that the %OS3 bit is OS-dependent. I'm using an updated Ubuntu 13.04 and using %OS3 yields NA.
When using strptime (or other POSIX-related functions such as as.Date), keep in mind some of the most common conversions used (edited for brevity, as suggested by DWin. Complete list at strptime):
%a Abbreviated weekday name in the current locale.
%A Full weekday name in the current locale.
%b Abbreviated month name in the current locale.
%B Full month name in the current locale.
%d Day of the month as decimal number (01–31).
%H Hours as decimal number (00–23). Times such as 24:00:00 are accepted for input.
%I Hours as decimal number (01–12).
%j Day of year as decimal number (001–366).
%m Month as decimal number (01–12).
%M Minute as decimal number (00–59).
%p AM/PM indicator in the locale. Used in conjunction with %I and not with %H.
`%S Second as decimal number (00–61), allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).
%U Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
%w Weekday as decimal number (0–6, Sunday is 0).
%W Week of the year as decimal number (00–53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
%y Year without century (00–99). On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19
%Y Year with century. Note that whereas there was no zero in the original Gregorian calendar, ISO 8601:2004 defines it to be valid (interpreted as 1BC)

Resources