I have started using data.table. Indeed it is very fast and quite nice syntax. I am having trouble with dates. I like to use lubridate. In many of my data sets I have dates or dates and times and have used lubridate to manipulate them. Lubridate stores the instant as a POSIX class. I have seen answers here that create new variables for instance just to get the year eg. 2005. I do not like that. There are times that I will be analyzing by year and other times by quarter and other times by month and other times by durations. I would like to do something simple such as this
mydatatable[,length(medical.record.number),by=year(date.of.service)]
that should give me the number of patient encounters in a given year. The by function is not working.
Error in names(byval) = as.character(bysuborig) :
'names' attribute [2] must be the same length as the vector [1]
Can you please point me to vignettes where data.tables is used with dates and where manipulations and categorizations of those dates are done on the fly.
This uses one of the examples in the help(IDateTime) page. It shows that you canc hange to syntax for the by=argument to a character value in the form " = " or (after #Matthew Dowle's comment below) you can try to use the functional form that you were using (although I have not been able to get it to work myself. I did get the preferred form: by=list(wday=wday(idate)) to work.) Note that the key creation assumes an IDateTime class since there is no idate or itime variable. Those are attributes of the class
datetime <- seq(as.POSIXct("2001-01-01"), as.POSIXct("2001-01-03"), by = "5 hour")
(af <- data.table(IDateTime(datetime), a = rep(1:2, 5), key = "a,idate,itime"))
af[, length(a), by = "wday = wday(idate)"]
wday V1
[1,] 2 4
[2,] 3 5
[3,] 4 1
Related
This question already has answers here:
Converting year and month ("yyyy-mm" format) to a date?
(9 answers)
Parse dates in format dmy together with dmY using parse_date_time
(2 answers)
Closed 2 years ago.
I could not find an answer to my following question via the search function:
Why does the given ifelse() condition not work the way I intend?
I got a dataset that wrongly had an open-text field for a date and so I got a variety of ways people filled the date in. By now I really got close to something useable but my intended next step of making every mm/yy entry that was before the year 2000 a mm/19yy entry via the ifelse function does not give me a correct result:
Dates <- c("10/19", "04/2019", "O5/1992", "03/92")
ifelse(str_length(Dates)==5 & str_sub(Dates,4,5)>20, stri_sub(Dates, 4, 3) <- 19, Dates)
The result looks like this:
[1] "10/1919" "04/192019" "O5/191992" "19"
While I would want it to look like this:
1] "10/19" "04/2019" "O5/1992" "03/1992"
Any help is highy appreciated!
This does not give the expected output you have shown but I think it is better to turn the dates into standard dates so that it easier to use them.
Dates <- c("10/19", "04/2019", "O5/1992", "03/92")
new_Date <- as.Date(lubridate::parse_date_time(paste0('1/', Dates), c('dmY', 'dmy')))
new_Date
#[1] "2019-10-01" "2019-04-01" "1992-05-01" "1992-03-01"
Then you can format these dates the way you want :
format(new_Date, '%Y-%m')
#[1] "2019-10" "2019-04" "1992-05" "1992-03"
Rather than do this in a single expression, I recommend splitting it apart for readability:
parts = str_split(dates, '/')
year = as.integer(map_chr(parts, `[[`, 2L))
months = as.integer(map_chr(parts, `[[`, 1L))
result = ifelse(
str_length(dates) == 5L & year > 20 & year < 100,
paste0(months, '/', '19', as.character(year)),
dates
)
This code also handles data type conversions explicitly, which makes the code more expressive and helps finding errors — for instance, your third date accidentally uses O (capital o) instead of 0, which I only noticed because my code complains about the invalid conversion.
Fundamentally I also agree with Ronak’s answer: the output you seem to want is inconsistent and should generally be avoided in favour of a uniform format, which incidentally leads to much simpler code, as Ronak’s answer shows.
I have a string/character variable contains a calendar date, eg,
x <- "2018-10-31"
I also have a variable y contains time, say 200 days.
y <- 200
How do I find out the calendar date for x + y?
I am not familiar with date type in R and struggle with how to approach this.
An add-on question, would this calculation be different if y = 4.3 months? Of course I can convert this into days, though wonder if there is more direct way to handle months without converting.
You could utilise the lubridate package, which is specifically designed for handling date time data.
library(lubridate)
x <- ymd("2018-10-31")
x + days(200)
[1] "2019-05-19"
lubridate works with 'period' objects, which require integers, so you would need to convert "4.3" months into something interpretable beforehand. "4.3" doesn't mean anything concrete in terms of date-time calculation anyways.
I am using the new version of data.table and especially the AWESOME fread function. My files contain dates that are loaded as strings (cause I don't know to do it otherwise) looking like 01APR2008:09:00:00.
I need to sort the data.table on those datetimes and then for the sort to be efficient to cast then in the IDateTime format (or anything alse I would not know yet).
> strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
[1] "2008-04-01 09:00:00"
> IDateTime(strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S"))
idate itime
1: 2008-04-01 09:00:00
> IDateTime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
Error in charToDate(x) :
character string is not in a standard unambiguous format
It looks like I cannot do DT[ , newType := IDateTime(strptime(oldType, "%d%b%Y:%H:%M:%S"))].
My questions are then:
Is there a way to cast directly to IDateTime from fread, such that I can sort afterward efficiently?
If not, what is the most efficient way to go knowing that I would like to be able to sort DT by this datetime column
Unfortunately (for efficiency) strptime produces a POSIXlt type, which is unsupported by data.table and always will be due its size (40 bytes per date!) and structure. Although strftime produces the much better POSIXct, it still does it via POSIXlt. More info here :
http://stackoverflow.com/a/12788992/403310
Looking to base functions such as as.Date, it uses strptime too, creating an integer offset from epoch (oddly) stored as double. The IDate (and friends) class in data.table aims to achieve integer epoch offsets stored as, um, integer. Suitable for fast sorting by base::sort.list(method = "radix") (which is really a counting sort). IDate doesn't really aim to be fast at (usually one off) conversion.
So to convert string dates/times, rightly or wrongly, I tend to roll my own helper function.
If the string date is "2012-12-24" I'd lean towards: as.integer(gsub("-", "", col)) and proceed with YYYYMMDD integer dates. Similarly times can be HHMMDD as an integer. Two columns: date and time separately can be useful if you generally want to roll = TRUE within a day, but not to the previous day. Grouping by month is simple and fast: by = date %/% 100L. Adding and subtracting days is troublesome, but it is anyway because rarely do you want to add calendar days, rather weekdays or business days. So that's a lookup to your business day vector anyway.
In your case the character month would need a conversion to 1:12. There isn't a separator in your dates "01APR2008", so a substring would be one way followed by a match or fmatch on the month name. Are you in control of the file format? If so, numbers are better in an unambiguous format that sorts naturally such as %Y-%m-%d, or %Y%m%d.
I haven't yet got to how best do this in fread, so date/times are left as character currently because I'm not yet sure how to detect the date format or which type to output. It does need to output either integer or double dates though, rather than inefficient character. I suspect that my use of YYYYMMDD integers are seen as unconventional, so I'm a little hesitant to make that the default. They have their place, and there are pros and cons of epoch based dates too. Dates don't have to be always epoch based is all I'm suggesting.
What do you think? Btw, thanks for encouragement on fread; was nice to see.
I d'ont know how your file is structured, but from your comment you want to use the date field as a key. Why not to read it as a time series and format it when in reading?
Here I use zoo to do it.(Here I suppose that the date column is the first one,otherwise see index.colum argument)
ff <- function(x) as.POSIXct(strptime(x,"%d%b%Y:%H:%M:%S"))
h <- read.zoo(text = "03avril2008:09:00:00 125
02avril2008:09:30:00 126
05avril2008:09:10:00 127
04avril2008:09:20:00 128
01avril2008:09:00:00 128"
,FUN=ff)
You get your dates sorted in the right format and sorted.
The conversion is natural from POSIXct to IDateTime
IDateTime(index(h))
idate itime
1: 2008-04-01 09:00:00
2: 2008-04-02 09:30:00
3: 2008-04-03 09:00:00
4: 2008-04-04 09:20:00
5: 2008-04-05 09:10:00
Here sure you still do 2 conversions, But you do it when reading data, and the second you do it without dealing with any format problem.
How to convert between year,month,day and dates in R?
I know one can do this via strings, but I would prefer to avoid converting to strings, partly because maybe there is a performance hit?, and partly because I worry about regionalization issues, where some of the world uses "year-month-day" and some uses "year-day-month".
It looks like ISODate provides the direction year,month,day -> DateTime , although it does first converts the number to a string, so if there is a way that doesn't go via a string then I prefer.
I couldn't find anything that goes the other way, from datetimes to numerical values? I would prefer not needing to use strsplit or things like that.
Edit: just to be clear, what I have is, a data frame which looks like:
year month day hour somevalue
2004 1 1 1 1515353
2004 1 1 2 3513535
....
I want to be able to freely convert to this format:
time(hour units) somevalue
1 1515353
2 3513535
....
... and also be able to go back again.
Edit: to clear up some confusion on what 'time' (hour units) means, ultimately what I did was, and using information from How to find the difference between two dates in hours in R?:
forwards direction:
lh$time <- as.numeric( difftime(ISOdate(lh$year,lh$month,lh$day,lh$hour), ISOdate(2004,1,1,0), units="hours"))
lh$year <- NULL; lh$month <- NULL; lh$day <- NULL; lh$hour <- NULL
backwards direction:
... well, I didnt do backwards yet, but I imagine something like:
create difftime object out of lh$time (somehow...)
add ISOdate(2004,1,1,0) to difftime object
use one of the solution below to get the year,month,day, hour back
I suppose in the future, I could ask the exact problem I'm trying to solve, but I was trying to factorize my specific problem into generic reusable questions, but maybe that was a mistake?
Because there are so many ways in which a date can be passed in from files, databases etc and for the reason you mention of just being written in different orders or with different separators, representing the inputted date as a character string is a convenient and useful solution. R doesn't hold the actual dates as strings and you don't need to process them as strings to work with them.
Internally R is using the operating system to do these things in a standard way. You don't need to manipulate strings at all - just perhaps convert some things from character to their numerical equivalent. For example, it is quite easy to wrap up both operations (forwards and backwards) in simple functions you can deploy.
toDate <- function(year, month, day) {
ISOdate(year, month, day)
}
toNumerics <- function(Date) {
stopifnot(inherits(Date, c("Date", "POSIXt")))
day <- as.numeric(strftime(Date, format = "%d"))
month <- as.numeric(strftime(Date, format = "%m"))
year <- as.numeric(strftime(Date, format = "%Y"))
list(year = year, month = month, day = day)
}
I forego the a single call to strptime() and subsequent splitting on a separation character because you don't like that kind of manipulation.
> toDate(2004, 12, 21)
[1] "2004-12-21 12:00:00 GMT"
> toNumerics(toDate(2004, 12, 21))
$year
[1] 2004
$month
[1] 12
$day
[1] 21
Internally R's datetime code works well and is well tested and robust if a bit complex in places because of timezone issues etc. I find the idiom used in toNumerics() more intuitive than having a date time as a list and remembering which elements are 0-based. Building on the functionality provided would seem easier than trying to avoid string conversions etc.
I'm a bit late to the party, but one other way to convert from integers to date is the lubridate::make_date function. See the example below from R for Data Science:
library(lubridate)
library(nycflights13)
library(tidyverse)
a <- flights %>%
mutate(date = make_date(year, month, day))
Found one solution for going from date to year,month,day.
Let's say we have a date object, that we'll create here using ISOdate:
somedate <- ISOdate(2004,12,21)
Then, we can get the numerical components of this as follows:
unclass(as.POSIXlt(somedate))
Gives:
$sec
[1] 0
$min
[1] 0
$hour
[1] 12
$mday
[1] 21
$mon
[1] 11
$year
[1] 104
Then one can get what one wants for example:
unclass(as.POSIXlt(somedate))$mon
Note that $year is [actual year] - 1900, month is 0-based, mday is 1-based (as per the POSIX standard)
I have written a function which, when provided a range of dates, the name of a particular day of the week and the occurrence of that day in a given month (for instance, the second Friday of each month) will return the corresponding date. However, it isn't very fast and I'm not 100% convinced of its robustness. Is there a package or set of functions in R which can do these kinds of operations on POSIX objects? Thanks in advance!
Using the function nextfri whose one line source is shown in the zoo Quick Reference vignette in the zoo package the following gives the second Friday of d where d is the "Date" of the first of the month:
library(zoo)
d <- as.Date(c("2011-09-01", "2011-10-01"))
nextfri(d) + 7
## [1] "2011-09-09" "2011-10-14"
(nextfri is not part of the zoo package -- you need to enter it yourself -- but its only one line)
The following gives the day of the week where 0 is Sunday, 1 is Monday, etc.
as.POSIXlt(d)$wday
## [1] 4 6
If you really are dealing exclusively with dates rather than date-times then you ought to be using "Date" class rather than "POSIXt" classes in order to avoid time zone errors. See the article in R News 4/1.
The timeDate package has some of that functionality; I based this little snippet of code on some code that package. This is for Dates, timeDate has underlying POSIX types.
nthNdayInMonth <- function(date,nday = 1, nth = 1){
wday <- (as.integer(date) - 3) %% 7
r <- (as.integer(date) + (nth -1) * 7 + (nday - wday)%%7)
as.Date(r,"1970-01-01")
}