From a data frame with timestamped rows (strptime results), what is the best method for aggregating statistics for intervals?
Intervals could be an hour, a day, etc.
There's the aggregate function, but that doesn't help with assigning each row to an interval. I'm planning on adding a column to the data frame that denotes interval and using that with aggregate, but if there's a better solution it'd be great to hear it.
Thanks for any pointers!
Example Data
Five rows with timestamps divided into 15-minute intervals starting at 03:00.
Interval 1
"2010-01-13 03:02:38 UTC"
"2010-01-13 03:08:14 UTC"
"2010-01-13 03:14:52 UTC"
Interval 2
"2010-01-13 03:20:42 UTC"
"2010-01-13 03:22:19 UTC"
Conclusion
Using a time series package such as xts should be the solution; however I had no success using them and winded up using cut. As I presently only need to plot histograms, with rows grouped by interval, this was enough.
cut is used liked so:
interv <- function(x, start, period, num.intervals) {
return(cut(x, as.POSIXlt(start)+0:num.intervals*period))
}
Standard functions to split vectors are cut and findInterval:
v <- as.POSIXct(c(
"2010-01-13 03:02:38 UTC",
"2010-01-13 03:08:14 UTC",
"2010-01-13 03:14:52 UTC",
"2010-01-13 03:20:42 UTC",
"2010-01-13 03:22:19 UTC"
))
# Your function return list:
interv(v, as.POSIXlt("2010-01-13 03:00:00 UTC"), 900)
# [[1]]
# [1] "2010-01-13 03:00:00"
# [[2]]
# [1] "2010-01-13 03:00:00"
# [[3]]
# [1] "2010-01-13 03:00:00"
# [[4]]
# [1] "2010-01-13 03:15:00 CET"
# [[5]]
# [1] "2010-01-13 03:15:00 CET"
# cut returns factor, you must provide proper breaks:
cut(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 2010-01-13 03:00:00 2010-01-13 03:00:00 2010-01-13 03:00:00
# [4] 2010-01-13 03:15:00 2010-01-13 03:15:00
# Levels: 2010-01-13 03:00:00 2010-01-13 03:15:00
# findInterval returns vector of interval id (breaks like in cut)
findInterval(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 1 1 1 2 2
For the record: cut has a method for POSIXt type, but unfortunately there is no way to provide start argument, effect is:
cut(v,"15 min")
# [1] 2010-01-13 03:02:00 2010-01-13 03:02:00 2010-01-13 03:02:00
# [4] 2010-01-13 03:17:00 2010-01-13 03:17:00
# Levels: 2010-01-13 03:02:00 2010-01-13 03:17:00
As you see it's start at 03:02:00. You could mess with labels of output factor (convert labels to time, round somehow and convert back to character).
Use a time series package. The xts package has functions designed specifically to do that. Or look at the aggregate and rollapply functions in the zoo package.
The rmetrics ebook has a useful discussion, including a performance comparison of the various packages: https://www.rmetrics.org/files/freepdf/TimeSeriesFAQ.pdf
Edit: Look at my answer to this question. Basically you need to truncate every timestamp into a specific interval and then do the aggregation using those new truncated timestamps as your grouping vector.
This is an interesting question; with the proliferation of the various time series packages and methods, there ought to be an approach for binning irregular time series other than by brute force that the OP suggests. Here is one "high-level" way to get the intervals that you can then use for aggregate et al, using a version of cut defined for chron objects.
require(chron)
require(timeSeries)
my.times <- "
2010-01-13 03:02:38 UTC
2010-01-13 03:08:14 UTC
2010-01-13 03:14:52 UTC
2010-01-13 03:20:42 UTC
2010-01-13 03:22:19 UTC
"
time.df <- read.delim(textConnection(my.times),header=FALSE,sep="\n",strip.white=FALSE)
time.seq <- seq(trunc(timeDate(time.df[1,1]),units="hours"),by=15*60,length=nrow(time.df))
intervals <- as.numeric(cut(as.chron(as.character(time.df$V1)),breaks=as.chron(as.character(time.seq))))
You get
intervals
[1] 1 1 1 2 2
which you can now append to the data frame and aggregate.
The coersion acrobatics above (from character to timeDate to character to chron) is a little unfortunate, so if there are cleaner solutions for binning irregular time data using xts or any of the other timeSeries packages, I'd love to hear about them as well!..
I am also curious to know what would be the most efficient approach for binning large high-frequency irregular time series, e.g. creating 1-minute volume bars on tick data for a very liquid stock.
Related
I have a csv file that consists of one column. The column presents the date of posting on a website. I want to plot a histogram to see how the number of posts varies over the years. The file contains the years (2012 to 2016) and consists of 11,000 rows.
sample of the file:
2 30/1/12 21:07
3 2/2/12 15:53
4 3/4/12 0:49
5 14/11/12 3:49
6 11/8/13 16:00
7 31/7/14 8:08
8 31/7/14 10:48
9 6/8/14 9:24
10 16/12/14 3:34
The data types is dataframe
class(postsData)
[1] "data.frame"
I tried converting the data to text using strptime function as below:
formatDate <- strptime(as.character(postsData$Date),format="“%d/%m/%y")
then plot the histogram
hist(formatDate,breaks=10,xlab="year")
Any tip or suggestion would be useful. Thank you,
use lubridate::dmy_hm()
strptime() is overly complicated in my opinion compared to { lubridate }.
library(lubridate)
d <- c("30/1/12 21:07",
"2/2/12 15:53",
"3/4/12 0:49",
"14/11/12 3:49",
"11/8/13 16:00",
"31/7/14 8:08",
"31/7/14 10:48",
"6/8/14 9:24",
"16/12/14 3:34")
d2 <- dmy_hm(d)
d2
Returns:
[1] "2012-01-30 21:07:00 UTC"
[2] "2012-02-02 15:53:00 UTC"
[3] "2012-04-03 00:49:00 UTC"
[4] "2012-11-14 03:49:00 UTC"
[5] "2013-08-11 16:00:00 UTC"
[6] "2014-07-31 08:08:00 UTC"
[7] "2014-07-31 10:48:00 UTC"
[8] "2014-08-06 09:24:00 UTC"
[9] "2014-12-16 03:34:00 UTC"
As you can see, lubridate functions return POSIXct objects.
class(d2)
[1] "POSIXct" "POSIXt"
Next, you can use lubridate::year() to get the year of each POSIXct object returned by dmy_hm(), and plot that histogram.
hist(year(d2))
Here's one approach. I think your date conversion is fine but you need to count the number of dates that occur in each year then plot that count as a histogram.
library(tidyverse)
# generate some data
date.seq <- tibble(xdate = seq(from = lubridate::ymd_hms('2000-01-01 00:00:00'), to=lubridate::ymd_hms('2016-12-31 24:59:59'), length.out = 100))
date.seq %>%
mutate(xyear = lubridate::year(xdate)) %>% # add a column of years
group_by(xyear) %>%
summarise(date_count = length(xdate)) %>% # Count the number of dates that occur in each year
ggplot(aes(x = xyear, y = date_count)) +
geom_col(colour = 'black', fill = 'blue') # plot as a column graph
There's no problem with strptime()*, however, the format option is intended to specify how the is formatted.
df1$date <- strptime(df1$date, format="%d/%m/%y %H:%M")
# [1] "2012-01-30 21:07:00 CET" "2012-02-02 15:53:00 CET"
# [3] "2012-04-03 00:49:00 CEST" "2012-11-14 03:49:00 CET"
# [5] "2013-08-11 16:00:00 CEST" "2014-07-31 08:08:00 CEST"
# [7] "2014-07-31 10:48:00 CEST" "2014-08-06 09:24:00 CEST"
# [9] "2014-12-16 03:34:00 CET"
What you probably want then is to use the format() function
formatDate <- format(df1$date, format="%F")
(or in this case simpler with formatDate <- as.Date(df1$date))
and then
hist(formatDate, breaks=10, xlab="year")
* credits to #MikkoMarttila
Data
df1 <- structure(list(id = 2:10, date = c("30/1/12 21:07", "2/2/12 15:53",
"3/4/12 0:49", "14/11/12 3:49", "11/8/13 16:00", "31/7/14 8:08",
"31/7/14 10:48", "6/8/14 9:24", "16/12/14 3:34")), class = "data.frame", row.names = c(NA,
-9L))
I have a regular 5 minute interval datetime data sets (about 50). POSIXt/ lubridate functions convert my datetime very nicely to a 24 hour format as required. But I would like to add another column with my day's definition to be from 6 am to 6 am (which is currently midnight to midnight). I am trying to do this to capture after 12AM activity as a part of current date rather than the next one.
I am currently trying to create a group every 288th row (there are 288 5minute intervals in a day). But it creates a problem because my datasets don't necessarily start at a unique time.
I do not want to create offsets because that tampers with the values corresponding to the time.
Any efficient ways around this problem? Thank you.
You can efficiently do it by first generating a sequence of date/times, then using cut to find the bin in which each value falls:
set.seed(2)
dat <- Sys.time() + sort(runif(10, min=0, max=5*24*60*60))
dat
# [1] "2017-07-29 15:43:10 PDT" "2017-07-29 20:23:12 PDT" "2017-07-29 22:24:22 PDT" "2017-07-31 08:22:57 PDT"
# [5] "2017-07-31 18:13:06 PDT" "2017-07-31 21:01:10 PDT" "2017-08-01 12:30:19 PDT" "2017-08-02 04:14:03 PDT"
# [9] "2017-08-02 17:26:14 PDT" "2017-08-02 17:28:52 PDT"
sixs <- seq(as.POSIXct("2017-07-29 06:00:00", tz = "UTC"), as.POSIXct("2017-08-03 06:00:00", tz = "UTC"), by = "day")
sixs
# [1] "2017-07-29 06:00:00 UTC" "2017-07-30 06:00:00 UTC" "2017-07-31 06:00:00 UTC" "2017-08-01 06:00:00 UTC"
# [5] "2017-08-02 06:00:00 UTC" "2017-08-03 06:00:00 UTC"
cut(dat, sixs, label = FALSE)
# [1] 1 1 1 3 3 3 4 5 5 5
According to the help page (?seq.POSIXt), you might choose by="DSTday" instead.
Checkout this question and the corresponding answer: How to manipulate the time part of a date column?
It illustrates a more robust solution as it is independent of your data structure (e.g. repeatition).
Following #meenaparam's solution:
Convert all date columns to dmy_hms format from lubridate package. Please explore other options like dmy_hm or ymd_hms etc, as per your specific need.
mutate(DATE = dmy_hms(DATE))
Now create a column to identify the data points that need to be modified in different ways. Like your data points with 00:00:00 to 05:59:59 (hms) needs to be part of the previous date.
DAY_PAST = case_when(hour(DATE) < 6 ~ "yup", TRUE ~ "nope"))
Now convert the day value of these "yup" dates to day(DATE)-1
NEW_DATE = case_when(DAY_PAST == "yup"
~ make_datetime(year(DATE-86400), month(DATE-86400), day = day(DATE-86400), hour = hour(DATE)),
TRUE ~ DATE)
.
I am trying to do some simple operation in R, after loading a table i encountered a date column which has many formats combined.
**Date**
1/28/14 6:43 PM
1/29/14 4:10 PM
1/30/14 12:09 PM
1/30/14 12:12 PM
02-03-14 19:49
02-03-14 20:03
02-05-14 14:33
I need to convert this to format like 28-01-2014 18:43 i.e. %d-%m-%y %h:%m
I tried this
tablename$Date <- as.Date(as.character(tablename$Date), "%d-%m-%y %h:%m")
but doing this its filling NA in the entire column. Please help me to get this right!
The lubridate package makes quick work of this:
library(lubridate)
d <- parse_date_time(dates, names(guess_formats(dates, c("mdy HM", "mdy IMp"))))
d
## [1] "2014-01-28 18:43:00 UTC" "2014-01-29 16:10:00 UTC"
## [3] "2014-01-30 12:09:00 UTC" "2014-01-30 12:12:00 UTC"
## [5] "2014-02-03 19:49:00 UTC" "2014-02-03 20:03:00 UTC"
## [7] "2014-02-05 14:33:00 UTC"
# put in desired format
format(d, "%m-%d-%Y %H:%M:%S")
## [1] "01-28-2014 18:43:00" "01-29-2014 16:10:00" "01-30-2014 12:09:00"
## [4] "01-30-2014 12:12:00" "02-03-2014 19:49:00" "02-03-2014 20:03:00"
## [7] "02-05-2014 14:33:00"
You'll need to adjust the vector in guess_formats if you come across other format variations.
When I put a single date to be parsed, it parses accurately
> ymd("20011001")
[1] "2001-10-01 UTC"
But when I try to create a vector of dates they all come out one day off:
> b=c(ymd("20111001"),ymd("20101001"),ymd("20091001"),ymd("20081001"),ymd("20071001"),ymd("20061001"),ymd("20051001"),ymd("20041001"),ymd("20031001"),ymd("20021001"),ymd("20011001"))
> b
[1] "2011-09-30 19:00:00 CDT" "2010-09-30 19:00:00 CDT" "2009-09-30 19:00:00 CDT"
[4] "2008-09-30 19:00:00 CDT" "2007-09-30 19:00:00 CDT" "2006-09-30 19:00:00 CDT"
[7] "2005-09-30 19:00:00 CDT" "2004-09-30 19:00:00 CDT" "2003-09-30 19:00:00 CDT"
[10] "2002-09-30 19:00:00 CDT" "2001-09-30 19:00:00 CDT"
how can I fix this??? Many thanks.
I don't claim to understand exactly what's going on here, but the proximal problem is that c() strips attributes, so using c() on a POSIX[c?]t vector changes it from UTC to the time zone specified by your locale strips the time zone attribute, messing it up (even if you set the time zone to agree with the one specified by your locale). On my system:
library(lubridate)
(y1 <- ymd("20011001"))
## [1] "2001-10-01 UTC"
(y2 <- ymd("20011002"))
c(y1,y2)
## now in EDT (and a day earlier/4 hours before UTC):
## [1] "2001-09-30 20:00:00 EDT" "2001-10-01 20:00:00 EDT"
(y12 <- ymd(c("20011001","20011002")))
## [1] "2001-10-01 UTC" "2001-10-02 UTC"
c(y12)
## back in EDT
## [1] "2001-09-30 20:00:00 EDT" "2001-10-01 20:00:00 EDT"
You can set the time zone explicitly ...
y3 <- ymd("20011001",tz="EDT")
## [1] "2001-10-01 EDT"
But c() is still problematic.
(y3c <- c(y3))
## [1] "2001-09-30 20:00:00 EDT"
So two solutions are
convert a character vector rather than combining the objects after converting them one by one or
restore the tzone attribute after combining.
For example:
attr(y3c,"tzone") <- attr(y3,"tzone")
#Joran points out that this is almost certainly a general property of applying c() to POSIX[c?]t objects, not specifically lubridate-related. I hope someone will chime in and explain whether this is a well-known design decision/infelicity/misfeature.
Update: there is some discussion of this on R-help in 2012, and Brian Ripley comments:
But in any case, the documentation (?c.POSIXct) is clear:
Using ‘c’ on ‘"POSIXlt"’ objects converts them to the current time
zone, and on ‘"POSIXct"’ objects drops any ‘"tzone"’ attributes
(even if they are all marked with the same time zone).
So the recommended way is to add a "tzone" attribute if you know what
you want it to be. POSIXct objects are absolute times: the timezone
merely affects how they are converted (including to character for
printing).
It might be nice if lubridate added a method to do this ...
From a data frame with timestamped rows (strptime results), what is the best method for aggregating statistics for intervals?
Intervals could be an hour, a day, etc.
There's the aggregate function, but that doesn't help with assigning each row to an interval. I'm planning on adding a column to the data frame that denotes interval and using that with aggregate, but if there's a better solution it'd be great to hear it.
Thanks for any pointers!
Example Data
Five rows with timestamps divided into 15-minute intervals starting at 03:00.
Interval 1
"2010-01-13 03:02:38 UTC"
"2010-01-13 03:08:14 UTC"
"2010-01-13 03:14:52 UTC"
Interval 2
"2010-01-13 03:20:42 UTC"
"2010-01-13 03:22:19 UTC"
Conclusion
Using a time series package such as xts should be the solution; however I had no success using them and winded up using cut. As I presently only need to plot histograms, with rows grouped by interval, this was enough.
cut is used liked so:
interv <- function(x, start, period, num.intervals) {
return(cut(x, as.POSIXlt(start)+0:num.intervals*period))
}
Standard functions to split vectors are cut and findInterval:
v <- as.POSIXct(c(
"2010-01-13 03:02:38 UTC",
"2010-01-13 03:08:14 UTC",
"2010-01-13 03:14:52 UTC",
"2010-01-13 03:20:42 UTC",
"2010-01-13 03:22:19 UTC"
))
# Your function return list:
interv(v, as.POSIXlt("2010-01-13 03:00:00 UTC"), 900)
# [[1]]
# [1] "2010-01-13 03:00:00"
# [[2]]
# [1] "2010-01-13 03:00:00"
# [[3]]
# [1] "2010-01-13 03:00:00"
# [[4]]
# [1] "2010-01-13 03:15:00 CET"
# [[5]]
# [1] "2010-01-13 03:15:00 CET"
# cut returns factor, you must provide proper breaks:
cut(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 2010-01-13 03:00:00 2010-01-13 03:00:00 2010-01-13 03:00:00
# [4] 2010-01-13 03:15:00 2010-01-13 03:15:00
# Levels: 2010-01-13 03:00:00 2010-01-13 03:15:00
# findInterval returns vector of interval id (breaks like in cut)
findInterval(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 1 1 1 2 2
For the record: cut has a method for POSIXt type, but unfortunately there is no way to provide start argument, effect is:
cut(v,"15 min")
# [1] 2010-01-13 03:02:00 2010-01-13 03:02:00 2010-01-13 03:02:00
# [4] 2010-01-13 03:17:00 2010-01-13 03:17:00
# Levels: 2010-01-13 03:02:00 2010-01-13 03:17:00
As you see it's start at 03:02:00. You could mess with labels of output factor (convert labels to time, round somehow and convert back to character).
Use a time series package. The xts package has functions designed specifically to do that. Or look at the aggregate and rollapply functions in the zoo package.
The rmetrics ebook has a useful discussion, including a performance comparison of the various packages: https://www.rmetrics.org/files/freepdf/TimeSeriesFAQ.pdf
Edit: Look at my answer to this question. Basically you need to truncate every timestamp into a specific interval and then do the aggregation using those new truncated timestamps as your grouping vector.
This is an interesting question; with the proliferation of the various time series packages and methods, there ought to be an approach for binning irregular time series other than by brute force that the OP suggests. Here is one "high-level" way to get the intervals that you can then use for aggregate et al, using a version of cut defined for chron objects.
require(chron)
require(timeSeries)
my.times <- "
2010-01-13 03:02:38 UTC
2010-01-13 03:08:14 UTC
2010-01-13 03:14:52 UTC
2010-01-13 03:20:42 UTC
2010-01-13 03:22:19 UTC
"
time.df <- read.delim(textConnection(my.times),header=FALSE,sep="\n",strip.white=FALSE)
time.seq <- seq(trunc(timeDate(time.df[1,1]),units="hours"),by=15*60,length=nrow(time.df))
intervals <- as.numeric(cut(as.chron(as.character(time.df$V1)),breaks=as.chron(as.character(time.seq))))
You get
intervals
[1] 1 1 1 2 2
which you can now append to the data frame and aggregate.
The coersion acrobatics above (from character to timeDate to character to chron) is a little unfortunate, so if there are cleaner solutions for binning irregular time data using xts or any of the other timeSeries packages, I'd love to hear about them as well!..
I am also curious to know what would be the most efficient approach for binning large high-frequency irregular time series, e.g. creating 1-minute volume bars on tick data for a very liquid stock.