split a data.table(or frame) by POSIXct factors - r

I am trying to split a data.table (an enhanced data.frame) by a POSIXct columns, without success...
rangedt <- as.POSIXct(c("2012-10-01 06:00","2012-10-01 21:00"), tz='GMT'); N=1e2
dts <- as.POSIXct(runif(n=N, min=min(rangedt), max=max(rangedt)), tz='GMT', origin='1970-01-01')
DT <- data.table(x=rnorm(N), dts=dts) # put data.frame if you prefer
# x dts
#1: 0.938973900218328494383 2012-10-01 17:11:46.503828
#2: 0.582959687387282210480 2012-10-01 17:33:24.203815
#3: -1.492752410394331263888 2012-10-01 08:37:37.585960
#4: 0.677074458537853418605 2012-10-01 08:55:04.598939
#5: 0.012120685348577473275 2012-10-01 09:35:16.664197
#6: -1.353204371844073161668 2012-10-01 18:45:46.737178
f <- cut(rangedt, breaks='10 min');
f
#[1] 2012-10-01 06:00:00 2012-10-01 21:00:00
#91 Levels: 2012-10-01 06:00:00 2012-10-01 06:10:00 2012-10-01 06:20:00
DT.split <- split(DT, f=findInterval(DT$dts,f))
length(DT.split)
#[1] This is because R make one class only fron the data, which I do not understand

I found the problem here, factors and POSIXct don't work well together
rangedt <- as.POSIXct(c("2012-10-01 06:00","2012-10-01 21:00"), tz='GMT'); N=1e2
dts <- as.POSIXct(runif(n=N, min=min(rangedt), max=max(rangedt)), tz='GMT', origin='1970-01-01')
DT <- data.table(x=rnorm(N), dts=dts) # put data.frame if you prefer
# x dts
#1: 0.938973900218328494383 2012-10-01 17:11:46.503828
#2: 0.582959687387282210480 2012-10-01 17:33:24.203815
#3: -1.492752410394331263888 2012-10-01 08:37:37.585960
#4: 0.677074458537853418605 2012-10-01 08:55:04.598939
#5: 0.012120685348577473275 2012-10-01 09:35:16.664197
#6: -1.353204371844073161668 2012-10-01 18:45:46.737178
f <- cut(rangedt, breaks='10 min');
f
#[1] 2012-10-01 06:00:00 2012-10-01 21:00:00
#91 Levels: 2012-10-01 06:00:00 2012-10-01 06:10:00 2012-10-01 06:20:00
f <- as.POSIXct(levels(f), tz='GMT', origin='1970-01-01')
DT[, groups:=findInterval(dts,f)]
DT.split <- split(DT, DT$groups)

Related

R and Data.table - applying rollapply over multiple columns

I would really appreciate if you can help me do the rollapply for each column of the data.table
time AUD NZD EUR GBP USD AUD
1 2013-01-01 20:00 0.213 -0.30467 -0.127515
2 2013-01-01 20:05 0.21191 -0.30467 -0.127975
3 2013-01-01 20:10 0.212185 -0.304965 -0.127935
4 2013-01-01 20:15 0.212055 -0.30511 -0.1288
5 2013-01-01 20:20 0.211225 -0.30536 -0.12938
6 2013-01-01 20:25 0.211185 -0.30527 -0.129195
7 2013-01-01 20:30 0.21159 -0.3059 -0.13043
8 2013-01-01 20:35 0.21142 -0.304955 -0.13155
9 2013-01-01 20:40 0.21093 -0.30419 -0.132715
10 2013-01-01 20:45 0.2078 -0.30339 -0.13544
11 2013-01-01 20:50 0.208445 -0.30304 -0.135645
12 2013-01-01 20:55 0.208735 -0.30185 -0.1357
13 2013-01-01 21:00 0.20891 -0.303265 -0.13722
14 2013-01-01 21:05 0.20903 -0.30428 -0.137495
15 2013-01-01 21:10 0.209615 -0.305495 -0.13734
16 2013-01-01 21:15 0.20981 -0.30588 -0.13772
17 2013-01-01 21:20 0.209855 -0.306935 -0.13801
18 2013-01-01 21:25 0.209585 -0.30604 -0.138045
19 2013-01-01 21:30 0.210105 -0.3061 -0.137765
20 2013-01-01 21:35 0.210335 -0.30734 -0.138525
Code that works:
library("zoo")
library("data.table")
calculateAverage <- function (x,N) {
tempDataStorage <- rollapply(out[,1], N, mean)
}
col1 <- out[,2]
col2 <- out[,3]
col3 <- out[,4]
average1 <- calculateAverage(col1, 2)
average2 <- calculateAverage(col2, 2)
average3 <- calculateAverage(col3, 2)
combine <- cbind(average1, average2, average3)
tempMatrix <- matrix(, nrow = nrow(out), ncol = ncol(out))
tempMatrix[2:nrow(out), 1:3] <- combine
Suggestion from SO:
test <- lapply(out[,with=F], function(x) rollapply(x,width=2, FUN=mean))
Challenges:
1. The code I created works, but it feels inefficient and not generic. It needs to be modified whenever the number of cols changes
2. Suggestion from SO output is list which is not useful to me
If an alternate method is suggested, I would be really appreciate it!
Thanks in advance
Edit:
Data table added
data <- cbind(mtcars,as.Date(c("2007-06-22", "2004-02-13")))
merge(rollapply(Filter(is.numeric, data), 2, mean),
Filter(Negate(is.numeric), data))
The first line creates data, so that there are not only numeric values in it. This is only to mimic your data, which is not available right now.
The second line filters only numeric columns and applies mean function to each of filtered columns.
Suggestion from David Arenburg worked perfectly!
MaPrice <- function(x, N) {
Mavg <- rollapply(x, N, mean)
Mavg
}
SpreadMA <- out[, lapply(.SD, MaPrice, N = 20)]

Count time stamps in different time intervals - issue with interval which spans midnight

I have a dataframe ("observations") with time stamps in H:M format ("Time"). In a second dataframe ("intervals"), I have time ranges defined by "From" and "Till" variables, also in H:M format.
I want to count number of observations which falls within each interval. I have been using between from data.table, which has been working without any problem when dates are included.
However, now I only have time stamps, without date. This causes some problems for the times which occurs in the interval which spans midnight (20:00 - 05:59). These times are not counted in the code I have tried.
Example below
interval.data <- data.frame(From = c("14:00", "20:00", "06:00"), Till = c("19:59", "05:59", "13:59"), stringsAsFactors = F)
observations <- data.frame(Time = c("14:32", "15:59", "16:32", "21:34", "03:32", "02:00", "00:00", "05:57", "19:32", "01:32", "02:22", "06:00", "07:50"), stringsAsFactors = F)
interval.data
# From Till
# 1: 14:00:00 19:59:00
# 2: 20:00:00 05:59:00 # <- interval including midnight
# 3: 06:00:00 13:59:00
observations
# Time
# 1: 14:32:00
# 2: 15:59:00
# 3: 16:32:00
# 4: 21:34:00 # Row 4-8 & 10-11 falls in 'midnight interval', but are not counted
# 5: 03:32:00 #
# 6: 02:00:00 #
# 7: 00:00:00 #
# 8: 05:57:00 #
# 9: 19:32:00
# 10: 01:32:00 #
# 11: 02:22:00 #
# 12: 06:00:00
# 13: 07:50:00
library(data.table)
library(plyr)
adply(interval.data, 1, function(x, y) sum(y[, 1] %between% c(x[1], x[2])), y = observations)
# From Till V1
# 1 14:00 19:59 4
# 2 20:00 05:59 0 # <- zero counts - wrong!
# 3 06:00 13:59 2
One approach is to use a non-equi join in data.table, and their helper function as.ITime for working with time strings.
You'll have an issue with the interval that spans midnight, but, there should only ever be one of those. And as you're interested in the number of observations per 'group' of intervals, you can treat this group as the equivalent of the 'Not' of the others.
For example, first convert your data.frame to data.table
library(data.table)
## set your data.frames as `data.table`
setDT(interval.data)
setDT(observations)
Then use as.ITime to convert to an integer representation of time
## convert time stamps
interval.data[, `:=`(FromMins = as.ITime(From),
TillMins = as.ITime(Till))]
observations[, TimeMins := as.ITime(Time)]
## you could combine this step with the non-equi join directly, but I'm separating it for clarity
You can now use a non-equi join to find the interval that each time falls within. Noting that those times that reutrn 'NA' are actually those that fall inside the midnight-spanning interval
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
]
# From Till FromMins TillMins Time
# 1: 14:00 19:59 872 872 14:32
# 2: 14:00 19:59 959 959 15.59
# 3: 14:00 19:59 992 992 16:32
# 4: NA NA 1294 1294 21:34
# 5: NA NA 212 212 03:32
# 6: NA NA 120 120 02:00
# 7: NA NA 0 0 00:00
# 8: NA NA 357 357 05:57
# 9: 14:00 19:59 1172 1172 19:32
# 10: NA NA 92 92 01:32
# 11: NA NA 142 142 02:22
# 12: 06:00 13:59 360 360 06:00
# 13: 06:00 13:59 470 470 07:50
Then to get the number of observatins for the groups of intervals, you just .N grouped by each time point, which can just be chained onto the end of the above statement
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
][
, .N
, by = .(From, Till)
]
# From Till N
# 1: 14:00 19:59 4
# 2: NA NA 7
# 3: 06:00 13:59 2
Where the NA group corresponds to the one that spans midnight
I just tweaked your code to get the desired result. Hope this helps!
adply(interval.data, 1, function(x, y)
if(x[1] > x[2]) return(sum(y[, 1] %between% c(x[1], 23:59), y[, 1] %between% c(00:00, x[2]))) else return(sum(y[, 1] %between% c(x[1], x[2]))), y = observations)
Output is:
From Till V1
1 14:00 19:59 4
2 20:00 05:59 7
3 06:00 13:59 2

R : how to get the rolling mean of a variable over the last few days but only at a given hour?

Consider this
time <- seq(ymd_hms("2014-02-24 23:00:00"), ymd_hms("2014-06-25 08:32:00"), by="hour")
group <- rep(LETTERS[1:20], each = length(time))
value <- sample(-10^3:10^3,length(time), replace=TRUE)
df2 <- data.frame(time,group,value)
str(df2)
> head(df2)
time group value
1 2014-02-24 23:00:00 A 246
2 2014-02-25 00:00:00 A -261
3 2014-02-25 01:00:00 A 628
4 2014-02-25 02:00:00 A 429
5 2014-02-25 03:00:00 A -49
6 2014-02-25 04:00:00 A -749
I would like to create a variable that contains, for each group, the rolling mean of value
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At time 2014-02-24 23:00:00, df2['rolling_mean_same_hour'] contains the mean of the values of value observed at 23:00:00 during the last 5 days in the data (not including 2014-02-24 of course).
I would like to do that in either dplyr or data.table. I confess having no ideas how to do that.
Any ideas?
Many thanks!
You can calculate the rollmean() with your data grouped by the group variable and hour of the time variable, normally the rollmean() will include the current observation, but you can use shift() function to exclude the current observation from the rollmean:
library(data.table); library(zoo)
setDT(df2)
df2[, .(rolling_mean_same_hour = shift(
rollmean(value, 5, na.pad = TRUE, align = 'right'),
n = 1,
type = 'lag'),
time), .(hour(time), group)]
# hour group rolling_mean_same_hour time
# 1: 23 A NA 2014-02-24 23:00:00
# 2: 23 A NA 2014-02-25 23:00:00
# 3: 23 A NA 2014-02-26 23:00:00
# 4: 23 A NA 2014-02-27 23:00:00
# 5: 23 A NA 2014-02-28 23:00:00
# ---
#57796: 22 T -267.0 2014-06-20 22:00:00
#57797: 22 T -389.6 2014-06-21 22:00:00
#57798: 22 T -311.6 2014-06-22 22:00:00
#57799: 22 T -260.0 2014-06-23 22:00:00
#57800: 22 T -26.8 2014-06-24 22:00:00

How to get NA returned from R aggregate over NA data?

I have a dataframe with a POSIXct datetime column and a column with a value.
The value may contain periods of NA, sometimes even lags between some hours (no data at all, eg.)
t v
2014-01-01 20:00:00 1000
2014-01-01 20:15:00 2300
2014-01-01 20:30:00 1330
2014-01-01 20:45:00 NA
2014-01-01 21:00:00 NA
2014-01-01 22:15:00 NA
2014-01-01 22:30:00 1330
2014-01-01 22:45:00 3333
One can easily see that there is a period with simply no data written (21:00 to 22:15)
When I now apply
aggregate(data, list(t=cut($t, "1hour"), FUN=sum)
it interprets anything missing as zero. When plotting it with ggplot2 and geom_line, the curve in that region will break down from 1000s to 10s.
I want that aggregate returns NA values for every hour that is not represented by the data (missing or NA itself), such that the values are not bent down to 0 and such that the line plot shows a gap in that period (disconnected data points).
Thanks to #JulienNavarre and #user20650 who both contributed parts of the solution, I put here my final solution which is additionally capable of handling data at non-regular times and demands at least x values per hour for aggregation.
data$t <- as.POSIXct(strptime(data$t,"%Y-%m-%d %H:%M:%S"))
x <- 4 # data available x times per hour
h <- 1 # aggregate to every h hours
# aggregation puts NA if data has not x valid values per hour
dataagg <- aggregate(data$v, list(t=cut(data$t, paste(h,"hours"))),
function(z) ifelse(length(z)<x*h||any(is.na(z)),NA,sum(z,na.rm=T)))
dataagg$t <- as.POSIXct(strptime(dataagg$t, '%Y-%m-%d %H:%M:%S'))
# Now fill up missing datetimes with NA
a <- seq(min(dataagg$t), max(dataagg$t), by=paste(h,"hours"))
t <- a[seq(1, length(a), by=1)]
tdf <- as.data.frame(t)
tdf$t <- as.POSIXct(strptime(tdf$t, '%Y-%m-%d %H:%M:%S'))
dataaggfinal <- merge(dataagg, tdf, by="t", all.y=T)
What you want is not clear tho, but maybe you are looking for a right join, which you can do with merge and all.Y = TRUE.
And after you can do your sum grouped by, with aggregate.
> data$t <- as.POSIXct(data$t)
>
> time.seq <- seq(min(as.POSIXct(data$t)), max(as.POSIXct(data$t)), by = "min")[seq(1, 166, by = 15)]
>
> merge(data, as.data.frame(time.seq), by.x = "t", by.y = "time.seq", all.y = T)
t v
1 2014-01-01 20:00:00 1000
2 2014-01-01 20:15:00 2300
3 2014-01-01 20:30:00 1330
4 2014-01-01 20:45:00 NA
5 2014-01-01 21:00:00 NA
6 2014-01-01 21:15:00 NA
7 2014-01-01 21:30:00 NA
8 2014-01-01 21:45:00 NA
9 2014-01-01 22:00:00 NA
10 2014-01-01 22:15:00 NA
11 2014-01-01 22:30:00 1330
12 2014-01-01 22:45:00 3333
And the x argument in aggregate should be, in this case, the variable you want to "sum", then its "data$v" not "data".

Add months to IDate column of data.table in R

I have been using data.table for practically everything I was using data.frames for, as it is much, much faster on big in-memory data (several million rows). However, I'm not quite sure how to add days or months to an IDate column without using apply (which is very slow).
A minimal example:
dates = c("2003-01-01", "2003-02-01", "2003-03-01", "2003-06-01", "2003-12-01",
"2003-04-01", "2003-05-01", "2003-07-01", "2003-09-01", "2003-08-01")
dt = data.table(idate1=as.IDate(dates))
Now, let's say I want to create a column with dates 6 months ahead. Normally, for a single IDate, I would do this:
seq(dt$idate1[1],by="6 months",length=2)[2]
But this won't work as from= must be of length 1:
dt[,idate2:=seq(idate1,by="6 months",length=2)[2]]
Is there an efficient way of doing it to create column idate2 in dt?
Thanks a lot,
RR
One way is to use mondate package and add the months to it and then convert it back to iDate class object.
require(mondate)
dt = data.table(idate1=as.IDate(dates))
dt[, idate2 := as.IDate(mondate(as.Date(idate1)) + 6)]
# idate1 idate2
# 1: 2003-01-01 2003-07-01
# 2: 2003-02-01 2003-08-02
# 3: 2003-03-01 2003-09-01
# 4: 2003-06-01 2003-12-02
# 5: 2003-12-01 2004-06-01
# 6: 2003-04-01 2003-10-02
# 7: 2003-05-01 2003-11-01
# 8: 2003-07-01 2004-01-01
# 9: 2003-09-01 2004-03-02
# 10: 2003-08-01 2004-02-01
Although, I suppose that there might be other better solutions.
You can use lubridate,
library(lubridate)
dt[, idate2 := as.IDate(idate1 %m+% months(6))]
idate1 idate2
1: 2003-01-01 2003-07-01
2: 2003-02-01 2003-08-01
3: 2003-03-01 2003-09-01
4: 2003-06-01 2003-12-01
5: 2003-12-01 2004-06-01
6: 2003-04-01 2003-10-01
7: 2003-05-01 2003-11-01
8: 2003-07-01 2004-01-01
9: 2003-09-01 2004-03-01
10: 2003-08-01 2004-02-01

Resources