The task appears simple, but i am struggling to get the way of initializing a vector (array, collection, or similar) of ranges. That is, I need to do something like this:
vec_of_ranges = HOW TO INITIALIZE THIS?
for i=1:10
range = i:20
vec_of_ranges[i]=range
end
Can anyone give me a hint on how to do this? I need it so I can then evaluate a given array on that collection of ranges...
Thanks in advance!
If you type typeof(1:50) you get UnitRange{Int64}
So you need an array of UnitRange
Try this:
vec_of_ranges = Array{UnitRange{Int64},1}(undef, 10)
for i=1:10
range = i:20
vec_of_ranges[i]=range
end
You might also want to use the fill() command
vec_of_ranges = fill(1:20, 10)
If you would use a comprehension you get both steps (i.e. getting the right element type and filling the contents) for free in one shot:
julia> [i:20 for i in 1:10]
10-element Array{UnitRange{Int64},1}:
1:20
2:20
3:20
4:20
5:20
6:20
7:20
8:20
9:20
10:20
alternatively you can use map:
julia> map(i -> i:20, 1:10)
10-element Array{UnitRange{Int64},1}:
1:20
2:20
3:20
4:20
5:20
6:20
7:20
8:20
9:20
10:20
If you wanted to use broadcasting you can do:
julia> UnitRange.(1:10, 20)
10-element Array{UnitRange{Int64},1}:
1:20
2:20
3:20
4:20
5:20
6:20
7:20
8:20
9:20
10:20
Related
Suppose I have a zoo object (or it could be a data.frame) that has an index on "time of day" and has some value (see sample data below):
val
...
2006-08-01 12:00 23
2006-08-01 12:01 24
2006-08-01 12:02 25
2006-08-01 12:03 26
2006-08-01 12:04 27
2006-08-01 12:05 28
2006-08-01 12:06 29
...
2006-08-02 12:00 123
2006-08-02 12:01 124
2006-08-02 12:02 125
2006-08-02 12:03 126
2006-08-02 12:04 127
...
I would like to call a custom function (call it custom.func(vals)) from 12:01 - 12:03 (i.e. something similar to zoo::rollapply) every time that interval occurs so in this example, daily. How would I do that?
NOTES (for robustness, it would also be great to take into account the following edge cases but not necessary):
Don't assume that I have values for 12:01 - 12:03 every day
Don't assume that the entire range 12:01 - 12:03 is present every day. Some days I might only have 12:01 and 12:02 but might be missing 12:03
What if I wanted my custom.func(vals) to be called on day boundaries like using val from 23:58 - 00:12?
Suppose our input is the POSIXct zoo object z given in the Note at the end.
Create a character vector times which has one element per element of z and is in the form HH:MM. Then create a logical ok which indicates which times are between the indicated boundary values. z[ok] is then z reduced to those values. Finally for each day apply sum (can use some other function if desired) using aggregate.zoo :
times <- format(time(z), "%H:%M")
ok <- times >= "12:01" & times <= "12:03"
aggregate(z[ok], as.Date, sum)
## 2006-08-01 2006-08-02
## 75 375
times straddle midnight
The version is for the case where the times straddle midnight. Note that the order of values sent to the function is not the original order but if the function is symmetric that does not matter.
times <- format(time(z), "%H:%M")
ok <- times >= "23:58" | times <= "00:12"
aggregate(z[ok], (as.Date(format(time(z))) + (times >= "23:58"))[ok], sum)
## 2006-08-02
## 41
Variation
The prior code chunk works if the function is symmetric in the components of its argument (which is the case for many functions such as mean and sum) but if the function were not symmetric we would need a slightly different approach. We define to.sec which translates an HH:MM string to numeric seconds and subtract to.sec("23:58") from each POSIXct datetime. Then the components of z to keep are those whose transformed times converted to HH:MM character strings that are less than "00:14".
to.sec <- function(x) with(read.table(text = x, sep = ":"), 3600 * V1 + 60 * V2)
times <- format(time(z) - to.sec("23:58"), "%H:%M")
ok <- times <= "00:14"
aggregate(z[ok], as.Date(time(z)[ok] - to.sec("23:58")), sum)
## 2006-08-01
## 41
Note
Lines <- "datetime val
2006-08-01T12:00 23
2006-08-01T12:01 24
2006-08-01T12:02 25
2006-08-01T12:03 26
2006-08-01T12:04 27
2006-08-01T12:05 28
2006-08-01T12:06 29
2006-08-01T23:58 20
2006-08-02T00:01 21
2006-08-02T12:00 123
2006-08-02T12:01 124
2006-08-02T12:02 125
2006-08-02T12:03 126
2006-08-02T12:04 127"
library(zoo)
z <- read.zoo(text = Lines, tz = "", header = TRUE, format = "%Y-%m-%dT%H:%M")
EDIT
Have revised the non-symmetric code and simplified all code chunks.
I recommend runner package which allows to compute any rolling function on irregular time series. Function runner is equivalent of rollApply with distinction that it can depend on dates. runner allows to apply any R function on window length defined by k with date idx (or any integer). Example below calculates regression on 5-minutes (5*60 sec) window span. Algorithm don't care if there will be day-change, just compute 5-minutes each time (for example 23:56-00:01).
Create data:
set.seed(1)
x <- cumsum(rnorm(1000))
y <- 3 * x + rnorm(1000)
time <- as.POSIXct(cumsum(sample(60:120, 1000, replace = TRUE)),
origin = Sys.Date()) # unequaly spaced time series
data <- data.frame(time, y, x)
Custom function to be called on sliding windows:
library(runner)
running_regression <- function(idx) {
predict(lm(y ~ x, data = data))[max(idx)]
}
data$pred <- runner(seq_along(x),
k = 60 * 5,
idx = time,
f = running_regression)
Once we have created dataset with rolling 5-minute prediction, then we can filter only particular windows - here, only 1-st minute of the hour. It means that we always keep {hh}:56 - {hh+1}:01
library(dplyr)
library(lubridate)
filtered <-
data %>%
filter(minute(time) == 1)
plot(data$time, data$y, type = "l", col = "red")
points(filtered$time, filtered$pred, col = "blue")
There are some other examples in vignette how to do this with runner
Mr. Ulrichs xts package is always phenomenal. I have always been using split for the ordinary 5 minutes, 15 minutes, 30 minutes splits. No problems ever.
Now I'm stuck.
#Setup test data
my.time <- seq(from = as.POSIXct('2000-01-01 00:00:00'),
to = as.POSIXct('2000-01-01 1:00:00'),
by = '1 sec')
my.data <- rep(10, length = length(my.time))
my.xts <- as.xts(my.data, order.by = my.time)
#Now splitting and checking endtimes of first split
tail((split(my.xts, f="minutes", k=20))[[1]])
tail((split(my.xts, f="minutes", k=30))[[1]])
#2000-01-01 00:19:59 10 #All good
#2000-01-01 00:29:59 10 #All good
tail((split(my.xts, f="minutes", k=22))[[1]])
#2000-01-01 00:11:59 10 #Hmmm, what am I missing. Expectimg 00:21:59
#As endpoints is used by split I also checked this behaviour
endpoints(my.xts, on="minutes", k=20)
#[1] 0 1200 2400 3600 3601 #All good
endpoints(my.xts, on="minutes", k=30)
#[1] 0 1800 3600 3601 #All good
endpoints(my.xts, on="minutes", k=22)
#[1] 0 720 2040 3360 3601 #Hmmm
Trying to understand this I dug further into the XTS code at https://github.com/joshuaulrich/xts/blob/master/src/endpoints.c
There I found that this is supposed to be more effective
c(0,which(diff(_x%/%on%/%k+1) != 0),NROW(_x))
So I tried this
which(diff(.index(my.xts) %/% 60 %/% 20 +1) != 0)
#[1] 1200 2400 3600 #As expected
which(diff(.index(my.xts) %/% 60 %/% 21 +1) != 0)
#[1] 1080 2340 3600 #Expecting 1260 2520...
which(diff(.index(my.xts) %/% 60 %/% 22 +1) != 0)
#[1] 720 2040 3360 #Expecting 1320 2640...
which(diff(.index(my.xts) %/% 60 %/% 23 +1) != 0)
#[1] 720 2100 3480 #Expecting 1380 2760...
which(diff(.index(my.xts) %/% 60 %/% 24 +1) != 0)
#[1] 1440 2880 #As expected
which(diff(.index(my.xts) %/% 60 %/% 30 +1) != 0)
#[1] 1800 3600 #As expected
This is where my brain overheated and I posted here instead. I'm sure there is something I'm simply missing, so I haven't posted this as a bug at Github. Please help to explain what is going on. Why am I not getting the expected results?
EDIT:
So, a quick think and I'm guessing this has to do with that all functions base on the start of Unix time and using time base which is not divisible with one hour. Is this a correct lead in my understanding?
EDIT2:
Posted my answer below after finally understanding how endpoints and split are supposed to work...
Of course, split (and endpoints) work as they are supposed to be working within xts. I.e. splitting with 1970-01-01 00:00:00 as the start point when deciding the intervals.
And yes, I was misusing split as an easy way to split at an arbitrary start point of an hour within my time series xts data.
Anyhow, I solved my little issue by writing this short function which "resets" the first timestamp to 1970-01-01 00:00:00.
## contuation from code snippet above
#So, I realised it was all about resetting the first time
#to 1970-01-01 00:00:0,
#so that I could do easily my "strange" splitting.
#Please note that this is not to be used for dates,
#only when working on hour, mins or secs
#and don't care that you break the index.
startMovedSplit <- function(x, ...) {
startIndex <- head(.index(x), n=1)
.index(x) <- .index(x) - startIndex
split(x, ...)
}
#New Try
tail((startMovedSplit(my.xts, f="minutes", k=20))[[1]])
tail((startMovedSplit(my.xts, f="minutes", k=30))[[1]])
#1970-01-01 00:19:59 10 #Good enough for my purposes
#1970-01-01 00:29:59 10 #Good enough for my purposes
tail((startMovedSplit(my.xts, f="minutes", k=22))[[1]])
#1970-01-01 00:21:59 10 #Good enough for my purposes
And the best part of all my misunderstanding of the xts library? I now know how to handle the ellipses (...) within functions and calling of subfunctions!
I wrote a formula based on a weekday calculating algorithm (found in Stackexchange as well, great job guys. Here is the code snippet:
countwd <- function(start, end, day){
x <- seq(start, end, by=1)
y <- weekdays(x, TRUE)
sum(y==day)
}
x$OFFDAY <- NULL
for(i in 1:nrow(x)){
x$OFFDAY[i] <- countwd(x$PICK_DATE[i], x$SHIP_DATE[i], "Mon")
}
This is way too slow (loop proceeds like 2-4 rows per second!!!!), and I have millions of entries for each month.
Here is the vectorisation of the function:
x$OFFDAY <- countwd(x$PICK_DATE, x$SHIP_DATE, "Mon")
Shows this error:
Error in seq.POSIXt(start, end, by = 1) : 'from' must be of length 1
I cannot understand how to apply the "apply" family functions in this case as I have two vectors to compare (yes, I am really new to this).
Sample Data:
PICK_DATE SHIP_DATE
01-APR-2017 00:51 02-APR-2017 06:55 AM
01-APR-2017 00:51 02-APR-2017 12:11 PM
01-APR-2017 00:51 02-APR-2017 12:11 PM
01-APR-2017 00:51 02-APR-2017 09:39 AM
I have converted these to POSIXct, and the formula works well for individual values (returns the second value though, no idea why. However, I can work around that):
>countwd(x$PICK_DATE[1], x$SHIP_DATE[1], "Mon")
[1] 0
An easy way to vectorize a function of multiple varying inputs is to use mapply:
mapply(countwd, x$SHIP_DATE, x$PICK_DATE, "Mon")
Or, alternatively, you can use sapply and pass a sequence of indices as a first argument (this way the syntax is very similar to a for loop:
sapply(1:nrow(x), function(i) countwd(x$SHIP_DATE[i], x$PICK_DATE[i], "Mon"))
The main inefficiency in your case however stems from the countwd function. Notice that you are passing POSIXt vectors to the function. Thus, when seq is called in the first row of the function, the by argument is taken to be seconds instead of days! This leads to generating needlessly large vectors (see ?seq.POSIXt for details).
Changing countwd in the following way should greatly improve performance:
countwd <- function(start, end, day) {
x <- seq(start, end, by="day")
y <- weekdays(x, TRUE)
sum(y==day)
}
Also note that weekdays is locale-specific and may not work as intended depending on your locale settings.
Based on #demirev's answer and my comments above, here is a worked example using the improved countwd function and mapply. I put in a few helper columns using lubridate to check the solution, and changed some of the dates to return values to df$off_days that were not zero.
library(lubridate)
df <- data.frame(pick_date = c(rep("01-APR-2017 00:51", 4)), ship_date = c("05-APR-2017 06:55", "09-APR-2017 12:11", "30-APR-2017 12:11", "02-MAY-2017 12:11"))
df$pick_date <- lubridate::dmy_hm(df$pick_date)
df$ship_date <- lubridate::dmy_hm(df$ship_date)
df$pick_day <- wday(df$pick_date, label = T)
df$ship_day <- wday(df$ship_date, label = T)
df$days_between <- interval(df$pick_date, df$ship_date) %/% days()
countwd <- function(start, end, day) {
x <- seq(start, end, by="day")
y <- weekdays(x, TRUE)
sum(y==day)
}
df$off_days <- mapply(countwd, df$pick_date, df$ship_date, "Mon")
df
pick_date ship_date pick_day ship_day days_between off_days
1 2017-04-01 00:51:00 2017-04-05 06:55:00 Sat Wed 4 1
2 2017-04-01 00:51:00 2017-04-09 12:11:00 Sat Sun 8 1
3 2017-04-01 00:51:00 2017-04-30 12:11:00 Sat Sun 29 4
4 2017-04-01 00:51:00 2017-05-02 12:11:00 Sat Tues 31 5
This question asks about aggregation by time period in R, what pandas calls resampling. The most useful answer uses the XTS package to group by a given time period, applying some function such as sum() or mean().
One of the comments suggested there was something similar in lubridate, but didn't elaborate. Can someone provide an idiomatic example using lubridate? I've read through the lubridate vignette a couple times and can imagine some combination of lubridate and plyr, however I want to make sure there isn't an easier way that I'm missing.
To make the example more real, let's say I want the daily sum of bicycles traveling northbound from this dataset:
library(lubridate)
library(reshape2)
bikecounts <- read.csv(url("http://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD"), header=TRUE, stringsAsFactors=FALSE)
names(bikecounts) <- c("Date", "Northbound", "Southbound")
Data looks like this:
> head(bikecounts)
Date Northbound Southbound
1 10/02/2012 12:00:00 AM 0 0
2 10/02/2012 01:00:00 AM 0 0
3 10/02/2012 02:00:00 AM 0 0
4 10/02/2012 03:00:00 AM 0 0
5 10/02/2012 04:00:00 AM 0 0
6 10/02/2012 05:00:00 AM 0 0
I don't know why you'd use lubridate for this. If you're just looking for something less awesome than xts you could try this
tapply(bikecounts$Northbound, as.Date(bikecounts$Date, format="%m/%d/%Y"), sum)
Basically, you just need to split by Date, then apply a function.
lubridate could be used for creating a grouping factor for split-apply problems. So, for example, if you want the sum for each month (ignoring year)
tapply(bikecounts$Northbound, month(mdy_hms(bikecounts$Date)), sum)
But, it's just using wrappers for base R functions, and in the case of the OP, I think the base R function as.Date is the easiest (as evidenced by the fact that the other Answers also ignored your request to use lubridate ;-) ).
Something that wasn't covered by the Answer to the other Question linked to in the OP is split.xts. period.apply splits an xts at endpoints and applies a function to each group. You can find endpoints that are useful for a given task with the endpoints function. For example, if you have an xts object, x, then endpoints(x, "months") would give you the row numbers that are the last row of each month. split.xts leverages that to split an xts object -- split(x, "months") would return a list of xts objects where each component was for a different month.
Although, split.xts() and endpoints() are primarily intended for xts objects, they also work on some other objects as well, including plain time based vectors. Even if you don't want to use xts objects, you still may find uses for endpoints() because of its convenience or its speed (implemented in C)
> split.xts(as.Date("1970-01-01") + 1:10, "weeks")
[[1]]
[1] "1970-01-02" "1970-01-03" "1970-01-04"
[[2]]
[1] "1970-01-05" "1970-01-06" "1970-01-07" "1970-01-08" "1970-01-09"
[6] "1970-01-10" "1970-01-11"
> endpoints(as.Date("1970-01-01") + 1:10, "weeks")
[1] 0 3 10
I think lubridate's best use in this problem is for parsing the "Date" strings into POSIXct objects. i.e. the mdy_hms function in this case.
Here's an xts solution that uses lubridate to parse the "Date" strings.
x <- xts(bikecounts[, -1], mdy_hms(bikecounts$Date))
period.apply(x, endpoints(x, "days"), sum)
apply.daily(x, sum) # identical to above
For this specific task, xts also has an optimized period.sum function (written in Fortran) that is very fast
period.sum(x, endpoints(x, "days"))
Using ddply from plyr package:
library(plyr)
bikecounts$Date<-with(bikecounts,as.Date(Date, format = "%m/%d/%Y"))
x<-ddply(bikecounts,.(Date),summarise, sumnorth=sum(Northbound),sumsouth=sum(Southbound))
> head(x)
Date sumnorth sumsouth
1 2012-10-02 1165 773
2 2012-10-03 1761 1760
3 2012-10-04 1767 1708
4 2012-10-05 1590 1558
5 2012-10-06 926 1080
6 2012-10-07 951 1191
> tail(x)
Date sumnorth sumsouth
298 2013-07-26 1964 1999
299 2013-07-27 1212 1289
300 2013-07-28 902 1078
301 2013-07-29 2040 2048
302 2013-07-30 2314 2226
303 2013-07-31 2008 2076
Here is an option using data.table
after importing the csv:
library(data.table)
# convert the data.frame to data.table
bikecounts <- data.table(bikecounts)
# Calculate
bikecounts[, list(NB=sum(Northbound), SB=sum(Southbound)), by=as.Date(Date, format="%m/%d/%Y")]
as.Date NB SB
1: 2012-10-02 1165 773
2: 2012-10-03 1761 1760
3: 2012-10-04 1767 1708
4: 2012-10-05 1590 1558
5: 2012-10-06 926 1080
---
299: 2013-07-27 1212 1289
300: 2013-07-28 902 1078
301: 2013-07-29 2040 2048
302: 2013-07-30 2314 2226
303: 2013-07-31 2008 2076
Note, you can also use fread() ("fast read") from the data.table package to read in the CSV into a data.table in one step.
The only draw back is you to manually convert the date/time from string.
eg:
bikecounts <- fread("http://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD", header=TRUE, stringsAsFactors=FALSE)
setnames(bikecounts, c("Date", "Northbound", "Southbound"))
bikecounts[, Date := as.POSIXct(D, format="%m/%d/%Y %I:%M:%S %p")]
Here is the requested lubridate solution, which I also added to the linked question. It uses a combination of lubridate and zoo aggregate() for these operations:
ts.month.sum <- aggregate(zoo.ts, month, sum)
ts.daily.mean <- aggregate(zoo.ts, day, mean)
ts.mins.mean <- aggregate(zoo.ts, minutes, mean)
Obviously, you need to first convert your data to a zoo() object, which is easy enough. You can also use yearmon() or yearqtr(), or custom functions for both split and apply. This method is as syntactically sweet as that of pandas.
I am familiar with the zoo function rollapply which allows you to do rolling computations on zoo or xts objects and you can specify the rolling increment via the by parameter. I am specifically interested in applying a function every month but using all of the past daily data in the computation. For example say my data set looks like this:
dte, val
1/01/2001, 10
1/02/2001, 11
...
1/31/2001, 2
2/01/2001, 54
2/02/2001, 34
...
2/30/2001, 29
I would like to select the end of each month and apply a function that uses all the daily data. This doesn't seem like it would work with rollapply since the by argument would be 30 sometimes, 29 other months, etc. My current idea is:
f <- function(xts_obj) { coef(lm(a ~ b, data=as.data.frame(xts_obj)))[1] }
month_end <- endpoints(my_xts, on="months", k=1)
rslt <- apply(month_end, 1, function(idx) { my_xts[paste0("/",idx)] })
Surely there is a better way to do this that would be quicker no?
To clarify: I would like to use overlapping periods just the rolling should be done monthly.
If I understand correctly, you can get the dates of your endpoints, then for each endpoint (i.e. using lapply or for), call rollapply using data up to that point.
getSymbols("SPY", src='yahoo', from='2012-01-01', to='2012-08-01')
idx <- index(SPY)[endpoints(SPY, 'months')]
out <- lapply(idx, function(i) {
as.xts(rollapplyr(as.zoo(SPY[paste0("/", i)]), 5,
function(x) coef(lm(x[, 4] ~ x[, 1]))[2], by.column=FALSE))
})
sapply(out, NROW)
#[1] 16 36 58 78 100 121 142 143
I temporarily coerce to zoo for the rollapplyr to make sure the rollapply.zoo method is being used (as opposed to the unexported rollapply.xts method), then coerce back to xts
As an answer to "Is the zoo/xts conversion needed?":
It isn't needed in this case, but rollapply won't work if you send it a dataframe, as I recently discovered from this StackOverflow answer
You want period.apply(), or its convenience helper apply.monthly(), both in xts.
Example:
R> foo <- xts(1:100, order.by=Sys.Date()+0:99)
R> apply.monthly(foo, sum)
[,1]
2012-08-31 105
2012-09-30 885
2012-10-31 1860
2012-11-25 2200
R>
or equally
R> apply.monthly(foo, quantile)
0% 25% 50% 75% 100%
2012-08-31 1 4.25 7.5 10.75 14
2012-09-30 15 22.25 29.5 36.75 44
2012-10-31 45 52.50 60.0 67.50 75
2012-11-25 76 82.00 88.0 94.00 100
R>
just to prove that functions returning more than one value can be used too.