Let's say I have a set of, partly overlapping, intervals
require(lubridate)
date1 <- as.POSIXct("2000-03-08 01:59:59")
date2 <- as.POSIXct("2001-02-29 12:00:00")
date3 <- as.POSIXct("1999-03-08 01:59:59")
date4 <- as.POSIXct("2002-02-29 12:00:00")
date5 <- as.POSIXct("2000-03-08 01:59:59")
date6 <- as.POSIXct("2004-02-29 12:00:00")
int1 <- new_interval(date1, date2)
int2 <- new_interval(date3, date4)
int3 <- new_interval(date5, date6)
Does anyone have an idea how one could construct a time series plot that provides, for every point in time, the number of overlapping intervals at that point?
So, for instance, to take the above example: For a given date in January 2000, the function I'm looking for would return the value "1" (the date is only within int2) while for a date in January 2001, it would return "3" (since that date is within int1, int2 and int3). Etc.
Any ideas?
Here's one way using foverlaps() function using data.table package:
Please install the development version 1.9.5 by following the installation instructions as a bug that affects overlap joins on numeric types has been fixed there.
require(data.table) ## 1.9.5+
intervals = data.table(start = c(date1, date3, date5),
end = c(date2, date4, date6))
# assuming your query is:
query = as.POSIXct(c("2000-01-01 00:00:00", "2001-01-01 00:00:00"))
We'll construct the query data.table with both start and end intervals as well:
querydt = data.table(start=query, end=query) # identical start,end
Then we can use foverlaps() as follows:
setkeyv(intervals, c("start", "end"))
ans = foverlaps(querydt, intervals, which=TRUE, nomatch=0L, type="within")
# xid yid
# 1: 1 1
# 2: 2 1
# 3: 2 2
# 4: 2 3
We first set key - which sorts the data.table intervals by the columns provided, in increasing order, and marks those columns as the key columns on which we want to perform the overlap join.
Then we use foverlaps() to find which intervals in querydt overlaps (falls type=within) with intervals. In this case, querydt consists of just points as start and end points are identical. This returns all matching indices (nomatch=0L removes all rows with no matches and which=TRUE returns indices instead of merged result) for those rows in querydt that falls within intervals.
Now all we have to do is to aggregate by xid and count the number of observations to get the count:
ans[, .N, by=xid]
# xid N
# 1: 1 1
# 2: 2 3
Check ?foverlaps for more info.
Related
Each row records the start time and end time of a time period. To simply, we assume all are in the same day, so we don't need to bother other more difficult issues.
I need to calculate how long each row's time period overlaps with other rows. And the number of rows that have such overlaps. Suppose I have a data set. I want column D and column E.
There is no sample data. I just use this fake data set to make an example.
dat <- data.frame(id=1:4,
Start = c("02:50","02:55","03:15","03:25"),
End = c("03:10", "03:05", "03:20", "03:30"),
num_overlap = c(1,1,0,0),
time_overlap = c(10,10,0,0))
dat
Here's a solution using the foverlaps method within the data.table package:
Packages:
library(data.table)
library(chron) # To convert the times provided into a `times` format
Your data:
dat <- data.frame(id=1:4,
Start = c("02:50","02:55","03:15","03:25"),
End = c("03:10", "03:05", "03:20", "03:30"))
I assume your times represent hours and minutes, so need to add the seconds ":00" and format as a times object:
dat$Start = times(paste0(dat$Start, ":00"))
dat$End = times(paste0(dat$End,":00"))
We will use the data.table package so we coerce the data frame to a data.table object
setDT(dat)
Set the Start and End time columns as keys for merging and then call the foverlaps function
setkey(dat, "Start", "End")
# Merge onto itself to find overlaps
overlaps = foverlaps(dat, dat, type = "any")
(See more on foverlaps here at https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/foverlaps)
foverlaps keeps cases where the match was with itself (id = i.id) so we remove them
overlaps = overlaps[id != i.id]
Next, compute the amount of overlap, for each overlap that occurred
overlaps[, time_overlap := pmin(i.End, End) - pmax(i.Start, Start)]
Convert to minutes per https://stackoverflow.com/a/29067748/3674399
overlaps[, time_overlap := 60 * 24 * as.numeric(time_overlap)]
Keep only id and time_overlap, and summarize by id
overlaps = overlaps[, list(id, time_overlap)]
overlaps = overlaps[, list(num_overlap = .N, time_overlap = sum(time_overlap)), by = "id"]
Finally, merge with original dataset now, and fill any missing values
result = merge(dat, overlaps, by = "id", all.x = T)
result[is.na(num_overlap), num_overlap := 0]
result[is.na(time_overlap), time_overlap := 0]
The result is as follows:
id Start End num_overlap time_overlap
1: 1 02:50:00 03:10:00 1 10
2: 2 02:55:00 03:05:00 1 10
3: 3 03:15:00 03:20:00 0 0
4: 4 03:25:00 03:30:00 0 0
If you are new to the data.table syntax, please see for example, this intro here: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
library(data.table)
library(QuantTools)
date_from <- '2018-11-01'
date_to <- '2018-11-30'
ticker <- 'SPFB.RTS'
# get days
dataDaily <- get_finam_data(ticker, date_from, date_to, 'day')
# get hours
dataHourly <- get_finam_data(ticker, date_from, date_to, 'hour')
# percent change of the day
dataDaily$pc <- ((dataDaily$close - dataDaily$open)/dataDaily$open)*100
# mark days with > 2 percent change
dataDaily$isBigCh <- dataDaily$pc[dataDaily$pc > 2]
So, I have a code above which downloads a daily/hourly OHLC data of the futures.
Questions:
1) How can I move the marks from dataDaily$isBigCh to dataHourly? It seems not easy because these data frames have different time formats and different lengths of rows.
dataHourly$time # has a format like this 2018-11-09 23:00:00
dataDaily$date # has a format like this 2018-11-09
2) How can I select the first bar of the day in dataHourly$time?
Slightly modified code for readability
# percent change of the day
dataDaily[, price_change := ( close / open - 1 ) * 100 ]
# mark days with > 2 percent change
dataDaily[, isBigCh := price_change > 2 ]
Question 1
# add date column to hourly data
# note that 00:00 time corresponds to 23:00-00:00 candle
dataHourly[, date := as.Date( time - as.difftime( '01:00:00' ) ) ]
# copy dataDaily isBigCh to dataHourly isBigChDaily
dataHourly[ dataDaily, isBigChDaily := isBigCh, on = 'date' ]
Question 2
# select first bar of the day
dataHourly[, .SD[1], by = date ]
Optionally
# remove date column from hourly data
dataHourly[, date := NULL ]
Note
library(data.table) not necessary as QuantTools loads it automatically
please read data.table manual it will save you lots of time trying to figure out simple manipulations similar to what you asked
I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2
Suppose I have a series of observations representing date intervals, e.g.
library(dplyr)
library(magrittr)
df <-
data_frame(start = as.Date(c('2000-01-01', '2000-01-03', '2000-01-08',
'2000-01-20', '2000-01-22')),
end = as.Date(c('2000-01-02', '2000-01-05', '2000-01-10',
'2000-01-21', '2000-02-10')))
I would like to group these observations such that the start time of observation n occurs within some specified interval following the end date of observation n-1. For instance, if we set that interval to be 5 days, we would see something like:
# start end group
# (date) (date) (dbl)
# 1 2000-01-01 2000-01-02 1
# 2 2000-01-03 2000-01-05 1
# 3 2000-01-08 2000-01-10 1
# 4 2000-01-20 2000-01-21 2
# 5 2000-01-22 2000-02-10 2
(For the sake of simplicity, I'm assuming no overlap in dates, although this isn't necessarily the case in the data). I thought about using igraph to create a weighted edgelist, but that seemed overly complicated. Efficiency is, I believe, important: I'll be running this on roughly 4 million groups of data of about 5-10 rows each.
While my solution does work, to me it seems error-prone, slow, and clunky. I'm thinking using a package or some vectorization would really improve matters.
group_dates <- function(df, interval){
# assign first date to first group
df %<>% arrange(start, end)
df[1, 'group'] <- 1
# for each start date, determine if it is within `interval` days of the
# closest end date
lapply(df$start[-1], function(cur_start){
earlier_data <- df[df$end <= cur_start, ]
diffs <- cur_start - earlier_data$end
min_interval <- diffs[which.min(diffs)]
closest_group <- earlier_data$group[which.min(diffs)]
if(min_interval <= interval){
df[df$start == cur_start, 'group'] <<- closest_group
} else {
df[df$start == cur_start, 'group'] <<- closest_group + 1
}
})
return(df)
}
You can do that relatively easily with dplyr.
The idea is the following:
Lag the end data (shifting it down by one)
Calculate the difference between start date and the lagged end date
Adding 'BreakPoints' - A variable with TRUE when the difference is more than 5 days and FALSE otherwise
Calculating the cumulative sum of this break-point. This will add 1 every time it find a new breakpoint so a new interval should be started
Something like this should work for you:
df %>%
mutate(lagged_end = lag(end),
diff = start - lagged_end,
new_interval = diff > 5,
new_interval = ifelse(is.na(new_interval), FALSE, new_interval),
interval_number = cumsum(new_interval))
This should be also quite quick since it's all in dplyr
This isn't as elegant as Lorenzo Rossi's solution, but offers a slightly different approach using cut.Date and 2 lines of code:
breakpoints <- c(FALSE, sapply(2:nrow(df), function(x) df[x,"start"] - df[x-1,"end"]) > 5)
clusterLabels <- as.numeric(cut.Date(df$start, c(min(df$start), df[breakpoints, "start"], max(df$start)+1)))
I have the following example:
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"),
to = as.POSIXct("2010-10-10 22:00"), by = 3600)
Dat <- data.frame(DateTime = Date1,
t = rnorm(length(Date1)))
I would like to find the range of values in a given day (i.e. maximum - minimum).
First, I've defined additional columns which define the unique days in terms of the date and in terms of the day of year (doy).
Dat$date <- format(Dat$DateTime, format = "%Y-%m-%d") # find the unique days
Dat$doy <- as.numeric(format(Dat$DateTime, format="%j")) # find the unique days
To then find the range I tried
by(Dat$t, Dat$doy, function(x) range(x))
but this returns the range as two values not a single value, So, my question is, how do I find the calculated range for each day and return them in a data.frame which has
new_data <- data.frame(date = unique(Dat$date),
range = ...)
Can anyone suggest a method for doing this?
I tend to use tapply for this kind of thing. ave is also useful sometimes. Here:
> dr = tapply(Dat$t,Dat$doy,function(x){diff(range(x))})
Always check tricksy stuff:
> dr[1]
121
3.084317
> diff(range(Dat$t[Dat$doy==121]))
[1] 3.084317
Use the names attribute to get the day-of-year and the values to make a data frame:
> new_data = data.frame(date=names(dr),range=dr)
> head(new_data)
date range
121 121 3.084317
122 122 4.204053
Did you want to convert the number day-of-year back to a date object?
# Use the data.table package
require(data.table)
# Set seed so data is reproducible
set.seed(42)
# Create data.table
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"), to = as.POSIXct("2010-10-10 22:00"), by = 3600)
DT <- data.table(date = as.IDate(Date1), t = rnorm(length(Date1)))
# Set key on data.table so that it is sorted by date
setkey(DT, "date")
# Make a new data.table with the required information (can be used as a data.frame)
new_data <- DT[, diff(range(t)), by = date]
# date V1
# 1: 2010-05-01 4.943101
# 2: 2010-05-02 4.309401
# 3: 2010-05-03 4.568818
# 4: 2010-05-04 2.707036
# 5: 2010-05-05 4.362990
# ---
# 159: 2010-10-06 2.659115
# 160: 2010-10-07 5.820803
# 161: 2010-10-08 4.516654
# 162: 2010-10-09 4.010017
# 163: 2010-10-10 3.311408