Extracting highest value in some unequal periods/time-series - r

I have two data frame: period_example (consists of Beg, and End) and price_example (consists of Date and High). I want the highest value of High for each Beg-End period. How to do it? Thank you.
Here is the data:
period_example <- data.frame(Beg = as.Date(c("2000-01-01","2000-01-04","2000-01-09")),
End = as.Date(c("2000-01-03","2000-01-08","2000-01-12")))
price_example <- data.frame(Date = seq(as.Date("2000-01-01"), as.Date("2000-01-12"), by="days"),
High = c(100,105,104,103,102,106,107,108,109,110,115,114))
The result should be like this:
result <- data.frame(Beg = as.Date(c("2000-01-01","2000-01-04","2000-01-09")),
End = as.Date(c("2000-01-03","2000-01-08","2000-01-12")),
High = c(105,108,115))

I think I found a solution for this problem, you could apply a function to each row and find the max between these dates in the other data frame:
period_example <- data.frame(Beg = as.Date(c("2000-01-01","2000-01-04","2000-01-09")),End = as.Date(c("2000-01-03","2000-01-08","2000-01-12")))
price_example <- data.frame(Date = seq(as.Date("2000-01-01"), as.Date("2000-01-12"),by="days"), High = c(100,105,104,103,102,106,107,108,109,110,115,114))
period_example$High <- apply(period_example,1 , function(x) max(price_example[price_example$Date >= x[1] & price_example$Date <= x[2], "High"]))
> period_example
Beg End High
1 2000-01-01 2000-01-03 105
2 2000-01-04 2000-01-08 108
3 2000-01-09 2000-01-12 115

data.table has a fast function for this: foverlaps.
library(data.table)
x = setDT(period_example)
y = setDT(price_example)
y[, `:=` (Beg = Date, End = Date)]
setkey(x, Beg, End)
z = foverlaps(y, x)
z[, .(High = max(High)), by = .(Beg, End)]

This should work
period_example <- data.frame(Beg = as.Date(c("2000-01-01","2000-01-04","2000-01-09")),End = as.Date(c("2000-01-03","2000-01-08","2000-01-12")))
price_example <- data.frame(Date = seq(as.Date("2000-01-01"), as.Date("2000-01-12"),by="days"), High = c(100,105,104,103,102,106,107,108,109,110,115,114))
betweenDates <- function(target,beg,end){
beg <- as.Date(beg)
end <- as.Date(end)
target <- as.Date(target)
return(target>=beg&target<=end)
}
selecteDates <- sapply(price_example$Date,function(x) betweenDates(x,period_example$Beg,period_example$End))
highValues <- sapply(1:nrow(period_example),function(x) max(price_example$High[selecteDates[x,]]))
result <- data.frame(period_example,High=highValues)

Related

Avoid for loop using data.table

I have a simulation over time (dev_quarters) that looks like this, which is a data.table :
simulation <- data.table(`Scenario ID` = 1, dev_quarter = seq(1:80), brand = 1, proportion = runif(80))
For each scenario, we have n_brand, n_scenario and a proportion.
I try to code the following : for each scenario, for each brand, compute the difference of the proportion between the beginning and the end of the year, for each year.
I made the following to recover the corresponding dev_quarters for each year :
x <- 2002:2021
lookup_T <- as.integer(format(Sys.Date(), "%Y"))
lookup_period <- data.table(years = lookup_T-x+1, quarters_t = (lookup_T-x+1)*4, quarters_t1 = (lookup_T-x+2)*4)
With a small example
n_scenario <- 1
n_brand <- 10
An ugly code that uses for loops :
result <- data.table(`Scenario ID` = numeric(), years = numeric(), brand = numeric(), proportion = numeric())
for(i in 1:n_scenario){
for(j in 1:n_brand){
prop_per_year <- c()
# for each year
for(k in 1:length(x)){
year <- lookup_period[k, ]
quarter_start_year <- year[["quarters_t"]]
quarter_end_year <- year[["quarters_t1"]]
end_year_prop <- simulation[`Scenario ID`==i & brand==j & dev_quarter==quarter_end_year]
start_year_prop <- simulation[`Scenario ID`==i & brand==j & dev_quarter==quarter_start_year]
prop_this_year <- max(end_year_prop[["proportion"]] - start_year_prop[["proportion"]], 0)
prop_per_year <- append(prop_per_year, prop_this_year)
}
result_temp <- data.table(`Scenario ID` = i, years = x, brand = j, proportion = prop_per_year)
result <- rbind(result, result_temp)
}
}
I considered to filter my data.table, using only rows were dev_quarters were 4k factors, but the issue remains the same about the for loops.
How can I avoid them using data.table ?
Thanks.
The absolute change in proportion between the 4th and 1st quarter can be calculated much more easily.
simulation[, year := 2002 + (dev_quarter-1) %/% 4] # Easier way to calculate the year
simulation[, .(change = last(proportion) - first(proportion)), by = c("Scenario ID", "brand", "year")

It is possible to check if a time (no date) is in specific datetime interval?

Example: I want to check if 03:00 is inside interval of 2015-01-05 00:52 and 2015-01-05 05:52, for this case my desirable outcome is to be TRUE.
Thanks in advance
I propose this solution for a data frame with several max and min datetimes (solution is shorter than the number of lines for creating a reproducible example):
# Reproducible example
set.seed(666)
timeLimits <- seq(from = (as.POSIXct("1996-6-6 06:06:06")),
to = (as.POSIXct("1997-6-6 06:06:06")),
by = "hour")
timeLimits <- matrix(data = sort(sample(x = timeLimits, size = 66)),
ncol = 2, byrow = TRUE)
timeLimits <- data.frame(min = as.POSIXct(x = timeLimits[,1], origin = "1970-1-1 00:00:00"),
max = as.POSIXct(x = timeLimits[,2], origin = "1970-1-1 00:00:00"))
# The solution
require(lubridate)
answer <- apply(timeLimits, 2, as.character)
answer <- apply(timeLimits, 1, function(x) is.element(3, hour(seq(as.POSIXct(x[1]), as.POSIXct(x[2]), "hour"))))
update
library(data.table)
as.ITime("03:00") %in% as.ITime( seq( as.POSIXct("2015-01-04 22:52"),
as.POSIXct("2015-01-05 05:52"),
by = 1 ) )
#[1] TRUE
old answer
as.ITime("03:00") %between% c(as.ITime("2015-01-05 00:52"),
as.ITime("2015-01-05 05:52"))
#[1] TRUE
You can try the code below by playing some tricks on strings
> match(x, sort(c(x, gsub(".*\\s", "", c(start, end))))) == 2
[1] TRUE
where
start <- "2015-01-05 00:52"
end <- "2015-01-05 05:52"
x <- "3:00"

Converting date to numeric but limiting to number of days in year

I want to create date object between 2008-01-01 and 2010-12-31 around 10K of them. I wrote the code for that but I actually want to keep days 1-366 in 2008 because of 2008-02-29 (leap year) I want them to restart after 366 then become 1 on 2009-01-01. I can do this as create only for 2008 then 2009 then 2010 but it won't be convenient. I was reading about lubridate but could not figure it out. I can also filter 1 to 366 then 367-731 but that's not gonna be efficient as well. Anyone knows a better way to do it?
set.seed(123)
tim1=sample(365*3+1,10000,replace = TRUE) ### that plus 1 from feb 29 in 2008
dat1=as.Date(tim1,origin="2007-12-31") # then 1 will be 2008-01-01
You can create a vector of all the target dates and sample from it. To create the vector, there is seq.Date, the seq method for objects of class "Date".
start <- as.Date("2008-01-01")
end <- as.Date("2010-12-31")
s <- seq(start, end, by = "days")
The vector s includes all days between start and end. Now sample from it.
set.seed(123)
dat1 <- sample(s, 10000, TRUE)
Transform the sample into day-of-the-year. See help("strptime")
as.numeric(format(dat1, format = "%j"))
In the end, remove s, it's no longer needed.
rm(s) # tidy up
Edit.
The following two functions do what the question asks for but with two different methods.
f1 is the code above wrapped in a function, f2 uses ave/seq_along/match and is a bit more complicated. The tests show function f2 to be twice as fast than f1
f1 <- function(start_date, end_date, n){
start <- as.Date(start_date)
end <- as.Date(end_date)
s <- seq(start, end, by = "days")
y <- sample(s, n, replace = TRUE)
as.numeric(format(y, format = "%j"))
}
f2 <- function(start_date, end_date, n){
start <- as.Date(start_date)
end <- as.Date(end_date)
s <- seq(start, end, by = "days")
y <- sample(s, n, replace = TRUE)
z <- ave(as.integer(s), lubridate::year(s), FUN = seq_along)
z[match(y, s)]
}
set.seed(123)
x1 <- f1("2008-01-01", "2010-12-31", 100)
set.seed(123)
x2 <- f2("2008-01-01", "2010-12-31", 100)
all.equal(x1, x2)
#[1] TRUE
Now the tests.
library(microbenchmark)
mb <- microbenchmark(
f1 = f1("2008-01-01", "2010-12-31", 1e4),
f2 = f2("2008-01-01", "2010-12-31", 1e4),
times = 50
)
print(mb, order = "median")
ggplot2::autoplot(mb)

How to efficiently collapse a vector of integers into a data.table of sequences, using R?

Given a large vector. For example:
set.seed(1)
in_vec <- sample(1:10000, 5000, replace = F)
How can I efficiently collapse this into a datatable that provides the start and end coordinates for all sequential integers. I am currently using the following code:
in_vec <- sort(in_vec) # sort by sequence
library(data.table)
interval_id <- findInterval(in_vec, in_vec[which(c(1, diff(in_vec)) > 1)]) # add unique IDs for sequences
dt <- data.table(vec = in_vec, # make data.table
int_id = interval_id)
long_to_short <- function(sub){ data.table(start = sub$vec[1], end = sub$vec[nrow(sub)]) } # custom function
library(plyr)
output <- ddply(dt, "int_id", long_to_short)
output$int_id <- NULL
However, the vector I am applying this to is very large, and I therefore need to maximise performance. Is there a data.table method? Any help will be greatly appreciated!
Using rleid() from data.table is helpful:
library(data.table)
set.seed(1)
dt <- data.table(in_vec = sample(1:10000, 5000, replace = F))
dt[order(in_vec),
.(start = min(in_vec),
end = max(in_vec)),
by = .(grp = rleid(c(0, cumsum(diff(in_vec) > 1))))
]
grp start end
1: 1 4 4
2: 2 6 7
3: 3 14 16
4: 4 19 19
5: 5 26 27
---
2483: 2483 9980 9980
2484: 2484 9988 9988
2485: 2485 9991 9992
2486: 2486 9994 9994
2487: 2487 9997 9998
For a completely base solution, this should be the most performant as it is not a grouping operation:
set.seed(1)
in_vec <- sample(1:10000, 5000, replace = F)
in_vec <- sort(in_vec)
grp <- c(0, cumsum(diff(in_vec) > 1))
data.frame(grp = unique(grp),
start = in_vec[!duplicated(grp)],
end = in_vec[!duplicated(grp, fromLast = T)]
)
Something like this?
dt[, .(start = first(vec), end = last(vec)), int_id]
Edit: I think the following will do what you need within data.table, adjust the fill = -1 depending on the actual range of your values.
set.seed(1)
in_vec <- sample(1:10000, 5000, replace = F)
dt <- data.table(vec = in_vec, key = 'vec')
dt[, int_id := cumsum(!shift(vec, 1, fill = -1) == vec - 1)]
dt[,.(start = first(vec), end = last(vec)), int_id]
You are almost there, just need to use the difference between sorted vectors to create a group. Then do range on them.
set.seed(1)
in_vec <- sample(1:10000, 5000, replace = F)
in_vec <- sort(in_vec)
grps <- cumsum(c(1,diff(in_vec)>1))
output <- data.frame(do.call(rbind,tapply(in_vec,grps,range)))
names(output) <- c("start","end")
And a dplyr solution
set.seed(1)
in_vec <- sample(1:10000, 5000, replace = F)
data.frame(x=in_vec) %>%
arrange(x) %>%
mutate(grps=cumsum(c(1,diff(x)>1))) %>%
group_by(grps) %>%
summarise(start=min(x),end=max(x)) %>%
select(start,end)

how to split data frame by time interval

I have two data frames, first is the daily return of 3 securities, second is the weights of the securities, as the following:
daily.return <- data.frame(date = seq.Date(from = as.Date("2015-01-01"),
by = "days",
length.out = 100),
a = runif(100,-0.1,0.1),
b = runif(100,-0.1,0.1),
c = runif(100,-0.1,0.1))
weights <- data.frame(startDate = c(as.Date("2015-01-01"),
as.Date("2015-02-10"),
as.Date("2015-03-15")),
endDate = c(as.Date("2015-02-09"),
as.Date("2015-03-14"),
as.Date("2015-04-10")),
a = c(0.3,0.5,0.2),
b = c(0.4,0.2,0.1),
c = c(0.3,0.3,0.7)
)
I know how to split data fame by weeks etc.., if we convert data frame to xts;but how to split this daily.return according to startDate and endDate in weights?
Suppose a fund have this three securities,how to calculate the fund nav and daily return?
This should do the job.
daily.return <- data.frame(date = seq.Date(from = as.Date("2015-01-01"),
by = "days",
length.out = 100),
a = runif(100,-0.1,0.1),
b = runif(100,-0.1,0.1),
c = runif(100,-0.1,0.1))
weights <- data.frame(startDate = c(as.Date("2015-01-01"),
as.Date("2015-02-10"),
as.Date("2015-03-15")),
endDate = c(as.Date("2015-02-09"),
as.Date("2015-03-14"),
as.Date("2015-04-10")),
a = c(0.3,0.5,0.2),
b = c(0.4,0.2,0.1),
c = c(0.3,0.3,0.7)
)
library(quantmod)
daily.xts <- as.xts(daily.return[,-1],daily.return[,1])
# Assuming that the total period is the same in both the data frames
weights.xts <- xts(matrix(NA,nrow(daily.xts),3),order.by=index(daily.xts))
names(weights.xts) <- c("a","b","c")
for (i in 1:nrow(weights)){
temp.inputs <- weights[i,]
temp.period <- paste(temp.inputs[,1],temp.inputs[,2],sep="/")
len <- nrow(weights.xts[temp.period])
weights.xts[temp.period,1:3] <- matrix(rep(as.numeric(temp.inputs[,3:5]),len),len,byrow=T)
}
weighted.returns <- daily.xts * weights.xts
weighted.returns <- as.xts(rowSums(weighted.returns),index(weighted.returns))
names(weighted.returns) <- "Weighted Returns"
weighted.returns$Cumulative <- cumsum(weighted.returns)
plot(weighted.returns$Cumulative)
You can split daily.return according to start and end date in weights using apply, performing row-wise operation
apply(weights, 1, function(x) daily.return[daily.return$date >= x[1]
& daily.return$date <= x[2], ])
This will give a list of 3 dataframes splitted according to the range in weights.
EDIT
If I have understood correctly, you want each value in the column a, b, c of the daily.return to multiply with respective columns in the weights.
apply(weights, 1, function(x) {
A <- daily.return[daily.return$date >= x[1] & daily.return$date <= x[2], ]
t(t(A[, 2:4]) * as.numeric(x[3:5]))
}
)

Resources