I am looking for a way to quickly calculate realized volatility on a rolling FORWARD looking basis. So I want to calculate the standard deviation using today as the first observation for the next n days.
At the moment, I calculate realized volatility in the backward direction with the following code:
index.realized <- xts(apply(index.ret,2,runSD,n=125), index(index.ret))*sqrt(252)
index.realized <- na.locf(index.realized, fromLast=TRUE)
I tried setting n = -125 but unsurprisingly, that doesn't work.
Thank you.
EDIT
To clarify what I am trying to do, here is the for loop I am using to accomplish this:
for(i in 1:nrow(index.ret)){
bear.realized[i,] = sd(bear.ret[i:(i+124),]) * sqrt(252)
index.realized[i,] = sd(index.ret[i:(i+124),]) * sqrt(252)
}
For the last 124 observations where I don't have enough data to compute the volatility, I want it to take the last "correct" calculation and use it for the rest of the series.
One way to do it is to "lag" your series with a negative k (note that k is interpreted differently in lag.xts than lag.ts and lag.zoo).
getSymbols("SPY")
spy <- ROC(Cl(SPY))
# note that k is interpreted differently from lag.ts and lag.zoo
spy$SPY.Lag <- lag(spy,-125)
# remove trailing NA
spy <- na.omit(spy)
rv <- runSD(spy$SPY.Lag,n=125)*sqrt(252)
OK I solved it. It's actually quite simple, was just thinking about this the completely wrong way.
index.realized <- xts(apply(index.ret,2,runSD,n=125), index(index.ret))*sqrt(252)
index.realized <- lag(index.realized, -124)
index.realized <- na.locf(index.realized)
Just calculate the realized volatility as per normal, and then lag it by the appropriate number so that it is "forward looking".
Related
Sorry if this is unclear, had trouble titling this.
Basically I have a linear model that predicts sales and one of the factors is the previous 10 days of sales. So, when predicting for the next month, I need an estimated number for what the "previous 10 days of sales" is for each day in the month.
I want to use the model to generate these numbers - so, for the first day I'm trying to predict, I have the actual number for the last 10 days of sales. For the day after that, I have 9 days of real data, plus the one predicted number generated. For the day after that, 8 days of real data and two generated, etc.
Not quite sure how to implement this and would appreciate any help. Thanks so much.
The first thing that came to mind would be a moving average using the predicted data. This gets hard to defend though once you're averaging only predicted data but its a place to start.
moving.average = 0
test.dat = rnorm(100, 10,2)
for(i in 1:30){
moving.average[i] = mean(test.dat[i:i+10])
}
Hope this is helpful
Kathy, get your first 10 data points from... where-ever. Seed your prediction with it.
initialization <- c(9.463, 9.704, 10.475, 8.076, 8.221, 8.509,
10.083, 9.572, 8.447, 10.081)
prediction = initialization
Here's a silly prediction function that uses the last 10 values:
predFn <- function(vec10){
stopifnot(length(vec10) == 10)
round(mean(vec10) + 1 , 3)
}
Although I usually like to use the map family, this one seems like it wants to be a loop
for(i in 11:20){
lo = i - 10
hi = i - 1
prediction[i] <- predFn(prediction[lo:hi])
}
What did we get?
prediction
# [1] 9.463 9.704 10.475 8.076 8.221 8.509 10.083 9.572 8.447 10.081 10.263 10.343 10.407 10.400 10.633 10.874 11.110 11.213
# [19] 11.377 11.670
Here's the relevant code from the vignette, altered slightly to fit it on the page here, and make it easy to reproduce. Code for visualizations omitted. Comments are from vignette author.
(Full vignette: https://cran.r-project.org/web/packages/pbo/vignettes/pbo.html)
library(pbo)
#First, we assemble the trials into an NxT matrix where each column
#represents a trial and each trial has the same length T. This example
#is random data so the backtest should be overfit.`
set.seed(765)
n <- 100
t <- 2400
m <- data.frame(matrix(rnorm(n*t),nrow=t,ncol=n,
dimnames=list(1:t,1:n)), check.names=FALSE)
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
#We can use any performance evaluation function that can work with the
#reassembled sub-matrices during the cross validation iterations.
#Following the original paper we can use the Sharpe ratio as
sharpe <- function(x,rf=0.03/252) {
sr <- apply(x,2,function(col) {
er = col - rf
return(mean(er)/sd(er))
})
return(sr)
}
#Now that we have the trials matrix we can pass it to the pbo function
#for analysis.
my_pbo <- pbo(m,s=8,f=sharpe,threshold=0)
summary(my_pbo)
Here's the portion i'm curious about:
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
Why is the data transformed within the for loop, and does this kind of re-scaling and re-centering need to be done with real returns? Or is this just something the author is doing to make his simulated returns look more like the real thing?
Googling and searching through stackoverflow turned up some articles and posts regarding scaling volatility to the square root of time, but this doesn't look quite like what I've seen. Usually they involve multiplying some short term (i.e. daily) measure of volatility by the root of time, but this isn't quite that. Also, the documentation for the package doesn't include this chunk of re-scaling and re-centering code. Documentation: https://cran.r-project.org/web/packages/pbo/pbo.pdf
So:
Why is the data transformed in this way/what is result of this
transformation?
Is it only necessary for this simulated data, or do I need to
similarly transform real returns?
I posted this question on the r-help mailing list and got the following answer:
"Hi Joe,
The centering and re-scaling is done for the purposes of his example, and
also to be consistent with his definition of the sharpe function.
In particular, note that the sharpe function has the rf (riskfree)
parameter with a default value of .03/252 i.e. an ANNUAL 3% rate converted
to a DAILY rate, expressed in decimal.
That means that the other argument to this function, x, should be DAILY
returns, expressed in decimal.
Suppose he wanted to create random data from a distribution of returns with
ANNUAL mean MU_A and ANNUAL std deviation SIGMA_A, both stated in decimal.
The equivalent DAILY returns would have mean MU_D = MU_A / 252 and standard
deviation SIGMA_D = SIGMA_A/SQRT(252).
He calls MU_D by the name mu_base and SIGMA_D by the name sigma_base.
His loop now converts the random numbers in his matrix so that each column
has mean MU_D and std deviation SIGMA_D.
HTH,
Eric"
I followed up with this:
"If I'm understanding correctly, if I’m wanting to use actual returns from backtests rather than simulated returns, I would need to make sure my risk-adjusted return measure, sharpe ratio in this case, matches up in scale with my returns (i.e. daily returns with daily sharpe, monthly with monthly, etc). And I wouldn’t need to transform returns like the simulated returns are in the vignette, as the real returns are going to have whatever properties they have (meaning they will have whatever average and std dev they happen to have). Is that correct?"
I was told this was correct.
Assuming we have a randomly sampled distribution, we can calculate and plot the associated ecdf as follows:
set.seed(1)
t1 <- rnorm(10000,mean=20)
t1 <- sort(t1)
t1[1:1000] <- t1[1:1000]*(-100)
t1[1001:7499] <- t1[1001:7499]*50
t1[7500:10000] <- t1[7500:10000]*100
cdft1 <- ecdf(t1)
plot(cdft1)
Now in this case, there are jumps (created by intention) in the empirical distribution. By jumps I mean, that it increases by a lot, let's say by more than 100% of the value from before. This happens in the example at position 7,500. My question is: How can I find these 'jump' indices most effectively?
You can get close to what you want just by looking at diff of the sorted t1 values.
St1 = sort(t1)
which(diff(St1) > abs(St1[-length(St1)]))
[1] 1000 7499
At point 1000, St1 switches from -1632.8700 to 934.6916, which technically meets your criterion of "more than 100% change". It does not seem clear to me what is wanted when there is a sign change like this.
I am trying to write a function which takes a positive real number and keeps adding terms of the harmonic series until the total sum exceeds the initial argument.
I need my function to display the total number of terms of the series that were added.
Here's my code so far:
harmonic<-function(n){
x<-c(0,1)
while (length(x) < n) {
position <- length(x)
new <- 1/(x[position] + x[position-1])
x <- c(x,new)
}
return(x)
}
I apologise for the errors in my code, unfortunately I have been working with R only for a month and this is the first time that I am using the while loop and I couldn't find any useful information around.
Thank you, I'd really appreciate your help.
Here's an attempt based on some info from this post at maths.stackexchange: https://math.stackexchange.com/q/496116
I can't speak as to whether it is highly accurate in all circumstances or even the best or an appropriate way to go about this. Caveat emptor.
harmsum.cnt <- function(x,tol=1e-09) {
em.cons <- 0.577215664901533
difffun <- function(x,n) x - (log(n) + em.cons + 1/(2*n) - 1/(12*n^2))
ceiling(uniroot(difffun, c(1, 1e10), tol = tol, x = x)$root)
}
Seems to work alright though:
harmsum.cnt(7)
#[1] 616
harmsum.cnt(15)
#[1] 1835421
Compare:
tail(cumsum(1/1:616),1); tail(cumsum(1/1:615),1)
#7.001274
#6.999651
dput(tail(cumsum(1/1:1835421),1)); dput(tail(cumsum(1/1:1835420),1))
#15.0000003782678
#14.9999998334336
This is a partial answer, which I'll try to fill in later. On the assumption that you want an exact answer, rather than the excellent approximation formula thelatemail found, there are a few tools to consider.
First, use of a hash-table or memoise methods will allow you to save previous calculations, thus saving a lot of time.
Second, since the sum of a (finite) sequence is independent of the grouping, you can calculate, e.g. the first N terms and the second (N+1):2N terms independently. Use parallel package to divide and conquer.
Third, before you get too deep into the morass, check the limits of floating-point accuracy via a call to .Machine$double.eps Once your 1/n term comes close to that, you'll need to switch over to gmp and Rmpfr to get full accuracy in your calculations.
Now, just to clarify what you "should" be doing, a correct loop is
mylimit <- [pick a value]
harmsum<-0
for(k in 1:N){
harmsum <- harmsum + 1/k
if (harmsum >= mylimit) break
}
(or similar setup using while)
I have a big data set (around 200k rows) where each row is a loan. I have the loan amount, the number of payments, and the loan payment.
I'm trying to get the loan rate.
R doesn't have a function for calculating this (at least base R doesn't have it, and I couldn't find it).
It isn't that hard to write both a npv and irr functions
Npv <- function(i, cf, t=seq(from=0,by=1,along.with=cf)) sum(cf/(1+i)^t)
Irr <- function(cf) { uniroot(npv, c(0,100000), cf=cf)$root }
And you can just do
rate = Irr(c(amt,rep(pmt,times=n)))
The problem is when you try to calculate the rate for a lot of payments. Because uniroot is not vectorized, and because rep takes a surprising amount of time, you end up with a slow calculation. You can make it faster if you do some math and figure out that you are looking for the roots of the following equation
zerome <- function(r) amt/pmt-(1-1/(1+r)^n)/r
and then use that as input for uniroot. This, in my pc, takes around 20 seconds to run for my 200k database.
The problem is that I'm trying to do some optimization, and this is a step of the optimization, so I'm trying to speed it up even more.
I've tried vectorization, but because uniroot is not vectorized, I can't go further that way. Is there any root finding method that is vectorized?
Thanks
Instead of using a root finder, you could use a linear interpolator. You will have to create one interpolator for each value of n (the number of remaining payments). Each interpolator will map (1-1/(1+r)^n)/r to r. Of course you will have to build a grid fine enough so it will return r to an acceptable precision level. The nice thing with this approach is that linear interpolators are fast and vectorized: you can find the rates for all loans with the same number of remaining payments (n) in a single call to the corresponding interpolator.
Now some code that proves it is a viable solution:
First, we create interpolators, one for each possible value of n:
n.max <- 360L # 30 years
one.interpolator <- function(n) {
r <- seq(from = 0.0001, to = 0.1500, by = 0.0001)
y <- (1-1/(1+r)^n)/r
approxfun(y, r)
}
interpolators <- lapply(seq_len(n.max), one.interpolator)
Note that I used a precision of 1/100 of a percent (1bp).
Then we create some fake data:
n.loans <- 200000L
n <- sample(n.max, n.loans, replace = TRUE)
amt <- 1000 * sample(100:500, n.loans, replace = TRUE)
pmt <- amt / (n * (1 - runif(n.loans)))
loans <- data.frame(n, amt, pmt)
Finally, we solve for r:
library(plyr)
system.time(ddply(loans, "n", transform, r = interpolators[[n[1]]](amt / pmt)))
# user system elapsed
# 2.684 0.423 3.084
It's fast. Note that some of the output rates are NA but it is because my random inputs made no sense and would have returned rates outside of the [0 ~ 15%] grid I selected. Your real data won't have that problem.