Why are simulated stock returns re-scaled and re-centered in the “pbo” vignette in the pbo (probability of backtest overfitting) package in R? - r

Here's the relevant code from the vignette, altered slightly to fit it on the page here, and make it easy to reproduce. Code for visualizations omitted. Comments are from vignette author.
(Full vignette: https://cran.r-project.org/web/packages/pbo/vignettes/pbo.html)
library(pbo)
#First, we assemble the trials into an NxT matrix where each column
#represents a trial and each trial has the same length T. This example
#is random data so the backtest should be overfit.`
set.seed(765)
n <- 100
t <- 2400
m <- data.frame(matrix(rnorm(n*t),nrow=t,ncol=n,
dimnames=list(1:t,1:n)), check.names=FALSE)
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
#We can use any performance evaluation function that can work with the
#reassembled sub-matrices during the cross validation iterations.
#Following the original paper we can use the Sharpe ratio as
sharpe <- function(x,rf=0.03/252) {
sr <- apply(x,2,function(col) {
er = col - rf
return(mean(er)/sd(er))
})
return(sr)
}
#Now that we have the trials matrix we can pass it to the pbo function
#for analysis.
my_pbo <- pbo(m,s=8,f=sharpe,threshold=0)
summary(my_pbo)
Here's the portion i'm curious about:
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
Why is the data transformed within the for loop, and does this kind of re-scaling and re-centering need to be done with real returns? Or is this just something the author is doing to make his simulated returns look more like the real thing?
Googling and searching through stackoverflow turned up some articles and posts regarding scaling volatility to the square root of time, but this doesn't look quite like what I've seen. Usually they involve multiplying some short term (i.e. daily) measure of volatility by the root of time, but this isn't quite that. Also, the documentation for the package doesn't include this chunk of re-scaling and re-centering code. Documentation: https://cran.r-project.org/web/packages/pbo/pbo.pdf
So:
Why is the data transformed in this way/what is result of this
transformation?
Is it only necessary for this simulated data, or do I need to
similarly transform real returns?

I posted this question on the r-help mailing list and got the following answer:
"Hi Joe,
The centering and re-scaling is done for the purposes of his example, and
also to be consistent with his definition of the sharpe function.
In particular, note that the sharpe function has the rf (riskfree)
parameter with a default value of .03/252 i.e. an ANNUAL 3% rate converted
to a DAILY rate, expressed in decimal.
That means that the other argument to this function, x, should be DAILY
returns, expressed in decimal.
Suppose he wanted to create random data from a distribution of returns with
ANNUAL mean MU_A and ANNUAL std deviation SIGMA_A, both stated in decimal.
The equivalent DAILY returns would have mean MU_D = MU_A / 252 and standard
deviation SIGMA_D = SIGMA_A/SQRT(252).
He calls MU_D by the name mu_base and SIGMA_D by the name sigma_base.
His loop now converts the random numbers in his matrix so that each column
has mean MU_D and std deviation SIGMA_D.
HTH,
Eric"
I followed up with this:
"If I'm understanding correctly, if I’m wanting to use actual returns from backtests rather than simulated returns, I would need to make sure my risk-adjusted return measure, sharpe ratio in this case, matches up in scale with my returns (i.e. daily returns with daily sharpe, monthly with monthly, etc). And I wouldn’t need to transform returns like the simulated returns are in the vignette, as the real returns are going to have whatever properties they have (meaning they will have whatever average and std dev they happen to have). Is that correct?"
I was told this was correct.

Related

R for loop with panel data for z-score calculation

I am currently working on creating some functions in RStudio with a dataset on roughly 100,000 individuals that are observed from 2005-2013. I have an unbalanced panel with two variables of interest - lets call them x and y for the sake of simplicity.
The function I am specifying takes the form of:
z = (mean(x) + mean(y)) / sd(x)
As noticeable, it is a normal z-score measure that is often used as a normalisation technique during the pre-processing stage of model estimation.
The goal of specifying the function is to compute z for each individual i in the dataset whilst taking into account that there are different periods T = 1,2...,t observed for the different individuals. In other words, in some cases I have data from 2008-2013, and for others I have data from say 2006-2010.
At the moment I have specified my function as follows:
z1 <- function(x,y) {
(mean(x) + mean(y))/sd(x)
}
when I execute it as:
z1(x,y)
I only get one number as an output representing the calculation from the total number of observations (about 150,000 rows). How should I edit my code to make sure I get one number for each individual in my dataset?
I am assuming that I must use a for loop that iterates and computes the z score for one individual at the time, but I am not sure how to specify this when writing my function.
It's returning a single value because the mean(x), mean(y) and sd(x) are all numeric values and you're not asking it to do anything else.
The following code simulates two (vectors) and does what (I think it is) that you want. It would help if were more descriptive though on your task.
x <- rbinom(100,3,(2/5))
y <- rpois(100,2.5)
f <- function(mvL,mvR){
answer = NULL;
vector <- readline('Which?: ')
if (vector=='Left'){
for (i in 1:length(mvL)){
answer[i] = mvL[i] - ((mean(mvL) + mean(mvR)) / sd(mvL));
}
}
else{
for (i in 1:length(mvR)){
answer[i] = mvR[i] - ((mean(mvL) + mean(mvR)) / sd(mvL));
}
}
return (answer);
}
f(x,y)

What is the formula to calculate the gini with sample weight

I need your helps to explain how I can obtain the same result as this function does:
gini(x, weights=rep(1,length=length(x)))
http://cran.r-project.org/web/packages/reldist/reldist.pdf --> page 2. Gini
Let's say, we need to measure the inocme of the population N. To do that, we can divide the population N into K subgroups. And in each subgroup kth, we will take nk individual and ask for their income. As the result, we will get the "individual's income" and each individual will have particular "sample weight" to represent for their contribution to the population N. Here is example that I simply get from previous link and the dataset is from NLS
rm(list=ls())
cat("\014")
library(reldist)
data(nls);data
help(nls)
# Convert the wage growth from (log. dollar) to (dollar)
y <- exp(recent$chpermwage);y
# Compute the unweighted estimate
gini_y <- gini(y)
# Compute the weighted estimate
gini_yw <- gini(y,w=recent$wgt)
> --- Here is the result----
> gini_y = 0.3418394
> gini_yw = 0.3483615
I know how to compute the Gini without WEIGHTS by my own code. Therefore, I would like to keep the command gini(y) in my code, without any doubts. The only thing I concerned is that the way gini(y,w) operate to obtain the result 0.3483615. I tried to do another calculation as follow to see whether I can come up with the same result as gini_yw. Here is another code that I based on CDF, Section 9.5, from this book: ‘‘Relative
Distribution Methods in the Social Sciences’’ by Mark S. Handcock,
#-------------------------
# test how gini computes with the sample weights
z <- exp(recent$chpermwage) * recent$wgt
gini_z <- gini(z)
# Result gini_z = 0.3924161
As you see, my calculation gini_z is different from command gini(y, weights). If someone of you know how to build correct computation to obtain exactly
gini_yw = 0.3483615, please give me your advices.
Thanks a lot friends.
function (x, weights = rep(1, length = length(x)))
{
ox <- order(x)
x <- x[ox]
weights <- weights[ox]/sum(weights)
p <- cumsum(weights)
nu <- cumsum(weights * x)
n <- length(nu)
nu <- nu/nu[n]
sum(nu[-1] * p[-n]) - sum(nu[-n] * p[-1])
}
This is the source code for the function gini which can be seen by entering gini into the console. No parentheses or anything else.
EDIT:
This can be done for any function or object really.
This is bit late, but one may be interested in concentration/diversity measures contained in the [SciencesPo][1] package.

cost function in cv.glm of boot library in R

I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.
The function has the following signature:
cv.glm(data, glmfit, cost, K)
with the first two denoting the data and model and K specifies the k-fold.
My problem is the cost parameter which is defined as:
cost: A function of two vector arguments specifying the cost function
for the crossvalidation. The first argument to cost should correspond
to the observed responses and the second argument should correspond to
the predicted or fitted responses from the generalized linear model.
cost must return a non-negative scalar value. The default is the
average squared error function.
I guess for classification it would make sense to have a function which returns the rate of misclassification something like:
nrow(subset(data, (predict >= 0.5 & data$response == "no") |
(predict < 0.5 & data$response == "yes")))
which is of course not even syntactically correct.
Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.
It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:
# [...] Since the response is a binary variable an
# appropriate cost function is
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:
## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <- rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1
## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333 ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
## (assuming you use 0.5 as the cutoff for your predictions).
I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)
Added on 2016-03-22:
The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:
cost <- function(r, pi = 0) {
mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}
I will try to explain the cost function in simple words. Let's take
cv.glm(data, glmfit, cost, K) arguments step by step:
data
The data consists of many observations. Think of it like series of numbers or even.
glmfit
It is generalized linear model, which runs on the above series. But there is a catch it splits data into several parts equal to K. And runs glmfit on each of them separately (test set), taking the rest of them as training set. The output of glmfit is a series consisting of same number of elements as the split input passed.
cost
Cost Function. It takes two arguments first the split input series(test set), and second the output of glmfit on the test input. The default is mean square error function.
.
It sums the square of difference between observed data point and predicted data point. Inside the function a loop runs over the test set (output and input should have same number of elements) calculates difference, squares it and adds to output variable.
K
The number to which the input should be split. Default gives leave one out cross validation.
Judging from your cost function description. Your input(x) would be a set of numbers between 0 and 1 (0-0.5 = no and 0.5-1 = yes) and output(y) is 'yes' or 'no'. So error(e) between observation(x) and prediction(y) would be :
cost<- function(x, y){
e=0
for (i in 1:length(x)){
if(x[i]>0.5)
{
if( y[i]=='yes') {e=0}
else {e=x[i]-0.5}
}else
{
if( y[i]=='no') {e=0}
else {e=0.5-x[i]}
}
e=e*e #square error
}
e=e/i #mean square error
return (e)
}
Sources : http://www.cs.cmu.edu/~schneide/tut5/node42.html
The cost function can optionally be defined if there is one you prefer over the default average squared error. If you wanted to do so then the you would write a function that returns the cost you want to minimize using two inputs: (1) the vector of known labels that you are predicting, and (2) the vector of predicted probabilities from your model for those corresponding labels. So for the cost function that (I think) you described in your post you are looking for a function that will return the average number of accurate classifications which would look something like this:
cost <- function(labels,pred){
mean(labels==ifelse(pred > 0.5, 1, 0))
}
With that function defined you can then pass it into your glm.cv() call. Although I wouldn't recommend using your own cost function over the default one unless you have reason to. Your example isn't reproducible, so here is another example:
> library(boot)
>
> cost <- function(labels,pred){
+ mean(labels==ifelse(pred > 0.5, 1, 0))
+ }
>
> #make model
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> #run cv with your cost function
> (nodal.glm.err <- cv.glm(nodal, nodal.glm, cost, nrow(nodal)))
$call
cv.glm(data = nodal, glmfit = nodal.glm, cost = cost, K = nrow(nodal))
$K
[1] 53
$delta
[1] 0.8113208 0.8113208
$seed
[1] 403 213 -2068233650 1849869992 -1836368725 -1035813431 1075589592 -782251898
...
The cost function defined in the example for cv.glm clearly assumes that the predictions are probabilities, which would require the type="response" argument in the predict function. The documentation from library(boot) should state this explicitly. I would otherwise be forced to assume that the default type="link" is used inside the cv.glm function, in which case the cost function would not work as intended.

Fast loan rate calculation for a big number of loans

I have a big data set (around 200k rows) where each row is a loan. I have the loan amount, the number of payments, and the loan payment.
I'm trying to get the loan rate.
R doesn't have a function for calculating this (at least base R doesn't have it, and I couldn't find it).
It isn't that hard to write both a npv and irr functions
Npv <- function(i, cf, t=seq(from=0,by=1,along.with=cf)) sum(cf/(1+i)^t)
Irr <- function(cf) { uniroot(npv, c(0,100000), cf=cf)$root }
And you can just do
rate = Irr(c(amt,rep(pmt,times=n)))
The problem is when you try to calculate the rate for a lot of payments. Because uniroot is not vectorized, and because rep takes a surprising amount of time, you end up with a slow calculation. You can make it faster if you do some math and figure out that you are looking for the roots of the following equation
zerome <- function(r) amt/pmt-(1-1/(1+r)^n)/r
and then use that as input for uniroot. This, in my pc, takes around 20 seconds to run for my 200k database.
The problem is that I'm trying to do some optimization, and this is a step of the optimization, so I'm trying to speed it up even more.
I've tried vectorization, but because uniroot is not vectorized, I can't go further that way. Is there any root finding method that is vectorized?
Thanks
Instead of using a root finder, you could use a linear interpolator. You will have to create one interpolator for each value of n (the number of remaining payments). Each interpolator will map (1-1/(1+r)^n)/r to r. Of course you will have to build a grid fine enough so it will return r to an acceptable precision level. The nice thing with this approach is that linear interpolators are fast and vectorized: you can find the rates for all loans with the same number of remaining payments (n) in a single call to the corresponding interpolator.
Now some code that proves it is a viable solution:
First, we create interpolators, one for each possible value of n:
n.max <- 360L # 30 years
one.interpolator <- function(n) {
r <- seq(from = 0.0001, to = 0.1500, by = 0.0001)
y <- (1-1/(1+r)^n)/r
approxfun(y, r)
}
interpolators <- lapply(seq_len(n.max), one.interpolator)
Note that I used a precision of 1/100 of a percent (1bp).
Then we create some fake data:
n.loans <- 200000L
n <- sample(n.max, n.loans, replace = TRUE)
amt <- 1000 * sample(100:500, n.loans, replace = TRUE)
pmt <- amt / (n * (1 - runif(n.loans)))
loans <- data.frame(n, amt, pmt)
Finally, we solve for r:
library(plyr)
system.time(ddply(loans, "n", transform, r = interpolators[[n[1]]](amt / pmt)))
# user system elapsed
# 2.684 0.423 3.084
It's fast. Note that some of the output rates are NA but it is because my random inputs made no sense and would have returned rates outside of the [0 ~ 15%] grid I selected. Your real data won't have that problem.

Root mean square deviation on binned GAM results using R

Background
A PostgreSQL database uses PL/R to call R functions. An R call to calculate Spearman's correlation looks as follows:
cor( rank(x), rank(y) )
Also in R, a naïve calculation of a fitted generalized additive model (GAM):
data.frame( x, fitted( gam( y ~ s(x) ) ) )
Here x represents the years from 1900 to 2009 and y is the average measurement (e.g., minimum temperature) for that year.
Problem
The fitted trend line (using GAM) is reasonably accurate, as you can see in the following picture:
The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.
Possible Solution
One way to improve the accuracy of the correlation is to use a root mean square error (RMSE) calculation on binned data.
Questions
Q.1. How would you implement the RMSE calculation on the binned data to get a correlation (between 0 and 1) of GAM's fit to the measurements, in the R language?
Q.2. Is there a better way to find the accuracy of GAM's fit to the data, and if so, what is it (e.g., root mean square deviation)?
Attempted Solution 1
Call the PL/R function using the observed amounts and the model (GAM) amounts: correlation_rmse := climate.plr_corr_rmse( v_amount, v_model );
Define plr_corr_rmse as follows (where o and m represent the observed and modelled data): CREATE OR REPLACE FUNCTION climate.plr_corr_rmse(
o double precision[], m double precision[])
RETURNS double precision AS
$BODY$
sqrt( mean( o - m ) ^ 2 )
$BODY$
LANGUAGE 'plr' VOLATILE STRICT
COST 100;
The o - m is wrong. I'd like to bin both data sets by calculating the mean of every 5 data points (there will be at most 110 data points). For example:
omean <- c( mean(o[1:5]), mean(o[6:10]), ... )
mmean <- c( mean(m[1:5]), mean(m[6:10]), ... )
Then correct the RMSE calculation as:
sqrt( mean( omean - mmean ) ^ 2 )
How do you calculate c( mean(o[1:5]), mean(o[6:10]), ... ) for an arbitrary length vector in an appropriate number of bins (5, for example, might not be ideal for only 67 measurements)?
I don't think hist is suitable here, is it?
Attempted Solution 2
The following code will solve the problem, however it drops data points from the end of the list (to make the list divisible by 5). The solution isn't ideal as the number "5" is rather magical.
while( length(o) %% 5 != 0 ) {
o <- o[-length(o)]
}
omean <- apply( matrix(o, 5), 2, mean )
What other options are available?
Thanks in advance.
You say that:
The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.
You could calculate the correlation between the fitted values and the measured values:
cor(y,fitted(gam(y ~ s(x))))
I don't see why you want to bin your data, but you could do it as follows:
mean.binned <- function(y,n = 5){
apply(matrix(c(y,rep(NA,(n - (length(y) %% n)) %% n)),n),
2,
function(x)mean(x,na.rm = TRUE))
}
It looks a bit ugly, but it should handle vectors whose length is not a multiple of the binning length (i.e. 5 in your example).
You also say that:
One way to improve the accuracy of the
correlation is to use a root mean
square error (RMSE) calculation on
binned data.
I don't understand what you mean by this. The correlation is a factor in determining the mean squared error - for example, see equation 10 of Murphy (1988, Monthly Weather Review, v. 116, pp. 2417-2424). But please explain what you mean.

Resources