Portfolio loss estimation using pairwise correlations - r

im trying to estimate the portfolio losses using pairwise correlation matrix. The current code which runs fine using two 'for' loops to do the job and takes a fair amount of time. Is there any way to optimize the code to reduce the run time?
Corp here is the number of obligors or counterparties (1000 in this case)
Edit: Trying to explain the logic a little better here
So there is a data frame df that contains borrower details (Client_ID, Sector, Exposure, Probablity of default (PD) ). For example
Client ID is 1, 2, 3, 4,5 uptil 1000
Sector ID could be anything from 1 to 21
Exposure is any amount from 0 to 99999999
So i need to identify all client_ID pairs, multiply their exposures, and here is where it gets complicates, i need to also multiply their correlation value . This correlation value is stored in a separate 21*21 matrix, and is chosen on the basis of sectors to which the two clients belong. What is the most efficient way to do this? The code below does the stuff fine, but it takes a while, and i feel there must be an easier way to do this? merge? Expand.grid?
RC_UL=function(N_EAD, N_PD, N_LGD, N_LGD_VAR,N_CO_FAC_LOAD,V_SECTOR_ID,N_DFLT_PT)
{for (i in 1:corp){
for(j in 1:corp ){
if (i==j) {
#borrower correlation with himself is always 1
sigma.ijk <- cbind( c(1, 1),c(1, 1))
} else
{ sigma.ijk <- cbind( c(1, N_CO_FAC_LOAD[i]*N_CO_FAC_LOAD[j] *rho[V_SECTOR_ID[i], V_SECTOR_ID[j]]),c(N_CO_FAC_LOAD[i]*N_CO_FAC_LOAD[j] * rho[V_SECTOR_ID[i], V_SECTOR_ID[j]], 1))
}
#rho is the correlation matrix across sectors (not shown in the code snippet #i have provided, but it is a 21*21 matrix corresponding to the 21 sectors #that the counterparties belong to
#N_CO_FAC_LOAD is an input that is specific to a sector
#UL for each borrower
UL[i]<-UL[i]+N_LGD[i] * N_LGD[j] *
N_EAD[i] * N_EAD[j] *
( pmvnorm(upper=c(N_DFLT_PT[i], N_DFLT_PT[j]), mean=c(0,0),sigma=sigma.ijk) - N_PD[i] * N_PD[j] )
}}
for (i in 1:corp){N_RC_UL[i]<-(UL[i]/sqrt(sum(UL)))}
return(N_RC_UL)
}
b<-RC_UL(EAD, PD, LGD, N_LGD_VAR,fac_weight_s,sector,def_pt1)

Related

Why are simulated stock returns re-scaled and re-centered in the “pbo” vignette in the pbo (probability of backtest overfitting) package in R?

Here's the relevant code from the vignette, altered slightly to fit it on the page here, and make it easy to reproduce. Code for visualizations omitted. Comments are from vignette author.
(Full vignette: https://cran.r-project.org/web/packages/pbo/vignettes/pbo.html)
library(pbo)
#First, we assemble the trials into an NxT matrix where each column
#represents a trial and each trial has the same length T. This example
#is random data so the backtest should be overfit.`
set.seed(765)
n <- 100
t <- 2400
m <- data.frame(matrix(rnorm(n*t),nrow=t,ncol=n,
dimnames=list(1:t,1:n)), check.names=FALSE)
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
#We can use any performance evaluation function that can work with the
#reassembled sub-matrices during the cross validation iterations.
#Following the original paper we can use the Sharpe ratio as
sharpe <- function(x,rf=0.03/252) {
sr <- apply(x,2,function(col) {
er = col - rf
return(mean(er)/sd(er))
})
return(sr)
}
#Now that we have the trials matrix we can pass it to the pbo function
#for analysis.
my_pbo <- pbo(m,s=8,f=sharpe,threshold=0)
summary(my_pbo)
Here's the portion i'm curious about:
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
Why is the data transformed within the for loop, and does this kind of re-scaling and re-centering need to be done with real returns? Or is this just something the author is doing to make his simulated returns look more like the real thing?
Googling and searching through stackoverflow turned up some articles and posts regarding scaling volatility to the square root of time, but this doesn't look quite like what I've seen. Usually they involve multiplying some short term (i.e. daily) measure of volatility by the root of time, but this isn't quite that. Also, the documentation for the package doesn't include this chunk of re-scaling and re-centering code. Documentation: https://cran.r-project.org/web/packages/pbo/pbo.pdf
So:
Why is the data transformed in this way/what is result of this
transformation?
Is it only necessary for this simulated data, or do I need to
similarly transform real returns?
I posted this question on the r-help mailing list and got the following answer:
"Hi Joe,
The centering and re-scaling is done for the purposes of his example, and
also to be consistent with his definition of the sharpe function.
In particular, note that the sharpe function has the rf (riskfree)
parameter with a default value of .03/252 i.e. an ANNUAL 3% rate converted
to a DAILY rate, expressed in decimal.
That means that the other argument to this function, x, should be DAILY
returns, expressed in decimal.
Suppose he wanted to create random data from a distribution of returns with
ANNUAL mean MU_A and ANNUAL std deviation SIGMA_A, both stated in decimal.
The equivalent DAILY returns would have mean MU_D = MU_A / 252 and standard
deviation SIGMA_D = SIGMA_A/SQRT(252).
He calls MU_D by the name mu_base and SIGMA_D by the name sigma_base.
His loop now converts the random numbers in his matrix so that each column
has mean MU_D and std deviation SIGMA_D.
HTH,
Eric"
I followed up with this:
"If I'm understanding correctly, if I’m wanting to use actual returns from backtests rather than simulated returns, I would need to make sure my risk-adjusted return measure, sharpe ratio in this case, matches up in scale with my returns (i.e. daily returns with daily sharpe, monthly with monthly, etc). And I wouldn’t need to transform returns like the simulated returns are in the vignette, as the real returns are going to have whatever properties they have (meaning they will have whatever average and std dev they happen to have). Is that correct?"
I was told this was correct.

R for loop with panel data for z-score calculation

I am currently working on creating some functions in RStudio with a dataset on roughly 100,000 individuals that are observed from 2005-2013. I have an unbalanced panel with two variables of interest - lets call them x and y for the sake of simplicity.
The function I am specifying takes the form of:
z = (mean(x) + mean(y)) / sd(x)
As noticeable, it is a normal z-score measure that is often used as a normalisation technique during the pre-processing stage of model estimation.
The goal of specifying the function is to compute z for each individual i in the dataset whilst taking into account that there are different periods T = 1,2...,t observed for the different individuals. In other words, in some cases I have data from 2008-2013, and for others I have data from say 2006-2010.
At the moment I have specified my function as follows:
z1 <- function(x,y) {
(mean(x) + mean(y))/sd(x)
}
when I execute it as:
z1(x,y)
I only get one number as an output representing the calculation from the total number of observations (about 150,000 rows). How should I edit my code to make sure I get one number for each individual in my dataset?
I am assuming that I must use a for loop that iterates and computes the z score for one individual at the time, but I am not sure how to specify this when writing my function.
It's returning a single value because the mean(x), mean(y) and sd(x) are all numeric values and you're not asking it to do anything else.
The following code simulates two (vectors) and does what (I think it is) that you want. It would help if were more descriptive though on your task.
x <- rbinom(100,3,(2/5))
y <- rpois(100,2.5)
f <- function(mvL,mvR){
answer = NULL;
vector <- readline('Which?: ')
if (vector=='Left'){
for (i in 1:length(mvL)){
answer[i] = mvL[i] - ((mean(mvL) + mean(mvR)) / sd(mvL));
}
}
else{
for (i in 1:length(mvR)){
answer[i] = mvR[i] - ((mean(mvL) + mean(mvR)) / sd(mvL));
}
}
return (answer);
}
f(x,y)

Plot a table of binomial distributions in R

For a game design issue, I need to better inspect binomial distributions. Using R, I need to build a two dimensional table that - given a fixed parameters 'pool' (the number of dice rolled), 'sides' (the number of sides of the die) has:
In rows --> minimum for a success (ranging from 0 to sides, it's a discrete distribution)
In columns --> number of successes (ranging from 0 to pool)
I know how to calculate it as a single task, but I'm not sure on how to iterate to fill the entire table
EDIT: I forgot to say that I want to calculate the probability p of gaining at least the number of successes.
Ok, i think this could be a simple solution. It has ratio of successes on rows and success thresholds on dice roll (p) on columns.
poolDistribution <- function(n, sides=10, digits=2, roll.Under=FALSE){
m <- 1:sides
names(m) <- paste(m,ifelse(roll.Under,"-", "+"),sep="")
s <- 1:n
names(s) <- paste(s,n,sep="/")
sapply(m, function(m.value) round((if(roll.Under) (1 - pbinom(s - 1, n, (m.value)/sides))*100 else (1 - pbinom(s - 1, n, (sides - m.value + 1)/sides))*100), digits=digits))
This gets you half of the way.
If you are new to R, you might miss out on the fact that a very powerful feature is that you can use a vector of values as an index to another vector. This makes part of the problem trivially easy:
pool <- 3
sides <- 20 # <cough>D&D<cough>
# you need to strore the values somewhere, use a vector
NumberOfRollsPerSide <- rep(0, sides)
names(NumberOfRollsPerSide) <- 1:sides # this will be useful in table
## Repeast so long as there are still zeros
## ie, so long as there is a side that has not come up yet
while (any(NumberOfRollsPerSide == 0)) {
# roll once
oneRoll <- sample(1:sides, pool, TRUE)
# add (+1) to each sides' total rolls
# note that you can use the roll outcome to index the vector. R is great.
NumberOfRollsPerSide[oneRoll] <- NumberOfRollsPerSide[oneRoll] + 1
}
# These are your results:
NumberOfRollsPerSide
All you have left to do now is count, for each side, in which roll number it first came up.

Loop structure for basic simulation model in R

I'm trying to write a basic model that simulates the growth of a population (whose initial size is drawn randomly from a normal distribution) and then grows by a user defined amount each 'year' (currently 2 individuals in the code below for arguments sake). The output that is produced only shows the results of one simulation and, within this simulation, the population hasn't grown at all i.e. for each 'year' the population hasn't grown/doesn't add to the previous 'years' population. I'm assuming that I've stuffed something up in the loop structure and keen for any advice!
n.years <- 3
n.sim <- 5
store.growth <- matrix(ncol=3,nrow= (n.years * n.sim))
for (i in 1:n.sim) {
init.pop.size <- rnorm(1,100,10)
for (j in 1:n.years){
#grow population
grow.pop <- init.pop.size + 5
store.growth[j,] <- cbind(grow.pop, n.years, n.sim)
}
}
store.growth

Fast loan rate calculation for a big number of loans

I have a big data set (around 200k rows) where each row is a loan. I have the loan amount, the number of payments, and the loan payment.
I'm trying to get the loan rate.
R doesn't have a function for calculating this (at least base R doesn't have it, and I couldn't find it).
It isn't that hard to write both a npv and irr functions
Npv <- function(i, cf, t=seq(from=0,by=1,along.with=cf)) sum(cf/(1+i)^t)
Irr <- function(cf) { uniroot(npv, c(0,100000), cf=cf)$root }
And you can just do
rate = Irr(c(amt,rep(pmt,times=n)))
The problem is when you try to calculate the rate for a lot of payments. Because uniroot is not vectorized, and because rep takes a surprising amount of time, you end up with a slow calculation. You can make it faster if you do some math and figure out that you are looking for the roots of the following equation
zerome <- function(r) amt/pmt-(1-1/(1+r)^n)/r
and then use that as input for uniroot. This, in my pc, takes around 20 seconds to run for my 200k database.
The problem is that I'm trying to do some optimization, and this is a step of the optimization, so I'm trying to speed it up even more.
I've tried vectorization, but because uniroot is not vectorized, I can't go further that way. Is there any root finding method that is vectorized?
Thanks
Instead of using a root finder, you could use a linear interpolator. You will have to create one interpolator for each value of n (the number of remaining payments). Each interpolator will map (1-1/(1+r)^n)/r to r. Of course you will have to build a grid fine enough so it will return r to an acceptable precision level. The nice thing with this approach is that linear interpolators are fast and vectorized: you can find the rates for all loans with the same number of remaining payments (n) in a single call to the corresponding interpolator.
Now some code that proves it is a viable solution:
First, we create interpolators, one for each possible value of n:
n.max <- 360L # 30 years
one.interpolator <- function(n) {
r <- seq(from = 0.0001, to = 0.1500, by = 0.0001)
y <- (1-1/(1+r)^n)/r
approxfun(y, r)
}
interpolators <- lapply(seq_len(n.max), one.interpolator)
Note that I used a precision of 1/100 of a percent (1bp).
Then we create some fake data:
n.loans <- 200000L
n <- sample(n.max, n.loans, replace = TRUE)
amt <- 1000 * sample(100:500, n.loans, replace = TRUE)
pmt <- amt / (n * (1 - runif(n.loans)))
loans <- data.frame(n, amt, pmt)
Finally, we solve for r:
library(plyr)
system.time(ddply(loans, "n", transform, r = interpolators[[n[1]]](amt / pmt)))
# user system elapsed
# 2.684 0.423 3.084
It's fast. Note that some of the output rates are NA but it is because my random inputs made no sense and would have returned rates outside of the [0 ~ 15%] grid I selected. Your real data won't have that problem.

Resources