Fast loan rate calculation for a big number of loans - r

I have a big data set (around 200k rows) where each row is a loan. I have the loan amount, the number of payments, and the loan payment.
I'm trying to get the loan rate.
R doesn't have a function for calculating this (at least base R doesn't have it, and I couldn't find it).
It isn't that hard to write both a npv and irr functions
Npv <- function(i, cf, t=seq(from=0,by=1,along.with=cf)) sum(cf/(1+i)^t)
Irr <- function(cf) { uniroot(npv, c(0,100000), cf=cf)$root }
And you can just do
rate = Irr(c(amt,rep(pmt,times=n)))
The problem is when you try to calculate the rate for a lot of payments. Because uniroot is not vectorized, and because rep takes a surprising amount of time, you end up with a slow calculation. You can make it faster if you do some math and figure out that you are looking for the roots of the following equation
zerome <- function(r) amt/pmt-(1-1/(1+r)^n)/r
and then use that as input for uniroot. This, in my pc, takes around 20 seconds to run for my 200k database.
The problem is that I'm trying to do some optimization, and this is a step of the optimization, so I'm trying to speed it up even more.
I've tried vectorization, but because uniroot is not vectorized, I can't go further that way. Is there any root finding method that is vectorized?
Thanks

Instead of using a root finder, you could use a linear interpolator. You will have to create one interpolator for each value of n (the number of remaining payments). Each interpolator will map (1-1/(1+r)^n)/r to r. Of course you will have to build a grid fine enough so it will return r to an acceptable precision level. The nice thing with this approach is that linear interpolators are fast and vectorized: you can find the rates for all loans with the same number of remaining payments (n) in a single call to the corresponding interpolator.
Now some code that proves it is a viable solution:
First, we create interpolators, one for each possible value of n:
n.max <- 360L # 30 years
one.interpolator <- function(n) {
r <- seq(from = 0.0001, to = 0.1500, by = 0.0001)
y <- (1-1/(1+r)^n)/r
approxfun(y, r)
}
interpolators <- lapply(seq_len(n.max), one.interpolator)
Note that I used a precision of 1/100 of a percent (1bp).
Then we create some fake data:
n.loans <- 200000L
n <- sample(n.max, n.loans, replace = TRUE)
amt <- 1000 * sample(100:500, n.loans, replace = TRUE)
pmt <- amt / (n * (1 - runif(n.loans)))
loans <- data.frame(n, amt, pmt)
Finally, we solve for r:
library(plyr)
system.time(ddply(loans, "n", transform, r = interpolators[[n[1]]](amt / pmt)))
# user system elapsed
# 2.684 0.423 3.084
It's fast. Note that some of the output rates are NA but it is because my random inputs made no sense and would have returned rates outside of the [0 ~ 15%] grid I selected. Your real data won't have that problem.

Related

Simulating a process n times in R

I've written an R script (sourced from here) simulating the path of a geometric Brownian motion of a stock price, and I need the simulation to run 1000 times such that I generate 1000 paths of the process Ut = Ste^-mu*t, by discretizing the law of motion derived from Ut which is the bottom line of the solution to the question posted here.
The process also has n = 252 steps and discretization step = 1/252, also risk of sigma = 0.4 and instantaneous drift mu, which I've treated as zero, although I'm not sure about this. I'm struggling to simulate 1000 paths of the process but am able to generate one single path, I'm unsure which variables I need to change or whether there's an issue in my for loop that's restricting me from generating all 1000 paths. Could it also be that the script is simulating each individual point for 252 realization instead of simulating the full process? If so, would this restrict me from generating all 1000 paths? Is it also possible that the array I'm generating defined as U hasn't being correctly generated by me? U[0] must equal 1 and so too must the first realization U(1) = 1. The code is below, I'm pretty stuck trying to figure this out so any help is appreciated.
#Simulating Geometric Brownian motion (GMB)
tau <- 1 #time to expiry
N <- 253 #number of sub intervals
dt <- tau/N #length of each time sub interval
time <- seq(from=0, to=N, by=dt) #time moments in which we simulate the process
length(time) #it should be N+1
mu <- 0 #GBM parameter 1
sigma <- 0.4 #GBM parameter 2
s0 <- 1 #GBM parameter 3
#simulate Geometric Brownian motion path
dwt <- rnorm(N, mean = 0, sd = 1) #standard normal sample of N elements
dW <- dwt*sqrt(dt) #Brownian motion increments
W <- c(0, cumsum(dW)) #Brownian motion at each time instant N+1 elements
#Define U Array and set initial values of U
U <- array(0, c(N,1)) #array of U
U[0] = 1
U[1] <- s0 #first element of U is s0. with the for loop we find the other N elements
for(i in 2:length(U)){
U[i] <- (U[1]*exp(mu - 0.5*sigma^2*i*dt + sigma*W[i-1]))*exp(-mu*i)
}
#Plot
plot(ts(U), main = expression(paste("Simulation of Ut")))
This questions is quite difficult to answer since there are a lot of unclear things, at least to me.
To begin with, length(time) is equal to 64010, not N + 1, which will be 254.
If I understand correctly, the brownian motion function returns the position in one dimension given a time. Hence, to calculate this position for each time the following can be enough:
s0*exp((mu - 0.5*sigma^2)*time + sigma*rnorm(length(time),0,time))
However, this calculates 64010 points, not 253. If you replicate it 1000 times, it gives 64010000 points, which is quite a lot.
> B <- 1000
> res <- replicate(B, {
+ s0*exp((mu - 0.5*sigma^2)*time + sigma*rnorm(length(time),0,time))
+ })
> length(res)
[1] 64010000
> dim(res)
[1] 64010 1000
I know I'm missing the second part, the one explained here, but I actually don't fully understand what you need there. If you can draw the formula maybe I can help you.
In general, avoid programming in R using for loops to iterate vectors. R is a vectorized language, and there is no need for that. If you want to run the same code B times, the replicate(B,{ your code }) function is your firend.

How to find the probability of extinction = 1 using Galton-Watson process in R?

I am simulating a basic Galton-Watson process (GWP) using a geometric distribution. I'm using this to find the probability of extinction for each generation. My question is, how do I find the generation at which the probability of extinction is equal to 1?
For example, I can create a function for the GWP like so:
# Galton-Watson Process for geometric distribution
GWP <- function(n, p) {
Sn <- c(1, rep(0, n))
for (i in 2:(n + 1)) {
Sn[i] <- sum(rgeom(Sn[i - 1], p))
}
return(Sn)
}
where, n is the number of generations.
Then, if I set the geometric distribution parameter p = 0.25... then to calculate the probability of extinction for, say, generation 10, I just do this:
N <- 10 # Number of elements in the initial population.
GWn <- replicate(N, GWP(10, 0.25)[10])
probExtinction <- sum(GWn==0)/N
probExtinction
This will give me the probability of extinction for generation 10... to find the probability of extinction for each generation I have to change the index value (to the corresponding generation number) when creating GWn... But what I'm trying to do is find at which generation will the probability of extinction = 1.
Any suggestions as to how I might go about solving this problem?
I can tell you how you would do this problem in principle, but I'm going to suggest that you may run into some difficulties (if you already know everything I'm about to say, just take it as advice to the next reader ...)
theoretically, the Galton-Watson process extinction probability never goes exactly to 1 (unless prob==1, or in the infinite-time limit)
of course, for any given replicate and random-number seed you can compute the first time point (if any) at which all of your lineages have gone extinct. This will be highly variable across runs, depending on the random-number seed ...
the distribution of extinction times is extremely skewed; lineages that don't go extinct immediately will last a loooong time ...
I modified your GWP function in two ways to make it more efficient: (1) stop the simulation when the lineage goes extinct; (2) replace the sum of geometric deviates with a single negative binomial deviate (see here)
GWP <- function(n, p) {
Sn <- c(1, rep(0, n))
for (i in 2:(n + 1)) {
Sn[i] <- rnbinom(1, size=Sn[i - 1], prob=p)
if (Sn[i]==0) break ## extinct, bail out
}
return(Sn)
}
The basic strategy now is: (1) run the simulations for a while, keep the entire trajectory; (2) compute extinction probability in every generation; (3) find the first generation such that p==1.
set.seed(101)
N <- 10 # Number of elements in the initial population.
maxgen <- 100
GWn <- replicate(N, GWP(maxgen, 0.5), simplify="array")
probExtinction <- rowSums(GWn==0)/N
which(probExtinction==1)[1]
(Subtract 1 from the last result if you want to start indexing from generation 0.) In this case the answer is NA, because there's 1/10 lineages that manages to stay alive (and indeed gets very large, so it will probably persist almost forever)
plot(0:maxgen, probExtinction, type="s") ## plot extinction probability
matplot(1+GWn,type="l",lty=1,col=1,log="y") ## plot lineage sizes (log(1+x) scale)
## demonstration that (sum(rgeom(n,...)) is equiv to rnbinom(1,size=n,...)
nmax <- 70
plot(prop.table(table(replicate(10000, sum(rgeom(10, prob=0.3))))),
xlim=c(0,nmax))
points(0:nmax,dnbinom(0:nmax, size=10, prob=0.3), col=2,pch=16)

Data perturbation - How to perform it?

I am doing some projects related to statistics simulation using R based on "Introduction to Scientific Programming and Simulation Using R" and in the Students projects session (chapter 24) i am doing the "The pipe spiders of Brunswick" problem, but i am stuck on one part of an evolutionary algorithm, where you need to perform some data perturbation according to the sentence bellow:
"With probability 0.5 each element of the vector is perturbed, independently
of the others, by an amount normally distributed with mean 0 and standard
deviation 0.1"
What does being "perturbed" really mean here? I dont really know which operation I should be doing with my vector to make this perturbation happen and im not finding any answers to this problem.
Thanks in advance!
# using the most important features, we create a ML model:
m1 <- lm(PREDICTED_VALUE ~ PREDICTER_1 + PREDICTER_2 + PREDICTER_N )
#summary(m1)
#anova(m1)
# after creating the model, we perturb as follows:
#install.packages("perturb") #install the package
library(perturb)
set.seed(1234) # for same results each time you run the code
p1_new <- perturb(m1, pvars=c("PREDICTER_1","PREDICTER_N") , prange = c(1,1),niter=200) # your can change the number of iterations to any value n. Total number of iteration would come to be n+1
p1_new # check the values of p1
summary(p1_new)
Perturbing just means adding a small, noisy shift to a number. Your code might look something like this.
x = sample(10, 10)
ind = rbinom(length(x), 1, 0.5) == 1
x[ind] = x[ind] + rnorm(sum(ind), 0, 0.1)
rbinom gets the elements to be modified with probability 0.5 and rnorm adds the perturbation.

Why are simulated stock returns re-scaled and re-centered in the “pbo” vignette in the pbo (probability of backtest overfitting) package in R?

Here's the relevant code from the vignette, altered slightly to fit it on the page here, and make it easy to reproduce. Code for visualizations omitted. Comments are from vignette author.
(Full vignette: https://cran.r-project.org/web/packages/pbo/vignettes/pbo.html)
library(pbo)
#First, we assemble the trials into an NxT matrix where each column
#represents a trial and each trial has the same length T. This example
#is random data so the backtest should be overfit.`
set.seed(765)
n <- 100
t <- 2400
m <- data.frame(matrix(rnorm(n*t),nrow=t,ncol=n,
dimnames=list(1:t,1:n)), check.names=FALSE)
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
#We can use any performance evaluation function that can work with the
#reassembled sub-matrices during the cross validation iterations.
#Following the original paper we can use the Sharpe ratio as
sharpe <- function(x,rf=0.03/252) {
sr <- apply(x,2,function(col) {
er = col - rf
return(mean(er)/sd(er))
})
return(sr)
}
#Now that we have the trials matrix we can pass it to the pbo function
#for analysis.
my_pbo <- pbo(m,s=8,f=sharpe,threshold=0)
summary(my_pbo)
Here's the portion i'm curious about:
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
Why is the data transformed within the for loop, and does this kind of re-scaling and re-centering need to be done with real returns? Or is this just something the author is doing to make his simulated returns look more like the real thing?
Googling and searching through stackoverflow turned up some articles and posts regarding scaling volatility to the square root of time, but this doesn't look quite like what I've seen. Usually they involve multiplying some short term (i.e. daily) measure of volatility by the root of time, but this isn't quite that. Also, the documentation for the package doesn't include this chunk of re-scaling and re-centering code. Documentation: https://cran.r-project.org/web/packages/pbo/pbo.pdf
So:
Why is the data transformed in this way/what is result of this
transformation?
Is it only necessary for this simulated data, or do I need to
similarly transform real returns?
I posted this question on the r-help mailing list and got the following answer:
"Hi Joe,
The centering and re-scaling is done for the purposes of his example, and
also to be consistent with his definition of the sharpe function.
In particular, note that the sharpe function has the rf (riskfree)
parameter with a default value of .03/252 i.e. an ANNUAL 3% rate converted
to a DAILY rate, expressed in decimal.
That means that the other argument to this function, x, should be DAILY
returns, expressed in decimal.
Suppose he wanted to create random data from a distribution of returns with
ANNUAL mean MU_A and ANNUAL std deviation SIGMA_A, both stated in decimal.
The equivalent DAILY returns would have mean MU_D = MU_A / 252 and standard
deviation SIGMA_D = SIGMA_A/SQRT(252).
He calls MU_D by the name mu_base and SIGMA_D by the name sigma_base.
His loop now converts the random numbers in his matrix so that each column
has mean MU_D and std deviation SIGMA_D.
HTH,
Eric"
I followed up with this:
"If I'm understanding correctly, if I’m wanting to use actual returns from backtests rather than simulated returns, I would need to make sure my risk-adjusted return measure, sharpe ratio in this case, matches up in scale with my returns (i.e. daily returns with daily sharpe, monthly with monthly, etc). And I wouldn’t need to transform returns like the simulated returns are in the vignette, as the real returns are going to have whatever properties they have (meaning they will have whatever average and std dev they happen to have). Is that correct?"
I was told this was correct.

Plot a table of binomial distributions in R

For a game design issue, I need to better inspect binomial distributions. Using R, I need to build a two dimensional table that - given a fixed parameters 'pool' (the number of dice rolled), 'sides' (the number of sides of the die) has:
In rows --> minimum for a success (ranging from 0 to sides, it's a discrete distribution)
In columns --> number of successes (ranging from 0 to pool)
I know how to calculate it as a single task, but I'm not sure on how to iterate to fill the entire table
EDIT: I forgot to say that I want to calculate the probability p of gaining at least the number of successes.
Ok, i think this could be a simple solution. It has ratio of successes on rows and success thresholds on dice roll (p) on columns.
poolDistribution <- function(n, sides=10, digits=2, roll.Under=FALSE){
m <- 1:sides
names(m) <- paste(m,ifelse(roll.Under,"-", "+"),sep="")
s <- 1:n
names(s) <- paste(s,n,sep="/")
sapply(m, function(m.value) round((if(roll.Under) (1 - pbinom(s - 1, n, (m.value)/sides))*100 else (1 - pbinom(s - 1, n, (sides - m.value + 1)/sides))*100), digits=digits))
This gets you half of the way.
If you are new to R, you might miss out on the fact that a very powerful feature is that you can use a vector of values as an index to another vector. This makes part of the problem trivially easy:
pool <- 3
sides <- 20 # <cough>D&D<cough>
# you need to strore the values somewhere, use a vector
NumberOfRollsPerSide <- rep(0, sides)
names(NumberOfRollsPerSide) <- 1:sides # this will be useful in table
## Repeast so long as there are still zeros
## ie, so long as there is a side that has not come up yet
while (any(NumberOfRollsPerSide == 0)) {
# roll once
oneRoll <- sample(1:sides, pool, TRUE)
# add (+1) to each sides' total rolls
# note that you can use the roll outcome to index the vector. R is great.
NumberOfRollsPerSide[oneRoll] <- NumberOfRollsPerSide[oneRoll] + 1
}
# These are your results:
NumberOfRollsPerSide
All you have left to do now is count, for each side, in which roll number it first came up.

Resources