I would like to run a discrete-time simulation (simplified version below). I generate a data frame of population members (one member per row) with their timestamps for entering and exiting a website. I then wish to count at each time interval how many members are on the site.
Currently I am looping through time and at each second counting how many members have entered and not yet exited. (I have also tried destructive iteration by removing exited members at each interval, which takes even longer. I also understand that I can use larger time intervals in the loop.)
How do I use linear algebra to eliminate the for-loop and excess runtime? My current approach does not scale well as population increases, and of course it is linear with respect to duration.
popSize = 10000
simDuration = 10000
enterTimestamp <- rexp(n = popSize, rate = .001)
exitTimestamp <- enterTimestamp + rexp(n = popSize, rate = .001)
popEvents <- data.frame(cbind(enterTimestamp,exitTimestamp))
visitorLoad <- integer(length = simDuration)
for (i in 1:simDuration) {
visitorLoad[i] <- sum(popEvents$enterTimestamp <= i &
popEvents$exitTimestamp > i)
if (i %% 100 == 0) {print(paste('Sim at',i,'out of',simDuration,
'seconds.',sep=' ') )}
}
plot(visitorLoad, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')
You can obtain the counts of visitors entering and exiting at different times and then use the cumulative sum to compute the number of visitors there at a particular time. This seems to meet your requirement of the code running quickly, though it does not use linear algebra.
diffs = rep(0, simDuration+1)
# Store the number of times a visitor enters and exits at each timestep. The table
# will contain headers that are the timesteps and values that are the number of
# people entering or exiting at the timestep.
tabEnter = table(pmax(1, ceiling(enterTimestamp)))
tabExit = table(pmin(simDuration+1, ceiling(exitTimestamp)))
# For each time index, add the number of people entering and subtract the number of
# people exiting. For instance, if in period 20, 3 people entered and 4 exited, then
# diffs[20] equals -1. as.numeric(names(tabEnter)) is the periods for which at least
# one person entered, and tabEnter is the number of people in each of those periods.
diffs[as.numeric(names(tabEnter))] = diffs[as.numeric(names(tabEnter))] + tabEnter
diffs[as.numeric(names(tabExit))] = diffs[as.numeric(names(tabExit))] - tabExit
# cumsum() sums the diffs vector through a particular time point.
visitorLoad2 = head(cumsum(diffs), simDuration)
How about this for simplicity:
vl<-unlist(lapply(1:simDuration,function(i)sum((enterTimestamp<=i)*(exitTimestamp>i))))
plot(vl, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')
It's twice as fast as current loop, but if performance is more important then #josilber 's solution is better, or maybe something with data.table(), will have a think...
EDIT - how about this for speed:
require(data.table)
require(plyr) # for count() function
system.time({
enter<-data.table(count(ceiling(enterTimestamp))) # entries grouped by second
exit<-data.table(count(ceiling(exitTimestamp))) # exits grouped by second
sim<-data.table(x=1:simDuration) # time index
merged<-merge(merge(sim,enter,by="x",all.x=T),exit,by="x",all.x=T)
mat<-data.matrix(merged[,list(freq.x,freq.y)]) # make matrix to remove NAs
mat[is.na(mat)]<-0 # remove NAs, there are quicker ways but more complicated
vl<-cumsum(mat[,1]-mat[,2]) # cumsum() to roll up the movements
})
user system elapsed
0.02 0.00 0.02
plot(vl, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')
** FURTHER EDIT ** - balance of performance and simplicity
system.time(cumsum(data.frame(table(cut(enterTimestamp,0:10000))-table(cut(exitTimestamp,0:10000)))[,2]))
user system elapsed
0.09 0.00 0.10
Related
I got a question related to probability theory and I tried to solve it by simulating it in R. However, I ran into a problem as the while loop does not seem to break.
The question is asking: How many people are needed such that there is at least a 70% chance that one of them is born on the last day of December?
Here is my code:
prob <- 0
people <- 1
while (prob <= 0.7) {
people <- people + 1 #start the iteration with 2 people in the room and increase 1 for every iteration
birthday <- sample(365, size = people, replace = TRUE)
prob <- length(which(birthday == 365)) / people
}
return(prob)
My guess is that it could never hit 70%, therefore the while loop never breaks, am I right? If so, did I interpret the question wrongly?
I did not want to post this on stats.stackexchange.com because I thought this is more related to code rather than math itself, but I will move it if necessary, thanks.
This is a case where an analytical solution based on probability is easier and more accurate than trying to simulate. I agree with Harshvardhan that your formulation is solving the wrong problem.
The probability of having at least one person in a pool of n have their birthday on a particular target date is 1-P{all n miss the target date}. This probability is at least 0.7 when P{all n miss the target date} < 0.3. The probability of each individual missing the target is assumed to be P{miss} = 1-1/365 (365 days per year, all birthdates equally likely). If the individual birthdays are independent, then P{all n miss the target date} = P{miss}^n.
I am not an R programmer, but the following Ruby should translate pretty easily:
# Use rationals to avoid cumulative float errors.
# Makes it slower but accurate.
P_MISS_TARGET = 1 - 1/365r
p_all_miss = P_MISS_TARGET
threshold = 3r / 10 # seeking P{all miss target} < 0.3
n = 1
while p_all_miss > threshold
p_all_miss *= P_MISS_TARGET
n += 1
end
puts "With #{n} people, the probability all miss is #{p_all_miss.to_f}"
which produces:
With 439 people, the probability all miss is 0.29987476838793214
Addendum
I got curious, since my answer differs from the accepted one, so I wrote a small simulation. Again, I think it's straightforward enough to understand even though it's not in R:
require 'quickstats' # Stats "gem" available from rubygems.org
def trial
n = 1
# Keep adding people to the count until one of them hits the target
n += 1 while rand(1..365) != 365
return n
end
def quantile(percentile = 0.7, number_of_trials = 1_000)
# Create an array containing results from specified number of trials.
# Defaults to 1000 trials
counts = Array.new(number_of_trials) { trial }
# Sort the array and determine the empirical target percentile.
# Defaults to 70th percentile
return counts.sort[(percentile * number_of_trials).to_i]
end
# Tally the statistics of 100 quantiles and report results,
# including margin of error, formatted to 3 decimal places.
stats = QuickStats.new
100.times { stats.new_obs(quantile) }
puts "#{"%.3f" % stats.avg}+/-#{"%.3f" % (1.96*stats.std_err)}"
Five runs produce outputs such as:
440.120+/-3.336
440.650+/-3.495
435.820+/-3.558
439.500+/-3.738
442.290+/-3.909
which is strongly consistent with the analytical result derived earlier and seems to differ significantly from other responder's answers.
Note that on my machine the simulation takes roughly 40 times longer than the analytical calculation, is more complex, and introduces uncertainty. To increase the precision you would need larger sample sizes, and thus longer run times. Given these considerations, I would reiterate my advice to go for the direct solution in this case.
Indeed, your probability will (almost) never reach 0.7, because you hardly will hit the point where exactly 1 person has got birthday = 365. When people gets larger, there will be more people having a birthday = 365, and the probability for exactly 1 person will decrease.
Furthermore, to calculate a probability for a given number of persons, you should draw many samples and then calculate the probability. Here is a way to achieve that:
N = 450 # max. number of peoples being tried
probs = array(numeric(), N) # empty array to store found probabilities
# try for all people numbers in range 1:N
for(people in 1:N){
# do 200 samples to calculate prop
samples = 200
successes = 0
for(i in 1:samples){
birthday <- sample(365, size = people, replace = TRUE)
total_last_day <- sum(birthday == 365)
if(total_last_day >= 1){
successes <- successes + 1
}
}
# store found prop in array
probs[people] = successes/samples
}
# output of those people numbers that achieved a probability of > 0.7
which(probs>0.7)
As this is a simulation, the result depends on the run. Increasing the sample rate would make the result more stable.
You are solving the wrong problem. The question is, "How many people are needed such that there is at least a 70% chance that one of them is born on the last day of December?". What you are finding now is "How many people are needed such that 70% have their birthdays on the last day of December?". The answer to the second question is close to zero. But the first one is much simpler.
Replace prob <- length(which(birthday == 365)) / people with check = any(birthday == 365) in your logic because at least one of them has to be born on Dec 31. Then, you will be able to find if that number of people will have at least one person born on Dec 31.
After that, you will have to rerun the simulation multiple times to generate empirical probability distribution (kind of Monte Carlo). Only then you can check for probability.
Simulation Code
people_count = function(i)
{
set.seed(i)
for (people in 1:10000)
{
birthday = sample(365, size = people, replace = TRUE)
check = any(birthday == 365)
if(check == TRUE)
{
pf = people
break
}
}
return(pf)
}
people_count() function returns the number of people required to have so that at least one of them was born on Dec 31. Then I rerun the simulation 10,000 times.
# Number of simulations
nsim = 10000
l = lapply(1:nsim, people_count) %>%
unlist()
Let's see the distribution of the number of people required.
To find actual probability, I'll use cumsum().
> cdf = cumsum(l/nsim)
> which(cdf>0.7)[1]
[1] 292
So, on average, you would need 292 people to have more than a 70% chance.
In addition to #pjs answer, I would like to provide one myself, written in R. I attempted to solve this question by simulation rather than an analytical approach, and I am sharing it in case it is helpful for someone else who also has the same problem. Its not that well written but the idea is there:
# create a function which will find if anyone is born on last day
last_day <- function(x){
birthdays <- sample(365, size = x, replace = TRUE) #randomly get everyone's birthdays
if(length(which(birthdays == 365)) >= 1) {
TRUE #find amount of people born on last day and return true if >1
} else {
FALSE
}
}
# find out how many people needed to get 70%
people <- 0 #set number of people to zero
prob <- 0 #set prob to zero
while (prob <= 0.7) { #loop does not stop until it hits 70%
people <- people + 1 #increase the number of people every iteration
prob <- mean(replicate(10000, last_day(people))) #run last_day 10000 times to find the mean of probability
}
print(no_of_people)
last_day() only return TRUE or FALSE. So I run last_day() 10000 times in the loop for every iteration to find out, out of 10000 times, how many times does it have one or more people born on the last day (This will give the probability). I then keep the loop running until the probability is 70% or more, then print the number of people.
The answer I get from running the loop once is 440 which is quite close to the answer provided by #pjs.
R and probability noob here. I'm looking to create a histogram that shows the distribution of how many attempts it took to return a heads, repeated over 1000+ simulated runs on the equivalent of an unfairly weighted coin (0.1 heads, 0.9 tails).
From my understanding, this is not a geometric distribution or binomial distribution (but might make use of either of these to create the simulated results).
The real-world (ish) scenario I am looking to model this for is a speedrun of the game Zelda: Ocarina of Time. One of the goals in this speedrun is to obtain an item from a character that has a 1 in 10 chance of giving the player the item each attempt. As such, the player stops attempting once they receive the item (which they have a 1/10 chance of receiving each attempt). Every run, runners/viewers will keep track of how many attempts it took to receive the item during that run, as this affects the time it takes the runner to complete the game.
This is an example of what I'm looking to create:
(though with more detailed labels on the x axis if possible). In this, I manually flipped a virtual coin with a 1/10 chance of heads over and over. Once I got a successful result I recorded how many attempts it took into a vector in R and then repeated about 100 times - I then mapped this vector onto a histogram to visualise what the distribution would look like for the usual amount of attempts it will take to get a successful result - basically, i'd like to automate this simulation instead of me having to manually flip the virtual unfair coin, write down how many attempts it took before heads, and then enter it into R myself).
I'm not sure if this is quite what you're looking for, but if you create a function for your manual coin flipping, you can just use replicate() to call it many times:
foo <- function(p = 0.1) {
i <- 0
failure <- TRUE
while ( failure ) {
i <- i + 1
if ( sample(x = c(TRUE, FALSE), size = 1, prob = c(p, 1-p)) ) {
failure <- FALSE
}
}
return(i)
}
set.seed(42)
number_of_attempts <- replicate(1000, foo())
hist(number_of_attempts, xlab = "Number of Attempts Until First Success")
As I alluded to in my comment though, I'm not sure why you think the geometric distribution is inappropriate.
It "is used for modeling the number of failures until the first success" (from the Wikipedia on it).
So, we can just sample from it and add one; the approaches are equivalent, but this will be faster when your number of samples is high:
number_of_attempts2 <- rgeom(1000, 0.1) + 1
hist(number_of_attempts2, xlab = "Number of Attempts Until First Success")
I would use the 'rle' function since you can make a lot of simulations in a short period of time. Use this to count the run of tails before a head:
> n <- 1e6
> # generate a long string of flips with unfair coin
> flips <- sample(0:1,
+ n,
+ replace = TRUE,
+ .... [TRUNCATED]
> counts <- rle(flips)
> # now pull out the "lengths" of "0" which will be the tails before
> # a head is flipped
> runs <- counts$lengths[counts$value == 0]
> sprintf("# of simulations: %d max run of tails: %d mean: %.1f\n",
+ length(runs),
+ max(runs),
+ mean(runs))
[1] "# of simulations: 90326 max run of tails: 115 mean: 10.0\n"
> ggplot()+
+ geom_histogram(aes(runs),
+ binwidth = 1,
+ fill = 'blue')
and you get a chart like this:
Histograph of runs
I would tabulate the cumsum.
p=.1
N <- 1e8
set.seed(42)
tosses <- sample(0:1, N, T, prob=c(1-p, p))
attempts <- tabulate(cumsum(tosses))
length(attempts)
# [1] 10003599
hist(attempts, freq=F, col="#F48024")
I am trying to simulate monthly panels of data where one variable depends on lagged values of that variable in R. My solution is extremely slow. I need around 1000 samples of 2545 individuals, each of whom is observed monthly over many years, but the first sample took my computer 8.5 hours to construct. How can I make this faster?
I start by creating an unbalanced panel of people with different birth dates, monthly ages, and variables xbsmall and error that will be compared to determine the Outcome. All of the code in the first block is just data setup.
# Setup:
library(plyr)
# Would like to have 2545 people (nPerson).
#Instead use 4 for testing.
nPerson = 4
# Minimum and maximum possible ages and birth dates
AgeMin = 10
AgeMax = 50
BornMin = 1950
BornMax = 1963
# Person-specific characteristics
ind =
data.frame(
id = 1:nPerson,
BornYear = floor(runif(length(1:nPerson), min=BornMin, max=BornMax+1)),
BornMonth = ceiling(runif(length(1:nPerson), min=0, max=12))
)
# Make an unbalanced panel of people over age 10 up to year 1986
# panel = ddply(ind, ~id, transform, AgeMonths = BornMonth)
panel = ddply(ind, ~id, transform, AgeMonths = (AgeMin*12):((1986-BornYear)*12 + 12-BornMonth))
# Set up some random variables to approximate the data generating process
panel$xbsmall = rnorm(dim(panel)[1], mean=-.3, sd=.45)
# Standard normal error for probit
panel$error = rnorm(dim(panel)[1])
# Placeholders
panel$xb = rep(0, dim(panel)[1])
panel$Outcome = rep(0, dim(panel)[1])
Now that we have data, here is the part that is slow (around a second on my computer for only 4 observations but hours for thousands of observations). Each month, a person gets two draws (xbsmall and error) from two different normal distributions (these were done above), and Outcome == 1 if xbsmall > error. However, if Outcome equals 1 in the previous month, then Outcome in the current month equals 1 if xbsmall + 4.47 > error. I use xb = xbsmall+4.47 in the code below (xb is the "linear predictor" in a probit model). I ignore the first month for each person for simplicity. For your information, this is simulating a probit DGP (but that is not necessary to know to solve the problem of computation speed).
# Outcome == 1 if and only if xb > -error
# The hard part: xb includes information about the previous month's outcome
start_time = Sys.time()
for(i in 1:nPerson){
# Determine the range of monthly ages to loop over for this person
AgeMonthMin = min(panel$AgeMonths[panel$id==i], na.rm=T)
AgeMonthMax = max(panel$AgeMonths[panel$id==i], na.rm=T)
# Loop over the monthly ages for this person and determine the outcome
for(t in (AgeMonthMin+1):AgeMonthMax){
# Indicator for whether Outcome was 1 last month
panel$Outcome1LastMonth[panel$id==i & panel$AgeMonths==t] = panel$Outcome[panel$id==i & panel$AgeMonths==t-1]
# xb = xbsmall + 4.47 if Outcome was 1 last month
# Otherwise, xb = xbsmall
panel$xb[panel$id==i & panel$AgeMonths==t] = with(panel[panel$id==i & panel$AgeMonths==t,], xbsmall + 4.47*Outcome1LastMonth)
# Outcome == 1 if xb > 0
panel$Outcome[panel$id==i & panel$AgeMonths==t] =
ifelse(panel$xb[panel$id==i & panel$AgeMonths==t] > - panel$error[panel$id==i & panel$AgeMonths==t], 1, 0)
}
}
end_time = Sys.time()
end_time - start_time
My thoughts for reducing computer time:
Something with cumsum()
Some wonderful panel data function that I do not know about
Find a way to make the t loop go through the same starting and ending points for each individual and then somehow use plyr::ddpl() or dplyr::gather_by()
Iterative solution: make an educated guess about the value of Outcome at each monthly age (say, the mode) and somehow adjust values that do not match the previous month. This would work better in my real application because xbsmall has a very clear trend in age.
Do the simulation only for smaller samples and then estimate the effect of sample size on the values I need (the distributions of regression coefficient estimates not calculated here)
One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:
generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}
where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do
out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)
and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.
I need to simulate a stock's daily returns. I am given r=(P(t+1)-P(t))/P(t) (normal distribution) mean of µ=1% and sd of σ =5%. P(t) is the stock price at end of day t. Simulate 100,000 instances of such daily returns.
Since I am a new R user, how do I setup t for this example. I am assuming P should be setup as:
P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)
You are getting it wrong: from what you wrote, the mean and the sd applies on the return and not on the price. I furthermore make the assumption that the mean is set for an annual basis (1% rate of return from one day to another is just ...huge!) and t moves along a day range of 252 days per year.
With these hypothesis, you can get a series of daily return in R with:
r = rnorm(100000, .01/252, .005)
Assuming the model you mentioned, you can get the serie of the prices P (containing 100001 elements, I will take P[1]=100 - change it with your own value if needed):
factor = 1 + r
temp = 100
P = c(100, sapply(1:100000, function(u){
p = factor[u]*temp
temp<<-p
p
}))
Your configuration for the return price you mention (mean=0.01 and sd=0.05) will however lead to exploding stock price (unrealistic model and parameters). Be carefull to check that prod(rate) will not return Inf .
Here is the result for the first 1000 values of P, representing 4 years:
plot(1:1000, P[1:1000])
One of the classical model (which does not mean this model is realistic) assumes the observed log return are following a normal distribution.
Hope this helps.
I see you already have an answer and ColonelBeauvel might have more domain knowledge than I (assuming this is business or finance homework.) I approached it a bit differently and am going to post a commented transcript. His method uses the <<- operator which is considered as a somewhat suspect strategy in R, although I must admit it seems quite elegant in this application. I suspect my method will probably be a lot faster if you ever get into doing large scale simulations.
Starting with your code:
P <- rnorm(100000, .01, .05)
# r=(P(t+1)-P(t))/P(t) definition, not R code
# inference: P_t+1 = r_t*P_t + P_t = P_t*(1+r_t)
# So, all future P's will be determined by P_1 and r_t
Since P_2 will be P_1*(1+r_1)r_1 then P_3 will be P_1*(1+r_1)*(1+r_2), .i.e a continued product of the vector (1+r) for which there is a vectorized function.
P <- P_1*cumprod(1+r)
#Error: object 'P_1' not found
P_1 <- 100
P <- P_1*cumprod(1+r)
#Error: object 'r' not found
# So the random simulation should have been for `r`, not P
r <- rnorm(100000, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
#Error in plot.window(...) : infinite axis extents [GEPretty(-inf,inf,5)]
str(P)
This occurred because the cumulative product went above the limits of numerical capacity and got assigned to Inf (infinity). Let's be a little more careful:
r <- rnorm(300, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
This strategy below iteratively updates the price at time t as 'temp' and multiplies it it by a single value. It's likely to be a lot slower.
r = rnorm(100000, .01/252, .005)
factor = 1 + r
temp = 100
P = c(100, sapply(1:300, function(u){
p = factor[u]*temp
temp<<-p
p
}))
> system.time( {r <- rnorm(10000, .01/250, .05)
+ P <- P_1*cumprod(1+r)
+ })
user system elapsed
0.001 0.000 0.002
> system.time({r = rnorm(10000, .01/252, .05)
+ factor = 1 + r
+ temp = 100
+ P = c(100, sapply(1:300, function(u){
+ p = factor[u]*temp
+ temp<<-p
+ p
+ }))})
user system elapsed
0.079 0.004 0.101
To simulate a log return of the daily stock, use the following method:
Consider working with 256 days of daily stock return data.
Load the original data into R
Create another data.frame for simulating Log return.
Code:
logr <- data.frame(Date=gati$Date[1:255], Shareprice=gati$Adj.Close[1:255], LogReturn=log(gati$Adj.Close[1:251]/gati$Adj.Close[2:256]))
gati is the dataset
Date and Adj.close are the variables
notice the [] values.
P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)
second line translates directly into :
r <- (P[-1] - P[length(P)]) / P[length(P)] # (1:5)[-1] gives 2:5
Stock returns are not normally distributed for Simple Returns ("R"), given their -1 lower bound per compounded period. However, Log Returns ("r") generally are. The below is adapted from #42's post above. There don't seem to be any solutions to simulating from Log Mean ("Expected Return") and Log Stdev ("Risk") in #Rstats, so I've included them here for those looking for "Monte Carlo Simulation using Log Expected Return and Log Standard Deviation"), which are normally distributed, and have no lower bound at -1. Note: from this single example, it would require looping over thousands of times to simulate a portfolio--i.e., stacking 100k plots like the below and averaging a single slice to calculate a portfolio's average expected return at a chosen forward month. The below should give a good basis for doing so.
startPrice = 100
forwardPeriods = 12*10 # 10 years * 12 months with Month-over-Month E[r]
factor = exp(rnorm(forwardPeriods, .04, .10)) # Monthly Expected Ln Return = .04 and Expected Monthly Risk = .1
temp = startPrice
P = c(startPrice, sapply(1:forwardPeriods, function(u){p = factor[u]*temp; temp <<- p; p}))
plot(P, type = "b", xlab = "Forward End of Month Prices", ylab = "Expected Price from Log E[r]", ylim = c(0,max(P)))
n <- length(P)
logRet <- log(P[-1]/P[-n])
# Notice, with many samples this nearly matches our initial log E[r] and stdev(r)
mean(logRet)
# [1] 0.04540838
sqrt(var(logRet))
# [1] 0.1055676
If tested with a negative log expected return, the price should not fall below zero. The other examples, will return negative prices with negative expected returns. The code I've shared here can be tested to confirm that negative prices do not exist in the simulation.
min(P)
# [1] 100
max(P)
# [1] 23252.67
Horizontal axis is number of days, and vertical axis is price.
n_prices <- 1000
volatility <- 0.2
amplitude <- 10
chng <- amplitude * rnorm(n_prices, 0, volatility)
prices <- cumsum(chng)
plot(prices, type='l')
I have a big data set (around 200k rows) where each row is a loan. I have the loan amount, the number of payments, and the loan payment.
I'm trying to get the loan rate.
R doesn't have a function for calculating this (at least base R doesn't have it, and I couldn't find it).
It isn't that hard to write both a npv and irr functions
Npv <- function(i, cf, t=seq(from=0,by=1,along.with=cf)) sum(cf/(1+i)^t)
Irr <- function(cf) { uniroot(npv, c(0,100000), cf=cf)$root }
And you can just do
rate = Irr(c(amt,rep(pmt,times=n)))
The problem is when you try to calculate the rate for a lot of payments. Because uniroot is not vectorized, and because rep takes a surprising amount of time, you end up with a slow calculation. You can make it faster if you do some math and figure out that you are looking for the roots of the following equation
zerome <- function(r) amt/pmt-(1-1/(1+r)^n)/r
and then use that as input for uniroot. This, in my pc, takes around 20 seconds to run for my 200k database.
The problem is that I'm trying to do some optimization, and this is a step of the optimization, so I'm trying to speed it up even more.
I've tried vectorization, but because uniroot is not vectorized, I can't go further that way. Is there any root finding method that is vectorized?
Thanks
Instead of using a root finder, you could use a linear interpolator. You will have to create one interpolator for each value of n (the number of remaining payments). Each interpolator will map (1-1/(1+r)^n)/r to r. Of course you will have to build a grid fine enough so it will return r to an acceptable precision level. The nice thing with this approach is that linear interpolators are fast and vectorized: you can find the rates for all loans with the same number of remaining payments (n) in a single call to the corresponding interpolator.
Now some code that proves it is a viable solution:
First, we create interpolators, one for each possible value of n:
n.max <- 360L # 30 years
one.interpolator <- function(n) {
r <- seq(from = 0.0001, to = 0.1500, by = 0.0001)
y <- (1-1/(1+r)^n)/r
approxfun(y, r)
}
interpolators <- lapply(seq_len(n.max), one.interpolator)
Note that I used a precision of 1/100 of a percent (1bp).
Then we create some fake data:
n.loans <- 200000L
n <- sample(n.max, n.loans, replace = TRUE)
amt <- 1000 * sample(100:500, n.loans, replace = TRUE)
pmt <- amt / (n * (1 - runif(n.loans)))
loans <- data.frame(n, amt, pmt)
Finally, we solve for r:
library(plyr)
system.time(ddply(loans, "n", transform, r = interpolators[[n[1]]](amt / pmt)))
# user system elapsed
# 2.684 0.423 3.084
It's fast. Note that some of the output rates are NA but it is because my random inputs made no sense and would have returned rates outside of the [0 ~ 15%] grid I selected. Your real data won't have that problem.