Calculating compound interest with vector of rates - r

I'm trying to see if there is a way to vectorize a calculation I performed. I searched for this answer and couldn't find what I needed.
I have a vector of growth rates. Each one represents one period (one year in my case). I want to apply this vector to some principal amount. Then, after the first growth rate is applied to the principal, use the result from the first iteration and apply the second growth element to the new value.
Here's some code for reproduction (all in base):
# Vector of interest or inflation rates
RateVector <- c(0.02, 0.03, 0.04, 0.05, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01) # forecasted rates
Principal <- data.frame(Principal = 1000000) # actual value of indicator in most recent period as data frame (list)
Here was my attempt to vectorize:
sapply(Principal, "*", 1 + cumsum(RateVector))
The problem with this is that the sapply function does not save the new amount and instead applies the vector of rates to the same initial principal. This is actually what I expected from this code. I don't know how to go about saving the new value after each iteration from element to element.
This is how I solved the problem, using a loop:
AmountVector <- Principal # initialize output vector
# Compound growth growth calculation loop
for(i in 1:length(RateVector)){
Principal = Principal * (1 + RateVector)[i]
AmountVector <- rbind(AmountVector,Principal)
}
# Print result
AmountVector

This is a "cumulative product", so ?cumprod is what you need:
1000000 * cumprod(1 + RateVector)
# [1] 1020000 1050600 1092624 1147255 1216091 1276895 1327971 1367810 1395166
#[10] 1409118
cbind(AmountVector, newresult = 1000000 * c(1, cumprod(1 + RateVector)))
# Principal newresult
#1 1000000 1000000
#2 1020000 1020000
#3 1050600 1050600
#4 1092624 1092624
#5 1147255 1147255
#6 1216091 1216091
#7 1276895 1276895
#8 1327971 1327971
#9 1367810 1367810
#10 1395166 1395166
#11 1409118 1409118

Related

calculate new random number considering distribution of already existing numbers in r

I have a dataframe with participants and I want to randomly assign them to a group (0,1). Each group should have approximately the same amount of participants.
My problem: I will keep adding participants. So, when I calculate a new random number for that participant, it should take into accound the distribution of the random numbers I already have.
This is my code:
groupData <- data.frame(participant = c(1), Group = floor(runif(1, min=0, max=2)))
groupData[nrow(groupData) + 1,] = c(2,floor(runif(1, min=0, max=2))) # with this I will be adding participants
I think what you're saying is that when iteratively adding participants to groupData, you want to randomly assign them to a group such that over time, the groups will be evenly distributed.
N.B., iteratively adding rows to a frame scales horribly, so if you're doing this with a lot of data, it will slow down a lot. See "Growing Objects" in The R Inferno.
We can weight the different groups proportion to their relative size (inversely), so that a new participant has a slightly-higher likelihood of being assigned an under-populated group.
For instance, if we already have 100 participants with unbalanced groups:
set.seed(42)
groupData <- data.frame(participant = 1:100, Group = sample(c(rep(0, 70), rep(1, 30))))
head(groupData)
# participant Group
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 1
# 5 5 0
# 6 6 1
table(groupData$Group)
# 0 1
# 70 30
then we can prioritize the under-filled group using
100 / (table(c(0:1, groupData$Group))-1)
# 0 1
# 1.428571 3.333333
which can be used with sample as in
sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group)) - 1) )
I use table(c(0:1, ..)) - 1 because I want this to work when there may not yet be participants in one of the groups; by concatenating 0:1 to it, I ensure heac group has at least one, and the "minus one" compensates for this artificiality, trying to keep the ratios unbiased.
To "prove" that this eventually rounds out ...
for (pa in 101:400) {
newgroup <- sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group))-1))
groupData <- rbind(groupData, data.frame(participant=pa, Group=newgroup))
}
library(ggplot2)
transform(groupData, GroupDiff = cumsum(Group == 0) - cumsum(Group == 1)) |>
ggplot(aes(participant, y = GroupDiff)) +
geom_point() +
geom_hline(yintercept=0) +
geom_vline(xintercept = 100) +
geom_text(data=data.frame(participant=101, GroupDiff=c(-Inf, -1, 1), vjust=c(-0.5, 0.5, -0.5), label=c("Start of group-balancing", "Group0-heavy", "Group1-heavy")), hjust=0, aes(label=label, vjust=vjust))
It is possible (even likely) that the balance will sway from side-to-side, but in general (asymptotically) it should stay balanced.
It occurs to me that the simplest method is just to assign people in pairs. Draw a random number (0 or 1) assign person N to the group associated with that value and assign person N+1 to the other group. That guarantees random assignment as well as perfectly equal group sizes.
Whether this properly simulates the situation you want to analyze is a separate issue.

Using R to sample 3 proportion variables so that the three samples add to 1

I have a data set that is split into 3 profiles
Profile 1 = 0.478 (95% confidence interval: 0.4, 0.56)
Profile 2 = 0.415 (95% confidence interval: 0.34, 0.49)
Profile 3 = 0.107 (95% confidence interval: 0.06, 0.15)
Profile 1 + Profile 2 + Profile 3 = 1
I want to create a stochastic model that selects a value for each profile from each proportion's confidence interval. I want to keep that these add up to one. I have been using
pro1_prop<- rpert (1, 0.4, 0.478, 0.56)
pro2_prop<- rpert (1, 0.34, 0.415, 0.49)
pro3_prop<- 1- (pro1_prop + pro2_prop)
But this does not seem robust enough. Also on some iterations, (pro1_prop + pro2_prop) >1 which results in a negative value for pro3_prop. Is there a better way of doing this? Thank you!
It is straightforward to sample from the posterior distributions of the proportions using Bayesian methods. I'll assume a multinomial model, where each observation is one of the three profiles.
Say the counts data for the three profiles are 76, 66, and 17.
Using a Dirichlet prior distribution, Dir(1/2, 1/2, 1/2), the posterior is also Dirichlet-distributed: Dir(76.5, 66.5, 17.5), which can be sampled using normalized random gamma variates.
x <- c(76, 66, 17) # observations
# take 1M samples of the proportions from the posterior distribution
theta <- matrix(rgamma(3e6, rep(x + 1/2, each = 1e6)), ncol = 3)
theta <- theta/rowSums(theta)
head(theta)
#> [,1] [,2] [,3]
#> [1,] 0.5372362 0.3666786 0.09608526
#> [2,] 0.4008362 0.4365053 0.16265852
#> [3,] 0.5073144 0.3686412 0.12404435
#> [4,] 0.4752601 0.4367119 0.08802793
#> [5,] 0.4428575 0.4520680 0.10507456
#> [6,] 0.4494075 0.4178494 0.13274311
# compare the Bayesian credible intervals with the frequentist confidence intervals
cbind(
t(mapply(function(i) quantile(theta[,i], c(0.025, 0.975)), seq_along(x))),
t(mapply(function(y) setNames(prop.test(y, sum(x))$conf.int, c("2.5%", "97.5%")), x))
)
#> 2.5% 97.5% 2.5% 97.5%
#> [1,] 0.39994839 0.5537903 0.39873573 0.5583192
#> [2,] 0.33939396 0.4910900 0.33840295 0.4959541
#> [3,] 0.06581214 0.1614677 0.06535702 0.1682029
If samples within the individual 95% CIs are needed, simply reject samples that fall outside the desired interval.
TL;DR: Sample all three values (for example from a pert distribution, as you did) and norm those values afterwards so they add up to one.
Sampling all three values independently from each other and then dividing by their sum so that the normed values add up to one seems to be the easiest option as it is quite hard to sample from the set of legal values directly.
Legal values:
The downside of my approach is that the normed values are not necessarily legal (i.e. in the range of the confidence intervals) any more. However, for these values using a pert distribution, this only happens about 0.5% of the time.
Code:
library(plotly)
library(freedom)
library(data.table)
# define lower (L) and upper (U) bounds and expected values (E)
prof1L <- 0.4
prof1E <- 0.478
prof1U <- 0.56
prof2L <- 0.34
prof2E <- 0.415
prof2U <- 0.49
prof3L <- 0.06
prof3E <- 0.107
prof3U <- 0.15
dt <- as.data.table(expand.grid(
Profile1 = seq(prof1L, prof1U, by = 0.002),
Profile2 = seq(prof2L, prof2U, by = 0.002),
Profile3 = seq(prof3L, prof3U, by = 0.002)
))
# color based on how far the points are away from the center
dt[, color := abs(Profile1 - prof1E) + abs(Profile2 - prof2E) + abs(Profile3 - prof3E)]
# only keep those points that (almost) add up to one
dt <- dt[abs(Profile1 + Profile2 + Profile3 - 1) < 0.01]
# plot the legal values
fig <- plot_ly(dt, x = ~Profile1, y = ~Profile2, z = ~Profile3, color = ~color, colors = c('#BF382A', '#0C4B8E')) %>%
add_markers()
fig
# try to simulate the legal values:
# first sample without considering the condition that the profiles need to add up to 1
nSample <- 100000
dtSample <- data.table(
Profile1Sample = rpert(nSample, prof1L, prof1U, prof1E),
Profile2Sample = rpert(nSample, prof2L, prof2U, prof2E),
Profile3Sample = rpert(nSample, prof3L, prof3U, prof3E)
)
# we want to norm the samples by dividing by their sum
dtSample[, SampleSums := Profile1Sample + Profile2Sample + Profile3Sample]
dtSample[, Profile1SampleNormed := Profile1Sample / SampleSums]
dtSample[, Profile2SampleNormed := Profile2Sample / SampleSums]
dtSample[, Profile3SampleNormed := Profile3Sample / SampleSums]
# now get rid of the cases where the normed values are not legal any more
# (e.g. Profile 1 = 0.56, Profile 2 = 0.38, Profile 3 = 0.06 => dividing by their sum
# will make Profile 3 have an illegal value)
dtSample <- dtSample[
prof1L <= Profile1SampleNormed & Profile1SampleNormed <= prof1U &
prof2L <= Profile2SampleNormed & Profile2SampleNormed <= prof2U &
prof3L <= Profile3SampleNormed & Profile3SampleNormed <= prof3U
]
# see if the sampled values follow the desired distribution
hist(dtSample$Profile1SampleNormed)
hist(dtSample$Profile2SampleNormed)
hist(dtSample$Profile3SampleNormed)
Histogram of normed sampled values for Profile 1:
Ok, some thoughts on the matter.
Lets think about Dirichlet distribution, as one providing RV summed up to 1.
We're talking about Dir(a1, a2, a3), and have to find needed ai.
From the expression for E[Xi]=ai/Sum(i, ai), it is obvious we could get three ratios solving equations
a1/Sum(i, ai) = 0.478
a2/Sum(i, ai) = 0.415
a3/Sum(i, ai) = 0.107
Note, that we have only solved for RATIOS. In other words, if in the expression for E[Xi]=ai/Sum(i, ai) we multiply ai by the same value, mean will stay the same. So we have freedom to choose multiplier m, and what will change is the variance/std.dev. Large multiplier means smaller variance, tighter sampled values around the means
So we could choose m freely to satisfy three 95% CI conditions, three equations for variance but only one df. So it is not possible in general.
One cold play with numbers and the code

Declining a value w/o a for loop

I would like to decline (i.e. multiply) a value (first.value) against a vector of percentage decline values (decline.vector), where the first value is declined against the starting percentage decline value, then that output value is declined against the second percentage decline value, and so on. I assume there is a more elegant way to do so in R than writing a for loop to reassign the new value and cbind to create the new vector, but I remain a novice.
The decline vectors are not sequences like below, this just an example.
Although, is it possible to sequence where 'by=' is a vector? I did not find anything in the ?seq that suggests it is possible.
Whereby:
first.value <- 100
decline.vector <- c(0.85, 0.9, 0.925, 0.95, 0.975)
Desired output:
[100] 85, 75.5, 70.763, 67.224, 65.544
You can do this with the Reduce function in base R
first.value <- 100
decline.vector <- c(0.85, 0.9, 0.925, 0.95, 0.975)
Reduce(`*`, decline.vector, first.value, accumulate = TRUE)
# [1] 100.00000 85.00000 76.50000 70.76250 67.22437 65.54377
You could also use cumprod
first.value * cumprod(c(1, decline.vector))
# [1] 100.00000 85.00000 76.50000 70.76250 67.22438 65.54377
If you don't want first.value to be the first element of the output, then do
first.value * cumprod(decline.vector)
# [1] 85.00000 76.50000 70.76250 67.22438 65.54377

Optimisation of matrix in R

I'm new to optimisation/calibration of models in R, but i'm eager to learn and really need some help. My question relates to demographic modelling.
I've done some research and found help here and here but neither have quite answered my question.
I have a matrix of scalars (propensities) where each column must total to 1. These propensities are used to estimate the number of households that would arise from a given population (persons by age). The propensities model tends to overestimate the number of households in history (for which I know the true number of households).
I want to calibrate the model to minimise the error in the number of households by tweaking the propensities such that the columns still add to 1 and propensities with an initial value of zero must remain zero.
Simple example:
# Propensities matrix
mtx <- matrix(c(0.00, 0.00, 0.85, 0.00, 0.15, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Population by age cohort
pop <- c(2600, 16200, 13400)
# True number of households
target <- c(7000, 4500, 5500)
# Function to optimise
hh <- function(mtx, pop, target) {
# Estimate living arrangements
x <- mtx %*% pop
# Estimate number of households using parent cohorts (1,2 and 4)
x <- c(x[1,1]/2, x[2,1]/2, x[4,1]) - target
return(x)
}
I haven't included any of my code for the optimisation/calibration step as it would be embarrassing and I've haven't been able to get anything to work!
Ideally i will have one set of propensities that generalises well for lots of different regions at the end of this process. Any advice on how i should go about achieving it? Helpful links?
Update
The snippet of code below executes the local search method as suggested by Enrico.
library(tidyverse)
library(NMOF)
data <- list(mtx = matrix(c(0.00, 0.00, 0.90, 0.00, 0.10, 0.25, 0.50, 0.00,
0.25, 0.00, 0.60, 0.20, 0.00, 0.20, 0.00), ncol = 3),
pop = c(2600, 16200, 13400),
target = c(7190, 4650, 5920))
# True mtx
mtx.true <- matrix(c(0.00, 0.00, 0.75, 0.00, 0.25, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Function to optimise
households <- function(x, data) {
# Estimate living arrangements
z <- x %*% data$pop
# Estimate number of households using parent cohorts (1,2 and 4)
z <- c(z[1,1]/2, z[2,1]/2, z[4,1]) - data$target
sum(abs(z))
}
# Local search function to perturb propensities
neighbour <- function(x, data) {
# Choose random column from mtx
i <- sample(1:ncol(x), 1)
# Select two non-zero propensities from mtx column
j <- which(x[, i] != 0) %>% sample(2, replace = FALSE)
# Randomnly select one to perturb positively
x[j[1], i] <- 0.1 * (1 - x[j[1], i]) + x[j[1], i]
# Perturb second propensity to ensure mtx column adds to 1
x[j[2], i] <- x[j[2], i] + (1 - sum(x[,i]))
x
}
# Local search algorithm inputs
localsearch <- list(x0 = data$mtx,
neighbour = neighbour,
nS = 50000,
printBar = FALSE)
# Execute
now <- Sys.time()
solution <- LSopt(OF = households, algo = localsearch, data)
#>
#> Local Search.
#> Initial solution: 2695
#> Finished.
#> Best solution overall: 425.25
Sys.time() - now
#> Time difference of 6.33272 secs
# Inspect propensity matrices
print(solution$xbest)
#> [,1] [,2] [,3]
#> [1,] 0.0000000 0.3925 0.6
#> [2,] 0.0000000 0.4250 0.2
#> [3,] 0.2937976 0.0000 0.0
#> [4,] 0.0000000 0.1825 0.2
#> [5,] 0.7062024 0.0000 0.0
print(mtx.true)
#> [,1] [,2] [,3]
#> [1,] 0.00 0.35 0.65
#> [2,] 0.00 0.45 0.15
#> [3,] 0.75 0.00 0.00
#> [4,] 0.00 0.20 0.20
#> [5,] 0.25 0.00 0.00
Thanks!
I can only comment on the optimisation part.
The code you have provided is sufficient; only your objective function evaluates to a vector. You will need to transform this vector into a single number that is to be minimised, such as the sum of squares or of absolute values.
When it comes to methods, I would try heuristics; in fact, I would try a Local-Search method. These methods operate on the solution through functions which you define; thus, you may code your solution as a matrix. More specifically, you would need two functions: the objective function (which you essentially have) and a neighbourhood function, which takes as input a solution and modifies it. In your particular case, it could take a matrix, select two none-zero elements from one column, and increase one and decrease the other. Thus, the column sum would remain unchanged.
Perhaps the tutorial http://enricoschumann.net/files/NMOF_Rmetrics2012.pdf is of interest, with R code http://enricoschumann.net/files/NMOF_Rmetrics2012.R .

Can't break out of while loop in R

The purpose of my code is to find the amount of people where the probability that at least 2 of them have the same birthday is 50%.
source('colMatches.r')
all_npeople = 1:300
days = 1:365
ntrials = 1000
sizematch = 2
N = length(all_npeople)
counter = 1
pmean = rep(0,N)
while (pmean[counter] <= 0.5)
{
npeople = all_npeople[counter]
x = matrix(sample(days, npeople*ntrials, replace=TRUE),nrow=npeople,
ncol=ntrials)
w = colMatches(x, sizematch)
pmean[counter] = mean(w)
counter = counter + 1
}
s3 = toString(pmean[counter])
s2 = toString(counter)
s1 = "The smallest value of n for which the probability of a match is at least 0.5 is equal to "
s4 = " (the test p value is "
s5 = "). This means when you have "
s6 = " people in a room the probability that two of them have the same birthday is 50%."
paste(s1, s2, s4, s3, s5, s2, s6, sep="")
When I run that code I get "The smallest value of n for which the probability of a match is at least 0.5 is equal to 301 (the test p value is NA). This means when you have 301 people in a room the probability that two of them have the same birthday is 50%." So the while statement isn't working properly for some reason. It's cycling all the way through all_npeople even though it should stop when pmean[counter] is no longer less than or equal to 0.5.
I know that pmean is updating correctly though because when I test it afterwards pmean[50] = 0.971. So that list is indeed correct but the while loop still won't end.
*colmatches is a function that determines if a column has a certain number of matches based on sizematch. So in this case it's looking at the matrix defined in x and listing 1 for every column that has at least 2 similar values and 0 for every column with no matches.
I admire your attempt to program this question, but the beauty of R is most of this work is done for you:
qbirthday(prob = 0.5, classes = 365, coincident = 2)
#answer is 23 people.
You maybe also be interested in:
pbirthday(n, classes = 365, coincident = 2)
If the purpose of the code is only to define number of people when probability that at least two of them have same birthday is above 0.5, it is possible to write it in much simplier way:
# note that probability below is probability of NOT having same birthday
probability <- 1
people <- 1
days <- 365
while(probability >= 0.5){
people <- people + 1
probability <- probability * (days + 1 - people) / days
}
print(people)

Resources