Create Simulation in R - r

I have following problem:
A casualty insurance company has 1000 policyholders, each of whom will
independently present a claim in the next month with probability 5%.
Assuming that the amounts of the claims made are independent exponential
random variables with mean 800 Dollars.
Does anyone know how to create simulation in R to estimate the probability
that the sum of those claims exceeds 50,000 Dollars?

This sounds like a homework assignment, so it's probably best to consult with your teacher(s) if you're unsure about how to approach this. Bearing that in mind, here's how I'd go about simulating this:
First, create a function that generates values from an exponential distribution and sums those values, based on the values you give in your problem description.
get_sum_claims <- function(n_policies, prob_claim, mean_claim) {
sum(rexp(n = n_policies*prob_claim, rate = 1/mean_claim))
}
Next, make this function return the sum of all claims lots of times, and store the results. The line with map_dbl does this, essentially instructing R to return 100000 simulated sums of claims from the get_sum_claims function.
library(tidyverse)
claim_sums <- map_dbl(1:100000, ~ get_sum_claims(1000, 0.05, 800))
Finally, we can calculate the probability that the sum of claims is greater than 50000 by using the code below:
sum(claim_sums > 50000)/length(claim_sums)
This gives a fairly reliable estimate of ~ 0.046 as the probability that the sum of claims exceeds 50000 in a given month.

I'm a bit inexperienced with R, but here is my solution.
First, construct a function which simulates a single trial. To do so, one needs to determine how many claims are filed n. I hope it is clear that n ~ Binomial(1000, 0.05). Note that, you cannot simply assume n = 1000 * 0.05 = 50. By doing so, you would decrease the variance, which will result in a lower probability. I can explain why this is the case if needed. Then, generate and sum n values based on an exponential distribution with mean 800.
simulate_total_claims <- function(){
claim_amounts <- rexp(rbinom(n=1,size=1000, prob = 0.05), rate = 1/800)
total <- sum(claim_amounts)
return(total)
}
Now, all that needs to be done is run the above function a lot and determine the proportion of runs which have values greater than 50000.
totals <- rerun(.n = 100000, simulate_total_claims())
estimated_prob <- mean(unlist(totals) > 50000)

Related

Can you make an argument for a function to be a random sample in R?

So, I'd like to test how precise is t-test for detecting a mean for various distributions. But I don't want to have to define the sampling distribution each time I run the function in the function. If I write function(data, mju) and then as data input rnorm(n) or any other random sample, I obviously get the same results when replicating the function, because I only have the one "data" sample, that was first inputted. To understand more clearly what I want, here is the code:
t_ci <- function(data,mju){
prod(t.test(data)$conf.int - mju)
}
set.seed(NULL)
prec_t <- function(data, n, N, mju){
sim <- replicate(N, t_ci(data, mju))
sim[sim<0]/N
}
The first function checks, whether the real theoretical parameter "mju" in in the confidence interval. The second one replicates the function t_ci N times, to see how precise the t test confidence intervals are for selected data. I'd like to have an option to just indicate the distribution and then it would generate n-sized samples N times and calculate the precision. But as far as my code goes, it only replicates the same data over and over. Maybe there is a solution for this problem?
Also, it seems that something is wrong with the function prec_t, because I'd like to have a count of times the t_ci produced negative outcome and then divide by N.
Any help would be greatly appreciated! Thanks in advance.

Weighted Likelihood of an Event Occurring

I want to identify the probability of certain events occurring for a range.
Min = 600 Max = 50,000 Most frequent outcome = 600
I generated a sequence of events: numbers <- seq(600,50000,by=1)
This is where I get stuck. Not sure if using the wrong distribution or attempt at execution is going down the wrong path.
qpois(numbers, lambda = 600) produces NaNs
So the outcome desired is to be able to get an output of weighted probabilities (weighted to the mean of 600). And then be able to assess the likelihood of an outlier event about 30000 is 5% or different cuts like that by summing the probabilities for those numbers.
A bit rusty, haven't used this for a few years so any online resources to refresh is also appreciated!
Firstly, I think you're looking for ppois rather than qpois. The function qpois(p, 600) takes a vector p of probabilities. If you do qpois(0.75, 600) you will get 616, meaning that 75% of observations will be at or below 616.
ppois is the opposite of qpois. If you do ppois(616, 600) you will get (approximately) 0.75.
As for your specific distribution, it can't be a Poisson distribution. Let's see what a Poisson distribution with a mean of 600 looks like:
x <- 500:700
plot(x, dpois(x, 600), type = "h")
Getting a value of greater than even 900 has (essentially) a zero probability:
1 - ppois(900, 600)
#> [1] 0
So if your data contains values of 30,000 or 50,000 as well as 600, it's certainly not a Poisson distribution.
Without knowing more about your actual data, it's not really possible to say what distribution you have. Perhaps if you include a sample of it in your question we could be of more help.
EDIT
With the sample of numbers provided in the comments, we can have a look at the actual empirical distribution:
hist(numbers, 200)
and if we want to know the probability at any point, we can create the empirical cumulative distribution function like this:
get_probability_of <- ecdf(numbers)
This allows us to do:
number <- 1:50000
plot(number, get_probability_of(number), ylab = "probability", type = "l")
and
get_probability_of(30000)
#> [1] 0.83588
Which means that the probability of getting a number higher than 30,000 is
1 - get_probability_of(30000)
#> [1] 0.16412
However, in this case, we know how the distribution is generated, so we can calculate the exact theoretical cdf just using some simple geometry (I won't show my working here because although it is simple, it is rather lengthy, dull, and not applicable to other distributions):
cdf <- function(x) ifelse(x < 600, 0, 1 - ((49400 - (x - 600)) / 49400)^2)
and
cdf(30000)
#> [1] 0.8360898
which is very close to, but more theoretically accurate than the empirical value.

Calculating lm() within a loop

Objective: The overall objective of the problem is to calculate the confidence interval (CI) of various sample sizes (n=2,4..1024) of rnorm, 10,000 times and then count the number of times each one fails (this likely requires a counter and an if/else statement). Finally the results are to be plotted
I am trying to calculate CI of the means for several simulations of a sample sizes, however, I am first trying to break down the code for one specific sample size a = 8.
The problem I have is that I do not know how to generate a linear model for each row. Would anyone know how I can do this? Here is what I have so far:
a <- 8
n.sim.3 <- 10000
for ( i in a) {
r.mat <- matrix(rnorm(i*n.sim.3), nrow=n.sim.3, ncol = a)
lm.tmp <- apply(three.mat,1,lm(n.sim.3~1) # The lm command is where I'm stuck I don't think this is correct)
confint.tmp <- confint(lm.tmp)

Preventing a Gillespie SSA Stochastic Model From Running Negative

I have produce a stochastic model of infection (parasitic worm), using a Gillespie SSA. The model used the "GillespieSSA"package (https://cran.r-project.org/web/packages/GillespieSSA/index.html).
In short the code models a population of discrete compartments. Movement between compartments is dependent on user defined rate equations. The SSA algorithm acts to calculate the number of events produced by each rate equation for a given timestep (tau) and updates the population accordingly, process repeats up to a given time point. The problem is, the number of events is assumed Poisson distributed (Poisson(rate[i]*tau)), thus produces an error when the rate is negative, including when population numbers become negative.
# Parameter Values
sir.parms <- c(deltaHinfinity=0.00299, CHi=0.00586, deltaH0=0.0854, aH=0.5,
muH=0.02, SigmaW=0.1, SigmaM =0.8, SigmaL=104, phi=1.15, f = 0.6674,
deltaVo=0.0166, CVo=0.0205, alphaVo=0.5968, beta=52, mbeta=7300 ,muV=52, g=0.0096, N=100)
# Inital Population Values
sir.x0 <- c(W=20,M=10,L=0.02)
# Rate Equations
sir.a <- c("((deltaH0+deltaHinfinity*CHi*mbeta*L)/(1+CHi*mbeta*L))*mbeta*L*N"
,"SigmaW*W*N", "muH*W*N", "((1/2)*phi*f)*W*N", "SigmaM*M*N", "muH*M*N",
"(deltaVo/(1+CVo*M))*beta*M*N", "SigmaL*L*N", "muV*L*N", "alphaVo*M*L*N", "(aH/g)*L*N")
# Population change for even
sir.nu <- matrix(c(+0.01,0,0,
-0.01,0,0,
-0.01,0,0,
0,+0.01,0,
0,-0.01,0,
0,-0.01,0,
0,0,+0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/32),nrow=3,ncol=11,byrow=FALSE)
runs <- 10
set.seed(1)
# Data Frame of output
sir.out <- data.frame(time=numeric(),W=numeric(),M=numeric(),L=numeric())
# Multiple runs and combining data and SSA methods
for(i in 1:runs){
sim <- ssa(sir.x0,sir.a,sir.nu,sir.parms, method="ETL", tau=1/12, tf=140, simName="SIR")
sim.out <- data.frame(time=sim$data[,1],W=sim$data[,2],M=sim$data[,3],L=sim$data[,4])
sim.out$run <- i
sir.out <- rbind(sir.out,sim.out)
}
Thus, rates are computed and the model updates the population values for each time step, with the data store in a data frame, then attached together with previous runs. However, when levels of the population get very low events can occur such that the number of events that occurs reducing a population is greater than the number in the compartment. One method is to make the time step very small, however this greatly increases the length of the simulation very long.
My question is there a way to augment the code so that as the data is created/ calculated at each time step any values of population numbers that are negative are converted to 0?
I have tried working on this problem, but only seem to be able to come up with methods that alter the values once the simulation is complete, with the negative values still causing issues in the runs themselves.
E.g.
if (sir.out$L < 0) sir.out$L == 0
Any help would be appreciated
I believe the problem is the method you set ("ETL") in the ssa function. The ETL method will eventually produce negative numbers. You can try the "OTL" method, based on Efficient step size selection for the tau-leaping simulation method- in which there are a few more parameters that you can tweak, but the basic command is:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="OTL", tf=140, simName="SIR")
Or the direct method, which will not produce negative number whatsoever:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="D", tf=140, simName="SIR")

R optimize linear function

I'm new to R and need a little help with a simple optimization.
I want to apply a functional transformation to a variable (sales_revenue) over time (24 month forecast values 1 to 24). Basically I want to push sales revenue for products from later months into earlier month.
The functional transformations on t time is:
trans=D+(t/(A+B*t+C*t^2))
I will then want to solve:
1) sales_revenue=sales_revenue*trans
where total_sales_revenue=1,000,000 (or within +/- 2.5%)
total_sales_revenue is the sum of all sales_revenue over the 24 months forecast.
If trans has too many parameters I can fix most of them if required and leave B free to estimate.
I think the approach should be fix all parameters except B, differentiate function (1) (not sure what ti diff by) and solve for a non zero minima (use constraints to make sure its the right minima and no-zero, run optimization on that function with the constraint that the total sum of sales_revenue*trans will be equal (or close to) 1,000,000.
#user2138362, did you mean "1) sales_revenue=total_sales_revenue*trans"?
I'm supposing your parameters A, C and D are fixed, and you want to find B such that the distance between your observed values and your predicted values is minimized.
Let's say your time is in months. So we can write a function to give you the squared distance:
dist <- function(B)
{
t <- 1:length(sales_revenue)
total_sales_revenue <- sum(sales_revenue)
predicted <- total_sales_revenue * (D+(t/(A+B*t+C*t^2)))
sum((sales_revenue-predicted)^2)
}
I'm also using the squared euclidean distance as a measure of distance. Make the appropriate changes if that is not the case.
Now, dist is the function you have to minimize. You can use optim, as pointed out by #iTech. But even at the minimum of dist it probably won't be zero, as you have many (24) observations. But you can get the best fit, plot it, and see if it's nice.

Resources