Birth/death model over time simulation - r

I'm trying to plot a graph that will show species number over time using a birth death model. So far all the packages and models I've seen will not allow me to input a starting number of species which is problematic as my extinction rate is higher than my speciation rate.
The data I have for this question is;
speciation rate (or b) is 0.0157
and extinction rate (or d) is 53.3.
starting number of species =250000.
currently I've tried using rbdtree and simbdtree.
Thank you in advance

Assuming you don't need the phylogenetic tree of the species that are generated, and just want to know the number of species over time, you don't really need a package for this: the code is fairly simple (if you know what equations to implement).
Also assuming that you want a stochastic simulation model.
Let's say for simplicity that you only want to know the numbers of species at the times at which speciation or extinction events occur (this is a little bit easier than figuring them out at equally spaced times). Let's say the per capita mutation/speciation/birth rate is b and the per capita death/extinction rate is d (it's fairly standard to assume these rates are constant per capita, you could make other assumptions if you wanted). Then if there are currently s species present, the total rate of events is (b+d)*s, and the probability that the next event is a birth is b/(b+d). The time is exponentially distributed.
num_events <- 1000000
s <- rep(NA, num_events+1)
t <- rep(NA, num_events+1)
t[1] <- 0
s[1] <- 250000
b <- 0.0157
d <- 53.3
set.seed(101)
for (i in 1:num_events) {
if (s[i]==0) break ## stop if extinct
delta_t <- rexp(1,rate=(b+d)*s[i])
t[i+1] <- t[i] + delta_t
if (runif(1)<(b/(b+d))) s[i+1] <- s[i]+1 else s[i+1] <- s[i]-1
}
plot(t,s,type="s",log="y")
curve(s[1]*exp((b-d)*x), add=TRUE, lwd=3, col=adjustcolor("red", alpha=0.4))
You can see that until you get down to a few 100 individuals, it's barely worth bothering with a stochastic simulation — the theoretical exponential decay curve matches the population dynamics almost exactly when the numbers are large.
The code could easily be sped up a little bit, at some small cost to interpretability (but it should take not more than a few seconds at most).

Related

How can I add an amount random error to a numerical variable in R?

I am working on investigating the relationship between body measurements and overall weight in a set of biological specimens using regression equations. I have been comparing my results to previous studies, which did not draw their measurement data and body weights from the same series of individuals. Instead, these studies used the mean values reported for each species from the previously published literature (with body measurements and weight drawn from different sets of individuals) or just took the midpoint of reported ranges of body measurements.
I am trying to figure out how to introduce a small amount of random error in my data to simulate the effects of drawing measurement and weight data from different sources. For example, mutating all data to be slightly altered from their actual value by roughly +/- 5% of their actual value, which is close to the difference I get between my measurements and the literature measurements, and seeing how much that affects accuracy statistics. I know there is the jitter() command, but that only seems to work with plotting data.
There is jitter function in base R which allows you to add random noise in the data.
x <- 1:10
set.seed(123)
jitter(x)
#[1] 0.915 2.115 2.964 4.153 5.176 5.818 7.011 8.157 9.021 9.983
Check ?jitter which explains different ways to control the noise added.
Straight forward if you know what the error looks like (i.e. how is your error distributed?). Is the error normally distributed? Uniform?
v1 <- rep(100, 10) # measurements with no noise
v1_n <- v1 + rnorm(10, 0, 20) #error with mean 0 and sd 20 sampled from normal distribution
v1_u <- v1 + runif(10, -5, 5) #error with mean 0 min -5 and max 5 from uniform distribution
v1_n
[1] 87.47092 103.67287 83.28743 131.90562 106.59016 83.59063 109.74858 114.76649 111.51563 93.89223
v1_u
[1] 104.34705 97.12143 101.51674 96.25555 97.67221 98.86114 95.13390 98.82388 103.69691 98.40349

Adding seasonal variations to wind speed time series

Following up from an R blog which is interesting and quite useful to simulate the time series of an unknown area using its Weibull parameters.
Although this method gives a reasonably good estimate of time series as a whole it suffers a great deal when we look for seasonal changes. To account for seasonal changes I want to employ seasonal maximum wind speeds and carry out the time series synthesis such that the yearly distribution remains constant ie. shape and scale parameters (annual values).
I want to employ seasonal maximum wind speeds to the below code by using 12 different maximum wind speeds, one each for every month. This will allow greater wind speeds at certain month and lower in others and should even out the resultant time series.
The code follows like this:
MeanSpeed<-7.29 ## Mean Yearly Wind Speed at the site.
Shape=2; ## Input Shape parameter (yearly).
Scale=8 ##Calculated Scale Parameter ( yearly).
MaxSpeed<-17 (##yearly)
## $$$ 12 values of these wind speed one for each month to be used. The resultant time series should satisfy shape and scale parameters $$ ###
nStates<-16
nRows<-nStates;
nColumns<-nStates;
LCateg<-MaxSpeed/nStates;
WindSpeed=seq(LCateg/2,MaxSpeed-LCateg/2,by=LCateg) ## Fine the velocity vector-centered on the average value of each category.
##Determine Weibull Probability Distribution.
wpdWind<-dweibull(WindSpeed,shape=Shape, scale=Scale); # Freqency distribution.
plot(wpdWind,type = "b", ylab= "frequency", xlab = "Wind Speed") ##Plot weibull probability distribution.
norm_wpdWind<-wpdWind/sum(wpdWind); ## Convert weibull/Gaussian distribution to normal distribution.
## Correlation between states (Matrix G)
g<-function(x){2^(-abs(x))} ## decreasing correlation function between states.
G<-matrix(nrow=nRows,ncol=nColumns)
G <- row(G)-col(G)
G <- g(G)
##--------------------------------------------------------
## iterative process to calculate the matrix P (initial probability)
P0<-diag(norm_wpdWind); ## Initial value of the MATRIX P.
P1<-norm_wpdWind; ## Initial value of the VECTOR p.
## This iterative calculation must be done until a certain error is exceeded
## Now, as something tentative, I set the number of iterations
steps=1000;
P=P0;
p=P1;
for (i in 1:steps){
r<-P%*%G%*%p;
r<-as.vector(r/sum(r)); ## The above result is in matrix form. I change it to vector
p=p+0.5*(P1-r)
P=diag(p)}
## $$ ----Markov Transition Matrix --- $$ ##
N=diag(1/as.vector(p%*%G));## normalization matrix
MTM=N%*%G%*%P ## Markov Transition Matrix
MTMcum<-t(apply(MTM,1,cumsum));## From the MTM generated the accumulated
##-------------------------------------------
## Calculating the series from the MTMcum
##Insert number of data sets.
LSerie<-52560; Wind Speed every 10 minutes for a year.
RandNum1<-runif(LSerie);## Random number to choose between states
State<-InitialState<-1;## assumes that the initial state is 1 (this must be changed when concatenating days)
StatesSeries=InitialState;
## Initallise----
## The next state is selected to the one in which the random number exceeds the accumulated probability value
##The next iterative procedure chooses the next state whose random number is greater than the cumulated probability defined by the MTM
for (i in 2:LSerie) {
## i has to start on 2 !!
State=min(which(RandNum1[i]<=MTMcum[State,]));
## if (is.infinite (State)) {State = 1}; ## when the above condition is not met max -Inf
StatesSeries=c(StatesSeries,State)}
RandNum2<-runif(LSerie); ## Random number to choose between speeds within a state
SpeedSeries=WindSpeed[StatesSeries]-0.5+RandNum2*LCateg;
##where the 0.5 correction is needed since the the WindSpeed vector is centered around the mean value of each category.
print(fitdistr(SpeedSeries, 'weibull')) ##MLE fitting of SpeedSeries
Can anyone suggest where and what changes I need to make to the code?
I don't know much about generating wind speed time series but maybe those guidelines can help you improve your code readability/reusability:
#1 You probably want to have a function which will generate a wind speed time
serie given a number of observations and a seasonal maximum wind speed. So first try to define your code inside a block like this one:
wind_time_serie <- function(nobs, max_speed){
#some code here
}
#2 Doing so, if it seems that some parts of your code are useful to generate wind speed time series but aren't about wind speed time series, try to put them into functions (e.g. the part you compute norm_wpdWind, the part you compute MTMcum,...).
#3 Then, the part of your code at the beginning when your define global variable should disappear and become default arguments in functions.
#4 Avoid using endline comments when your line is already long and delete the ending semicolumns.
#This
State<-InitialState<-1;## assumes that the initial state is 1 (this must be changed when concatenating days)
#Would become this:
#Assumes that the initial state is 1 (this must be changed when concatenating days)
State<-InitialState<-1
Then your code should be more reusable / readable by other people. You have an example below of those guidelines applied to the rnorm part:
norm_distrib<-function(maxSpeed, states = 16, shape = 2, scale = 8){
#Fine the velocity vector-centered on the average value of each category.
LCateg<-maxSpeed/states
WindSpeed=seq(LCateg/2,maxSpeed-LCateg/2,by=LCateg)
#Determine Weibull Probability Distribution.
wpdWind<-dweibull(WindSpeed,shape=shape, scale=scale)
#Convert weibull/Gaussian distribution to normal distribution.
return(wpdWind/sum(wpdWind))
}
#Plot normal distribution with the max speed you want (e.g. 17)
plot(norm_distrib(17),type = "b", ylab= "frequency", xlab = "Wind Speed")

Estimate the chance n rolls of m fair six-sided dice

Similar with De mere problem
I want to generate a Monte Carlo simulation to estimate the probability of rolling at least one from n rolls of m fair six-sided dice.
My code:
m<-5000
n<-3
x<-replicate(m, sample(1:6,n,TRUE)==1)
p<-sum(x)/m
p is the probability estimated. Here I get the value 0.4822.
My questions:
1) Is there any other way without using sum to do it?
2) I doubt the code is wrong as the probability maybe too high.
Although the question as stated is a little unclear, the code suggests you want to estimate the chance of obtaining at least one outcome of "1" among n independent dice and that you aim to estimate this by simulating the experiment m times.
Program simulations from the inside out. Begin with a single iteration. You started well, but to be perfectly clear let's redo it using a highly suggestive syntax. Try this:
1 %in% sample(1:6,n,TRUE)
This uses sample to realize the results of n independent fair dice and checks whether the outcome 1 appears among any of them.
Once you are satisfied that this emulates your experiment (run it a bunch of times), then indeed replicate will perform the simulation:
x <- replicate(m, 1 %in% sample(1:6,n,TRUE))
That produces m results. Each will be TRUE (interpreted as equal to 1) in all iterations where 1 appeared and otherwise will be FALSE (interpreted as 0). Consequently, the average number of times 1 appeared can be obtained as
mean(x)
This empirical frequency is a good estimate of the theoretical probability.
As a check, note that 1 will not appear on a single die with a probability of 1-1/6 = 5/6 and therefore--because the n dice are independent--will not appear on any of them with a probability of (5/6)^n. Consequently the chance a 1 will appear must be 1 - (5/6)^n. Let us output those two values: the simulation mean and theoretical result. We might also include a Z score, which is a measure of how far away from the theoretical result the mean is. Typically, Z scores between -2 and 2 aren't significant evidence of any discrepancy.
Here's the full code. Although there are faster ways to write it, this is very fast already and is about as clear as one could make it.
m <- 5000 # Number of simulation iterations
n <- 3 # Number of dice per iteration
set.seed(17) # For reproducible results
x <- replicate(m, 1 %in% sample(1:6,n,TRUE))
# Compare to a theoretical result.
theory <- 1-(5/6)^n
avg <- mean(x)
Z <- (avg - theory) / sd(x) * sqrt(length(x))
c(Mean=signif(avg, 5), Theoretical=signif(theory, 5), Z.score=signif(Z, 3))
The output is
Mean Theoretical Z.score
0.4132 0.4213 -1.1600
Notice that neither result is anywhere near n/6, which would be 1/2 = 0.500.

Preventing a Gillespie SSA Stochastic Model From Running Negative

I have produce a stochastic model of infection (parasitic worm), using a Gillespie SSA. The model used the "GillespieSSA"package (https://cran.r-project.org/web/packages/GillespieSSA/index.html).
In short the code models a population of discrete compartments. Movement between compartments is dependent on user defined rate equations. The SSA algorithm acts to calculate the number of events produced by each rate equation for a given timestep (tau) and updates the population accordingly, process repeats up to a given time point. The problem is, the number of events is assumed Poisson distributed (Poisson(rate[i]*tau)), thus produces an error when the rate is negative, including when population numbers become negative.
# Parameter Values
sir.parms <- c(deltaHinfinity=0.00299, CHi=0.00586, deltaH0=0.0854, aH=0.5,
muH=0.02, SigmaW=0.1, SigmaM =0.8, SigmaL=104, phi=1.15, f = 0.6674,
deltaVo=0.0166, CVo=0.0205, alphaVo=0.5968, beta=52, mbeta=7300 ,muV=52, g=0.0096, N=100)
# Inital Population Values
sir.x0 <- c(W=20,M=10,L=0.02)
# Rate Equations
sir.a <- c("((deltaH0+deltaHinfinity*CHi*mbeta*L)/(1+CHi*mbeta*L))*mbeta*L*N"
,"SigmaW*W*N", "muH*W*N", "((1/2)*phi*f)*W*N", "SigmaM*M*N", "muH*M*N",
"(deltaVo/(1+CVo*M))*beta*M*N", "SigmaL*L*N", "muV*L*N", "alphaVo*M*L*N", "(aH/g)*L*N")
# Population change for even
sir.nu <- matrix(c(+0.01,0,0,
-0.01,0,0,
-0.01,0,0,
0,+0.01,0,
0,-0.01,0,
0,-0.01,0,
0,0,+0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/32),nrow=3,ncol=11,byrow=FALSE)
runs <- 10
set.seed(1)
# Data Frame of output
sir.out <- data.frame(time=numeric(),W=numeric(),M=numeric(),L=numeric())
# Multiple runs and combining data and SSA methods
for(i in 1:runs){
sim <- ssa(sir.x0,sir.a,sir.nu,sir.parms, method="ETL", tau=1/12, tf=140, simName="SIR")
sim.out <- data.frame(time=sim$data[,1],W=sim$data[,2],M=sim$data[,3],L=sim$data[,4])
sim.out$run <- i
sir.out <- rbind(sir.out,sim.out)
}
Thus, rates are computed and the model updates the population values for each time step, with the data store in a data frame, then attached together with previous runs. However, when levels of the population get very low events can occur such that the number of events that occurs reducing a population is greater than the number in the compartment. One method is to make the time step very small, however this greatly increases the length of the simulation very long.
My question is there a way to augment the code so that as the data is created/ calculated at each time step any values of population numbers that are negative are converted to 0?
I have tried working on this problem, but only seem to be able to come up with methods that alter the values once the simulation is complete, with the negative values still causing issues in the runs themselves.
E.g.
if (sir.out$L < 0) sir.out$L == 0
Any help would be appreciated
I believe the problem is the method you set ("ETL") in the ssa function. The ETL method will eventually produce negative numbers. You can try the "OTL" method, based on Efficient step size selection for the tau-leaping simulation method- in which there are a few more parameters that you can tweak, but the basic command is:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="OTL", tf=140, simName="SIR")
Or the direct method, which will not produce negative number whatsoever:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="D", tf=140, simName="SIR")

Extremely high probability of being alive BTYD R

I am working on BTYD R package and the problem is that the values of the probability that a customer is alive at the end of calibration are extremely high. Even observations with only one transaction in calibration period have this probability around 0.9999. I know that the parameter "s" (estimated by the package) is used in this calculation. My gamma is very low (almost 0). When I tried to change it manually for higher value the probabilities went down. Any idea how to deal with this problem? I attach my codes below.
elog <- dc.MergeTransactionsOnSameDate(elog)
end.of.cal.period <- min(elog$date)+as.numeric((max(elog$date)-min(elog$date))/2)
data <- dc.ElogToCbsCbt(elog, per="week",
T.cal=end.of.cal.period,
merge.same.date=TRUE,
statistic = "freq")
cal2.cbs <- as.matrix(data[[1]][[1]])
## prameters estimation
params2 <- pnbd.EstimateParameters(cal2.cbs)
## log likehood
(LL <- pnbd.cbs.LL(params2, cal2.cbs))
p.matrix <- c(params2, LL)
for (i in 1:20) {
params2 <- pnbd.EstimateParameters(cal2.cbs, params2)
LL <- pnbd.cbs.LL(params2, cal2.cbs)
p.matrix.row <- c(params2, LL)
p.matrix <- rbind(p.matrix, p.matrix.row)
}
(params2 <- p.matrix[dim(p.matrix)[1],1:4])
# set up parameter names for a more descriptive result
param.names <- c("r", "alpha", "s", "beta")
LL <- pnbd.cbs.LL(params2, cal2.cbs)
# PROBABILITY A CUSTOMER IS ALIVE AT END OF CALIBRATION / TRAINING
x <- cal2.cbs["123", "x"] # x is frequency
t.x <- cal2.cbs["123", "t.x"] # t.x is recency, ie time of last transactions
T.cal <- 26 # week of end of cal, i.e. present
pnbd.PAlive(params2, x, t.x, T.cal)
There is no "gamma" parameter being estimated - "s" and "beta" define the gamma distribution of dropout rate heterogeneity. I recommend editing your post to include the parameters, as well as the output of
pnbd.PlotDropoutRateHeterogeneity(params2)
Without seeing your parameter estimates or knowing the context of your data, there are at least two (not mutually exclusive) possibilities.
First, you could have very low (e.g., zero) dropout rate. If so, you can still fit a plain NBD model of transaction rate, and assume a zero dropout rate.
Second, you could be seeing the "increasing frequency paradox". From pages 17-19 of one of Peter Fader/Bruce Hardie's papers:
For low frequency customers, there is an almost linear relationship between recency and [expected transactions]. However, this relationship becomes highly
nonlinear for high frequency customers. In other words, for customers
who have made a relatively large number of transactions in the past,
recency plays a much bigger role in determining [value] than for an
infrequent past purchaser.
According to the authors, a customer such as you describe with few (or even just a single) transaction receive a high probability of being "alive" with less dependency on recency. This is because by definition, a low frequency customer can have long "gaps" between purchases. Therefore we should assign less risk to a lower frequency customer even if they have not transacted for some time. Compare this to a high frequency customer - the longer we go without seeing a transaction, the faster we should could conclude that the customer is "dead" since we know they would ordinarily being making many transactions.

Resources