How to speed up simulation in R - r

Below I have a code which attempts to find an unknown equilibrium price distribution (I hope the details are not important - this is a game-theoretic economics problem) using a crude evolutionary algorithm (as found in the main loop).
However, the code runs relatively slowly, and I'm not sure how to make it more efficient. I suspect that the function "profit_p1" could be written in a vector based manner such that the function "profit_all" using sapply can be avoided.
I'm not a programmer and mainly use R for writing quick simulations, so please bear with me. If anybody could give me a hint, I would greatly appreciate it. Many thanks in advance!
al <- 0.4 #Parameter constraints: al >= ah > 0; al + ah < 1
ah <- 0.3 #
R0 <- al*(1-al)/(ah*(1-ah)) #Calculate relevant boundaries for R=PH/PL: R0 and R1
R1 <- (1-al)/ah #
grid <- seq(0.01, 1, 0.01) #Sets up a grid on the relevant interval [0,1] (100 points currently)
iter <- 10000000 #Maximal number of iterations
l <- length(grid) #Calculate number of points in grid
pfreq <- rep(1/l, l) #Initial guess for (symmetric) first-period price distribution: uniform over grid
#--------------------
profit_p1 <- function(p1) #Function that gives expected profit for arbitrary price in grid
{
x <- sum(pfreq[grid < p1/R0])*ah*p1 +
(1-al)*al/(1-ah) * sum(pfreq[(grid >= p1/R0 & grid < p1)]*grid[(grid >= p1/R0 & grid < p1)]) +
(al+ah)/2 * p1 * sum(pfreq[abs(grid-p1)<0.0001]) +
al*p1*sum(pfreq[grid > p1 & grid <= p1*R0]) +
(1-ah)*ah/(1-al) * sum(pfreq[(grid > p1*R0 & grid <= p1*R1)]*grid[(grid > p1*R0 & grid <= p1*R1)]) +
(1-ah)*p1 * sum(pfreq[grid > p1*R1])
return(x)
}
profit_all <- function() #Function that gives expected profit for all prices in grid
{
return(sapply(grid,profit_p1))
}
#--------------------
for(count in 1:iter) #Main loop
{
pfreq[which.max(profit_all())] <- pfreq[which.max(profit_all())] + 0.0001 # The freq. of the grid point which yields the highest expected profit is increased slightly
pfreq <- pfreq / sum(pfreq) # But of course, the total probability mass must sum up to 1
#--------------------
if(count %% 100 == 0) # Display price distribution and expected profits after every 100 iterations
{
plot(grid, profit_all(), ylim = c(0,0.5), type="l")
lines(grid, pfreq / max(pfreq)*0.4, col="orange")
}
#--------------------
}

Related

Trying to write a gradient descent algorithm in R

Implement the gradient descent algorithm in this question. Let
{X1,…,Xn} be a dataset and g(x)=n−1∑ni=1(x−Xi)2. It is known that the
mean of the dataset is the solution to the following minimization
problem minx∈ℝg(x).
To minimize g(x), you are going to use a while loop to implement the
gradient descent algorithm, as follows.
Step 0. Initialize x1=0 Step 1. In the kth step, where k=1,2,…, set
xk+1=xk−0.99k×g′(xk).
Step 2. Repeat Step 1 until |g′(xk)| is smaller than a small tolerance
level tol (e.g., set it to 1e-5) or if k exceeds the maximum number of
iterations Kmax (e.g., set it to 1000).
You are going to implement the gradient descent algorithm to find the
mean. Use the dataset cars$speed for {X1,…,Xn}.You don’t have to write
the algorithm into a function in this question; you are going to do
this in the next.
Could someone help me with this?
Here is what I have so far
data(cars)
x1 <- 0
k <- 1
toleranceLevel <-0.00005
X <- cars$speed
kmax <- 10000
while(x1 > toleranceLevel){
gxprime <- 2 * mean(x1 - X)
gxprime
x1 <-(((x1)-(.99^k))*gxprime)
if(x1 < toleranceLevel){
k <- k + 1
} else {
}
if(k == kmax){
break
}
print(k)
}
data(cars)
x_old <- 0
k <- 1
toleranceLevel <-0.00005
X <- cars$speed
kmax <- 10000
err <- 1
while(err > toleranceLevel & k < kmax){
x_new <- x_old -.99^k * 2 * mean(x_old - X)
err <- abs(x_new - x_old)
x_old <- x_new
k <- k + 1
}
x_new

Converting the outputs of a for loop from a list to a data frame

I have constructed a discrete time SIR model using a loop within a function (i have added my code below).
Currently the results of the iterations are coming out as a list which seems to show all the S values first followed by the I values and then the R values, which I have deduced myself from the nature of the values.
I need the output as a data frame with the column names: 'Iteration', 'S', 'I' and 'R' from left to right and the corresponding values underneath such that when a row is read it will tell you the iteration and values of S, I and R at that iteration.
I do not know how to construct a data frame that and returns the output values in this way, I have only started learning R a few weeks ago and so am not yet proficient so any help would be HUGELY appreciated.
Thank you in advance.
#INITIAL CONDITIONS
S=999
I=1
R=0
#PARAMETERS
beta = 0.003 # infectious contact rate (/person/day)
gamma = 0.2 # recovery rate (/day)
#SIR MODEL WITH POISSON SAMPLING
discrete_SIR_model <- function(){
for(i in 1:30){ #the number of iterations of loop indicates the
#duration of the model in days
# i.e. 'i in 1:30' constitutes 30 days
deltaI<- rpois(1,beta * I * S) #rate at which individuals in the
#population are becoming infected
deltaR<-rpois(1,gamma * I)#rate at which infected individuals are
#recovering
S[i+1]<-S[i] -deltaI
I[i+1] <-I[i] + deltaI -deltaR
R[i+1]<-R[i]+deltaR
}
}
output <- list(c(S, I, R))
output
If a foor loop is used, one can define vectors or a data frame beforehand where the results are stored:
beta <- 0.001 # infectious contact rate (/person/day)
gamma <- 0.2 # recovery rate (/day)
S <- I <- R <- numeric(31)
S[1] <- 999
I[1] <- 1
R[1] <- 0
set.seed(123) # makes the example reproducible
for(i in 1:30){
deltaI <- rpois(1, beta * I[i] * S[i])
deltaR <- rpois(1, gamma * I[i])
S[i+1] <- S[i] - deltaI
I[i+1] <- I[i] + deltaI - deltaR
R[i+1] <- R[i] + deltaR
}
output <- data.frame(S, I, R)
output
matplot(output)
As an alternative, it is also possible to employ a package for this. Package deSolve is intended for differential equations, but it can also solve the discrete case with method "euler":
library(deSolve)
discrete_SIR_model <- function(t, y, p) {
with(as.list(c(y, p)), {
deltaI <- rpois(1, beta * I * S)
deltaR <- rpois(1, gamma * I)
list(as.double(c(-deltaI, deltaI - deltaR, deltaR)))
})
}
y0 <- c(S = 999.0, I=1, R=0)
p <- c(
beta = 0.001, # infectious contact rate (/person/day)
gamma = 0.2 # recovery rate (/day)
)
times <- 1:30
set.seed(576) # to make the example reproducible
output <- ode(y0, times, discrete_SIR_model, p, method="euler")
plot(output, mfrow=c(1,3))
Note: I reduced beta, otherwise the discrete model would become unstable.

Store last variable in a for loop equation into an array

I'm trying to calculate the probability of extinction of a fictional lizard population size. To do this, I am running a for loop for 100 simulations over a period of 30 years, and seeing the probability of each simulation from going extinct. At the end of my 100 simulations, I need to plot a histogram depicting the final population size at the end of the 30 year interval. I figured that the easiest way to plot the histogram would be to create a different vector, and store the final population size of each simulation into this vector (pop). However, I have no idea how to code for this and have not found an answer online for my predicament.
I am using the following code:
tmax <- 31
runmax <- 100
Year <- 0:(tmax-1)
N <- numeric(tmax) %vector for the population size
N <- N + 1
epsilon <- numeric(tmax)
rmax <- 0.87992 %maximum growth rate (a value previously calculated)
K <- 34.64252 %carrying capacity (a value previously calculated)
N[1] <- K
extinct <- 0
for(t in 2:tmax){
sdr <- 0.9469428
epsilon[t-1] <- rnorm(1,0,sdr) %this takes into account the random population stochasticity (random chance a population will go extinct)
N[t] <- exp(rmax*(1-(N[t-1]/K))+epsilon[t-1])*N[t-1]
if(N[t] < 1.0) {
N[t] <- 0.0;break
}
pop=numeric(runmax)
pop[1]=N[30]
}
extinct <- extinct + ifelse(N[tmax]<=1,1,0)
plot(Year,N,type='l',ylim=c(0,200))
for(i in 1:runmax){
N <- numeric(tmax)
N <- N+1
N[1] <- K
for(t in 2:tmax){
sdr <- 0.9469428
epsilon[t-1] <- rnorm(1,0,sdr)
N[t] <- exp(rmax*(1-(N[t-1]/K))+epsilon[t-1])*N[t-1]
if(N[t] < 1.0) {
N[t] <- 0.0
break
}
for(w in 2:runmax){
pop[w]<- N[30]
}
}
extinct <- extinct + ifelse(N[tmax]<=1,1,0)
lines(Year,N,col=i)
}
So in the above code, pop is the vector where I'm storing the population at N[30]. The idea is then to use hist(pop) to plot the histogram.
Thanks in advance!
You can get the results in a matrix like this:
pop=matrix(rep(0,runmax*tmax),ncol=tmax)
for(i in 1:runmax){
N <- numeric(tmax)
N <- N+1 # this can be removed
N[1] <- K
for(t in 2:tmax){
sdr <- 0.9469428 # this could be placed outside the loops
epsilon[t-1] <- rnorm(1,0,sdr)
N[t] <- exp(rmax*(1-(N[t-1]/K))+epsilon[t-1])*N[t-1]
if(N[t] < 1.0) {N[t] <- 0.0}
pop[i,t]=N[t]
if(N[t] ==0) {break}
}
extinct <- extinct + ifelse(N[tmax]<=1,1,0)
lines(Year,N,col=i)
}
hist(pop[,tmax]) #simulation results for tmax

Vectorizing code and stuck but good

Here are some sample starting values for variables in the code below.
sd <- 2
sdtheory <- 1.5
meanoftheory <- 0.6
obtained <- 0.8
tails <- 2
I'm trying to vectorize the following code. It is a component of a Bayes factor calculator that was originally written by Dienes and adapted to R by Danny Kaye & Thom Baguley. This part is for calculating the likelihood for the theory. I've got the thing massively sped up by vectorizing but I can't match output of the bit below.
area <- 0
theta <- meanoftheory - 5 * sdtheory
incr <- sdtheory / 200
for (A in -1000:1000){
theta <- theta + incr
dist_theta <- dnorm(theta, meanoftheory, sdtheory)
if(identical(tails, 1)){
if (theta <= 0){
dist_theta <- 0
} else {
dist_theta <- dist_theta * 2
}
}
height <- dist_theta * dnorm(obtained, theta, sd)
area <- area + height * incr
}
area
And below is the vectorized version.
incr <- sdtheory / 200
newLower <- meanoftheory - 5 * sdtheory + incr
theta <- seq(newLower, by = incr, length.out = 2001)
dist_theta <- dnorm(theta, meanoftheory, sdtheory)
if (tails == 1){
dist_theta <- dist_theta[theta > 0] * 2
theta <- theta[theta > 0]
}
height <- dist_theta * dnorm(obtained, theta, sd)
area <- sum(height * incr)
area
This code exactly copies the results of the original if tails <- 2. Everything I've got here so far should just copy and paste and give the exact same results. However, once tails <- 1 the second function no longer matches exactly. But as near as I can tell I'm doing the equivalent in the new if statement to what is happening in the original. Any help would be appreciated.
(I did try to create a more minimal example, stripping it down to just he loop and if statements and a tiny amount of slices and I just couldn't get the code to fail.)
You're dropping observations where theta==0. That's a problem because the output of dnorm is not zero when theta==0. You need those observations in your output.
Rather than drop observations, a better solution would be to set those elements to zero.
incr <- sdtheory / 200
newLower <- meanoftheory - 5 * sdtheory + incr
theta <- seq(newLower, by = incr, length.out = 2001)
dist_theta <- dnorm(theta, meanoftheory, sdtheory)
if (tails == 1){
dist_theta <- ifelse(theta < 0, 0, dist_theta) * 2
theta[theta < 0] <- 0
}
height <- dist_theta * dnorm(obtained, theta, sd)
area <- sum(height * incr)
area
The original calculation has an error due to floating point arithmetic; adding incr each time causes theta to actually equal 7.204654e-14 when it should equal zero. So it's not actually doing the right thing on that pass through the loop; it's not doing the <= code when it should be. Your code is (at least, it did with these starting values on my machine).
Your code isn't necessarily guaranteed to do the right thing every time either; what seq does is better than adding an increment over and over again, but it's still floating point arithmetic. You really should probably be checking to within machine tolerance of zero, perhaps using all.equal or something similar.

Explain the quantile() function in R

I've been mystified by the R quantile function all day.
I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.
From the docs:
Q[i](p) = (1 - gamma) x[j] + gamma
x[j+1],
I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma
where 1 <= i <= 9, (j-m)/n <= p <
(j-m+1)/ n, x[j] is the jth order
statistic, n is the sample size, and m
is a constant determined by the sample
quantile type. Here gamma depends on
the fractional part of g = np+m-j.
So, how calculate j? m?
For the continuous sample quantile
types (4 through 9), the sample
quantiles can be obtained by linear
interpolation between the kth order
statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta
+ 1),
where α and β are constants determined
by the type. Further, m = alpha + p(1
- alpha - beta), and gamma = g.
Now I'm really lost. p, which was a constant before, is now apparently a function.
So for Type 7 quantiles, the default...
Type 7
p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.
Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.
I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.
quantile.R source code
or type: quantile.default
You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.
where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.
The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)
Also, that last sentence is just wrong. It should read
Here gamma depends on the fractional part of np+m, g = np+m-j
As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.
For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.
This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.
As for the last bit, R is just stating what S uses.
The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.
BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)
There are various ways of computing quantiles when you give it a vector, and don't have a known CDF.
Consider the question of what to do when your observations don't fall on quantiles exactly.
The "types" are just determining how to do that. So, the methods say, "use a linear interpolation between the k-th order statistic and p(k)".
So, what's p(k)? One guy says, "well, I like to use k/n". Another guy says, "I like to use (k-1)/(n-1)" etc. Each of these methods have different properties that are better suited for one problem or another.
The \alpha's and \beta's are just ways to parameterize the functions p. In one case, they're 1 and 1. In another case, they're 3/8 and -1/4. I don't think the p's are ever a constant in the documentation. They just don't always show the dependency explicitly.
See what happens with the different types when you put in vectors like 1:5 and 1:6.
(also note that even if your observations fall exactly on the quantiles, certain types will still use linear interpolation).
I believe the R help documentation is clear after the revisions noted in #RobHyndman's comment, but I found it a bit overwhelming. I am posting this answer in case it helps someone move quickly through the options and their assumptions.
To get a grip on quantile(x, probs=probs), I wanted to check out the source code. This too was trickier than I anticipated in R so I actually just grabbed it from a github repo that looked recent enough to run with. I was interested in the default (type 7) behavior, so I annotated that some, but didn't do the same for each option.
You can see how the "type 7" method interpolates, step by step, both in the code and also I added a few lines to print some important values as it goes.
quantile.default <-function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE
, type = 7, ...){
if(is.factor(x)) { #worry about non-numeric data
if(!is.ordered(x) || ! type %in% c(1L, 3L))
stop("factors are not allowed")
lx <- levels(x)
} else lx <- NULL
if (na.rm){
x <- x[!is.na(x)]
} else if (anyNA(x)){
stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
}
eps <- 100*.Machine$double.eps #this is to deal with rounding things sensibly
if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
stop("'probs' outside [0,1]")
#####################################
# here is where terms really used in default type==7 situation get defined
n <- length(x) #how many observations are in sample?
if(na.p <- any(!p.ok)) { # set aside NA & NaN
o.pr <- probs
probs <- probs[p.ok]
probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
}
np <- length(probs) #how many quantiles are you computing?
if (n > 0 && np > 0) { #have positive observations and # quantiles to compute
if(type == 7) { # be completely back-compatible
index <- 1 + (n - 1) * probs #this gives the order statistic of the quantiles
lo <- floor(index) #this is the observed order statistic just below each quantile
hi <- ceiling(index) #above
x <- sort(x, partial = unique(c(lo, hi))) #the partial thing is to reduce time to sort,
#and it only guarantees that sorting is "right" at these order statistics, important for large vectors
#ties are not broken and tied elements just stay in their original order
qs <- x[lo] #the values associated with the "floor" order statistics
i <- which(index > lo) #which of the order statistics for the quantiles do not land on an order statistic for an observed value
#this is the difference between the order statistic and the available ranks, i think
h <- (index - lo)[i] # > 0 by construction
## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] # This is the interpolation step: assemble the estimated quantile by removing h*low and adding back in h*high.
# h is the arithmetic difference between the desired order statistic amd the available ranks
#interpolation only occurs if the desired order statistic is not observed, e.g. .5 quantile is the actual observed median if n is odd.
# This means having a more extreme 99th observation doesn't matter when computing the .75 quantile
###################################
# print all of these things
cat("floor pos=", c(lo))
cat("\nceiling pos=", c(hi))
cat("\nfloor values= ", c(x[lo]))
cat( "\nwhich floors not targets? ", c(i))
cat("\ninterpolate between ", c(x[lo[i]]), ";", c(x[hi[i]]))
cat( "\nadjustment values= ", c(h))
cat("\nquantile estimates:")
}else if (type <= 3){## Types 1, 2 and 3 are discontinuous sample qs.
nppm <- if (type == 3){ n * probs - .5 # n * probs + m; m = -0.5
} else {n * probs} # m = 0
j <- floor(nppm)
h <- switch(type,
(nppm > j), # type 1
((nppm > j) + 1)/2, # type 2
(nppm != j) | ((j %% 2L) == 1L)) # type 3
} else{
## Types 4 through 9 are continuous sample qs.
switch(type - 3,
{a <- 0; b <- 1}, # type 4
a <- b <- 0.5, # type 5
a <- b <- 0, # type 6
a <- b <- 1, # type 7 (unused here)
a <- b <- 1 / 3, # type 8
a <- b <- 3 / 8) # type 9
## need to watch for rounding errors here
fuzz <- 4 * .Machine$double.eps
nppm <- a + probs * (n + 1 - a - b) # n*probs + m
j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
h <- nppm - j
if(any(sml <- abs(h) < fuzz)) h[sml] <- 0
x <- sort(x, partial =
unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
)
x <- c(x[1L], x[1L], x, x[n], x[n])
## h can be zero or one (types 1 to 3), and infinities matter
#### qs <- (1 - h) * x[j + 2] + h * x[j + 3]
## also h*x might be invalid ... e.g. Dates and ordered factors
qs <- x[j+2L]
qs[h == 1] <- x[j+3L][h == 1]
other <- (0 < h) & (h < 1)
if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]
}
} else {
qs <- rep(NA_real_, np)}
if(is.character(lx)){
qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)}
if(names && np > 0L) {
names(qs) <- format_perc(probs)
}
if(na.p) { # do this more elegantly (?!)
o.pr[p.ok] <- qs
names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
names(o.pr)[p.ok] <- names(qs)
o.pr
} else qs
}
####################
# fake data
x<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,99)
y<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,9)
z<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7)
#quantiles "of interest"
probs<-c(0.5, 0.75, 0.95, 0.975)
# a tiny bit of illustrative behavior
quantile.default(x,probs=probs, names=F)
quantile.default(y,probs=probs, names=F) #only difference is .975 quantile since that is driven by highest 2 observations
quantile.default(z,probs=probs, names=F) # This shifts everything b/c now none of the quantiles fall on an observation (and of course the distribution changed...)... but
#.75 quantile is stil 5.0 b/c the observations just above and below the order statistic for that quantile are still 5. However, it got there for a different reason.
#how does rescaling affect quantile estimates?
sqrt(quantile.default(x^2, probs=probs, names=F))
exp(quantile.default(log(x), probs=probs, names=F))

Resources