Speeding up Monte Carlo simulations? - r

I am currently doing research to predict where kudzu (an invasive vine) will spread in Oklahoma over a five year time by using Monte Carlo simulation. I have created a raster with presence points and loaded it into R.
For each Monte Carlo simulation (each year), I am running 6000 iterations to provide accurate results. However, due to "for" loops, this is taking a long time. The first year usually finishes running in 3 days, however the second year has been running for over 3 weeks and still is not complete.
Is there any way to speed this process up?
Each year builds off the previous one. I have provided the code below for the first two years:
OK.rast16<-raster("OK_rast20161.tif")
p.a16<-as.matrix(OK.rast16)
table(p.a16)
# Set the random number seed so results can be reproduced if needed
set.seed(10)
drow.pa16 <- 133.7197873 # distance of grid cell (meters) in n-s direction
trow.pa16 <- length(p.a16[,1]) # total number of rows
dcol.pa16 <- 133.7197873 # distance of grid cell (meters) in e-w direction
tcol.pa16 <- length(p.a16[1,]) # total number of rows
#####Year 1 of infection in 2016#####
kudzu_sim1_16 <- matrix(0,trow.pa16,tcol.pa16)
for(m in 1:6000)
{
OK.kudzu_16 <- p.a16 # initialize matrix of annual dispersal
for(i in 1:trow.pa16)
{
for(j in 1:tcol.pa16)
{
if(!is.na(p.a16[i,j]) & p.a16[i,j] == 1)
{
for(k in 1:trow.pa16)
{
for(l in 1:tcol.pa16)
{
if(!is.na(OK.kudzu_16[k,l]) & OK.kudzu_16[k,l] == 0)
{
distcalc <- sqrt((abs(i-k)*drow.pa16)^2+(abs(j-l)*dcol.pa16)^2)
prob <- exp(0.0369599-0.00474401*distcalc)
if(prob>runif(1)) {OK.kudzu_16[k,l] <- 1}
}
}
}
}
}
}
kudzu_sim1_16 <- OK.kudzu_16+kudzu_sim1_16
}
#####Year 2 of infection in 2016####
kudzu_sim2_16 <- matrix(0,trow.pa16,tcol.pa16)
for(m in 1:6000)
{
OK.kudzu1_16 <- OK.kudzu_16 # initialize matrix of annual dispersal
for(i in 1:trow.pa16)
{
for(j in 1:tcol.pa16)
{
if(!is.na(OK.kudzu_16[i,j]) & OK.kudzu_16[i,j] == 1)
{
for(k in 1:trow.pa16)
{
for(l in 1:tcol.pa16)
{
if(!is.na(OK.kudzu1_16[k,l]) & OK.kudzu1_16[k,l] == 0)
{
distcalc <- sqrt((abs(i-k)*drow.pa16)^2+(abs(j-l)*dcol.pa16)^2)
prob <- exp(0.0369599-0.00474401*distcalc)
if(prob>runif(1)) {OK.kudzu1_16[k,l] <- 1}
}
}
}
}
}
}
kudzu_sim2_16 <- OK.kudzu1_16+kudzu_sim2_16
}
Here is the raster to load to start the code:
kudzu in OK

Originally a comment, but it rapidly exceeded the length limit:
1) 133mx133m is a very small grid size for spatial simulations on something as large as an entire state. It might help to find a way to make resolution a parameter of your simulation (rather than a hard-wired number), streamline the code so that it runs well at a larger resolution, then increase the resolution. The raster function has an optional parameter named res which can be used to control the resolution.
2) While vectorization will surely help, it is unlikely to transform an algorithm which runs for weeks with no output into one which runs in a reasonable amount of time. Perhaps you need to fundamentally rethink your algorithm. You seem to be comparing every grid cell with every other grid cell. That doesn't strike me as biologically plausible. Kudzu spreads locally. Why should what happens next year to a given 133m x 133m cell in the Oklahoma panhandle depend on the current state of another cell over by Lake Texoma? If your simulation has any biological realism, exp(0.0369599-0.00474401*distcalc) should be negligibly small for two such cells, but your code doesn't neglect it. An algorithm which is in some sense localized might be better.
3) There is an awful lot of entries in your matrix which correspond to points outside of Oklahoma. Unless your model is designed to see how kudzu also diffuses over a large part of Texas, those might be irrelevant for your program. If so, a fundamentally different data structure (e.g. a list of locations) which only has entries for points in Oklahoma might be preferable. Or, maybe not. Just something to think about.
4) For more detailed help, it would help if you explain what your algorithm actually is (and not just what it intends to do). It isn't completely obvious in a quick read of your code.

Related

How can I shorten the runtime of for loops and if statements in R while using igraph for forest fire simulations

I am simulating forest fires in R and have to use the igraph package. My code currently works but is extremely slow. I read through ways of vectorizing my for loops or using seq_along or putting conditionals outside my loops. I was unable to figure out how to use these solutions in my specific code. As for the description of my code: I am simulating forest fires where I loop through 21 different percentages representing the likelihood of a blank vertex becoming a tree (0 through 1 by .05 intervals). In each of these loops I am running 100 full forest fires. Each forest fire is comprised of 50 time steps. In each time step, I check which vertices of my igraph need to be changed to empty, tree, and fire. For the specific problem I am working on, I am tracking the largest number of trees on fire during each forest fire so that I can later generate a graph of the average maximum fire for the 21 different percentages. Any tips on how to speed up this code would be much appreciated.
OG <- graph.lattice(c(30,30))
V(OG)$color <- "black"
total.burning.tree.max <- matrix(nrow = 21, ncol = 100)
for (p in seq(0, 1, .05)) {
for (x in 1:100) {
fire.start <- sample(900, 1)
tree.start <- sample(900, (900*.7))
G <- OG
V(G)$color[tree.start] <- "green"
V(G)$color[fire.start] <- "red"
current.burning.tree.max <- 1
H <- G
for (h in 1:50) {
if (length(V(G)[color == "red"]) > current.burning.tree.max) {
current.burning.tree.max <- length(V(G)[color == "red"])
}
for (i in 1:length(V(G)[color == "black"])) {
if (runif(1) <= p) {
V(H)$color[V(G)[color == "black"][i]] <- "green"
}
}
if (length(V(G)[color == "red"]) > 0) {
for (d in 1:length(V(G)[color == "red"])) {
V(H)$color[V(G)[color == "red"][d]] <- "black"
potential.fires <- neighbors(G, V(G)[color == "red"][d])
for (z in 1:length(potential.fires)) {
if (V(G)$color[potential.fires[z]] == "green") {
V(H)$color[potential.fires[z]] <- "red"
}
}
}
}
G <- H
}
total.burning.tree.max[(p*20), x] <- current.burning.tree.max
print(current.burning.tree.max)
}
}
burn.numbers <- c()
for (c in 1:21) {
burn.numbers[c] <- average(total.burning.tree.max[c, ])
}
plot(burn.graph, type = "l")
General notes on optimising your code:
First of all, your code is full of nested loops where each simulation loops over nodes in igraph to change values. This is a bad idea, since igraph is quicker.
Consider for example this loop over all nodes of a given colour like you do:
for (i in 1:length(V(G)[color == "red"])) {
V(H)$color[V(G)[color == "red"][i]] <- "black"
}
It would be better to store the subset of nodes, and use it to make changes all at once:
V(G)[ V(G)$color=="red" ] <- "black"
Note also that you need not place runif(1, p) inside a loop, but you can perform any number of probability comparisons if you let runif() output a vector like so:
runif(sum( V(G)$color=="red" ), 0, 1) <= p
Consider summarising boolean values when you don't need the actual value of a variable or igraph node attribute:
sum(V(G)$color=="red") == length( V(G)$color[ V(G)$color =="red" ] )
In your example, as often when running simulations in general or in igraph in particular, computation speed depends on dynamics within the simulation. My script below, for example, executes much quicker for time-steps with few trees on fire. The function adjacent_vertices() is an obvious time-bandit here when it is instructed to return mode="total". Yet, that function should be quicker than you looping around on your own.
When you look for iterations that consume a lot of time, you'd find that your script suffers a lot from checking neighbours of burning trees with burning neighbours.
Introducing new behaviours to facilitate optimisation:
My optimizing solution is to introduce a new colour: "orange", for fires that have already been spread. Since all trees with burning neighbours catch fire during each time-step, the simulation needs not check for neighbours of trees that cought fire before the previous time-step. This significantly reduces the number of neighbour-tests performed by adjacent_vertices(), a function that would run 20*100*50*270 or so times on p=.05. That's a million neighbour-checks right there! If we don't need to check for neighbours of yellow trees that already have all their neighbours alit, we save a lot of CPU cycles.
I hope I've provided some good general pointers. Next to your script above, the below script, I hope, can serve for pedagogical purposes.
In the script below, I've changed the way of storing simulation data, as well as a function in the simulation that I might have miss-understood. p below now states the probability that burning trees are put out each time-step, while neighbours of burning trees are sure to catch fire in the next time-step (as they were in your simulation).
Each level of p plots an example graph.
Note also that the line that sets new trees on fire can be ever so slightly optimised by removing the runif() that allows you to change values for a separate probability of neighbouring trees to catch fire.
tree_fires <- potential_fires[ runif(length(potential_fires), 0, 1) <= FIRE_PROBABILITY ]
As always in optimising. Spend your efforts where they count! Removing the runif() for tree_fires probably saves you only around a millionth of the time compared to moving to orange trees to ease the work of adjacent_vertices().
A note on your approach:
I've done similar simulations of decease-spreading in social networks. It matters a lot where you put the initial fire. The maximum number of trees on fire in one iteration is capped a lot by the walls of your forest. This would result in significantly higher variation of your measurement within each level assumed by p. I very much suggest that you move to a model which places the initial fire in the middle of your forest. I've included configuration variables for this.
Sugestion summary:
library("igraph")
# Configurations
PROB_LEVELS <- 20 # How many probability levels?
FOREEST_SIMULATIONS <- 100 # How many simulations shouls occur for each probability level?
TIMESTEPS <- 50 # How many iterations shouls fires spread for in each simulation?
FIRE_PROBABILITY <- 1 # How likely is it that an adjacent tree will catch fire? (Lower values decrease speed of fire spreading)
FIXED_STARTING_POINT <- TRUE # Should the fire begin at the same place always?
PLAYGROUND <- 30 # The size of the forest (higher values decrease likelyhood of hiting foret-walls)
FOREST_DENSITY <- .7 # The percentage of nodes that are trees in an unburnt forest. (higher values facilitates spread of fire)
# 900 trees
OG <- graph.lattice(c(PLAYGROUND, PLAYGROUND))
V(OG)$color <- "gray"
# Store simulation results in a list instead.
stat <- lapply(1:PROB_LEVELS, function(x) rep(NA,FOREEST_SIMULATIONS))
plotforest <- function(graph){plot(graph, vertex.label=NA, vertex.size=5, layout=layout_on_grid(graph) )}
# Make dimulations using these probabilities
for (p in 1:PROB_LEVELS/PROB_LEVELS) {
cat("p =",p)
for (x in 1:FOREEST_SIMULATIONS) {
# Each iteration have different random configurations of forests with a fixed tree-density
G <- OG
V(G)$color[ sample(PLAYGROUND^2, (PLAYGROUND^2 * FOREST_DENSITY )) ] <- "green"
# Firees could start at random tree or in the "middle"
if(FIXED_STARTING_POINT){
V(G)$color[ round(PLAYGROUND^2/2)-(PLAYGROUND/2) ] <- "red" }
else{
V(G)$color[ sample(PLAYGROUND^2, 1) ] <- "red" }
# Collect simulation data over time-steps during which the fire spreads
burning_tree_max <- 1
for(h in 1:TIMESTEPS){
# Put out trees that are on fire using probability `p`
# This replaces your loop for (i in 1:length(V(G)[color == "red"])) {}
trees_on_fire <- V(G)[ V(G)$color=="red" ] # make this subset only once per iteration. Store it. You could use %in% c('red','orange' )
if(length(trees_on_fire) == 0){break;print(h)} # Abort time-steps if there are no more contageous fires.
V(G)$color[ trees_on_fire[ runif(length(trees_on_fire), 0, 1) <= p ] ] <- "black"
# Set neighboring trees of burning trees on fire (only green trees can catch fire)
# This replaces your nested loop staring with for (d in 1:length(V(G)[color == "red"])) { }
last_egnited <- V(G)$color=="red"
potential_fires <- adjacent_vertices(G, last_egnited, mode="total")
potential_fires <- unique(unlist(potential_fires))
potential_fires
tree_fires <- potential_fires[ runif(length(potential_fires), 0, 1) <= FIRE_PROBABILITY ]
# Store last time-step's burning trees as orange, and egnite new neighbors.
V(G)$color[last_egnited] <- "orange"
V(G)$color[tree_fires][V(G)$color[tree_fires] == "green"] <- "red" # Set all green subsetted neighbors of flaming treas on fire at once
# No orange tree can have a green neighbour!
# Track maximum number of trees on fire.
burning_tree_max <- max(burning_tree_max, sum(V(G)$color=="red") )
}
# store simulation results as sum of currently burning trees
stat[[p*PROB_LEVELS]][x] <- burning_tree_max
}
cat(": averaging", round(mean(stat[[p*PROB_LEVELS]], na.rm=T),1), "trees.", fill=T)
plotforest(G)
}
# Plot the simulation results
plot(sapply(stat, function(x) mean(x)), type="l",
ylab="Maximum number of trees on fire", xlab=NA,
main="Snapshot of fires during a simulation",
sub="50 time-cycles ona 30x30 sized forest ")

What is the difference between stat:kmeans and "naive" k-means

I am trying to understand what the stat:kmeans does differently to the simple version explained eg on Wikipedia. I am honestly so supremely clueless.
Reading the help on kmeans I learned that the default algorithm is Hartigan–Wong not the more basic method, so there should be a difference, but playing around with some normal distributed variables I couldn't find a case where they differed substantially and predictably.
For reference, this is my utterly horrible code I tested it against
##squre of eudlidean metric
my_metric <- function(x=vector(),y=vector()) {
stopifnot(length(x)==length(y))
sum((x-y)^2)
}
## data: xy data
## k: amount of groups
my_kmeans <- function(data, k, maxIt=10) {
##get length and check if data lengths are equal and if enough data is provided
l<-length(data[,1])
stopifnot(l==length(data[,2]))
stopifnot(l>k)
## generate the starting points
ms <- data[sample(1:l,k),]
##append the data with g column and initilize last
data$g<-0
last <- data$g
it<-0
repeat{
it<-it+1
##iterate through each data point and assign to cluster
for(i in 1:l){
distances <- c(Inf,Inf,Inf)
for(j in 1:k){
distances[j]<-my_metric(data[i,c(1,2)],ms[j,])
}
data$g[i] <- which.min(distances)
}
##update cluster points
for(i in 1:k){
points_in_cluster <- data[data$g==i,1:2]
ms[i,] <- c(mean(points_in_cluster[,1]),mean(points_in_cluster[,2]))
}
##break condition: nothing changed
if(my_metric(last,data$g)==0 | it > maxIt){
break
}
last<-data$g
}
data
}
First off, this was a duplication (as I just found out) of this post.
But I will still try to give an example: When the clusters are separated, Lloyd tends to leave the centers inside the clusters they start in, meaning that some may end up partitioned while some others might be lumped together

Why are simulated stock returns re-scaled and re-centered in the “pbo” vignette in the pbo (probability of backtest overfitting) package in R?

Here's the relevant code from the vignette, altered slightly to fit it on the page here, and make it easy to reproduce. Code for visualizations omitted. Comments are from vignette author.
(Full vignette: https://cran.r-project.org/web/packages/pbo/vignettes/pbo.html)
library(pbo)
#First, we assemble the trials into an NxT matrix where each column
#represents a trial and each trial has the same length T. This example
#is random data so the backtest should be overfit.`
set.seed(765)
n <- 100
t <- 2400
m <- data.frame(matrix(rnorm(n*t),nrow=t,ncol=n,
dimnames=list(1:t,1:n)), check.names=FALSE)
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
#We can use any performance evaluation function that can work with the
#reassembled sub-matrices during the cross validation iterations.
#Following the original paper we can use the Sharpe ratio as
sharpe <- function(x,rf=0.03/252) {
sr <- apply(x,2,function(col) {
er = col - rf
return(mean(er)/sd(er))
})
return(sr)
}
#Now that we have the trials matrix we can pass it to the pbo function
#for analysis.
my_pbo <- pbo(m,s=8,f=sharpe,threshold=0)
summary(my_pbo)
Here's the portion i'm curious about:
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
Why is the data transformed within the for loop, and does this kind of re-scaling and re-centering need to be done with real returns? Or is this just something the author is doing to make his simulated returns look more like the real thing?
Googling and searching through stackoverflow turned up some articles and posts regarding scaling volatility to the square root of time, but this doesn't look quite like what I've seen. Usually they involve multiplying some short term (i.e. daily) measure of volatility by the root of time, but this isn't quite that. Also, the documentation for the package doesn't include this chunk of re-scaling and re-centering code. Documentation: https://cran.r-project.org/web/packages/pbo/pbo.pdf
So:
Why is the data transformed in this way/what is result of this
transformation?
Is it only necessary for this simulated data, or do I need to
similarly transform real returns?
I posted this question on the r-help mailing list and got the following answer:
"Hi Joe,
The centering and re-scaling is done for the purposes of his example, and
also to be consistent with his definition of the sharpe function.
In particular, note that the sharpe function has the rf (riskfree)
parameter with a default value of .03/252 i.e. an ANNUAL 3% rate converted
to a DAILY rate, expressed in decimal.
That means that the other argument to this function, x, should be DAILY
returns, expressed in decimal.
Suppose he wanted to create random data from a distribution of returns with
ANNUAL mean MU_A and ANNUAL std deviation SIGMA_A, both stated in decimal.
The equivalent DAILY returns would have mean MU_D = MU_A / 252 and standard
deviation SIGMA_D = SIGMA_A/SQRT(252).
He calls MU_D by the name mu_base and SIGMA_D by the name sigma_base.
His loop now converts the random numbers in his matrix so that each column
has mean MU_D and std deviation SIGMA_D.
HTH,
Eric"
I followed up with this:
"If I'm understanding correctly, if I’m wanting to use actual returns from backtests rather than simulated returns, I would need to make sure my risk-adjusted return measure, sharpe ratio in this case, matches up in scale with my returns (i.e. daily returns with daily sharpe, monthly with monthly, etc). And I wouldn’t need to transform returns like the simulated returns are in the vignette, as the real returns are going to have whatever properties they have (meaning they will have whatever average and std dev they happen to have). Is that correct?"
I was told this was correct.

vectorize random failure of a graph in R

I have the following function that I intend to run on a large scale with 10000 random networks.
m.ci.fail<-function(graph){
fail <- function(net) {
vids<-sample(V(net),1)
net <- delete.vertices(net,vids)
return(net)
}
compsize <- function(net,graph) {
b <- clusters(graph)
c <- clusters(net)
S <- max(c$csize)/max(b$csize)
return(S)
}
nodes<-1:vcount(graph)
R<-10000
cpmat<-matrix(nrow=length(nodes), ncol=R)
for(i in 1:R){
gr<-rewire(graph,mode="simple",niter=10000)
E(gr)$weight<-E(graph)$weight
grr<-gr
E(grr)$weight<-E(graph)$weight
cp<-numeric(length(nodes))
for(t in 1:(length(nodes)-1)){
gr<-fail(gr)
grcp<-compsize(gr,grr)
cp[t]<-grcp
}
cpmat[,i]<-cp
}
return(cpmat)
}
EDIT: I'm trying to create a randomly re-sampled distribution based on the original graph so I can get a confidence interval and later compare the range of random failures to centrality-based sequential deletions. Testing it as it is took hours with a graph of small size (30 nodes). I figure that if I could find a way to vectorize the random failure function this would be faster. I'm trying to vectorize the two 'for' loops but the fail function is making it a pain. Any suggestion on how I can do that?
Thanks in advance,

Loop structure for basic simulation model in R

I'm trying to write a basic model that simulates the growth of a population (whose initial size is drawn randomly from a normal distribution) and then grows by a user defined amount each 'year' (currently 2 individuals in the code below for arguments sake). The output that is produced only shows the results of one simulation and, within this simulation, the population hasn't grown at all i.e. for each 'year' the population hasn't grown/doesn't add to the previous 'years' population. I'm assuming that I've stuffed something up in the loop structure and keen for any advice!
n.years <- 3
n.sim <- 5
store.growth <- matrix(ncol=3,nrow= (n.years * n.sim))
for (i in 1:n.sim) {
init.pop.size <- rnorm(1,100,10)
for (j in 1:n.years){
#grow population
grow.pop <- init.pop.size + 5
store.growth[j,] <- cbind(grow.pop, n.years, n.sim)
}
}
store.growth

Resources