Supplying seed to Stan doesn't guarantee the same chains - r

I was trying to compare the run time of two very much equivalent but slightly different models. But the chains will always take very different time based on the random numbers sampled. So I tried to fix this supplying the same seed parameter to stan().
I tried this running the same model two times. Unfortunatelly, I found out that even the same seed will not guarantee that the process will produce the same samples! And the run time of the chains in both runs is also very different!
100 iterations, 50 warmup, 15 chains
1st run: total 57.42 seconds, chain 14: 4.515 seconds, chain 15: 2.782 seconds;
2nd run: total 76.99 seconds, chain 14: 2.703 seconds, chain 15: 4.125 seconds.
How is it possible that seed doesn't guarantee the same chains in Stan?
How can this be guaranteed?
The following options were used to run Stan (rstan version 2.5.0):
t1 <- proc.time()
out7.5b <- out <- stan(fit = fit7.5b, data = win.data, init = inits,
pars = params,
iter = ni, warmup = nb, chains = nc, seed = 123)
t2 <- proc.time()
print(t2 - t1)

Related

Is there a way in R to find a maximum value during a three point estimate

I am using the R programming language. Suppose I have the following 3 point estimate data : Data
Here, Task & Task 2 are being done parallelly, whereas Task 3 and Task 4 are done in series, where task 4 is dependent on the completion of task 3. So now, minimum time from Task 1 & Task 2 is '10', most likely is '20' and maximum is '40'. Which will be added to Task 3 & 4 giving us the total time.
When the three point cost estimation is given, the min, most likely and max cost is added together and a simulation(1000, 10000...whatever) is run. But in case of time The general rule is: time for tasks in series should be added; time for tasks in parallel equal the time it takes for the longest task.
How is the time estimation executed in R as we are adding up rows for multiple simulations in one go.
code:
inv_triangle_cdf <- function(P, vmin, vml, vmax){
Pvml <- (vml-vmin)/(vmax-vmin)
return(ifelse(P < Pvml,
vmin + sqrt(P*(vml-vmin)*(vmax-vmin)),
vmax - sqrt((1-P)*(vmax-vml)*(vmax-vmin))))
}
#no of simulation trials
n=1000
#read in cost data
task_costs <- read.csv(file="task_costs.csv", stringsAsFactors = F)
str(task_costs)
#set seed for reproducibility
set.seed(42)
#create data frame with rows = number of trials and cols = number of tasks
csim <- as.data.frame(matrix(nrow=n,ncol=nrow(task_costs)))
# for each task
for (i in 1:nrow(task_costs)){
#set task costs
vmin <- task_costs$cmin[i]
vml <- task_costs$cml[i]
vmax <- task_costs$cmax[i]
#generate n random numbers (one per trial)
psim <- runif(n)
#simulate n instances of task
csim[,i] <- inv_triangle_cdf(psim,vmin,vml,vmax)
}
#sum costs for each trial
ctot <- csim[,1] + csim[,2] + csim[,3] + csim[,4] #costs add
ctot
How can I update this in order to accommodate time duration from the data given above?

Why the processing time behaves different with these two functions using parallel?

Imagine I have two functions, one is a simple mean of sum of squares, and the other, a little more elaborated that computes a regression, that I want to apply to the lines of a "big" matrix or data frame.
In order to take advantage of multiple cores (on Windows) I tried the parallel package and got very different results for the two functions using the same sequence of commands.
For the apparently more complex function (regression) it appears that the time reduction is significant using more cores (Here I show a result from a PC with 3 cores and a PC with 12 cores, the behavior is similar with up to 11 cores, the time reduction decreases with more cores).
But for the "simple" function, mean of squares, the time of executions is very variable, almost erratic (also tested with up to 11 cores).
First, Is there a reason why this is happening? Second, I imagine there are other ways to do that task, can you suggest any?
Here is the code to generate the plots:
library(parallel)
nc=detectCores()-1 #number of cores
myFun =function(z) coef(lm(rep(1,length(z))~z)) #regression
myFun2 =function(z) sum(z^2)/length(z) # mean of squares
my.mat = matrix(rnorm(1000000,.01,0.4),ncol=100) #data
# using FUN = myFun
# Replicate 10 times
for(j in 1:10){
ncor=2:nc
timed=c()
for (i in seq_along(ncor)){
cl <- makeCluster(mc <- getOption("cl.cores", ncor[i]))
stime <- Sys.time()
res=parApply(cl = cl, X = my.mat, MARGIN = 1, FUN = myFun)
tm=Sys.time()-stime
timed[i]=tm
stopCluster(cl)
}
# no cores
stime <- Sys.time()
res=apply(my.mat, MARGIN = 1, FUN = myFun)
tm=Sys.time()-stime
(dr=data.frame(nc=c(1,ncor),ts=as.numeric(c(tm,timed))))
plot(dr,type="l",col=3,main=j)
#stopCluster(cl)
if (j==1)fres1=dr else fres1=merge(fres1,dr,by="nc")
}
plot(fres1[,1:2],type="l",col=2,ylim=range(fres1[,-1]))
for(i in 3:11)lines(fres1[,i],col=i+1)
# For the second plot use the same code but change FUN = myFun2

How to efficiently parallelize brms::brm?

Problem summary
I am fitting a brms::brm_multiple() model to a large dataset where missing data has been imputed using the mice package. The size of the dataset makes the use of parallel processing very desirable. However, it isn't clear to me how to best use the compute resources because I am unclear about how brms divides sampling on the imputed dataset among cores.
How can I choose the following to maximize efficient use of compute resources?
number of imputations (m)
number of chains (chains)
number of cores (cores)
Conceptual example
Let's say that I naively (or deliberately foolishly for sake of example) choose m = 5, chains = 10, cores = 24. There are thus 5 x 10 = 50 chains to be allocated among 24 cores reserved on the HPC. Without parallel processing, this would take ~50 time units (excluding compiling time).
I can imagine three parallelization strategies for brms_multiple(), but there may be others:
Scenario 1: Imputed datasets in parallel, associated chains in serial
Here, each of the 5 imputations is allocated to it's own processor which runs through the 10 chains in serial. The processing time is 10 units (a 5x speed improvement vs. non-parallel processing), but poor planning has wasted 19 cores x 10 time units = 190 core time units (ctu; =80% of the reserved compute resources). The efficient solution would be to set cores = m.
Scenario 2: Imputed datasets in serial, associated chains in parallel
Here, the sampling begins by taking the first imputed dataset and running one of the chains for that dataset on each of 10 different cores. This is then repeated for the remaining four imputed datasets. The processing takes 5 time units (a 10x speed improvement over serial processing & a 2x improvement over Scenario 1). However, here too compute resources are wasted: 14 cores x 5 time units = 70 ctu. The efficient solution would be to set cores = chains
Scenario 3: Free-for-all, wherein each core takes on a pending imputation/chain combination when it becomes available until all are processed.
Here, the sampling begins by allocating all 24 cores, each one to one of the 50 pending chains. After they finish their iterations, a second batch of 24 chains is processed, bringing the total chains processed to 48. But now there are only two chains pending and 22 cores sit idle for 1 time unit. The total processing time is 3 time units, and the wasted compute resource is 22 ctu. The efficient solution would be to set cores to a multiple of m x chains.
Minimal reproducible example
This code compares the compute time using an example modified from a brms vignette. Here we'll set m = 10, chains = 6, and cores = 4. This makes for a total of 60 chains to be processed. Under these conditions, I would expect speed improvement (vs. serial processing) is as follows*:
Scenario 1: 60/(6 chains x ceiling(10 m / 4 cores)) = 3.3x
Scenario 2: 60/(ceiling(6 chains / 4 cores) x 10 m) = 3.0x
Scenario 3: 60/ceiling((6 chains x 10 m) / 4 cores) = 4.0x
*(ceiling/rounding up is used because chains cannot be subdivided among cores)
library(brms)
library(mice)
library(tictoc) # convenience functions for timing
# Load data
data("nhanes", package = "mice")
# There are 10 imputations x 6 chains = 60 total chains to be processed
imp <- mice(nhanes, m = 10, print = FALSE, seed = 234023)
# Fit the model first to get compilation out of the way
fit_base <- brm_multiple(bmi ~ age*chl, data = imp, chains = 6,
iter = 10000, warmup = 2000)
# Use update() function to avoid re-compiling time
# Serial processing (127 sec on my machine)
tic() # start timing
fit_serial <- update(fit_base, .~., cores = 1L)
t_serial <- toc() # stop timing
t_serial <- diff(unlist(t_serial)[1:2]) # calculate seconds elapsed
# Parallel processing with 3 cores (82 sec)
tic()
fit_parallel <- update(fit_base, .~., cores = 4L)
t_parallel <- toc()
t_parallel <- diff(unlist(t_parallel)[1:2]) # calculate seconds elapsed
# Calculate speed up ratio
t_serial/t_parallel # 1.5x
Clearly I am missing something. I can't distinguish between the scenarios with this approach.

How to calculate perplexity for LDA with Gibbs sampling

I perform an LDA topic model in R on a collection of 200+ documents (65k words total). The documents have been preprocessed and are stored in the document-term matrix dtm. Theoretically, I should expect to find 5 distinct topics in the corpus, but I would like to calculate the perplexity score and see how the model fit changes with the number of topics. Below is the code I use. The problem is it gives me an error when i try to calculate the perplexity score and I am not sure how to fix it (I am new to R). The error is in the last line of code. I would appreciate any help.
burnin <- 4000 #burn-in parameter
iter <- 2000 # #of iteration after burn-in
thin <- 500 #take every 500th iteration for further use to avoid correlations between samples
seed <-list(2003,10,100,10005,765)
nstart <- 5 #use 5 different starting points
best <- TRUE #return results of the run with the highest posterior probability
#Number of topics (run the algorithm for different values of k and make a choice based by inspecting the results)
k <- 5
#Run LDA using Gibbs sampling
ldaOut <-LDA(dtm,k, method="Gibbs",
control=list(nstart=nstart, seed = seed, best=best,
burnin = burnin, iter = iter, thin=thin))
perplexity(ldaOut, newdata = dtm)
Error in method(x, k, control, model, mycall, ...) : Need 1 seeds
It needs one more parameter "estimate_theta",
use below code:
perplexity(ldaOut, newdata = dtm,estimate_theta=FALSE)

Rstan on Rstudio MCMC having too elevated running time (limited use of avaiable CPU and RAM)

I am a newbie of the Rstan world, but I really need it for my thesis. I am actually using the script and a similar dataset from a guy from NYU, who reports as an estimated time for a similar DS of about 18 hours. However, when I try to run my model it won't do more than 10% in 18hours. Thus, I ask for some little help to understand what I am doing wrong and how to improve the efficiency.
I am running a 500 iter, 100 warmup 2 chains model with a Bernoulli_logit function over 5 parameters, trying to estimate 2 of them through a No U Turn MC procedure. (at each step it draws from a random normal a each parameters, then it estimates y and compares it with the actual data to see if the new parameters are a better fit to the data)
y[n] ~ bernoulli_logit( alpha[kk[n]] + beta[jj[n]] - gamma * square( theta[jj[n]] - phi[kk[n]] ) );
(n being about 10mln)
My data is a 10.000x1004 matrix of 0s and 1s. To wrap it up, it is a matrix about people following politicians on twitter and I want to estimate their political ideas given who they follow. I run the model on RStudio with R x64 3.1.1 on a Win8 Professional, 6bit, I7 quad core with 16 GB ram.
Checking the performances, rsession uses no more than 14% CPU and 6GB of ram, although 7 more GB are free. While trying to subsample to a 10.000x250 matrix, I have noticed that it will use below 1.5GB instead. However, I have tried the procedure with a 50x50 dataset and it worked just fine, so there is no mistake in the procedure.
Rsession opens 8 threads, i see activity on each core but none is fully occupied.
I wonder why is it the case that my PC does not work at the best of its possibilities and whether there might be some bottleneck, a cap or a setup that prevents it to do so. R is 64 bit (just checked) and so Rstan should be (even though I had some difficulties in installing and that might have messed up some parameters)
this is what happens when i compile it
Iteration: 1 / 1 [100%] (Sampling)
# Elapsed Time: 0 seconds (Warm-up)
# 11.451 seconds (Sampling)
# 11.451 seconds (Total)
SAMPLING FOR MODEL 'stan.code' NOW (CHAIN 2).
Iteration: 1 / 1 [100%] (Sampling)
# Elapsed Time: 0 seconds (Warm-up)
# 12.354 seconds (Sampling)
# 12.354 seconds (Total)
while when i run it it just works for hours but it never goes beyond the 10% of the first chain (mainly because I have interrupted it after my pc was about to melt down).
Iteration: 1 / 500 [ 0%] (Warmup)
and has this setting:
stan.model <- stan(model_code=stan.code, data = stan.data, init=inits, iter=1, warmup=0, chains=2)
## running modle
stan.fit <- stan(fit=stan.model, data = stan.data, iter=500, warmup=100, chains=2, thin=thin, init=inits)
please help me find what is slowing down the procedure (and if nothing wtong is happening, what can I manipulate to have still some reasonable result in shorter time?).
I thank you in advance,
ML
here's the model (From Pablo Barbera, NYU)
n.iter <- 500
n.warmup <- 100
thin <- 2 ## this will give up to 200 effective samples for each chain and par
Adjmatrix <- read.csv("D:/TheMatrix/Adjmatrix_1004by10000_20150424.txt", header=FALSE)
##10.000x1004 matrix of {0, 1} with the relationship "user i follows politician j"
StartPhi <- read.csv("D:/TheMatrix/StartPhi_20150424.txt", header=FALSE)
##1004 vector of values [-1, 1] that should be a good prior for the Phi I want to estimate
start.phi<-ba<-c(do.call("cbind",StartPhi))
y<-Adjmatrix
J <- dim(y)[1]
K <- dim(y)[2]
N <- J * K
jj <- rep(1:J, times=K)
kk <- rep(1:K, each=J)
stan.data <- list(J=J, K=K, N=N, jj=jj, kk=kk, y=c(as.matrix(y)))
## rest of starting values
colK <- colSums(y)
rowJ <- rowSums(y)
normalize <- function(x){ (x-mean(x))/sd(x) }
inits <- rep(list(list(alpha=normalize(log(colK+0.0001)),
beta=normalize(log(rowJ+0.0001)),
theta=rnorm(J), phi=start.phi,mu_beta=0, sigma_beta=1,
gamma=abs(rnorm(1)), mu_phi=0, sigma_phi=1, sigma_alpha=1)),2)
##alpha and beta are the popularity of the politician j and the propensity to follow people of user i;
##phi and theta are the position on the political spectrum of pol j and user i; phi has a prior given by expert surveys
##gamma is just a weight on the importance of political closeness
library(rstan)
stan.code <- '
data {
int<lower=1> J; // number of twitter users
int<lower=1> K; // number of elite twitter accounts
int<lower=1> N; // N = J x K
int<lower=1,upper=J> jj[N]; // twitter user for observation n
int<lower=1,upper=K> kk[N]; // elite account for observation n
int<lower=0,upper=1> y[N]; // dummy if user i follows elite j
}
parameters {
vector[K] alpha;
vector[K] phi;
vector[J] theta;
vector[J] beta;
real mu_beta;
real<lower=0.1> sigma_beta;
real mu_phi;
real<lower=0.1> sigma_phi;
real<lower=0.1> sigma_alpha;
real gamma;
}
model {
alpha ~ normal(0, sigma_alpha);
beta ~ normal(mu_beta, sigma_beta);
phi ~ normal(mu_phi, sigma_phi);
theta ~ normal(0, 1);
for (n in 1:N)
y[n] ~ bernoulli_logit( alpha[kk[n]] + beta[jj[n]] -
gamma * square( theta[jj[n]] - phi[kk[n]] ) );
}
'
## compiling model
stan.model <- stan(model_code=stan.code,
data = stan.data, init=inits, iter=1, warmup=0, chains=2)
## running modle
stan.fit <- stan(fit=stan.model, data = stan.data,
iter=n.iter, warmup=n.warmup, chains=2,
thin=thin, init=inits)
samples <- extract(stan.fit, pars=c("alpha", "phi", "gamma", "mu_beta",
"sigma_beta", "sigma_alpha"))
First, my apologies: I would have introduced this as a comment, but I don't have enough reputation.
Here's the question you asked: "what can I manipulate to have still some reasonable result in shorter time?"
The answer is, it depends. Instead of representing things as a binary matrix, have your tried reducing the size of the matrix by using counts? Based on the type of model you're trying to run, I imagine there is some non-identifiablity in the posterior. Could you try reparameterizing?
Also, you may want to run in CmdStan if R is causing problems with memory management.

Resources