I'm using a topic modeling approach that works well on my computer in RStudio, except that it takes ages. So I'm using a linux cluster. However, also I seem to request a lot of capacity, it doesn't really speed up:
I'm sorry I'm a greenhorn... So this is what I'm using in the pageant shell:
salloc -N 240 --mem=61440 -t 06:00:00 -p med
#!/bin/sh
#SBATCH --nodes=200
#SBATCH --time=06:00:00
#SBATCH --partition=med
#SBATCH --mem=102400
#SBATCH --job-name=TestJobUSERNAME
#SBATCH --mail-user=username#ddomain.com
#SBATCH --mail-type=ALL
#SBATCH --cpus-per-task=100
squeue –u username
cd /work/username/data
module load R
export OMP_NUM_THREADS=100
echo "sbatch: START SLURM_JOB_ID $SLURM_JOB_ID (SLURM_TASK_PID $SLURM_TASK_PID) on $SLURMD_NODENAME"
echo "sbatch: SLURM_JOB_NODELIST $SLURM_JOB_NODELIST"
echo "sbatch: SLURM_JOB_ACCOUNT $SLURM_JOB_ACCOUNT"
Rscript myscript.R
I'm pretty sure there's sth. wrong with my inputs because:
it isn't really faster (but my R code of course could also be just slow - so I tried various R codes with different calculation types)
whether I'm using 1 oder 200 nodes, the calculation of the same R script takes almost exactly the same time (there should be at least 244 nodes, though)
the echo results do not give complete information and I do not receive e-Mail notifications
so these are my typical outcomes:
#just very small request to copy/paste the results, usually I request the one above
[username#gw02 ~]$ salloc -N 2 --mem=512 -t 00:10:00 -p short
salloc: Granted job allocation 1234567
salloc: Waiting for resource configuration
salloc: Nodes cstd01-[218-219] are ready for job
Disk quotas for user username (uid 12345):
-- disk space --
Filesystem limit used avail used
/home/user 32G 432M 32G 2%
/work/user 1T 219M 1024G 0%
[username#gw02 ~]$ squeue -u username
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234567 short bash username R 2:14 2 cstd01-[218-219]
#(directory, module load, etc.)
#missing outcomes for SLURM_TAST_PID and SLUMD_NODENAME:
[username#gw02 data]$ echo "sbatch: START SLURM_JOB_ID $SLURM_JOB_ID (SLURM_TASK_PID $SLURM_TASK_PID) on $SLURMD_NODENAME"
sbatch: START SLURM_JOB_ID 1314914 (SLURM_TASK_PID ) on
Can anybody help? Thank you so much!
EDIT:
As Ralf Stubner points out in his comment, I don't do parallelization in the R Code. I have absolutely no idea on how to do that.
Here is one example calculation:
# Create the data frame
col1 <- runif (12^5, 0, 2)
col2 <- rnorm (12^5, 0, 2)
col3 <- rpois (12^5, 3)
col4 <- rchisq (12^5, 2)
df <- data.frame (col1, col2, col3, col4)
# Original R code: Before vectorization and pre-allocation
system.time({
for (i in 1:nrow(df)) { # for every row
if ((df[i, "col1"] + df[i, "col2"] + df[i, "col3"] + df[i, "col4"]) > 4) { # check if > 4
df[i, 5] <- "greater_than_4" # assign 5th column
} else {
df[i, 5] <- "lesser_than_4" # assign 5th column
}
}
})
... and a shortened "real code":
library(NLP)
library(tm)
library(SnowballC)
library(topicmodels)
library(lda)
library(textclean)
# load data and create corups
filenames <- list.files(getwd(),pattern='*.txt')
files <- lapply(filenames,readLines)
docs <- Corpus(VectorSource(files))
# clean data (shortened, just two examples)
docs.adj <- tm_map(docs.adj, removeWords, stopwords('english'))
docs.adj <-tm_map(docs.adj,content_transformer(tolower))
# create document-term matrix
dtm <- DocumentTermMatrix(docs.adj)
dtm_stripped <- removeSparseTerms(dtm, 0.8)
rownames(dtm_stripped) <- filenames
freq <- colSums(as.matrix(dtm_stripped))
ord <- order(freq,decreasing=TRUE)
### find optimal number of k
burnin <- 10000
iter <- 250
thin <- 50
seed <-list(3)
nstart <- 1
best <- TRUE
seq_start <- 2
seq_end <- length(files)
iteration <- floor(length(files)/5)
best.model <- lapply(seq(seq_start,seq_end, by=iteration), function(k){LDA(dtm_stripped, k, method = 'Gibbs',control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(seq(seq_start,seq_end, by=iteration)), LL=as.numeric(as.matrix(best.model.logLik)))
optimal_k <- best.model.logLik.df[which.max(best.model.logLik.df$LL),]
print(optimal_k)
### do topic modeling with more iterations on optimal_k
burnin <- 4000
iter <- 1000
thin <- 100
seed <-list(2003,5,63)
nstart <- 3
best <- TRUE
ldaOut <-LDA(dtm_stripped,optimal_k, method='Gibbs', control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
From a quick look at your R script, it looks like it is in:
best.model <- lapply(seq(seq_start,seq_end, by=iteration), function(k){
LDA(dtm_stripped, k, method = 'Gibbs', control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
})
where most of the processing time takes place. Here you could try to parallelize the code using future_lapply() instead of lapply(), i.e.
best.model <- future_lapply(seq(seq_start,seq_end, by=iteration), function(k){
LDA(dtm_stripped, k, method = 'Gibbs', control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
}, future.seed = TRUE)
I've also added future.seed = TRUE to make sure you random number generation is statistically sound when done in parallel. The future_lapply() function is in the future.apply package (*) so you need to do:
library(future.apply)
at the top of your script. Now there's one final thing you need to do - you need to tell it to run in parallel (the default is sequential) by adding:
plan(multiprocess)
also at the top (after attaching future.apply). The default is to use whatever cores are "available" where "available" means it is also agile to the number of cores the HPC scheduler (e.g. Slurm) allocates to your job. If you try the above on your local machine, it will default to using the number of cores it has. That is, you can verify your code also on your local machine and you should see some speedup. When you know it works, then you can rerun it on the cluster via your Slurm allocation and it should work there out of the box - but run with more parallel processes.
You might find my blog post on future.apply from 2018-06-23 useful - it has some FAQ at the end.
(*) Disclaimer: I'm the author of future.apply.
Related
I have an optimised time allocation lp code in R using LpSolverAPI. The code works fine with the given constraints which are :
The time allotted per job must be met
The time a worker works cannot exceed the paid work time
But, i need to add a "priority" while the time is allocated to each job. Meaning, there are jobs that need to be completed first/ must be given first priority while allocating workers.
The code i have is below.
The time must be allotted to job4, job2, job3 and finally job1 (as per priority_matrix)
If no time is workers are free then lower priority jobs can remain un allocated.
With the below code (without priority) no time is allocated to job 4.
library(lpSolveAPI)
jobs <- c(1,1,1,1)
workermax <- c(8,6,9)
jobmax <- c(6,6,5,6)
priority_matrix <- as.matrix(c(4,2,3,1),ncol = 1, bycol = T) ##priority of each job
rownames(priority_matrix) <- c("Job1","Job2","Job3","Job4")
scheduler_input2 <- matrix(c(8, 0, 0, 0,6,0,8,0,9,0,6,0), nrow=4 , byrow =T)
obj.vals <- scheduler_input2
colnames(obj.vals) <- c("worker1","worker2","worker3")
rownames(obj.vals) <- c("Job1","Job2","Job3","Job4")
nworkers <- ncol(scheduler_input2)
njobs <- nrow(scheduler_input2) #4
ncol = nworkers*njobs
distribution<- make.lp(ncol=ncol)
set.type(distribution, columns=1:ncol, type = c("integer"))
set.objfn(distribution, obj.vals)
lp.control(distribution,sense='max')
#cosntraint1
time_per_job <- function (job_index) {
skill_cols <- (0:(nworkers-1))*njobs + job_index
add.constraint(distribution, rep(1,nworkers), indices=skill_cols,type="<=", rhs=jobmax[job_index])
}
lapply(1:njobs, time_per_job)
#cosntraint2
max_hrs_by_worker <- function (DA_index) {
DA_cols <- (DA_index-1)*njobs + (1:njobs) #relevant columns for a given room
add.constraint(distribution, xt=jobs, indices=DA_cols,type="<=",rhs=workermax[DA_index])
}
lapply(1:nworkers, max_hrs_by_worker)
solve(distribution)
get.objective(distribution)
distribution<-matrix(get.variables(distribution), njobs,nworkers)
Thanks in advance.
The description is not precise enough to give a definite answer. Also, write the problem down in mathematical notation before starting to code. The code is not very easy to read and not at all structure-revealing. This is partly a problem with LpSolveAPI which has a somewhat medieval way to represent optimization models.
If you want to enforce that job2 can only be executed if job4 is executed then introduce binary variables:
y[j] = 1 if job j is executed
0 otherwise
and the precedence constraint:
y[job4] >= y[job2]
Context: I am running a simulation via R that each repetition takes too long and it is memory consuming. Therefore, I need that for every repetition another session in R starts, so that I will not run into memory issues.
My problem: After executing my bash script, I exit from the terminal, the current repetition finishes successfully, but the next one does not start (I am running it on a server via ssh).
What I have done:
compoisson.sh bash script:
#!/bin/bash
for rho in $(seq 1 3); do
for rep in $(seq 1 200); do
Rscript Teste.R $rho $rep
done
done
In the terminal (after entering via ssh user#domain...):
chmod +x compoisson.sh
sh compoisson.sh &
exit
My Teste.R script (the content is not important, it could be an empty file):
rm(list=ls())
library(TMB)
# library(TMB, lib.loc = "/home/est/bonat/nobackup/github")
model <- "04_compoisson_bi" #1st try
compile(paste0(model, ".cpp"), flags = "-O0 -ggdb")
dyn.load(dynlib(model))
## Data simulation -------------------------------------------------------------
nresp <- 2; beta1 <- log(7);beta2 <- log(1.5);nu1 <- .7
nu2 <- .7;n <- 50;s2_1 <- 0.3;s2_2 <- 0.15;true_rho <- 0
sample_size <- 1000
openmp(4)
args <- commandArgs(TRUE)
rhos <- c(-.5,0,.5)
true_rho <- rhos[abs(as.numeric(args[1]))]
j <- abs(as.numeric(args[2]))
seed <- 2109+j
res_neg <- simulacao(nresp, beta1, beta2, true_rho, s2_1, s2_2, seed, sample_size = sample_size, model, nu1=nu1, nu2=nu2, j = j) # 1 by time
saveRDS(res_neg, file = paste0(getwd(), "/Output/output_cmp_rho", true_rho, "n", sample_size, "j", j, ".rds"))
An important detail is that I need to run it on a external server via ssh.
I did a small test with an empty .R file on my PC, and I was able too see different processes being created via htop. On server, it did not happened.
I also tried nohup to run my compoisson.sh file (question1, question2), but I did not have any success. My test:
nohup compoisson.sh &
ignoring the entrance and attaching the output to 'nohup.out'
nohup: failt to execute the command 'compoisson.sh': File or directory does not exists.
What am I doing wrong?
Solved with nohup ./compoisson.sh & instead of sh compoisson.sh &
In the book "Machine Learning - A Probabilistic Perspective" by Kevin P. Murphy the first task reads:
Exercise 1.1 KNN classifier on shuffled MNIST data
Run mnist1NNdemo
and verify that the misclassification rate (on the first 1000 test
cases) of MNIST of a 1-NN classifier is 3.8%. (If you run it all on
all 10,000 test cases, the error rate is 3.09%.) Modify the code so
that you first randomly permute the features (columns of the training
and test design matrices), as in shuffledDigitsDemo, and then apply
the classifier. Verify that the error rate is not changed.
My simple understanding is that the exercise is looking for the 1-NN after loading the files(kNN() in R).
The files:
train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)
are taken from the The MNIST DATABASE
I found a popular template for loading the files:
# for the kNN() function
library(VIM)
load_mnist <- function() {
load_image_file <- function(filename) {
ret = list()
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
ret$n = readBin(f,'integer',n=1,size=4,endian='big')
nrow = readBin(f,'integer',n=1,size=4,endian='big')
ncol = readBin(f,'integer',n=1,size=4,endian='big')
x = readBin(f,'integer',n=ret$n*nrow*ncol,size=1,signed=F)
ret$x = matrix(x, ncol=nrow*ncol, byrow=T)
close(f)
ret
}
load_label_file <- function(filename) {
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
n = readBin(f,'integer',n=1,size=4,endian='big')
y = readBin(f,'integer',n=n,size=1,signed=F)
close(f)
y
}
train <<- load_image_file("train-images.idx3-ubyte")
test <<- load_image_file("t10k-images.idx3-ubyte")
train$y <<- load_label_file("train-labels.idx1-ubyte")
test$y <<- load_label_file("t10k-labels.idx1-ubyte")
}
show_digit <- function(arr784, col=gray(12:1/12)) {
image(matrix(arr784, nrow=28)[,28:1], col=col)
}
According to the comment, in the command line this should work:
# Error "Error in matrix(arr784, nrow = 28) : object 'train' not found"
show_digit(train$x[5,])
The question is how can I use the show_digit function ?
Edit Remove extra question
What I figured out for the problem:
First run the whole file in R Studio or ESS, then call the load_mnist() from the console.
After that execute show_digit(train$x[3,]) in the console again and it works.
Finding the KNN classifier can be done on the whole data set:
a <- knn(train, test, train$y) but it would be a very slow process.
Predictions for the result can be done like table(test$y, a), test$y is predicted, a is the actual result.
I am trying to figure out how i could use any of the parallel processing packages like foreach or doParallel in this random forest loop i have created:
ModelInfo <- data.frame ( model=as.numeric()
,Nodesize=as.numeric()
,Mrty=as.numeric()
,Maxdepth=as.numeric()
,Cp=as.numeric()
,Accuracy_Training=as.numeric()
,AUC_Training=as.numeric())
w=1
set.seed(1809)
NumberOfSamples=1
# Number of iterations
rfPred=list()
pred=list()
roundpred=list()
cTab=list()
Acc=list()
pred.to.roc=list()
pred.rocr=list()
perf.rocr=list()
AUC=list()
Var_imp=list()
rf_model_tr = list()
length(rf_model_tr) <- NumberOfSamples
for (i in 1:NumberOfSamples)
{
rf_model_tr[[i]] = list()
rfPred[[i]]=list()
pred[[i]]=list()
roundpred[[i]]=list()
cTab[[i]]=list()
Acc[[i]]=list()
pred.to.roc[[i]]=list()
pred.rocr[[i]]=list()
perf.rocr[[i]]=list()
AUC[[i]]=list()
Var_imp[[i]]=list()
## Tune nodesize
nodesize =c(10,20,50,80,100,200)
n=length(nodesize)
length(rf_model_tr[[i]]) <- n
for ( j in 1: length (nodesize))
{
rf_model_tr[[i]][[j]] = list()
rfPred[[i]][[j]]=list()
pred[[i]][[j]]=list()
roundpred[[i]][[j]]=list()
cTab[[i]][[j]]=list()
Acc[[i]][[j]]=list()
pred.to.roc[[i]][[j]]=list()
pred.rocr[[i]][[j]]=list()
perf.rocr[[i]][[j]]=list()
AUC[[i]][[j]]=list()
Var_imp[[i]][[j]]=list()
## Tune mrty
mrtysize =c(2,3,4)
m=length(mrtysize)
length(rf_model_tr[[i]][[j]]) <- m
for ( k in 1: length (mrtysize))
{
rf_model_tr[[i]][[j]][[k]] = list()
rfPred[[i]][[j]][[k]]=list()
pred[[i]][[j]][[k]]=list()
roundpred[[i]][[j]][[k]]=list()
cTab[[i]][[j]][[k]]=list()
Acc[[i]][[j]][[k]]=list()
pred.to.roc[[i]][[j]][[k]]=list()
pred.rocr[[i]][[j]][[k]]=list()
perf.rocr[[i]][[j]][[k]]=list()
AUC[[i]][[j]][[k]]=list()
Var_imp[[i]][[j]][[k]]=list()
## Tune maxdepth
maxdep =c(10,20,30)
z=length(maxdep)
length(rf_model_tr[[i]][[j]][[k]]) <- z
for (l in 1:length (maxdep))
{
rf_model_tr[[i]][[j]][[k]][[l]] = list()
rfPred[[i]][[j]][[k]][[l]]=list()
pred[[i]][[j]][[k]][[l]]=list()
roundpred[[i]][[j]][[k]][[l]]=list()
cTab[[i]][[j]][[k]][[l]]=list()
Acc[[i]][[j]][[k]][[l]]=list()
pred.to.roc[[i]][[j]][[k]][[l]]=list()
pred.rocr[[i]][[j]][[k]][[l]]=list()
perf.rocr[[i]][[j]][[k]][[l]]=list()
AUC[[i]][[j]][[k]][[l]]=list()
Var_imp[[i]][[j]][[k]][[l]]=list()
## Tune cp
cp =c(0,0.01,0.001)
p=length(cp)
length(rf_model_tr[[i]][[j]][[k]][[l]]) <- p
for (m in 1:length (cp))
{
rf_model_tr[[i]][[j]][[k]][[l]][[m]]= randomForest (as.factor(class) ~.
, data=train,mtry=mrtysize[[k]],maxDepth = maxdep[[l]], replace=F, importance=T, do.trace=10, ntree=200,nodesize=nodesize[j],cp=cp[[m]])
#Accuracy
rfPred[[i]][[j]][[k]][[l]][[m]] <- predict(rf_model_tr[[i]][[j]][[k]][[l]][[m]], train, type = "prob")
pred[[i]][[j]][[k]][[l]][[m]] <- colnames(rfPred[[i]][[j]][[k]][[l]][[m]] )[apply(rfPred[[i]][[j]][[k]][[l]][[m]] ,1,which.max)]
cTab[[i]][[j]][[k]][[l]][[m]] = table(pred[[i]][[j]][[k]][[l]][[m]],train$class)
Acc[[i]][[j]][[k]][[l]][[m]]<- sum(diag(cTab[[i]][[j]][[k]][[l]][[m]])) / sum(cTab[[i]][[j]][[k]][[l]][[m]])
#AUC
pred.to.roc[[i]][[j]][[k]][[l]][[m]]<-rfPred[[i]][[j]][[k]][[l]][[m]][,2]
pred.rocr[[i]][[j]][[k]][[l]][[m]]<-prediction(pred.to.roc[[i]][[j]][[k]][[l]][[m]],as.factor(train$class))
perf.rocr[[i]][[j]][[k]][[l]][[m]]<-performance(pred.rocr[[i]][[j]][[k]][[l]][[m]],measure="auc",x.measure="cutoff")
AUC[[i]][[j]][[k]][[l]][[m]]<-as.numeric(perf.rocr[[i]][[j]][[k]][[l]][[m]]#y.values)
#Variable Importance
Var_imp[[i]][[j]][[k]][[l]][[m]]<-(importance(rf_model_tr[[i]][[j]][[k]][[l]][[m]],type=2))
ModelInfo[w,1]<-w
ModelInfo[w,2]<-nodesize[[j]]
ModelInfo[w,3]<-mrtysize[[k]]
ModelInfo[w,4]<-maxdep[[l]]
ModelInfo[w,5]<-cp[[m]]
ModelInfo[w,6]<-Acc[[i]][[j]][[k]][[l]][[m]]
ModelInfo[w,7]<-AUC[[i]][[j]][[k]][[l]][[m]]
w=w+1
}
}
}
}
}
Basically ,what i am doing is that i am creating all possible model variations with one dataset based on the available tuning parameters for a random forest (nodesize,cp ect) and storing that information to the table model info as every iteration goes by. In addition i add measures like accuracy and AUC, so as to compare the different models created in the end and make a pick.
The reason i am looking for an alternative, is that the caret package offers me only to tune the mtry allthough there i do have the chance to run parRF which could solve my problem, but i prefer to incorporate something here, how would that be possible?
I have read about the foreach and doParallel packages but i dont quite get how this could be syntaxed here.
If the initial data is needed please let me know, i just thought at this point to show the part that neeeds to be parallel computed.
Thank you in advance
Hi I normally just code everything manually. In linux/mac I use parallel package and mclapply which can use memory forking. Forking processes use less memory and are faster to start up. Windows do not support forking thus I use doParallel package (other packages could do also). the foreach() function is a user friendly parallel mapper. I find myself to spend more time setting up single PC parallel computing than saving from speed-up. Still fun :)
If you work on a university, you may have access to a large cluster. The BatchJobs package is another mapper which can use many different backends, e.g. a Torque/PBS que system. I can borrow 80 nodes with 4 CPU's giving me a potential 320 times speedup (more like 150 times in practice). I learned about BatchJobs from this great introduction. I like that BatchJobs also can run single or multi-core locally, which is much easier to debug.
The code below introduces how to create a list of jobs with both foreach and BatchJobs. Each job is a set of arguments. The job arguments are fused with standard arguments and a model is trained. Some statistics is returned and all results and arguments are combined into a data.frame.
useForeach = FALSE #If FALSE, will run as batchjobs. Only faster for cluster computing.
library(randomForest)
#load a data set
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
download.file(url,destfile="winequality-white.csv",mode="w")
wwq = read.csv(file="winequality-white.csv",header=T,sep=";")
X = wwq[,names(wwq) != "quality"]
y = wwq[,"quality"]
#2 - make jobs
pars = expand.grid(
mtry = c(1:3),
sampsize = floor(seq(1000,1898,length.out = 3)),
nodesize = c(1,3)
)
jobs = lapply(1:dim(pars)[1], function(i) pars[i,])
#3 - make node function there will excute a number of jobs
test.pars = function(someJobs,useForeach=TRUE) {
#if running cluster, global environment imported manually
if(!useForeach) load(file="thisState.rda")
do.call(rbind,lapply(someJobs,function(aJob){ #do jobs and bind results by rows
print(aJob)
merged.args = c(alist(x=X,y=y),as.list(aJob)) #merge std. and job args
run.time = system.time({rfo = do.call(randomForest,merged.args)}) #run a job
data.frame(accuracy=tail(rfo$rsq,1),run.time=run.time[3],mse=tail(rfo$mse,1))
}))
}
##test function single core
jobsToTest = 1:5
out = test.pars(jobs[jobsToTest])
print(cbind(out,do.call(rbind,jobs[jobsToTest])))
#4a execute jobs with foreach package:
if(useForeach) {
library(foreach)
library(doParallel)
CPUs=4
cl = makeCluster(CPUs)#works both for windows and linux, otherwise forking is better
registerDoParallel(cl)
nodes=min(CPUs,length(jobs)) #how many splits of jobList, not so important for foreach...
job.array = suppressWarnings(split(jobs,1:nodes)) #split warns if each core cannot get same amount of jobs
out = foreach(i=job.array,.combine=rbind,.packages="randomForest") %dopar% test.pars(i)
stopCluster(cl)
} else {
library(BatchJobs)
#4b - execute jobs with BatchJobs package (read manual how to set up on cluster)
nodes=min(80,length(jobs)) # how many nodes to split job onto
job.array = split(jobs,1:nodes)
save(list=ls(),file="thisState.rda") #export this state(global environment) to every node
#initiate run
reg = makeRegistry(id ="myFirstBatchJob",packages="randomForest")
batchMap(reg,fun=test.pars,someJobs = job.array,more.args=list(useForeach=FALSE))
submitJobs(reg)
waitForJobs(reg)
out = loadResults(reg)
#6- wrap up save filnalResults to user
finalResult = cbind(do.call(rbind,jobs),do.call(rbind,out))
save(out,file="finalResult.rda")
removeRegistry(reg,ask="no")
}
#7- print final result
print(cbind(do.call(rbind,jobs),out))
I have a basic question on trying to make OpenBUGS322 run with R using R2OpenBUGS.
When running several trial scripts (see one example below) I get an error
Error in matrix(, n.sims, n.parameters) :
invalid 'nrow' value (too large or NA)
Through a search I found that other people experienced this with OpenBUGS322 and OpenBUGS321 should work fine. HOWEVER, could anybody advice where could I find older version.
I need OpenBUGS321setup.exe and cannot find it anywhere.
(I have fixed several other issues such as dbus config, and the OpenBUGS opens now when prompted from XQuartz)
Here is more info to find out whether I am on the right path.
I am running versions:
Mac OS X 10.5.8
Wine 1.4.1
XQuartz 2.6.3
OpenBUGS322
Trial code from WinBUGS textbook with added path for WINE and OpenBUGS (not 100% sure if correct)
Thanks a lot for any help.
Script:
y10<-rnorm(n=10, mean=600, sd=30)
y1000<-rnorm(n=1000, mean=600, sd=30)
summary(lm(y1000~1))
library(R2OpenBUGS)
#setwd("/Users/Pavel/wine/ProgramFiles/OpenBUGS322")
setwd("/Users/Pavel/Documents/R/OpenBUGS")
sink("model.txt")
cat("
model{
population.mean~dunif(0,5000)
precision<-1/population.variance
population.variance<-population.sd* population.sd
population.sd~dunif(0,100)
for(i in 1:nobs){
mass[i]~dnorm(population.mean, precision)
}
}
", fill=TRUE)
sink()
WINE="/opt/local/bin/wine"
WINEPATH="/opt/local/bin/winepath"
OpenBUGS.pgm="/Users/Pavel/wine/ProgramFiles/OpenBUGS322/OpenBUGS.exe"
#Package the data to be handed to OpenBUGS
win.data<-list(mass=y1000, nobs=length(y1000))
#Function to generate starting values
inits<-function()
list(population.mean=rnorm(1,600), population.sd=runif(1,1,30))
#Parameters to be monitored
params<-c("population.mean", "population.sd", "population.variance")
#MCMC settings
nc<-3 #Number of chains
ni<-1000 #Number of draws for each chain
nb<-1 #number of draws to discard as burn -in
nt<-1 #Thinning rate
out<-bugs(data=win.data, inits=inits, parameters.to.save=params, model.file="model.txt", n.thin=nt, n.chains=nc, n.burnin=nb, n.iter=ni, OpenBUGS.pgm=OpenBUGS.pgm, WINE=WINE, WINEPATH=WINEPATH,useWINE=T)
Could you use rjags instead? Syntax is almost the same:
require(rjags)
modelstring <- "
model {
for(i in 1:nobs){
mass[i] ~ dnorm(m, prec) # precision
}
m ~ dunif(0, 5000)
prec <- 1/sqrt(SD) # convert to Std Deviation
SD ~ dunif(0, 100)
}
"
writeLines(modelstring,con="model.txt")
y1000 <- stats::rnorm(n=1000, mean=600, sd=30)
dataList <- list(
mass = y1000,
nobs = length(y1000)
)
initsList <- list(
m = stats::rnorm(n=1, mean=600, sd=1),
SD = stats::runif(n=1, min=1, max=30)
)
parameters <- c("m","SD") # to be monitored.
adaptSteps <- 100 # "tune" the samplers.
burnInSteps <- 100 # "burn-in" the samplers.
nChains <- 3 # Number of chains to run.
numSavedSteps <-2000 # Total number of steps in chains to save.
thinSteps <- 1 # Number of steps to "thin" (1=keep every step).
nPerChain <- ceiling(( numSavedSteps * thinSteps ) / nChains) # Steps per chain
jagsModel <- rjags::jags.model(
"model.txt", data=dataList,
inits=initsList, n.chains=nChains,
n.adapt=adaptSteps)
stats::update(jagsModel, n.iter=burnInSteps)
MCMC1 <- as.matrix(rjags::coda.samples(
jagsModel, variable.names=parameters,
n.iter=nPerChain, thin=thinSteps))
SDsample <- matrix(MCMC1[,grep("SD",colnames(MCMC1))],
ncol=length(grep("SD",colnames(MCMC1))))
You can then convert to variance with:
SDsample^2
If you're looking for OpenBUGS321setup.exe you should be able to find it here. I note that it hasn't been well tested under WINE. Is a Linux emulator a possibility?