I would like to use bootstrapping using the boot library. Since calculating the statistics from each sample is a length process, it is going to take several days for the entire bootstrapping calculation to conclude. Since the computer I am using disconnects every several hours, I would like to use some checkpoint mechanism such that I will not have to start from scratch every time. Currently, I am running:
results <- boot(data=data, statistic=my_slow_function, R=10000, parallel='snow', ncpus=4, cl=cl)
but I would rather run it with R=100 multiple times such that I will be able to save the intermediate results and retrieve them if the connection hang-up. How can I achieve that?
Thank you in advance
Maybe you can combine results for the bootstrap replicates:
#simulating R=10000
results_list <- lapply(1:00, function(x) {
return(boot(data=data, statistic=my_slow_function, R=100, parallel='snow', ncpus=4)$t)
})
results_t <- unlist(results_list)
hist(results_t)
t0 = mean(results_t)
Related
I need to add a simulation counter to each simulation in my code using parallel processing. So for each simulation there should be an additional value stating this is "simulation x", the next simulation will be "simulation x+1" etc. which will be stored in an additional column. The problem is that when I attempt to add a counter with a for loop then the counter only stores one digit for each combination of beta, theta and delta; not for each iteration as well. i.e. the pseudo code to help visualise this attempted solution is:
counter<-1
start parallelisation{
function
counter<-counter+1
}
end parallelisation
I've created a very simplified version of my code, hopefully if you can find a solution to this problem then I can apply the same solution to the more complex script. Note I am using 20 cores to solve my issue, you will of course know that you need to specify a reasonable amount of cores based on your PC specifications. Below is the code:
library("parallel")
betavalues<-seq(from=50,to=150,length.out=3)
thetavalues<-seq(from=200,to=300,length.out=3)
deltavalues<-seq(from=20,to=140,length.out=3)
outputbind<-c()
iterations<-5
examplefunction<- function(i=NULL){
for (j in betavalues){
for(k in thetavalues){
for(l in deltavalues){
output<-data.frame(beta=j,theta=k,delta=l)
outputbind<-rbind(outputbind,output)
}
}
}
data<-data.frame(beta=outputbind$beta,theta=outputbind$theta,delta=outputbind$delta)
}
cl <- makeCluster(mc <- getOption("cl.cores", 20))
clusterExport(cl=cl, varlist=ls())
par_results <- parLapply(1:iterations, fun=examplefunction, cl=cl)
clusterEvalQ(cl,examplefunction)
stopCluster(cl)
data <- do.call("rbind", par_results)
To clarify, I wish to add an additional column to data that will state the simulation number.
This problem has been bugging me for weeks, and a solution would be immensely appreciated!
Edit: Adding a sequence of numbers based on the length of data post parallel processing is not a sufficient solution, as the length of each simulation will vary in the more complicated script. Therefore, the solution needs to be added within or prior to the object data being created.
I have some code which requires several simulations and I am hoping to run across separate computers. Each simulation requires identifying a random subset of the data to then run the analyses on. When I try to run this on separate computers at the same time, I get notice that the same rows are selected for each simulation. So if I am running 3 simulations, each simulation will identify the same 'random' samples across separatae computers. I am not sure why this is, can anyone suggest any code to get around this?
I show the sample_n function in dplyr below, but the same thing happened using the 'sample' function in Base R. Thanks in advance.
library(dplyr)
explanatory <- c(1,2,3,4,3,2,4,5,6,7,8,5,4,3)
response <- c(3,4,5,4,5,6,4,6,7,8,6,10,11,9)
A <- data.frame(explanatory,response)
B <- data.frame(explanatory,response)
C <- data.frame(explanatory,response)
for(i in 1:3)
{
Rand_A = sample_n(A,8)
Rand_B = sample_n(B,8)
Rand_C = sample_n(C,8)
Rand_All = rbind(Rand_A, Rand_B,Rand_C)
}
You can set the seed for each computer separately as brb suggests above. You could also have this happen automatically by setting the seed to the computer's ip address, which would eliminate the need to edit your script for each computer. One implementation using the ipify package:
library(devtools)
install_github("gregce/ipify")
library(ipify)
set.seed(as.numeric(gsub("[.]","",get_ip())))
I think my problem is easy, I just could not find a solution for it. So, I am trying to do bootstrapping in r and the results have been successful. However, I want to automate my code to make it run for 10000,15000, etc.. times without changing these variables every time.
My code is:
mydata #is a time series data
port<-as.xts(mydata, order.by = as.Date(dates, "%d-%b-%Y"))
# created a function for bootstrapping
sim<-function(nsim,series,size){
result<-replicate(nsim, Return.cumulative(sample(series,size,replace=F), geometric=TRUE))
return(result)
}
output<-sim<-(10000,port,12) # running the function
mean(output) # finding the mean of the bootstrap output
so instead of changing the nsim which is=10000 or the size =12 everytime I want to run the function, is there a way where I can loop the function to run lets say 10000, 150000,200000.
Thanks for your help!
I am curious if there is a way in R to do some function which runs either forever or for sone long time which you can matter terminate without losing the results.
For example, say I would like to fit lots of linear models to some randomly generated data like so
dat <-list()
for (i in 1:1e99){
x <- 1:10
y <- cumsum(runif(10))
dat[[i]] <- lm(y~x)
}
I would like to leave my computer for a long time and when I return, I will stop the function. I would like to keep all of the models that have been built.
Basically I want it to do as much as it can before I get back and then not lose its progress when I go to stop it.
Does anyone know a way of accomplishing this in R?
You run the loop then hit the stop button when you get back.
I have R code that I need to get to A "parallelization" stage. Im new at this so please forgive me if I use the wrong terms. I have a process that just has to chug through individual by individual one at a time and then average across individuals in the end. The process is the exact same for each individual (its a Brownian Bridge), I just have to do this for >300 individuals. So, I was hoping someone here might know how to change my code so that it can be spawned? or parallelized? or whatever the word is to make sure that the 48 CPU's I now have access to can help reduce the 58 days it will take to compute this with my little laptop. In my head I would just send out 1 individual to one processor. Have it run through the script and then send another one....if that makes sense.
Below is my code. I have tried to comment in it and have indicated where I think the code needs to be changed.
for (n in 1:(length(IDNames))){ #THIS PROCESSES THROUGH EACH INDIVIDUAL
#THIS FIRST PART IS JUST EXTRACTING THE DATA FROM MY TWO INPUT FILES.
#I HAVE ONE FILE WITH ALL THE LOCATIONS AND THEN ANOTHER FILE WITH A DATE RANGE.
#EACH INDIVIDUAL HAS DIFFERENT DATE RANGES, THUS IT HAS TO PULL OUT EACH INDIVIDUALS
#DATA SET SEPARATELY AND THEN RUN THE FUNCTION ON IT.
IndivData = MovData[MovData$ID==IDNames[n],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames[n],]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
########################
#THIS IS THE PROCESS WHERE I THINK I NEED THAT HAS TO HAVE EACH INDIVIDUAL RUN THROUGH IT
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
#############################
# BELOW IS JUST CODE TO BIND THE RESULTS INTO A GRID DATA FRAME I ALREADY CREATED.
#I DO NOT UNDERSTAND HOW THE MULTICORE PROCESSED CODE WOULD JOIN THE DATA BACK
#WHICH IS WHY IVE INCLUDED THIS PART OF THE CODE.
if(n==1){ #creating a data fram with the x, y, and probabilities for the first individual
BBMMProbGrid = as.data.frame(1:length(BBMM[[2]]))
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[2]],BBMM[[3]],BBMM[[4]])
colnames(BBMMProbGrid)=c("GrdId","X","Y",paste(IDNames[n],"_Prob", sep=""))
} else { #For every other individual just add the new information to the dataframe
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[4]])
colnames(BBMMProbGrid)[n*2+2]=paste(IDNames[n],"_Prob", sep ="")
}# end if
} #end loop through individuals
Not sure why this has been voted down either. I think the foreach package is what you're after. Those first few pdfs have very clear useful information in them. Basically write what you want done for each person as a function. Then use foreach to send the data for one person out to a node to run the function (while sending another persons to another node etc) and then it compiles all the results using something like rbind. I've used this a few times with great results.
Edit: I didn't look to rework your code as I figure given you've got that far you'll easily have the skills to wrap it into a function and then use the one liner foreach.
Edit 2: This was too long for a comment to reply to you.
I thought since you had got that far with the code that you would be able to get it into a function :) If you're still working on this, it might help to think of writing a for loop to loop over your subjects and do the calculations required for that subject. Then, that for loop is what you want in your function. I think in your code that is everything down to 'area.grid'. Then you can get rid of most of your [n]'s since the data is only subset once per iteration.
Perhaps:
pernode <- function(MovData) {
IndivData = MovData[MovData$ID==IDNames[i],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames,]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]
&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
return(BBMM)
}
Then something like:
library(doMC)
library(foreach)
registerDoMC(cores=48) # or perhaps a few less than all you have
system.time(
output <- foreach(i = 1:length(IDNames)), .combine = "rbind", .multicombine=T,
.inorder = FALSE) %dopar% {pernode(i)}
)
Hard to say whether that is it without some test data, let me know how you get on.
This is a general example since I didn't have the patience to read through all of your code. One of the quickest ways to spread this across multiple processors would be to use the multicore library and the mclapply (a parallelized version of lapply) to push a list (individual items on the list would be dataframes for each of the 300+ individuals in your case) through a function.
Example:
library(multicore)
result=mclapply(data_list, your_function,mc.preschedule=FALSE, mc.set.seed=FALSE)
As I understand you description you have access to a distributed computer cluster. So the package multicore will be not working. You have to use Rmpi, snow or foreach. Based on your existing loop structure I would advice to use the foreach and doSnow package. But your codes looks like as you have a lot of data. You probably have to check to reduce the data (only the required ones) which will be send to the nodes.