I think my problem is easy, I just could not find a solution for it. So, I am trying to do bootstrapping in r and the results have been successful. However, I want to automate my code to make it run for 10000,15000, etc.. times without changing these variables every time.
My code is:
mydata #is a time series data
port<-as.xts(mydata, order.by = as.Date(dates, "%d-%b-%Y"))
# created a function for bootstrapping
sim<-function(nsim,series,size){
result<-replicate(nsim, Return.cumulative(sample(series,size,replace=F), geometric=TRUE))
return(result)
}
output<-sim<-(10000,port,12) # running the function
mean(output) # finding the mean of the bootstrap output
so instead of changing the nsim which is=10000 or the size =12 everytime I want to run the function, is there a way where I can loop the function to run lets say 10000, 150000,200000.
Thanks for your help!
Related
I would like to use bootstrapping using the boot library. Since calculating the statistics from each sample is a length process, it is going to take several days for the entire bootstrapping calculation to conclude. Since the computer I am using disconnects every several hours, I would like to use some checkpoint mechanism such that I will not have to start from scratch every time. Currently, I am running:
results <- boot(data=data, statistic=my_slow_function, R=10000, parallel='snow', ncpus=4, cl=cl)
but I would rather run it with R=100 multiple times such that I will be able to save the intermediate results and retrieve them if the connection hang-up. How can I achieve that?
Thank you in advance
Maybe you can combine results for the bootstrap replicates:
#simulating R=10000
results_list <- lapply(1:00, function(x) {
return(boot(data=data, statistic=my_slow_function, R=100, parallel='snow', ncpus=4)$t)
})
results_t <- unlist(results_list)
hist(results_t)
t0 = mean(results_t)
I have performed a multiple comparison test in R and when I print the results, I can only see the results of 1000 rows. However, I need to see the whole outcome. My test is time dependent and I have 100 time points of 6 different measurements. You can find an example screenshot of the output from my console, which only shows the result from time point 45 to 100 but not the time points before.
Tukey<-emmeans(Data, list(pairwise ~ Types| Time), adjust = "tukey")
Tukey
I have already tried options(max.print=100000) but didn't change anything.
I am looking forward to your response.
EDIT: After request from Mr. Lahouir, I checked the print option if it run properly and seems to be.
getOption('max.print')
[1] 100000
Best Regards,
summ <- summary(Tukey[[2]])
You now have essentially a data frame, so just pull out a piece at a time.
summ[1:500, ]
summ[501:1000, ]
Etc.
Or save it with, say, write.csv
I am curious if there is a way in R to do some function which runs either forever or for sone long time which you can matter terminate without losing the results.
For example, say I would like to fit lots of linear models to some randomly generated data like so
dat <-list()
for (i in 1:1e99){
x <- 1:10
y <- cumsum(runif(10))
dat[[i]] <- lm(y~x)
}
I would like to leave my computer for a long time and when I return, I will stop the function. I would like to keep all of the models that have been built.
Basically I want it to do as much as it can before I get back and then not lose its progress when I go to stop it.
Does anyone know a way of accomplishing this in R?
You run the loop then hit the stop button when you get back.
I have a dataset of 80 variables, and I want to loop though a subset of 50 of them and construct returns. I have a list of the names of the variables for which I want to construct returns, and am attempting to use the dplyr command mutate to construct the variables in a loop. Specifically my code is:
for (i in returnvars) {
alldta <- mutate(alldta,paste("r",i,sep="") = (i - lag(i,1))/lag(i,1))}
where returnvars is my list, and alldta is my dataset. When I run this code outside the loop with just one of the `i' values, it works fine. The code for that looks like this:
alldta <- mutate(alldta,rVar = (Var- lag(Var,1))/lag(Var,1))
However, when I run it in the loop (e.g., attempting to do the previous line of code 50 times for 50 different variables), I get the following error:
Error: unexpected '=' in:
"for (i in returnvars) {
alldta <- mutate(alldta,paste("r",i,sep="") ="
I am unsure why this issue is coming up. I have looked into a number of ways to try and do this, and have attempted solutions that use lapply as well, without success.
Any help would be much appreciated! If there is an easy way to do this with one of the apply commands as well, that would be great. I did not provide a dataset because my question is not data specific, I'm simply trying to understand, as a relative R beginner, how to construct many transformed variables at once and add them to my data frame.
EDIT: As per Frank's comment, I updated the code to the following:
for (i in returnvars) {
varname <- paste("r",i,sep="")
alldta <- mutate(alldta,varname = (i - lag(i,1))/lag(i,1))}
This fixes the previous error, but I am still not referencing the variable correctly, so I get the error
Error in "Var" - lag("Var", 1) :
non-numeric argument to binary operator
Which I assume is because R sees my variable name Var as a string, rather than as a variable. How would I correctly reference the variable in my dataset alldta? I tried get(i) and alldta$get(i), both without success.
I'm also still open to (and actively curious about), more R-style ways to do this entire process, as opposed to using a loop.
Using mutate inside a loop might not be a good idea either. I am not sure if mutate makes a copy of the data frame but its generally not a good practice to grow a data frame inside a loop. Instead create a separate data frame with the output and then name the columns based on your logic.
result = do.call(rbind,lapply(returnvars,function(i) {...})
names(result) = paste("r",returnvars,sep="")
After playing around with this more, I discovered (thanks to Frank's suggestion), that the following works:
extended <- alldta # Make a copy of my dataset
for (i in returnvars) {
varname <- paste("r",i,sep="")
extended[[varname]] = (extended[[i]] - lag(extended[[i]],1))/lag(extended[[i]],1)}
This is still not very R-styled in that I am using a loop, but for a task that is only repeating about 50 times, this shouldn't be a large issue.
I have R code that I need to get to A "parallelization" stage. Im new at this so please forgive me if I use the wrong terms. I have a process that just has to chug through individual by individual one at a time and then average across individuals in the end. The process is the exact same for each individual (its a Brownian Bridge), I just have to do this for >300 individuals. So, I was hoping someone here might know how to change my code so that it can be spawned? or parallelized? or whatever the word is to make sure that the 48 CPU's I now have access to can help reduce the 58 days it will take to compute this with my little laptop. In my head I would just send out 1 individual to one processor. Have it run through the script and then send another one....if that makes sense.
Below is my code. I have tried to comment in it and have indicated where I think the code needs to be changed.
for (n in 1:(length(IDNames))){ #THIS PROCESSES THROUGH EACH INDIVIDUAL
#THIS FIRST PART IS JUST EXTRACTING THE DATA FROM MY TWO INPUT FILES.
#I HAVE ONE FILE WITH ALL THE LOCATIONS AND THEN ANOTHER FILE WITH A DATE RANGE.
#EACH INDIVIDUAL HAS DIFFERENT DATE RANGES, THUS IT HAS TO PULL OUT EACH INDIVIDUALS
#DATA SET SEPARATELY AND THEN RUN THE FUNCTION ON IT.
IndivData = MovData[MovData$ID==IDNames[n],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames[n],]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
########################
#THIS IS THE PROCESS WHERE I THINK I NEED THAT HAS TO HAVE EACH INDIVIDUAL RUN THROUGH IT
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
#############################
# BELOW IS JUST CODE TO BIND THE RESULTS INTO A GRID DATA FRAME I ALREADY CREATED.
#I DO NOT UNDERSTAND HOW THE MULTICORE PROCESSED CODE WOULD JOIN THE DATA BACK
#WHICH IS WHY IVE INCLUDED THIS PART OF THE CODE.
if(n==1){ #creating a data fram with the x, y, and probabilities for the first individual
BBMMProbGrid = as.data.frame(1:length(BBMM[[2]]))
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[2]],BBMM[[3]],BBMM[[4]])
colnames(BBMMProbGrid)=c("GrdId","X","Y",paste(IDNames[n],"_Prob", sep=""))
} else { #For every other individual just add the new information to the dataframe
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[4]])
colnames(BBMMProbGrid)[n*2+2]=paste(IDNames[n],"_Prob", sep ="")
}# end if
} #end loop through individuals
Not sure why this has been voted down either. I think the foreach package is what you're after. Those first few pdfs have very clear useful information in them. Basically write what you want done for each person as a function. Then use foreach to send the data for one person out to a node to run the function (while sending another persons to another node etc) and then it compiles all the results using something like rbind. I've used this a few times with great results.
Edit: I didn't look to rework your code as I figure given you've got that far you'll easily have the skills to wrap it into a function and then use the one liner foreach.
Edit 2: This was too long for a comment to reply to you.
I thought since you had got that far with the code that you would be able to get it into a function :) If you're still working on this, it might help to think of writing a for loop to loop over your subjects and do the calculations required for that subject. Then, that for loop is what you want in your function. I think in your code that is everything down to 'area.grid'. Then you can get rid of most of your [n]'s since the data is only subset once per iteration.
Perhaps:
pernode <- function(MovData) {
IndivData = MovData[MovData$ID==IDNames[i],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames,]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]
&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
return(BBMM)
}
Then something like:
library(doMC)
library(foreach)
registerDoMC(cores=48) # or perhaps a few less than all you have
system.time(
output <- foreach(i = 1:length(IDNames)), .combine = "rbind", .multicombine=T,
.inorder = FALSE) %dopar% {pernode(i)}
)
Hard to say whether that is it without some test data, let me know how you get on.
This is a general example since I didn't have the patience to read through all of your code. One of the quickest ways to spread this across multiple processors would be to use the multicore library and the mclapply (a parallelized version of lapply) to push a list (individual items on the list would be dataframes for each of the 300+ individuals in your case) through a function.
Example:
library(multicore)
result=mclapply(data_list, your_function,mc.preschedule=FALSE, mc.set.seed=FALSE)
As I understand you description you have access to a distributed computer cluster. So the package multicore will be not working. You have to use Rmpi, snow or foreach. Based on your existing loop structure I would advice to use the foreach and doSnow package. But your codes looks like as you have a lot of data. You probably have to check to reduce the data (only the required ones) which will be send to the nodes.