I am curious if there is a way in R to do some function which runs either forever or for sone long time which you can matter terminate without losing the results.
For example, say I would like to fit lots of linear models to some randomly generated data like so
dat <-list()
for (i in 1:1e99){
x <- 1:10
y <- cumsum(runif(10))
dat[[i]] <- lm(y~x)
}
I would like to leave my computer for a long time and when I return, I will stop the function. I would like to keep all of the models that have been built.
Basically I want it to do as much as it can before I get back and then not lose its progress when I go to stop it.
Does anyone know a way of accomplishing this in R?
You run the loop then hit the stop button when you get back.
Related
I'm trying to figure out how to get a for loop setup in R when I want it to run two or more parameters at once. Below I have posted a sample code where I am able to get the code to run and fill a matrix table with two values. In the 2nd line of the for loop I have
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],-.7))
And what I would like to do is replace the -.7 with another tt[i], example below, so that my for loop would run through the values starting at (-1,-1), then it would be as follows (-1,-.99),
(-1,-.98),...,(1,.98),(1,.99),(1,1) where the result matrix would then be populated by the output of Q and sigma.
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],tt[i]))
or something similar to
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],ss[i]))
It may be very possible that this would be better handled by two for loops however I'm not 100% sure on how I would set that up so the first parameter would be fixed and the code would run through the sequence of the second parameter, once that would get finished the first parameter would now increase by one and fix itself at that increase until the second parameter does another run through.
I've posted some sample code down below where the ARMA.var function just comes from the ts.extend package. However, any insight into this would be great.
Thank you
tt<-seq(-1,1,0.01)
Result<-matrix(NA, nrow=length(tt)*length(tt), ncol=2)
for (i in seq_along(tt)){
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],-.7))
Q<-t((y-X%*%beta_est_d))%*%solve(R)%*%(y-X%*%beta_est_d)+
lam*t(beta_est_d)%*%D%*%beta_est_d
RSS<-sum((y-X%*%solve(t(X)%*%solve(R)%*%X+lam*D)%*%t(X)%*%solve(R)%*%y)^2)
Denom<-n-sum(diag(X%*%solve(t(X)%*%solve(R)%*%X+lam*D)%*%t(X)%*%solve(R)))
sigma<-RSS/Denom
Result[i,1]<-Q
Result[i,2]<-sigma
rm(Q)
rm(R)
rm(sigma)
}
Edit: I realize that what I have posted above is quite unclear so to simplify things consider the following code,
x<-seq(1,20,1)
y<-seq(1,20,2)
Result<-matrix(NA, nrow=length(x)*length(y), ncol=2)
for(i in seq_along(x)){
z1<-x[i]+y[i]
z2<-z1+y[i]
Result[i,1]<-z1
Result[i,2]<-z2
}
So the results table would appear as follow as the following rows,
Row1: 1+1=2, 2+1=3
Row2: 1+3=4, 4+3=7
Row3: 1+5=6, 6+5=11
Row4: 1+7=8, 8+7=15
And this pattern would continue with x staying fixed until the last value of y is reached, then x would start at 2 and cycle through the calculations of y to the point where my last row is as,
RowN: 20+19=39, 39+19=58.
So I just want to know if is there a way to do it in one loop or if is it easier to run it as 2 loops.
I hope this is clearer as to what my question was asking, and I realize this is not the optimal way to do this, however for now it is just for testing purposes to see how long my initial process takes so that it can be streamlined down the road.
Thank you
I would like to use bootstrapping using the boot library. Since calculating the statistics from each sample is a length process, it is going to take several days for the entire bootstrapping calculation to conclude. Since the computer I am using disconnects every several hours, I would like to use some checkpoint mechanism such that I will not have to start from scratch every time. Currently, I am running:
results <- boot(data=data, statistic=my_slow_function, R=10000, parallel='snow', ncpus=4, cl=cl)
but I would rather run it with R=100 multiple times such that I will be able to save the intermediate results and retrieve them if the connection hang-up. How can I achieve that?
Thank you in advance
Maybe you can combine results for the bootstrap replicates:
#simulating R=10000
results_list <- lapply(1:00, function(x) {
return(boot(data=data, statistic=my_slow_function, R=100, parallel='snow', ncpus=4)$t)
})
results_t <- unlist(results_list)
hist(results_t)
t0 = mean(results_t)
I want to run the following regressions, the variable which has the problem is EP, is a dummy variable and I must to check different cases, z (lenght=1000) is the threshold variable. Ι want to crate 1000 different variables of EP from z variable and save the coefficients. I use a loop in loop but the results are completely wrong.The code runs properly and does not make an error. The square brackets and parentheses are the code I run. The problem is that there is a huge delay and the results after two hours still running.
I reduced the sample by 99% and again I did not get a result, the code ran without problem .
I do not want anything special, just for each value of z to run a different regression and end up to stored the estimates. I can not understand why take so long. Any idea?
for (k in 1:1000){
z<-u[k]
for (i in 1:length(dS)){
if (dS[i]>=z) {
EP[i]=1
} else {
EP[i]=0
}
fitT <- dynlm(dR ~ L(dR,1)+L(EN)+L(EP)+L(ΚΜ,1)
prob[[k]] <- summary(fitT)$coefficients[1, 2]
}
You don't have a closing } for the i-loop; you also don't have a closing ) for dynlm.
Note, you can really replace your i-loop by
EP <- as.integer(dS >= z)
Next time when asking question, be clear and specific. What do you mean by "I use a loop in loop but the results are completely wrong"? Error message, etc?
I am running some matrix algebra on a large data set. Each iteration of the outer most loop populates one row of two different vectors that are allocated to 64,797 rows. I am printing a counter to screen for the outer loop to check progress. This might not be ideal. R is still working, according to task manager and using a good bit of memory and processor. However, the R console is not responding and I can only read at the end that I am at least to row 31,000ish (there is scroll space, but I cannot scroll down to see the last number printed). I do not know if the program is "hung" (no longer iterating outer loop) and I am wasting my time waiting, or if I should stick it out. The machine has been running for a few days. Given the program's structure, I can END the process and restart from the last row populated. However, if I end the process, will I lose the previously assigned data in my vector I am populating? That would be bad, as I'd have to start all over. Here is the code below. The end goal are the vectors called: save.trace and save.trace2.
for (i in 1:nrow(coor.cal)){
print(i)
for (j in 1:nrow(coor.cal)){
dist<-( (coor.cal[i,1]-coor.cal[j,1])^2 + (coor.cal[i,2]-coor.cal[j,2])^2)^.5
#finding distances between observations
w[j]<-exp(-0.5*((dist/bw)^2))#computing weight matrix for observation i
if (dist>bw){w[j]<-0}
}
for (k in 1:27){
xv<-xmat[ ,k]
xtw[k, ]<-xv*w
}
xtwx<-xtw%*%xmat
xtwx.inv<-ginv(xtwx)
xtwx.inv.xtw<-xtwx.inv%*%xtw
xrow<-xmat[i, ]
temp<-xrow%*%xtwx.inv.xtw
save.trace[i]<-temp[i]
save.trace2[i]<-sum(temp*temp)
}
Here's a better example.
saved <- 0
for(i in 1:100)
{
saved <- i
Sys.sleep(0.1)
}
Run this code, and press escape sometime in the next 10 seconds (before the loop completes).
Take a look at the value of saved. It should be more than 0, indicating that your progress has been stored.
I did not have the memory to risk an experiment to answer my question. I just borrowed another machine, tried it, and indeed you CAN end a process and still retain previously stored information. I had not run into this problem before. I attempted to delete my question, but could not. I'll leave this in case it helps someone else.
I have R code that I need to get to A "parallelization" stage. Im new at this so please forgive me if I use the wrong terms. I have a process that just has to chug through individual by individual one at a time and then average across individuals in the end. The process is the exact same for each individual (its a Brownian Bridge), I just have to do this for >300 individuals. So, I was hoping someone here might know how to change my code so that it can be spawned? or parallelized? or whatever the word is to make sure that the 48 CPU's I now have access to can help reduce the 58 days it will take to compute this with my little laptop. In my head I would just send out 1 individual to one processor. Have it run through the script and then send another one....if that makes sense.
Below is my code. I have tried to comment in it and have indicated where I think the code needs to be changed.
for (n in 1:(length(IDNames))){ #THIS PROCESSES THROUGH EACH INDIVIDUAL
#THIS FIRST PART IS JUST EXTRACTING THE DATA FROM MY TWO INPUT FILES.
#I HAVE ONE FILE WITH ALL THE LOCATIONS AND THEN ANOTHER FILE WITH A DATE RANGE.
#EACH INDIVIDUAL HAS DIFFERENT DATE RANGES, THUS IT HAS TO PULL OUT EACH INDIVIDUALS
#DATA SET SEPARATELY AND THEN RUN THE FUNCTION ON IT.
IndivData = MovData[MovData$ID==IDNames[n],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames[n],]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
########################
#THIS IS THE PROCESS WHERE I THINK I NEED THAT HAS TO HAVE EACH INDIVIDUAL RUN THROUGH IT
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
#############################
# BELOW IS JUST CODE TO BIND THE RESULTS INTO A GRID DATA FRAME I ALREADY CREATED.
#I DO NOT UNDERSTAND HOW THE MULTICORE PROCESSED CODE WOULD JOIN THE DATA BACK
#WHICH IS WHY IVE INCLUDED THIS PART OF THE CODE.
if(n==1){ #creating a data fram with the x, y, and probabilities for the first individual
BBMMProbGrid = as.data.frame(1:length(BBMM[[2]]))
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[2]],BBMM[[3]],BBMM[[4]])
colnames(BBMMProbGrid)=c("GrdId","X","Y",paste(IDNames[n],"_Prob", sep=""))
} else { #For every other individual just add the new information to the dataframe
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[4]])
colnames(BBMMProbGrid)[n*2+2]=paste(IDNames[n],"_Prob", sep ="")
}# end if
} #end loop through individuals
Not sure why this has been voted down either. I think the foreach package is what you're after. Those first few pdfs have very clear useful information in them. Basically write what you want done for each person as a function. Then use foreach to send the data for one person out to a node to run the function (while sending another persons to another node etc) and then it compiles all the results using something like rbind. I've used this a few times with great results.
Edit: I didn't look to rework your code as I figure given you've got that far you'll easily have the skills to wrap it into a function and then use the one liner foreach.
Edit 2: This was too long for a comment to reply to you.
I thought since you had got that far with the code that you would be able to get it into a function :) If you're still working on this, it might help to think of writing a for loop to loop over your subjects and do the calculations required for that subject. Then, that for loop is what you want in your function. I think in your code that is everything down to 'area.grid'. Then you can get rid of most of your [n]'s since the data is only subset once per iteration.
Perhaps:
pernode <- function(MovData) {
IndivData = MovData[MovData$ID==IDNames[i],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames,]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]
&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
return(BBMM)
}
Then something like:
library(doMC)
library(foreach)
registerDoMC(cores=48) # or perhaps a few less than all you have
system.time(
output <- foreach(i = 1:length(IDNames)), .combine = "rbind", .multicombine=T,
.inorder = FALSE) %dopar% {pernode(i)}
)
Hard to say whether that is it without some test data, let me know how you get on.
This is a general example since I didn't have the patience to read through all of your code. One of the quickest ways to spread this across multiple processors would be to use the multicore library and the mclapply (a parallelized version of lapply) to push a list (individual items on the list would be dataframes for each of the 300+ individuals in your case) through a function.
Example:
library(multicore)
result=mclapply(data_list, your_function,mc.preschedule=FALSE, mc.set.seed=FALSE)
As I understand you description you have access to a distributed computer cluster. So the package multicore will be not working. You have to use Rmpi, snow or foreach. Based on your existing loop structure I would advice to use the foreach and doSnow package. But your codes looks like as you have a lot of data. You probably have to check to reduce the data (only the required ones) which will be send to the nodes.