Stopping computation in R; will I lose results up to that point? - r

I am running some matrix algebra on a large data set. Each iteration of the outer most loop populates one row of two different vectors that are allocated to 64,797 rows. I am printing a counter to screen for the outer loop to check progress. This might not be ideal. R is still working, according to task manager and using a good bit of memory and processor. However, the R console is not responding and I can only read at the end that I am at least to row 31,000ish (there is scroll space, but I cannot scroll down to see the last number printed). I do not know if the program is "hung" (no longer iterating outer loop) and I am wasting my time waiting, or if I should stick it out. The machine has been running for a few days. Given the program's structure, I can END the process and restart from the last row populated. However, if I end the process, will I lose the previously assigned data in my vector I am populating? That would be bad, as I'd have to start all over. Here is the code below. The end goal are the vectors called: save.trace and save.trace2.
for (i in 1:nrow(coor.cal)){
print(i)
for (j in 1:nrow(coor.cal)){
dist<-( (coor.cal[i,1]-coor.cal[j,1])^2 + (coor.cal[i,2]-coor.cal[j,2])^2)^.5
#finding distances between observations
w[j]<-exp(-0.5*((dist/bw)^2))#computing weight matrix for observation i
if (dist>bw){w[j]<-0}
}
for (k in 1:27){
xv<-xmat[ ,k]
xtw[k, ]<-xv*w
}
xtwx<-xtw%*%xmat
xtwx.inv<-ginv(xtwx)
xtwx.inv.xtw<-xtwx.inv%*%xtw
xrow<-xmat[i, ]
temp<-xrow%*%xtwx.inv.xtw
save.trace[i]<-temp[i]
save.trace2[i]<-sum(temp*temp)
}

Here's a better example.
saved <- 0
for(i in 1:100)
{
saved <- i
Sys.sleep(0.1)
}
Run this code, and press escape sometime in the next 10 seconds (before the loop completes).
Take a look at the value of saved. It should be more than 0, indicating that your progress has been stored.

I did not have the memory to risk an experiment to answer my question. I just borrowed another machine, tried it, and indeed you CAN end a process and still retain previously stored information. I had not run into this problem before. I attempted to delete my question, but could not. I'll leave this in case it helps someone else.

Related

For loop setup with multiple parameters in R

I'm trying to figure out how to get a for loop setup in R when I want it to run two or more parameters at once. Below I have posted a sample code where I am able to get the code to run and fill a matrix table with two values. In the 2nd line of the for loop I have
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],-.7))
And what I would like to do is replace the -.7 with another tt[i], example below, so that my for loop would run through the values starting at (-1,-1), then it would be as follows (-1,-.99),
(-1,-.98),...,(1,.98),(1,.99),(1,1) where the result matrix would then be populated by the output of Q and sigma.
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],tt[i]))
or something similar to
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],ss[i]))
It may be very possible that this would be better handled by two for loops however I'm not 100% sure on how I would set that up so the first parameter would be fixed and the code would run through the sequence of the second parameter, once that would get finished the first parameter would now increase by one and fix itself at that increase until the second parameter does another run through.
I've posted some sample code down below where the ARMA.var function just comes from the ts.extend package. However, any insight into this would be great.
Thank you
tt<-seq(-1,1,0.01)
Result<-matrix(NA, nrow=length(tt)*length(tt), ncol=2)
for (i in seq_along(tt)){
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],-.7))
Q<-t((y-X%*%beta_est_d))%*%solve(R)%*%(y-X%*%beta_est_d)+
lam*t(beta_est_d)%*%D%*%beta_est_d
RSS<-sum((y-X%*%solve(t(X)%*%solve(R)%*%X+lam*D)%*%t(X)%*%solve(R)%*%y)^2)
Denom<-n-sum(diag(X%*%solve(t(X)%*%solve(R)%*%X+lam*D)%*%t(X)%*%solve(R)))
sigma<-RSS/Denom
Result[i,1]<-Q
Result[i,2]<-sigma
rm(Q)
rm(R)
rm(sigma)
}
Edit: I realize that what I have posted above is quite unclear so to simplify things consider the following code,
x<-seq(1,20,1)
y<-seq(1,20,2)
Result<-matrix(NA, nrow=length(x)*length(y), ncol=2)
for(i in seq_along(x)){
z1<-x[i]+y[i]
z2<-z1+y[i]
Result[i,1]<-z1
Result[i,2]<-z2
}
So the results table would appear as follow as the following rows,
Row1: 1+1=2, 2+1=3
Row2: 1+3=4, 4+3=7
Row3: 1+5=6, 6+5=11
Row4: 1+7=8, 8+7=15
And this pattern would continue with x staying fixed until the last value of y is reached, then x would start at 2 and cycle through the calculations of y to the point where my last row is as,
RowN: 20+19=39, 39+19=58.
So I just want to know if is there a way to do it in one loop or if is it easier to run it as 2 loops.
I hope this is clearer as to what my question was asking, and I realize this is not the optimal way to do this, however for now it is just for testing purposes to see how long my initial process takes so that it can be streamlined down the road.
Thank you

R Updating A Column In a Large Dataframe

I've got a dataframe, which is stored in a csv, of 63 columns and 1.3 million rows. Each row is a chess game, each column is details about the game (e.g. who played in the game, what their ranking was, the time it was played, etc). I have a column called "Analyzed", which is whether someone later analyzed the game, so it's a yes/no variable.
I need to use the API offered by chess.com to check whether a game is analyzed. That's easy. However, how do I systematically update the csv file, without wasting huge amounts of time reading in and writing out the csv file, while accounting for the fact that this is going to take a huge amount of time and I need to do it in stages? I believe a best practice for chess.com's API is to use Sys.sleep after every API call so that you lower the likelihood that you are accidentally making concurrent requests, which the API doesn't handle very well. So I have Sys.sleep for a quarter of a second. If we assume the API call itself takes no time, then this means this program will need to run for 90 hours because of the sleep time alone. My goal is to make it so that I can easily run this program in chunks, so that I don't need to run it for 90 hours in a row.
The code below works great to get whether a game has been analyzed, but I don't know how to intelligently update the original csv file. I think my best bet would be to rewrite the new dataframe and replace the old Games.csv every 1000 or say API calls. See the commented code below.
My overall question is, when I need to update a column in csv that is large, what is the smart way to update that column incrementally?
library(bigchess)
library(rjson)
library(jsonlite)
df <- read.csv <- "Games.csv"
for(i in 1:nrow(df)){
data <- read_json(df$urls[i])
if(data$analysisLogExists == TRUE){
df$Analyzed[i] <- 1
}
if(data$analysisLogExists==FALSE){
df$Analyzed[i] = 0
}
Sys.sleep(.25)
##This won't work because the second time I run it then I'll just reread the original lines
##if i try to account for this by subsetting only the the columns that haven't been updated,
##then this still doesn't work because then the write command below will not be writing the whole dataset to the csv
if(i%%1000){
write.csv(df,"Games.csv",row.names = F)
}
}

How to modify elements of a vector based on other elements in parallel?

I'm trying to parallelize part of my code, but I can't figure out how since everywhere I read about parallelization the object is completely split and then processed using only the "information" in the chunk. My problem cannot be split in independent chunks, but the updates can be done independently.
Here's a simplified version
a <- runif(10000)
indexes <- c(seq(1,9999,2), seq(2,10000,2))
for(i in indexes){
.prev <- ifelse(i>1, a[i-1], 0)
.next <- ifelse(i<10000, a[i+1], 1)
a[i] <- runif(1, min(.prev,.next), max(.prev, .next))
}
While each iteration on the for loop depends on the current values in a, the order defined in indexes makes the problem parallelized within even and odd indexes, e.g., if indexes = seq(1,9999,2) the dependence is not a problem since the values used in each iteration won't ever be modified. On the other hand it will be impossible to execute this with the subset a[indexes], hence the "splitting" strategy in the guides I read are unable to perform this operation.
How can I parallelize a problem like this, where I need the object to be modified and "look" at the whole vector every time? Instead of each worker returning some output, each worker has to modify a "shared" object in memory?

Appending text file for parallel simulation output

I am running R simulation using a multi-core system. The simulation result I am monitoring is a vector of length 900. My plan was to append this vector (row-wise) in text file (using write.table) once each simulation ends. My simulation run from 1:1000. While I was working on my laptop the result was fine because the work is sequential. When I work i cluster the simulations are divided and there might be a conflict on who to write first. The reason for my claim is I am even getting even impossible values for the first column of my text file (This column used to store the simulation index). If you need a sample code I can attach.
There is no way to write to a text file with parallel threads that respects order. Each thread has it's own buffer and has no indication when it is appropriate to write, because there is no cross communication. So what will happen is they'll all try to the same file at the same time, even in the middle of another thread's write which is why you're getting impossible values in your first column.
The solution is to write to separate files for each thread, or return the value as the output of the multithreaded apply loop. Then combine the results at the end sequentially.

How to overcome an infinite loop?

I am totally new to R. Hopefully you can help. I am trying to simulate from a Hawkes process using R. The main idea is that-first of all I simulated some events from a homogeneous Poisson process. Then each of these events will create their own children using a non homogeneous Poisson process. The code is like as below:
SimulateHawkesprocess<-function(n,tmax,lambda,lambda2){
times<-Simulatehomogeneousprocess(n,lambda)
count<-1
while(count<n){
newevent<-times[count] + Simulateinhomogeneousprocess(lambda2,tmax,lambdamax=NA)
times<-c(times,newevent)
count<-count+1
n<-length(times)
}
return(times)
}
But the r code is producing this infinite loop(probably because of the last line: (n<-length(times))). How can I overcome this problem? How can I put a stopping condition?
This is not a R specific problem. You need to get your algorithm working correctly first. Compare the code you have written against what you want to do. If you need help with the algorithm then tag the question as such. Moreover the function call to Simulateinhomogeneousprocess is very inconsistent. Some insight into that function would help. What is that function returning, a number or a vector?
Within the loop you are increasing the value of n by at least 1 each time so you never reach the end.
newevent<-times[count] + Simulateinhomogeneousprocess(lambda2,tmax,lambdamax=NA)
This creates a non empty variable
times<-c(times,newevent)
Increases the "times" vector by at least 1 (since newevent is non-empty)
count<-count+1
n<-length(times)
You increase the count by 1 but also increase the n value by atleast 1 thus creating a never ending loop. One of these things has to change for the loop to stop.

Resources