Can I stop and start a script in R - r

I am working with R and my script is taking a very long time. I was thinking I can stop it and then start it again by changing my counters.
My code is this
NC <- MLOA
for (i in 1:313578){
len_mods <- length(MLOA[[i]])
for (j in 1:2090){
for(k in 1:len_mods){
temp_match <- matchv[j]
temp_rep <- replacev[j]
temp_mod <- MLOA[[i]][k]
is_found <- match(temp_mod,temp_match, nomatch = 0, incomparables = 0)
if(is_found[1] == 1) NC[[i]][k] <- temp_rep
rm(temp_match,temp_rep,temp_mod)
}
}
}
I am thinking that I can stop my script, then re-start it by checking what values of i,j and k are and changing the counts to start at their current values. So instead of counting "for (i in 1:313578)" if i is up to 100,000 I could do (i in 100000:313578).
I don't want to stop my script though before checking that my logic about restarting it is solid.
Thanks in anticipation

I'm a bit confused what you are doing. Generally on this forum it is a good idea to greatly simplify your code, and only present the core of the problem in a very simple example. That withstanding, this might help. Put your for loop in a function whose parameters are the first elements of the sequence of numbers you loop over. For example:
myloop <- function(x,...){
for (i in seq(x,313578,1)){
...
This way you can easily manipulate were your loop starts.
The more important question is, however, why are you using for loops in the first place? In R, for loops should be avoided at all costs. By vectorizing your code you can greatly increase its speed. I have realized speed increases of a factor of 500!
In general, the only reason you use a for loop in R is if current iterations of the for loop depend on previous iterations. If this is the case then you are likely bound to the slow for loop.
Depending on your computer skills, however, even for loops can be made faster in R. If you know C, or are willing to learn a bit, interfacing with C can dramatically increase the speed of your code.
An easier way to increase the speed of your code, which unfortunately will not yield the same speed up as interfacing with C, is using R's Byte Complier. Check out the cmpfun function.
One final thing on speeding up code: The following line of codetemp_match <- matchv[j] looks innocuous enough, however, this can really slow things down. This is because every time you assign matchv[j] to temp_match you make a copy of temp_match. That means that your computer needs to find some were to store this copy in RAM. R is smart, as you make more and more copies, it will clean up after you and throw away those copies you are no longer using with the garbage collect function. However finding places to store your copies as well as calling the garbage collect function take time. Read this if you want to learn more: http://adv-r.had.co.nz/memory.html.

You could also use while loops for your 3 loops to maintain a counter. In the following, you can stop the script at any time (and view the intermediate results) and restart by changing continue=TRUE or simply running the loop part of the script:
n <- 6
res <- array(NaN, dim=rep(n,3))
continue = FALSE
if(!continue){
i <- 1
j <- 1
k <- 1
}
while(k <= n){
while(j <= n){
while(i <= n){
res[i,j,k] <- as.numeric(paste0(i,j,k))
Sys.sleep(0.1)
i <- i+1
}
j <- j+1
i <- 1
}
k <- k+1
j <- 1
}
i;j;k
res

This is what I got to....
for(i in 1:313578)
{
mp<-match(MLOA[[i]],matchv,nomatch = 0, incomparables=0)
lgic<- which(as.logical(mp),arr.ind = FALSE, useNames = TRUE)
NC[[i]][lgic]<-replacev[mp]}
Thanks to those who responded, Jacob H, you are right, I am definitely a newby with R, your response was useful. Frank - your pointers helped.
My solution probably still isn't an optimal one. All I wanted to do was a find and replace. Matchv was the vector in which I was searching for a match for each MLOA[i], with replacev being the vector of replacement information.

Related

Problem with checking logical within for loop

Inspired by the leetcode challenge for two sum, I wanted to solve it in R. But while trying to solve it by brute-force I run in to an issue with my for loop.
So the basic idea is that given a vector of integers, which two integers in the vector, sums up to a set target integer.
First I create 10000 integers:
set.seed(1234)
n_numbers <- 10000
nums <- sample(-10^4:10^4, n_numbers, replace = FALSE)
The I do a for loop within a for loop to check every single element against eachother.
# ensure that it is actually solvable
target <- nums[11] + nums[111]
test <- 0
for (i in 1:(length(nums)-1)) {
for (j in 1:(length(nums)-1)) {
j <- j + 1
test <- nums[i] + nums[j]
if (test == target) {
print(i)
print(j)
break
}
}
}
My problem is that it starts wildly printing numbers before ever getting to the right condition of test == target. And I cannot seem to figure out why.
I think there are several issues with your code:
First, you don't have to increase your j manually, you can do this within the for-statement. So if you really want to increase your j by 1 in every step you can just write:
for (j in 2:(length(nums)))
Second, you are breaking only the inner-loop of the for-loop. Look here Breaking out of nested loops in R for further information on that.
Third, there are several entries in nums that gave the "right" result target. Therefore, your if-condition works well and prints all combination of nums[i]+nums[j] that are equal to target.

Iteration in r for loop

I want to write a for loop that iterates over a vector or list, where i'm adding values to them in each iteration. i came up with the following code, it's not iterating more than 1 iteration. i don't want to use a while loop to write this program. I want to know how can i control for loops iterator. thanks in advance.
steps <- 1
random_number <- c(sample(20, 1))
for (item in random_number){
if(item <18){
random_number <- c(random_number,sample(20, 1))
steps <- steps + 1
}
}
print(paste0("It took ", steps, " steps."))
It depends really on what you want to achieve. Either way, I am afraid you cannot change the iterator on the fly. while seems resonable in this context, or perhaps knowing the plausible maximum number of iterations, you could proceed with those, and deal with needless iterations via an if statement. Based on your code, something more like:
steps <- 1
for (item in 1:100){
random_number <- c(sample(20, 1))
if(random_number < 18){
random_number <- c(random_number,sample(20, 1))
steps <- steps + 1
}
}
print(paste0("It took ", steps, " steps."))
Which to be honest is not really different from a while() combined with an if statement to make sure it doesn't run forever.
This can't be done. The vector used in the for loop is evaluated at the start of the loop and any changes to that vector won't affect the loop. You will have to use a while loop or some other type of iteration.

Most efficient way to remove elements from vector, do a calculation and put them back in R

I'm trying to find the best way to do a cut in a vector using 12 breaks, but I want the first cluster to contain values from 0 to a specific value (onePercentQuantile). So one way to do that is to remove all values below onePercentQuantile in vector while keeping track of the removed value indexes, then run a cut on vector with only 11 breaks and finally put back the removed indexes as cluster 1 in the new object, starting form the leftmost index.
I did not find a way to ignore values under a specific threshold in cut function, so I did it myself.
Here is an example with the most efficient way I could find :
vector_gene <- sample(1:100, 10000, replace=TRUE)
onePercentQuantile<- 5
indexUnderPercentQuantile <- which(vector_gene <= onePercentQuantile)
overPercentQuantile <- vector_gene[vector_gene > onePercentQuantile]
tmp <- as.numeric(cut(c(t(overPercentQuantile)),breaks=11))+1
if (1 == indexUnderPercentQuantile[1]){
tmp <- insert(tmp, 1, values=1)
indexUnderPercentQuantile <- indexUnderPercentQuantile[2:length(indexUnderPercentQuantile)]
}
for (i in 1:length(vector_gene)){
if (i>length(tmp)){
tmp <- c(tmp,rep(1,length(vector_gene)-i+1))
break
}
else if (i == indexUnderPercentQuantile[1]){
tmp <- c(tmp[1:i-1],1,tmp[i:length(tmp)])
if (length(indexUnderPercentQuantile)>1){
indexUnderPercentQuantile <- indexUnderPercentQuantile[2:length(indexUnderPercentQuantile)]
}
}
}
Using the profvis package I have traced memory usage and running time.
Up to 10.000 elements, result is instantaneous using barely no memory.
Up to 100.000 elements, memory usage is already around 9Go running in 3.5s
Up to 1.000.000 elements, the running time is so long that I shut it down.
The bottleneck is the reattribution of the removed indexes :
tmp <- c(tmp[1:i-1],1,tmp[i:length(tmp)])
I've tried the insert function in R.utils but it was worst in memory usage and running time
Is there a better way, especially in memory usage, to achieve this problem ? Thanks !

Parallel recursive function in R?

I’ve been wracking my brain around this problem all week and could really use an outside perspective. Basically I’ve built a recursive tree function where the output of each node in one layer is used as the input for a node in the subsequent layer. I’ve generated a toy example here where each call generates a large matrix, splits it into submatrices, and then passes those submatrices to subsequent calls. The key difference from similar questions on Stack is that each call of tree_search doesn't actually return anything, it just appends results onto a CSV file.
Now I'd like to parallelize this function. However, when I run it with mclapply and mc.cores=2, the runtime increases! The same happens when I run it on a multicore cluster with mc.cores=12. What’s going on here? Are the parent nodes waiting for the child nodes to return some output? Does this have something to do with fork/socket parallelization?
For background, this is part of an algorithm that models gene activation in white blood cells in response to viral infection. I’m a biologist and self-taught programmer so I’m a little out of my depth here - any help or leads would be really appreciated!
# Load libraries.
library(data.table)
library(parallel)
# Recursive tree search function.
tree_search <- function(submx = NA, loop = 0) {
# Terminate on fifth loop.
message(paste("Started loop", loop))
if(loop == 5) {return(TRUE)}
# Create large matrix and do some operation.
bigmx <- matrix(rnorm(10), 50000, 250)
bigmx <- sin(bigmx^2)
# Aggregate matrix and save output.
agg <- colMeans(bigmx)
append <- file.exists("output.csv")
fwrite(t(agg), file = "output.csv", append = append, row.names = F)
# Split matrix in submatrices with 100 columns each.
ind <- ceiling(seq_along(1:ncol(bigmx)) / 100)
lapply(unique(ind), function(i) {
submx <- bigmx[, ind == i]
# Pass each submatrix to subsequent call.
loop <- loop + 1
tree_search(submx, loop) # sub matrix is used to generate big matrix in subsequent call (not shown)
})
}
# Initiate tree search.
tree_search()
After a lot more brain wracking and experimentation, I ended up answering my own question. I’m not going to refer to the original example since I've changed up my approach quite a bit. Instead I’ll share some general observations that might help people in similar situations.
1.) For loops are more memory efficient than lapply and recursive functions
When you use lapply, each call creates a copy of your current environment. That’s why you can do this:
x <- 5
lapply(1:10, function(i) {
x <- x + 1
x == 6 # TRUE
})
x == 5 # ALSO TRUE
At the end x is still 5, which means that each call of lapply was manipulating a separate copy of x. That’s not good if, say, x was actually a large dataframe with 10,000 variables. for loops, on the other hand, allow you to override the variables on each loop.
x <- 5
for(i in 1:10) {x <- x + 1}
x == 5 # FALSE
2.) Parallelize once
Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. Therefore, you should use mclapply with discretion. In my case, that meant NOT putting mclapply inside a recursive function where it was getting called tens to hundreds of times. Instead, I split the starting point into 16 parts and ran 16 different tree searches on separate nodes.
3.) You can use mclapply to throttle memory usage
If you split a job into 10 parts and process them with mclapply and mc.preschedule=F, each core will only process 10% of your job at a time. If mc.cores was set to two, for example, the other 8 "nodes" would wait until one part finished before starting a new one. This is useful if you are running into memory issues and want to prevent each loop from taking on more than it can handle.
Final Note
This is one of the more interesting problems I’ve worked on so far. However, recursive tree functions are complicated. Draw out the algorithm and force yourself to spend a few days away from your code so that you can come back with a fresh perspective.

Speed up for loop assigning data to matrix in R

I am simulating data and filling a matrix using a for loop in R. Currently the loop is running slower than I would like. I've done some work to vectorize some of the variables to improve the loops speed but it still taking some time. I believe the
mat[j,year] <- sum(vec==1)/x
part of the loop is slowing things down. I've looked into filling matrices more efficiently but could not find anything to help my current problem. Eventually this will be used as a part of a shiny app so all of variables I assign will need to be easily assigned different values.
Any advice to speed up the loop or more efficiently write this loop would be greatly appreciated.
Here is the loop:
#These variables are all specified because they need to change with different simulations
num.sims <- 20
time <- 50
mat <- matrix(nrow = num.sims, ncol = time)
x <- 1000
init <- 0.5*x
vec <- vector(length = x)
ratio <- 1
freq <- -0.4
freq.vec <- numeric(nrow(mat))
## start a loop
for (j in 1:num.sims) {
vec[1:init] <- 1; vec[(init+1):x] <- 2
year <- 2
freq.vec[j] <- sum(vec==1)/x
for (i in 1:(x*(time-1))) {
freq.1 <- sum(vec==1)/x; freq.2 <- 1 - freq.1
fit.ratio <- exp(freq*(freq.1-0.5) + log(ratio))
Pr.1 <- fit.ratio*freq.1/(fit.ratio*freq.1 + freq.2)
vec[ceiling(x*runif(1))] <- sample(c(1,2), 1, prob=c(Pr.1,1-Pr.1))
## record data
if (i %% x == 0) {
mat[j,year] <- sum(vec==1)/x
year <- year + 1
}}}
The inner loop is what is slowing you down. You're doing x number of iterations to update each cell in the matrix. Since each trip to modify vec depends on the previous iteration, this would be difficult to simplify. #Andrew Feierman is probably correct that this would benefit from being moved to C++, at least the four lines before the if statement.
Alternatively, this only takes 10-20 seconds to run. Unless you're going to scale this up or run it many times, it might not be worth the trouble to speed it up. If you do keep it as is, you could put a progress bar in Shiny to let the user know things are still working.
Depending on how often you will need to call this loop, it could be worth rewriting it in C++. R is built on C++, and any C++ will run many, many times faster than even efficient R code.
sourceCpp is a good package to start with: https://www.rdocumentation.org/packages/Rcpp/versions/0.12.11/topics/sourceCpp

Resources