Why does loading saved R file increase CPU usage? - r

I have an R script that I want to run frequently. Few months ago when I wrote it and initiated, there was no problem.
Now, my script is consuming almost all (99%) of the CPU and its slower than it used to be. I am running the script in a server and other users experience slow response from the server when the script is running.
I tried to find out the piece of code where its slow. The following loop is taking almost all the time and CPU that is used by the script.
for (i in 1:100){
load (paste (saved_file, i, ".RData", sep=""))
Do something (which is fast)
assign (paste ("var", i, sep=""), vector)
}
The loaded data is about 11 MB in each iteration. When I run above script for an arbitrary "i", the loading of file step takes longer time than other commands.
I spent few hours reading forum posts but could not get any hint about my problem. It would be great if you could point out if there's something I am missing or suggest more effective way to load a file in R.
EDIT: Added space in the codes to make it easier to read.

paste(saved_file, i, ".RData", sep = "")
Loads a object at each iteration, with name xxx1, xxx2, and so on.
Did you tried to rm the object at the end of loop? I guess the object stays in memory, regardless of your variable being reused.
Just a tip: add spaces in your code (like i did), it's much more easier to read/debug.

Related

R how to clear memory

Here's my problem:
I'm working on a Linux system and run R in a console.
Within an R programs loop, I load a really huge Data file, several GB.
Pseudocode:
for(i in 1:n){
data=read(Hugefile[i])
...
# do some stuff
...
rm(data)
}
When R was started new from console, the first iteration loads the data successfully. But in the second iteration, I get an allocation error. Even when I clear everything by using rm(list=ls()) and gc(), I get the same error trying to load this file manually. First, when I close R and open it again, I then can load another file of that size.
Does anyone know how to clear the memory of R within a loop and without restarting R?
Thanks for your help :)

R occupying virtual Memory completely

I rewrote my program many times to not hit any memory limits. It again takes up full VIRT which does not make any sense to me. I do not save any objects. I write to disk each time I am done with a calculation.
The code (simplified) looks like
lapply(foNames, # these are just folder names like ["~/datastes/xyz","~/datastes/xyy"]
function(foName){
Filepath <- paste(foName,"somefile,rds",sep="")
CleanDataObject <- readRDS(Filepath) # reads the data
cl <- makeCluster(CONF$CORES2USE) # spins up a cluster (it does not matter if I use the cluster or not. The problem is intependent imho)
mclapply(c(1:noOfDataSets2Generate),function(x,CleanDataObject){
bootstrapper(CleanDataObject)
},CleanDataObject)
stopCluster(cl)
})
The bootstrap function simply samples the data and save the sampled data to disk.
bootstrapper <- function(CleanDataObject){
newCPADataObject <- sample(CleanDataObject)
newCPADataObject$sha1 <- digest::sha1(newCPADataObject, algo="sha1")
saveRDS(newCPADataObject, paste(newCPADataObject$sha1 ,".rds", sep = "") )
return(newCPADataObject)
}
I do not get how this can now accumulate to over 60 GB of RAM. The code is highly simplified but imho there is nothing else which could be problematic. I can paste more code details if needed.
How does R manage to successively eat up my memory, even though I already re-wrote the software to store the generated object on disk?
I have had this problem with loops in the past. It is more complicated to address in functions and apply.
But, what I have done is used two things in combination to fix the problem.
Within each function that generates temporary files, use rm(file-name) to remove the temp file and then run gc() which forces a garbage collection before exiting the functions. This will slow the process some, but reduce memory pressure. This way each iteration of apply will purge before moving on to the next step. You may have to go back to your first function in nested functions to accomplish this well. It takes experimentation to figure out where the system is getting backed up.
I find this to be especially necessary if you use ANY methods called from packages built over rJava, it is extremely wasteful of resources and R has no way of running garbage collection on the Java heap, and most authors of java packages do not seem to be accounting for the need to collect in their methods.

Restart R session without interrupting the for loop

In my for loop, I need to remove RAM. So I delete some objects with rm() command. Then, I do gc() but the RAM still the same
So I use .rs.restartR() instead of gc() and it works: a sufficient part of my RAM is removed after the restart of the R session.
My problem is the for loop which is interrupted after the R restart. Do you have an idea to automatically go on the for loop after the .rs.restartR() command ?
I just stumbled across this post as I'm having a similar problem with rm() not clearing memory as expected. Like you, if I kill the script, remove everything using rm(list=ls(all.names=TRUE)) and restart, the script takes longer than it did initially. However, restarting the session using .rs.restartR() then sourcing again works as expected. As you say, there is no way to 'refresh' the session while inside a loop.
My solution was to write a simple bash script which calls my .r file.
Say you have a loop in R that runs from 1 to 3 and you want to restart the session after each iteration. My bash script 'runR.sh' could read as follows:
#!/bin/bash
for i in {1..3}
do
echo "Rscript myRcode.r $i" #check call to r script is as expected
Rscript myRcode.r $i
done
Then at the top of 'myRcode.r':
args <- commandArgs()
print(args) #list the command line arguments.
myvar <- as.numeric(args[6])
and remove your for (myvar in...){}, keeping just the contents of the loop.
You'll see from print(args) that your input from your shell script is the 6th element of the array, hence args[6] in the following line when assigning your variable. If you're passing in a string, e.g. a file name, then you don't need as.numeric of course.
Running ./runR.sh will then call your script and hopefully solve your memory problem. The only minor issue is that you have to reload your packages each time, unlike when using .rs.restartR(), and may have to repeat other bits that ordinarily would only be run once.
It works in my case, I would be interested to hear from other more seasoned R/bash users whether there are any issues with this solution...
By saving the iteration as an external file, and writing an rscript which calls itself, the session can be restarted within a for loop from within rstudio. This example requires the following steps.
#Save an the iteration as a separate .RData file in the working directory.
iter <- 1
save(iter, file="iter.RData")
Create a script which calls itself for a certain number of iterations. Save the following script as "test_script.R"
###load iteration
library(rstudioapi)
load("iter.RData")
###insert function here.
time_now <- Sys.time()
###save output of function to a file.
save(time_now, file=paste0("time_", iter, ".Rdata"))
###update iteration
iter <- iter+1
save(iter, file="iter.RData")
###restart session calling the script again
if(iter < 5){
restartSession(command='source("test_script.R")')
}
Do you have an idea to automatically go on the for loop after the .rs.restartR() command ?
It is not possible.
Okay, you could configure your R system to do something like this, but it sounds like a bad idea. I'm not really sure if you want to restart the for loop from the beginning or pick it up where left off. (I'm also very confused that you seem to have been able to enter commands in the R console while a for loop was executing. I think there's more than you are not telling us.)
You can use your rprofile.site file to automatically run commands when R starts. You could set it up to automatically run your for loop code whenever R starts. But this seems like a bad idea. I think you should find a different sort of fix for your problem.
Some of the things you could do to help the situation: have your for loop write output for each iteration to disk and also write some sort of log to disk so you know where you left off. Maybe write a function around your for loop that takes an argument of where to start, so that you can "jump in" at any point.
With this approach, rather than "restarting R and automatically picking up the loop", a better bet would be to use Rscript (or similar) and use R or the command line to sequentially run each iteration (or batch of iterations) in its own R session.
The best fix would be to solve the memory issue without restarting. There are several questions on SO about memory management - try the answers out and if they don't work, make a reproducible example and ask a new question.
You can make your script recursive by sourcing itself after restarting session.
Make sure the script will take into account the initial status of the loop. So you might have to save the current status of the loop in a .rds file before restarting session. Then call the .rds file from inside the loop after restarting session. This would help you start the loop where it was before restarting r session.
I just found out about this command 'restartSession'. I'm using it because I was also running into memory consumption issues as the garbage collector will not give back the RAM to the OS (Linux).
library(rstudioapi)
restartSession(command = "print('x')")
An approach independent from Rstudio:
If you want to run this in Rstudio, don't use R console, but terminal, otherwise use rstudioapi::restartSession() as in other answers - not recommended (crashes) -.
create iterator and load script (in system terminal would be:)
R -e 'saveRDS(1,"i.rds"); source("Script.R")'
Script.R file:
# read files and iterator
i<-readRDS("i.rds")
print(i)
# open process id of previous loop to kill it
tryCatch(pid <- readRDS(file="pid.rds"), error=function(e){NA} )
if (exists("pid")){
library(tools)
tools::pskill(pid, SIGKILL)
}
# update objects and iterator
i <- i+1
# process
pid <- Sys.getpid()
# save files and iterator
saveRDS(i, file="i.rds")
# process ID to close it in next loop
saveRDS(pid, file="pid.rds")
### restart session calling the script again
if(i <= 20 ) {
print(paste("Processing of", i-1,"ended, restarting") )
assign('.Last', function() {system('Rscript Script.R')} )
q(save = 'no')
}

R console unexpectedly slow, long behind job (PDF output) is finished

When I run a large R scripts (works nicely as expected, basically produces a correct PDF at the end of the script (base plotting plus beeswarm, last line of script is dev.off()), I notice that the PDF is finished after ~3 seconds and can even be opened in other applications, long before the console output (merely few integer values and echo of code ~400 lines) is finished (~20 seconds). There are no errors reported. In between, the echo stops and does nothing for seconds.
I work with R Studio V0.97.551, R version 3.0.1, on Win-7.
gc() or close and restart R did not help, and the data structures used are not big anyway (5 dataframes with up to 60 obs and 64 numeric or short character variables). The available memory should be sufficient (according to task manager, around 4 GB throughout), but CPU is busy during that time.
I agree this is not reproducible for other people w/o the script, which is however too large to post, but maybe someone has experienced the same problem or even an explanation or suggestion what to check? Thanks in advance!
EDIT:
I run exactly the same code directly in R 3.0.1 (w/o RStudio), and the problem was gone, suggesting the problem is related to RStudio. I added the tag RStudio, but I am not sure if I am now supposed to move this question somewhere else?
Recently I came across similar problem--running from RStudio becomes very slow, even when it is executing something as simple as example('plot'). After searching around, this post pointed me to the right place that eventually led to a workaround: resetting RStudio by renaming the "RStudio-Desktop Directory". The exact way to do so depends upon the OS you are using, and you could find the detail instruction here. I just tried it, and it works.

write.table(...,append=T) : Cannot open the connection

I'm wondering if anyone else has ever encountered this problem. I'm writing a fairly small amount of data to a csv file. It's about 30 lines, 50 times.
I'm using a for loop to write data to the file.
It seems "finicky" sometimes the operation completes successfully, and other times it stops after the first ten times (300 lines) other times 3, or 5... by telling me
"cannot open connection".
I imagine it is some type of timeout. Is there a way to tell R to "slow down" when writing tables?
Before you ask: there's just too much code to provide an example here.
Code would help, despite your objections. R has a fixed-size connection pool and I suspect you are running out of connection.
So make sure you follow the three-step of
open connection (and check for error as a bonus)
write using the connection
close the connection
I can't reproduce it on a R 2.11.1 32bit on a Windows 7 64bit. For these kind of things, please provide more info on your system (see e.g. ?R.version, ?Sys.info )
Memory is a lot faster than disk access. 1500 lines are pretty much manageable in the memory and can be written to file in one time. If it's different sets of data, add an extra factor variable indicating the set (set1 to set50). All your data is easily manageable in one dataframe and you avoid having to access the disk many times.
In case it really is for 50 files, this code illustrates the valuable advice of Dirk :
for(i in 1:50){
...
ff <- file("C:/Mydir/Myfile.txt",open="at")
write.table(myData,file=ff)
close(ff)
}
See also the help: ?file
EDIT : you should use open="at" instead of open="wt". "at" is appending mode. "wt" is writing mode. append=T is the same as open="at".

Resources