R how to clear memory - r

Here's my problem:
I'm working on a Linux system and run R in a console.
Within an R programs loop, I load a really huge Data file, several GB.
Pseudocode:
for(i in 1:n){
data=read(Hugefile[i])
...
# do some stuff
...
rm(data)
}
When R was started new from console, the first iteration loads the data successfully. But in the second iteration, I get an allocation error. Even when I clear everything by using rm(list=ls()) and gc(), I get the same error trying to load this file manually. First, when I close R and open it again, I then can load another file of that size.
Does anyone know how to clear the memory of R within a loop and without restarting R?
Thanks for your help :)

Related

read.csv crashes RStudio

Help me figure out what I am doing wrong!
I have about 20 .csv files (product feeds) online. I used to be able to fetch them all. But now they crash R if I fetch more than one or two. File size is about 50K rows / 30 columns each.
I guess it's a memory issue but I've tried on a different computer with the exact same result.
Could it be some formatting in the files that make R use too much memory? Or what can it be?
If I run one of these everything is good. Two sometimes. Three and it almost certainly crashes
a <- read.csv("URL1")
b <- read.csv("URL2")
c <- read.csv("URL3")
I have tried specifying all sorts of stuff like:
d <- read.csv("URL4",skipNul=TRUE,sep=",",stringsAsFactors=FALSE,header=TRUE)
I keep getting this message:
R session aborted.
R encountered a fatal error.
The session was terminated.
We have some commercial software where I can fetch the same files without issues, so the files should be fine.
And my script was running twice daily for several months without issues
R version 3.6.1
Platform: x86_64-apple-darwin15.6.0 (64-bit)
I have had this issue as well but with read_csv(). I haven't figured out what the exact cause is yet, but my best guess is that trying to read a file and write that file to a variable at the same time is too much for memory or CPU to handle.
Stemming from that guess, I tried this method and it has worked perfectly for me:
library(dplyr)
a <- read.csv("URL1") %>% as_tibble()
# you can use other data types instead of tibble. that is just my example.
The whole idea is to split the reading process from the writing process by separating them using a pipe. This makes sure that one must be finished before the next can start.

cannot allocate vector but my environment is empty

I found lots of questions here asking how to deal with "cannot allocate vector of size **" and tried the suggestions but still can't find out why Rstudio crashes every time.
I'm using 64bit R in Windows 10, and my memory.limit() is 16287.
I'm working with a bunch of large data files (mass spectra) that take up 6-7GB memory each, so I've been calling individual files one at a time and saving it as a variable with the XCMS package like below.
msdata <- xcmsRaw(datafile1,profstep=0.01,profmethod="bin",profparam=list(),includeMSn=FALSE,mslevel=NULL, scanrange=NULL)
I do a series of additional operations to clean up data and make some plots using rawEIC (also in XCMS package), which increases my memory.size() to 7738.28. Then I removed all the variables I created that are saved in my global environment using rm(list=ls()). But when I try to call in a new file, it tells me it cannot allocate vector of size **Gb. With the empty environment, my memory.size() is 419.32, and I also checked with gc() to confirm that the used memory (on the Vcells row) is on the same order with when I first open a new R session.
I couldn't find any information on why R still thinks that something is taking up a bunch of memory space when the environment is completely empty. But if I terminate the session and reopen the program, I can import the data file - I just have to re-open the session every single time one data file processing is done, which is getting really annoying. Does anyone have suggestions on this issue?

Restart R session without interrupting the for loop

In my for loop, I need to remove RAM. So I delete some objects with rm() command. Then, I do gc() but the RAM still the same
So I use .rs.restartR() instead of gc() and it works: a sufficient part of my RAM is removed after the restart of the R session.
My problem is the for loop which is interrupted after the R restart. Do you have an idea to automatically go on the for loop after the .rs.restartR() command ?
I just stumbled across this post as I'm having a similar problem with rm() not clearing memory as expected. Like you, if I kill the script, remove everything using rm(list=ls(all.names=TRUE)) and restart, the script takes longer than it did initially. However, restarting the session using .rs.restartR() then sourcing again works as expected. As you say, there is no way to 'refresh' the session while inside a loop.
My solution was to write a simple bash script which calls my .r file.
Say you have a loop in R that runs from 1 to 3 and you want to restart the session after each iteration. My bash script 'runR.sh' could read as follows:
#!/bin/bash
for i in {1..3}
do
echo "Rscript myRcode.r $i" #check call to r script is as expected
Rscript myRcode.r $i
done
Then at the top of 'myRcode.r':
args <- commandArgs()
print(args) #list the command line arguments.
myvar <- as.numeric(args[6])
and remove your for (myvar in...){}, keeping just the contents of the loop.
You'll see from print(args) that your input from your shell script is the 6th element of the array, hence args[6] in the following line when assigning your variable. If you're passing in a string, e.g. a file name, then you don't need as.numeric of course.
Running ./runR.sh will then call your script and hopefully solve your memory problem. The only minor issue is that you have to reload your packages each time, unlike when using .rs.restartR(), and may have to repeat other bits that ordinarily would only be run once.
It works in my case, I would be interested to hear from other more seasoned R/bash users whether there are any issues with this solution...
By saving the iteration as an external file, and writing an rscript which calls itself, the session can be restarted within a for loop from within rstudio. This example requires the following steps.
#Save an the iteration as a separate .RData file in the working directory.
iter <- 1
save(iter, file="iter.RData")
Create a script which calls itself for a certain number of iterations. Save the following script as "test_script.R"
###load iteration
library(rstudioapi)
load("iter.RData")
###insert function here.
time_now <- Sys.time()
###save output of function to a file.
save(time_now, file=paste0("time_", iter, ".Rdata"))
###update iteration
iter <- iter+1
save(iter, file="iter.RData")
###restart session calling the script again
if(iter < 5){
restartSession(command='source("test_script.R")')
}
Do you have an idea to automatically go on the for loop after the .rs.restartR() command ?
It is not possible.
Okay, you could configure your R system to do something like this, but it sounds like a bad idea. I'm not really sure if you want to restart the for loop from the beginning or pick it up where left off. (I'm also very confused that you seem to have been able to enter commands in the R console while a for loop was executing. I think there's more than you are not telling us.)
You can use your rprofile.site file to automatically run commands when R starts. You could set it up to automatically run your for loop code whenever R starts. But this seems like a bad idea. I think you should find a different sort of fix for your problem.
Some of the things you could do to help the situation: have your for loop write output for each iteration to disk and also write some sort of log to disk so you know where you left off. Maybe write a function around your for loop that takes an argument of where to start, so that you can "jump in" at any point.
With this approach, rather than "restarting R and automatically picking up the loop", a better bet would be to use Rscript (or similar) and use R or the command line to sequentially run each iteration (or batch of iterations) in its own R session.
The best fix would be to solve the memory issue without restarting. There are several questions on SO about memory management - try the answers out and if they don't work, make a reproducible example and ask a new question.
You can make your script recursive by sourcing itself after restarting session.
Make sure the script will take into account the initial status of the loop. So you might have to save the current status of the loop in a .rds file before restarting session. Then call the .rds file from inside the loop after restarting session. This would help you start the loop where it was before restarting r session.
I just found out about this command 'restartSession'. I'm using it because I was also running into memory consumption issues as the garbage collector will not give back the RAM to the OS (Linux).
library(rstudioapi)
restartSession(command = "print('x')")
An approach independent from Rstudio:
If you want to run this in Rstudio, don't use R console, but terminal, otherwise use rstudioapi::restartSession() as in other answers - not recommended (crashes) -.
create iterator and load script (in system terminal would be:)
R -e 'saveRDS(1,"i.rds"); source("Script.R")'
Script.R file:
# read files and iterator
i<-readRDS("i.rds")
print(i)
# open process id of previous loop to kill it
tryCatch(pid <- readRDS(file="pid.rds"), error=function(e){NA} )
if (exists("pid")){
library(tools)
tools::pskill(pid, SIGKILL)
}
# update objects and iterator
i <- i+1
# process
pid <- Sys.getpid()
# save files and iterator
saveRDS(i, file="i.rds")
# process ID to close it in next loop
saveRDS(pid, file="pid.rds")
### restart session calling the script again
if(i <= 20 ) {
print(paste("Processing of", i-1,"ended, restarting") )
assign('.Last', function() {system('Rscript Script.R')} )
q(save = 'no')
}

Extract what was printed deep in the R console

When I execute commands in R, the output is printed in the console. After some threshold (I guess, some maximum number of lines), the R console no longer shows the first commands and their output. I cannot scroll up that far because it is simply no longer there.
How can I access this "early" output if it has disappeared from the console?
I care mostly about error messages and messages generated by my own script. I do use a script file and save my results to a file, if anyone wonders, but this still does not help solve my problem.
(I have tried saving the R workspace and R history and then loading it again, but did not know what to do next and was not able to find what I needed...)

Why does loading saved R file increase CPU usage?

I have an R script that I want to run frequently. Few months ago when I wrote it and initiated, there was no problem.
Now, my script is consuming almost all (99%) of the CPU and its slower than it used to be. I am running the script in a server and other users experience slow response from the server when the script is running.
I tried to find out the piece of code where its slow. The following loop is taking almost all the time and CPU that is used by the script.
for (i in 1:100){
load (paste (saved_file, i, ".RData", sep=""))
Do something (which is fast)
assign (paste ("var", i, sep=""), vector)
}
The loaded data is about 11 MB in each iteration. When I run above script for an arbitrary "i", the loading of file step takes longer time than other commands.
I spent few hours reading forum posts but could not get any hint about my problem. It would be great if you could point out if there's something I am missing or suggest more effective way to load a file in R.
EDIT: Added space in the codes to make it easier to read.
paste(saved_file, i, ".RData", sep = "")
Loads a object at each iteration, with name xxx1, xxx2, and so on.
Did you tried to rm the object at the end of loop? I guess the object stays in memory, regardless of your variable being reused.
Just a tip: add spaces in your code (like i did), it's much more easier to read/debug.

Resources