R accumulating memory in each iteration with large input files - r

I am reading around 20,000 text files in a for loop for sentiment analysis. Each file is of around 20-40 MB size. In each loop, I am taking out some sentiment counts (just a 2 numbers) out of my input text and storing it in a dataframe. The issue is, in each iteration, I can see R is cumulatively accumulating memory. After 10,000 files I see around 13GB memory allocation for R in my task manager. I tried gc() and rm() to delete objects after each iteration, but still it does not work. The logic is as I am using the same objects iteratively R is not releasing memory used in the previous iterations.
for(i in 1:20,000){
filename <- paste0("file_", i, ".txt")
text <- readLines(filename)
# Doing sentiment analysis based on dictionary based approach
# Storing sentiment counts in dataframe
# Removing used objects
rm(filename, text)
gc()
}

You could try to check which objects are taking memory and that you do not use anymore:
print(sapply(ls(), function(x) pryr::object_size(get(x))/1024/1024))
(EDIT: just saw the comment with this almost identical advice)
this line will give you the size in Megabytes of every object present in the environment (in RAM).
Alternatively if nothing appears, you can call gc() several times instead of once like:
rm(filename, text)
for (i in 1:3) gc()
It is usually more effective...
If nothing works, that could mean the memory is fragmented and thus RAM is free but unavailable to use as misplaced between data you still use.
The solution could be to run your scripts by chunks of files say 1000 by 1000.

Related

R needs several hours to save very small objects. Why?

I am running several calculations and ML algorithms in R and store their results in four distinctive tables.
For each calculation, I obtain four tables, which I store in a single list.
According to R, all of my lists are labelled as "Large List (4 elements, 971.2 kB)" in the upper right quadrant in RStudio where all my objects, functions, etc. are displayed.
I have five of these lists and save them for later use with the save() function.
I use the function:
save(list1, list2, list3, list4, list5, file="mypath/mylists.RData")
For some reason, which I do not understand, R takes more than 24 hours to save these four lists with only 971.2 kB each.
Maybe, I should add that apparently more than 10GB of my RAM are used by R at the time. However, the lists are as small as I indicated above.
Does anyone have an idea why it takes so long to save the lists to my harddrive and what I could do about it?
Thank you
This is just a guess, because we don't have your data.
Some objects in r contain references to environments. The most common examples are functions and formulas. If you save one of those, r may need to save the whole environment. This can drastically increase the size of what is being saved. If you are short of memory that could take a very long time due to swapping.
Example:
F <- function () {
X <- rnorm(1000000)
Y ~ z
}
This function returns a small formula which references the environment holding X, so saving it will take a lot of space.
Thanks for your answers.
I solved my problem by writing a function which extracts the tables from the objects and saves them as .csv files in a folder. I cleaned the environment and shut down the computer. Afterwards, I restarted the computer, started R and loaded all the .csv files again. I then saved the thus created objects with the familiar save() command.
It is probably not the most elegant way, but it worked and was quite quick.

R not releasing memory after filtering and reducing data frame size

I need to read a huge dataset, trim it to a tiny one, and then use in my program. After trimming memory is not released (regardless of usage of gc() and rm()). I am puzzled by this behavior.
I am on Linux, R 4.2.1. I read a huge .Rds file (>10 Gb) (both with the base function and the readr version). Memory usage shows 14.58 Gb. I do operations and decrease its size to 800 rows and 24.7 Mb. But memory usage stays the same within this session regardless of what I do. I tried:
Piping readRDS directly into trimming functions and only storing the trimmed result;
First reading rds into a variable and then replacing it with the trimmed version;
Reading rds into a variable, storing the trimmed data in a new variable, and then removing the big dataset with rm() followed by garbage collection gc().
I understand what the workaround should be: a bash script that first creates a temporary file with this reduced dataset and then runs a separate R session to work with that dataset. But feels like this shouldn't be happening?

R read.table() extremely slow

I have a simple R code which looks like this:
for(B in 1:length(Files)){
InputDaten[,B]<-read.table(Files[B],header=FALSE,dec=".",skip=12,sep = ",",colClasses=c("numeric"))
}
so I read 1.39GB of files into the memory and would like to process them. However, this takes about an hour to read. When I watch the memory space which is occupied it increases only every 10 minutes. The last two minutes result in a linear increase in the memory space in dependence of time. Why might that be? Can I make it faster?
Edit 1
InputDaten<-data.frame(c(1:15360),444)
This is how i initialised InputDaten
I used fread now, the result looks the same. Here is a screenshot of the memory usage when i started fread, the memory usage doesn't increase at all for a while. (fread started approximately at the middle of the timeframe)
http://pic-hoster.net/upload/57790/Unbenannt.png

Is there a package like bigmemory in R that can deal with large list objects?

I know that the R package bigmemory works great in dealing with large matrices and data frames. However, I was wondering if there is any package or any ways to efficiently work with large list.
Specifically, I created a list with its elements being vectors. I have a for loop and during each iteration, multiple values were appended to a selected element in that list (a vector). At first, it runs fast, but when the iteration is over maybe 10000, it slows down gradually (one iteration takes about a second). I'm going to go through about 70000 to 80000 iterations, and the list would be so large after that.
So I was just wondering if there is something like big.list as the big.matrix in the bigmemory package that could speed up this whole process.
Thanks!
I'm not really sure if this a helpful answer, but you can interactively work with lists on disk using the filehash package.
For example here's some code that makes a disk database, assigns a preallocated empty list to the database, then runs a function (getting the current time) that fills the list in the database.
# how many items in the list?
n <- 100000
# setup database on disk
dbCreate("testDB")
db <- dbInit("testDB")
# preallocate vector in database
db$time <- vector("list", length = n)
# run function using disk object
for(i in 1:n) db$time[[i]] <- Sys.time()
There is hardly any use of RAM during this process, however it is VERY slow (two orders of magnitude slower than doing it in RAM on some of my tests) due to constant disk I/O. So I'm not sure that this method is a good answer to the question of how you can speed up working on big objects.
DSL package might help. The DList object works like a drop in replacement for R's list. Futher, it provides a distributed list like facility too.

mclapply with big objects - "serialization is too large to store in a raw vector"

I keep hitting an issue with the multicore package and big objects. The basic idea is that I'm using a Bioconductor function (readBamGappedAlignments) to read in large objects. I have a character vector of filenames, and I've been using mclapply to loop over the files and read them into a list. The function looks something like this:
objects <- mclapply(files, function(x) {
on.exit(message(sprintf("Completed: %s", x)))
message(sprintf("Started: '%s'", x))
readBamGappedAlignments(x)
}, mc.cores=10)
However, I keep getting the following error: Error: serialization is too large to store in a raw vector. However, it seems I can read the same files in alone without this error. I've found mention of this issue here, without resolution.
Any parallel solution suggestions would be appreciated - this has to be done in parallel. I could look towards snow, but I have a very powerful server with 15 processors, 8 cores each and 256GB of memory I can do this on. I rather just do it on this machine across cores, rather than using one of our clusters.
The integer limit is rumored to be addressed very soon in R. In my experience that limit can block datasets with under 2 billion cells (around the maximum integer), and low level functions like sendMaster in the multicore package rely on passing raw vectors. I had around 1 million processes representing about 400 million rows of data and 800 million cells in the data.table format, and when mclapply was sending the results back it ran into this limit.
A divide and conquer strategy is not that hard and it works. I realize this is a hack and one should be able to rely on mclapply.
Instead of one big list, create a list of lists. Each sub-list is smaller than the broken version, and you then feed them into mclapply split by split. Call this file_map. The results are a list of lists, so you could then use the special double concatenate do.call function. As a result, each time mclapply finishes the size of the serialized raw vector is of a manageable size.
Just loop over the smaller pieces:
collector = vector("list", length(file_map)) # more complex than normal for speed
for(index in 1:length(file_map)) {
reduced_set <- mclapply(file_map[[index]], function(x) {
on.exit(message(sprintf("Completed: %s", x)))
message(sprintf("Started: '%s'", x))
readBamGappedAlignments(x)
}, mc.cores=10)
collector[[index]]= reduced_set
}
output = do.call("c",do.call('c', collector)) # double concatenate of the list of lists
Alternately, save the output to a database as you go such as SQLite.

Resources