Is there a package like bigmemory in R that can deal with large list objects? - r

I know that the R package bigmemory works great in dealing with large matrices and data frames. However, I was wondering if there is any package or any ways to efficiently work with large list.
Specifically, I created a list with its elements being vectors. I have a for loop and during each iteration, multiple values were appended to a selected element in that list (a vector). At first, it runs fast, but when the iteration is over maybe 10000, it slows down gradually (one iteration takes about a second). I'm going to go through about 70000 to 80000 iterations, and the list would be so large after that.
So I was just wondering if there is something like big.list as the big.matrix in the bigmemory package that could speed up this whole process.
Thanks!

I'm not really sure if this a helpful answer, but you can interactively work with lists on disk using the filehash package.
For example here's some code that makes a disk database, assigns a preallocated empty list to the database, then runs a function (getting the current time) that fills the list in the database.
# how many items in the list?
n <- 100000
# setup database on disk
dbCreate("testDB")
db <- dbInit("testDB")
# preallocate vector in database
db$time <- vector("list", length = n)
# run function using disk object
for(i in 1:n) db$time[[i]] <- Sys.time()
There is hardly any use of RAM during this process, however it is VERY slow (two orders of magnitude slower than doing it in RAM on some of my tests) due to constant disk I/O. So I'm not sure that this method is a good answer to the question of how you can speed up working on big objects.

DSL package might help. The DList object works like a drop in replacement for R's list. Futher, it provides a distributed list like facility too.

Related

construct large "matrix" in R

I'm trying to construct a large matrix:
mat <- matrix(0,ncol=700000,nrow=700000)
I tried on machines with a lot of RAM but they don't seem to be able to handle it.
is there another data structure I could use that is faster or less memory intensive?
I would still need the same number of rows and columns filled with 0s.

R needs several hours to save very small objects. Why?

I am running several calculations and ML algorithms in R and store their results in four distinctive tables.
For each calculation, I obtain four tables, which I store in a single list.
According to R, all of my lists are labelled as "Large List (4 elements, 971.2 kB)" in the upper right quadrant in RStudio where all my objects, functions, etc. are displayed.
I have five of these lists and save them for later use with the save() function.
I use the function:
save(list1, list2, list3, list4, list5, file="mypath/mylists.RData")
For some reason, which I do not understand, R takes more than 24 hours to save these four lists with only 971.2 kB each.
Maybe, I should add that apparently more than 10GB of my RAM are used by R at the time. However, the lists are as small as I indicated above.
Does anyone have an idea why it takes so long to save the lists to my harddrive and what I could do about it?
Thank you
This is just a guess, because we don't have your data.
Some objects in r contain references to environments. The most common examples are functions and formulas. If you save one of those, r may need to save the whole environment. This can drastically increase the size of what is being saved. If you are short of memory that could take a very long time due to swapping.
Example:
F <- function () {
X <- rnorm(1000000)
Y ~ z
}
This function returns a small formula which references the environment holding X, so saving it will take a lot of space.
Thanks for your answers.
I solved my problem by writing a function which extracts the tables from the objects and saves them as .csv files in a folder. I cleaned the environment and shut down the computer. Afterwards, I restarted the computer, started R and loaded all the .csv files again. I then saved the thus created objects with the familiar save() command.
It is probably not the most elegant way, but it worked and was quite quick.

R accumulating memory in each iteration with large input files

I am reading around 20,000 text files in a for loop for sentiment analysis. Each file is of around 20-40 MB size. In each loop, I am taking out some sentiment counts (just a 2 numbers) out of my input text and storing it in a dataframe. The issue is, in each iteration, I can see R is cumulatively accumulating memory. After 10,000 files I see around 13GB memory allocation for R in my task manager. I tried gc() and rm() to delete objects after each iteration, but still it does not work. The logic is as I am using the same objects iteratively R is not releasing memory used in the previous iterations.
for(i in 1:20,000){
filename <- paste0("file_", i, ".txt")
text <- readLines(filename)
# Doing sentiment analysis based on dictionary based approach
# Storing sentiment counts in dataframe
# Removing used objects
rm(filename, text)
gc()
}
You could try to check which objects are taking memory and that you do not use anymore:
print(sapply(ls(), function(x) pryr::object_size(get(x))/1024/1024))
(EDIT: just saw the comment with this almost identical advice)
this line will give you the size in Megabytes of every object present in the environment (in RAM).
Alternatively if nothing appears, you can call gc() several times instead of once like:
rm(filename, text)
for (i in 1:3) gc()
It is usually more effective...
If nothing works, that could mean the memory is fragmented and thus RAM is free but unavailable to use as misplaced between data you still use.
The solution could be to run your scripts by chunks of files say 1000 by 1000.

Yet another apply Questions

I am totally convinced that an efficient R programm should avoid using loops whenever possible and instead should use the big family of the apply functions.
But this cannot happen without pain.
For example I face with a problem whose solution involves a sum in the applied function, as a result the list of results is reduced to a single value, which is not what I want.
To be concrete I will try to simplify my problem
assume N =100
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
As you can see the function inside cause length of the built vector to explode
whereas using the sum inside would collapse everything to single value
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
What I would like to have is a the list of degree of N.
so what do you think? how can I repair it?
Your question doesn't contain reproducible code (what's "eps"?), but on the general point about for loops and optimising code:
For loops are not incredibly slow. For loops are incredibly slow when used improperly because of how memory is assigned to objects. For primitive objects (like vectors), modifying a value in a field has a tiny cost - but expanding the /length/ of the vector is fairly costly because what you're actually doing is creating an entirely new object, finding space for that object, copying the name over, removing the old object, etc. For non-primitive objects (say, data frames), it's even more costly because every modification, even if it doesn't alter the length of the data.frame, triggers this process.
But: there are ways to optimise a for loop and make them run quickly. The easiest guidelines are:
Do not run a for loop that writes to a data.frame. Use plyr or dplyr, or data.table, depending on your preference.
If you are using a vector and can know the length of the output in advance, it will work a lot faster. Specify the size of the output object before writing to it.
Do not twist yourself into knots avoiding for loops.
So in this case - if you're only producing a single value for each thing in N, you could make that work perfectly nicely with a vector:
#Create output object. We're specifying the length in advance so that writing to
#it is cheap
output <- numeric(length = length(N))
#Start the for loop
for(i in seq_along(output)){
output[i] <- your_computations_go_here(N[i])
}
This isn't actually particularly slow - because you're writing to a vector and you've specified the length in advance. And since data.frames are actually lists of equally-sized vectors, you can even work around some issues with running for loops over data.frames using this; if you're only writing to a single column in the data.frame, just create it as a vector and then write it to the data.frame via df$new_col <- output. You'll get the same output as if you had looped through the data.frame, but it'll work faster because you'll only have had to modify it once.

mclapply with big objects - "serialization is too large to store in a raw vector"

I keep hitting an issue with the multicore package and big objects. The basic idea is that I'm using a Bioconductor function (readBamGappedAlignments) to read in large objects. I have a character vector of filenames, and I've been using mclapply to loop over the files and read them into a list. The function looks something like this:
objects <- mclapply(files, function(x) {
on.exit(message(sprintf("Completed: %s", x)))
message(sprintf("Started: '%s'", x))
readBamGappedAlignments(x)
}, mc.cores=10)
However, I keep getting the following error: Error: serialization is too large to store in a raw vector. However, it seems I can read the same files in alone without this error. I've found mention of this issue here, without resolution.
Any parallel solution suggestions would be appreciated - this has to be done in parallel. I could look towards snow, but I have a very powerful server with 15 processors, 8 cores each and 256GB of memory I can do this on. I rather just do it on this machine across cores, rather than using one of our clusters.
The integer limit is rumored to be addressed very soon in R. In my experience that limit can block datasets with under 2 billion cells (around the maximum integer), and low level functions like sendMaster in the multicore package rely on passing raw vectors. I had around 1 million processes representing about 400 million rows of data and 800 million cells in the data.table format, and when mclapply was sending the results back it ran into this limit.
A divide and conquer strategy is not that hard and it works. I realize this is a hack and one should be able to rely on mclapply.
Instead of one big list, create a list of lists. Each sub-list is smaller than the broken version, and you then feed them into mclapply split by split. Call this file_map. The results are a list of lists, so you could then use the special double concatenate do.call function. As a result, each time mclapply finishes the size of the serialized raw vector is of a manageable size.
Just loop over the smaller pieces:
collector = vector("list", length(file_map)) # more complex than normal for speed
for(index in 1:length(file_map)) {
reduced_set <- mclapply(file_map[[index]], function(x) {
on.exit(message(sprintf("Completed: %s", x)))
message(sprintf("Started: '%s'", x))
readBamGappedAlignments(x)
}, mc.cores=10)
collector[[index]]= reduced_set
}
output = do.call("c",do.call('c', collector)) # double concatenate of the list of lists
Alternately, save the output to a database as you go such as SQLite.

Resources