r: managing memory allocation in loops - r

First, this question is NOT about
Error: cannot allocate vector of size n
I accept this error as a given and I am trying to avoid the error in code
I have a dataset of 3000+ variables and 120000 cases
All columns are numeric
I need to reset NA with zero
If I reassign values to 0 for the entire dataset, I get the memory
allocation error.
So I am reassigning the values to zero one column at a time:`
resetNA <- function(results)
{
for (i in 1:ncol(results))
{
if(i>10)
{
results[,i][is.na(results[,i])] <- 0
}
}
print(head(results))
}
After about 1000 columns, I still get the memory allocation error.
Now, this seems strange to me. Somehow memory allocation is incrementing after each loop. However, I don't see why this would be the case.
Also, I tried calling garbage collection function after each loop, I still got the memory allocation error.
Can someone explain to me how I can manage the variables to avoid the incremental increase in memory allocation (after all, the data frame size has not changed).

As noted in the comments above, the answer is here:
Fastest way to replace NAs in a large data.table
I tried it and it works very well

I have learned an important general principle about r memory usage.
See this discussion.
Whereever possible avoid looping through a dataframe. Use lapply. This converts a dataframe to a list and then runs the relevant function on the list. It then returns a list. Convert the list back to a dataframe.
The following example recodes numeric frequencies to a categorical variable. It is fast and does not increase memory usage.
list1<-lapply(mybigdataframe,function(x) ifelse( x>0,"Yes","No"))
newdf1<-as.data.frame(list1)

Related

What is the most memory efficient way to initialize a list before a loop in R?

I am wondering what the most memory efficient way to initialize a list is in R if that list is going to be used in a loop to store results. I know that growing an object in a loop can cause a serious hit in computational efficiency so I am trying to avoid that as much as possible.
My problem is as follows. I have several groups of data that I want to process individually. The gist of my code is I have a loop that runs through each group one at a time, does some t-tests, and then returns only the statistically significant results (thus variable length results for each group). So far I am initializing a list of length(groups) to store the results of each iteration.
My main question is how I should be initializing the list so that the object is not grown in the loop.
Is it good enough to do list = vector(mode = "list", length=length(groups)) for the initialization?
I am skeptical about this because it just creates a list of length(groups) but each entry is equal to NULL. My concern is that during each iteration of the loop when I go to store data into the list, it is going to recopy the object each time as the entry goes from NULL to my results vector, in which case initializing the list doesn't really do much good. I don't know how the internals of a list work, however, so it is possible that it just stores the reference to the vector being stored in the list, meaning recopying is not necessary.
The other option would be to initialize each element of the list to a vector of the maximum possible length the results could have.
This is not a big issue as the maximum number of possible valid results is known. If I took this approach I would just overwrite each vector with the results vector within the loop. Since the maximum amount of memory would already be reserved hopefully no recopying/growth would occur. I don't want to take this approach, however, if it is not necessary and the first option above is good enough.
Below is some psuedo code describing my problem
#initialize variables
results = vector(mode="list", length=length(groups)) #the line of code in question
y=1
tTests = vector(length = length(singleGroup))
#perform analysis on each group in groups
for(group in groups)
{
#returns a vector of p values with one entry per element in group
tTests = tTestFunction(group)
results[[y]] = tTests<=0.05
y=y+1
}
Your code does not work, so it is a bad example. Consider this:
x <- vector("list", length = 4)
tracemem(x) ## trace memory copies of "x"
for (i in 1:4) x[[i]] <- rnorm(4)
No extra copy of x is made during update. So there is nothing to worry.
As suggested by #lmo, even if you use x <- list() to initialize this list, no memory copy will be incurred, either.
Comment
The aim of my answer, is to refer you to the use of tracemem, when you want to trace (possible) memory copies made during code execution. Had you known this function, you would not ask us here.
Here is my other answer made, related to using tracemem. It is in a different context, though. There, you can see what tracemem would return when memory copies are made.

R: Why does it take so long to parse this data table

I have a data frame df that has 15 columns and 1000000 rows of all ints. My code is:
for(i in 1:nrow(df))
{
if(is.null(df$col1[i]) || .... || is.null(df$col9[i]))
df[-i,] #to delete the row if one of those columns is null
}
This has been running for an hour and is still going. Why? It seems like it should be relatively fast code to run. How can I speed it up?
The reason it is slow is that R is relatively slow at looping through vectors. Most functions in R are vectorized which means you can perform them on a vector at once much faster than it can loop through each element one by one. On a side note, I don't think you have NULLs in your data frame. I think you have NAs so I'm going to assume that is what you have. Even if you have NULLs then the following should still work.
This syntax should give you a nice speed boost.
This will take advantage of rowSums producing NA for every row that has missing values in it.
df<-subset(df, !is.na(rowSums(df[,1:10])))
This syntax should also work.
df<-df[rowSums(is.na(df[,1:10]))==0,]

R language: how to work with dynamically sized vector?

I'm learning R programming, and trying to understand the best approach to work with a vector when you don't know the final size it will end up being. For example, in my case I need to build the vector inside a for loop, but only for some iterations, which aren't know beforehand.
METHOD 1
I could run through the loop a first time to determine the final vector length, initialize the vector to the correct length, then run through the loop a second time to populate the vector. This would be ideal from a memory usage standpoint, since the vector memory would occupy the required amount of memory.
METHOD 2
Or, I could use one for loop, and simply append to the vector as needed, but this would be inefficient from a memory allocation standpoint since a new block may need to be assigned each time a new element is appended to the vector. If you're working with big data, this could be a problem.
METHOD 3
In C or Matlab, I usually initialize the vector length to the largest possible length that I know the final vector could occupy, then populate a subset of elements in the for loop. When the loop completes, I'll re-size the vector length appropriately.
Since R is used a lot in data science, I thought this would be a topic others would have encountered and there may be a best practice that was recommended. Any thoughts?
Canonical R code would use lapply or similar to run the function on each element, then combine the results in some way. This avoids the need to grow a vector or know the size ahead of time. This is the functional programming approach to things. For example,
set.seed(5)
x <- runif(10)
some_fun <- function(x) {
if (x > 0.5) {
return(x)
} else {
return(NULL)
}
}
unlist(lapply(x, some_fun))
The size of the result vector is not specified, but is determined automatically by combining results.
Keep in mind that this is a trivial example for illustration. This particular operation could be vectorized.
I think Method1 is the best approach if you have a very large amount of data. But in general you might want to read this chapter before you make a final decision:
http://adv-r.had.co.nz/memory.html

Yet another apply Questions

I am totally convinced that an efficient R programm should avoid using loops whenever possible and instead should use the big family of the apply functions.
But this cannot happen without pain.
For example I face with a problem whose solution involves a sum in the applied function, as a result the list of results is reduced to a single value, which is not what I want.
To be concrete I will try to simplify my problem
assume N =100
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
As you can see the function inside cause length of the built vector to explode
whereas using the sum inside would collapse everything to single value
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
What I would like to have is a the list of degree of N.
so what do you think? how can I repair it?
Your question doesn't contain reproducible code (what's "eps"?), but on the general point about for loops and optimising code:
For loops are not incredibly slow. For loops are incredibly slow when used improperly because of how memory is assigned to objects. For primitive objects (like vectors), modifying a value in a field has a tiny cost - but expanding the /length/ of the vector is fairly costly because what you're actually doing is creating an entirely new object, finding space for that object, copying the name over, removing the old object, etc. For non-primitive objects (say, data frames), it's even more costly because every modification, even if it doesn't alter the length of the data.frame, triggers this process.
But: there are ways to optimise a for loop and make them run quickly. The easiest guidelines are:
Do not run a for loop that writes to a data.frame. Use plyr or dplyr, or data.table, depending on your preference.
If you are using a vector and can know the length of the output in advance, it will work a lot faster. Specify the size of the output object before writing to it.
Do not twist yourself into knots avoiding for loops.
So in this case - if you're only producing a single value for each thing in N, you could make that work perfectly nicely with a vector:
#Create output object. We're specifying the length in advance so that writing to
#it is cheap
output <- numeric(length = length(N))
#Start the for loop
for(i in seq_along(output)){
output[i] <- your_computations_go_here(N[i])
}
This isn't actually particularly slow - because you're writing to a vector and you've specified the length in advance. And since data.frames are actually lists of equally-sized vectors, you can even work around some issues with running for loops over data.frames using this; if you're only writing to a single column in the data.frame, just create it as a vector and then write it to the data.frame via df$new_col <- output. You'll get the same output as if you had looped through the data.frame, but it'll work faster because you'll only have had to modify it once.

Removing lines in data.table and spiking memory usage

I have a data.table of a decent size 89M rows, 3.7Gb. Keys are in place so everything is set-up properly. However I am experiencing a problem when I remove rows based on a column's value. The memory usage just goes through the roof!
Just for the record I have read the other posts here about this, but they don't really help much. Also, I am using RStudio which I am pretty sure is not ideal but it helps while experimenting, however I notice the same behaviour in the R console. I am using Windows.
Let me post an example (taken from a similar question regarding removal of rows) of creating a very big data.table approx 1e6x100
rm(list=ls(all=TRUE)) #Clean stuff
gc(reset=TRUE) #Call gc (not really helping but whatever..)
dimension=1e6 #let's say a million
DT = data.table(col1 = 1:dimension)
cols = paste0('col', 2:100) #let these be conditions as columns
for (col in cols){ DT[, col := 1:dimension, with = F] }
DT.m<-melt(DT,id=c('col1','col2','col3'))
Ok so now we have a data.table with 97M rows, approx 1.8Gb. This is our starting point.
Let's remove all rows where the value column (after the melt) is e.g. 4
DT.m<-DT.m[value!=4]
The last line takes a huge amount of memory! Prior to executing this line, in my PC the memory usage is approx 4.3Gb, and just after the line is executed, it goes to 6.9Gb!
This is the correct way to remove the lines, right? (just checking). Has anyone come across this behaviour before?
I thought of looping for all parameters and keeping the rows I am interested in, in another data.table but somehow I doubt that this is a proper way of working.
I am looking forward to your help.
Thanks
Nikos
Update: With this commit, the logical vector is replaced by row indices to save memory (Read the post below for more info). Fixed in 1.9.5.
Doing sum(DT.m$value == 4L) gives me 97. That is, you're removing a total of 97 rows from 97 million. This in turn implies that the subset operation would return ~1.8GB data set as well.
Your memory usage was 4.3GB to begin with
The condition you provide value == 4 takes the space of a logical vector of size 97 million =~360MB.
data.table computes a which(that_value) to fetch indices = almost all the rows = another 360MB
The data that's being subset has to be allocated elsewhere first, and that's ~1.8GB.
Total comes to 4.3+1.8+0.72 =~ 6.8GB
And garbage collection hasn't happened yet. If you now do gc(), the memory corresponding to old DT.m should be released.
The only place where I can see we can save space is by replacing the logical vector with the integer vector (rather than storing the integer indices in another vector) to save the extra 360MB of space.
Usually which results in a much smaller (negligible) value - and therefore subset is faster - that being the reason for using which(). But in this case, you remove 97 rows.
But good to know that we can save a bit of memory. Could you please file an issue here?
Removing rows by reference, #635, when implemented, should both be fast and memory efficient.

Resources