column slots in data.table - r

I have a dataset x with 350m rows and 4 columns. When joining two columns from a dataset i of 13m rows and 19 columns, I encounter the following error:
Internal logical error. DT passed to assign has not been allocated enough column slots. l=4, tl=4, adding 1
I have checked Not Enough Columns Slots but there the problem appears to be in the number of columns. Since I have only a few, I would be surprised if this was the issue.
Also, I found https://github.com/Rdatatable/data.table/issues/1830, where the error is related to "column slots", but I do not understand what they are. When checking truelength, I obtain
> truelength(x)
[1] 0
> truelength(i)
[1] 0
My understanding is that setting, for example, alloc.col(x,32) or alloc.col(i,32), or both could solve the issue. However, I don`t understand what this does and and what the issue is. Can anyone offer an explanation?

Part of what makes data.table so efficient is it tries to be smart about memory usage (whereas base data.frames tend to end up getting copied left and right in regular usage, e.g., setting names(DF) = col_names can actually copy all of DF despite only manipulating an attribute of the object).
Part of this, in turn, is that a data.table is always allocated a certain size in memory to allow for adding/subtracting column pointers more fluidly (from a memory perspective).
So, while actual columns take memory greedily (when they're created, sufficient memory is claimed to store the nrow(DT)-size vector), the column pointers, which store addresses where to find the actual data (you can think of this ~like~ column names, if you don't know the grittier details of pointers), have a fixed memory slot upon creation.
alloc.col forces the column pointer address reserve process; this is most commonly used in two cases:
Your data needs a lot of columns (by default, room is allocated for 1024 pointers more than there are columns at definition)
You've loaded your data from RDS (since readRDS/load don't know to allocate this memory for a data.table upon loading, we have to trigger this ourselves)
I assume Frank is right and that you're experiencing the latter. See ?alloc.col for some more details, but in most cases, you should just run alloc.col(x) and alloc.col(i) -- except for highly constrained machines, allocating 1024 column pointers requires relatively little memory, so you shouldn't spend to much effort skimping and trying to figure out the right quantity.

Related

For memory, what should be done when you need to constantly grow a vector to an unknown upper limit?

Suppose that you are dealing with a potentially infinite amount of data. Suppose further that you do not have this data stored in memory, but can generate individual terms at will. Finally, suppose that you want to do some experiment on this data that will involve checking a large but unknown amount of terms in a way that necessitates keeping a great many of them in memory. Toy problems with Recamán's sequence, like "find the minimum number terms needed in that sequence for the first 25 even numbers to have appeared", are what I have in mind as typical examples.
The obvious solution to this sort of problem would be to write some code like:
list<-c(first term)
while([not found enough terms yet])
{
nextTerm<-Whatever
if(this term worked){list<-c(list,nextTerm)}
}
However, building a big vector like this by adding one new term at a time is your memory's worst nightmare. The alternative that I often see suggested is to pre-allocate a big vector in memory by making the first line of your code something like list<-numeric(10^6), but those solutions suppose that we have some rough idea of how many terms we need to check, which isn't always the case. So what can we do when we are dealing with an ever-growing list of unknown required length?
This is very popular subject in R check this answer: https://stackoverflow.com/a/45195098/5442527
Summing up:
Do not use c() to bind as providing value by index [ is much faster. I know that it might seem surprising that you could grow pre-allocated vector. Make an iter variable before while loop and increase the index inside the if statement.
Normally like in Python you do not have to care about it when using append. Even starting with empty list is not an problem as the list (reserved memory) grows expotentialy (x2x2x1.5x1.2...) when you pass some perimeter number of elements. Link Over-allocating

Metadata of a Spark DataFrame (RDD)

I am benchmarking spark in R via "sparklyr" and "SparkR". I test different functions on different Testdata. In two particular cases, where I count the amount of zeros in a column and the amount of NA's in a column, I realized that no matter how big the data is, the result is there in less than a second. All the other computations scale with the size of the data.
So I don't think that Spark computes anything there, but that those cases are stored somewhere in the meta data, and that it computed these results while loading the data. I tested my functions and they always give me the right result.
Can anyone confirm whether the number of zeros and number of nulls in a column is stored in a dataframe's metadata, and if not, why does it return so quickly with the correct value?
There is no metadata associated to a Spark DataFrame that would contain columnar data; therefore, my guess is that the performance difference you measured is caused by something else. Hard to tell without a reproducible example.

Removing lines in data.table and spiking memory usage

I have a data.table of a decent size 89M rows, 3.7Gb. Keys are in place so everything is set-up properly. However I am experiencing a problem when I remove rows based on a column's value. The memory usage just goes through the roof!
Just for the record I have read the other posts here about this, but they don't really help much. Also, I am using RStudio which I am pretty sure is not ideal but it helps while experimenting, however I notice the same behaviour in the R console. I am using Windows.
Let me post an example (taken from a similar question regarding removal of rows) of creating a very big data.table approx 1e6x100
rm(list=ls(all=TRUE)) #Clean stuff
gc(reset=TRUE) #Call gc (not really helping but whatever..)
dimension=1e6 #let's say a million
DT = data.table(col1 = 1:dimension)
cols = paste0('col', 2:100) #let these be conditions as columns
for (col in cols){ DT[, col := 1:dimension, with = F] }
DT.m<-melt(DT,id=c('col1','col2','col3'))
Ok so now we have a data.table with 97M rows, approx 1.8Gb. This is our starting point.
Let's remove all rows where the value column (after the melt) is e.g. 4
DT.m<-DT.m[value!=4]
The last line takes a huge amount of memory! Prior to executing this line, in my PC the memory usage is approx 4.3Gb, and just after the line is executed, it goes to 6.9Gb!
This is the correct way to remove the lines, right? (just checking). Has anyone come across this behaviour before?
I thought of looping for all parameters and keeping the rows I am interested in, in another data.table but somehow I doubt that this is a proper way of working.
I am looking forward to your help.
Thanks
Nikos
Update: With this commit, the logical vector is replaced by row indices to save memory (Read the post below for more info). Fixed in 1.9.5.
Doing sum(DT.m$value == 4L) gives me 97. That is, you're removing a total of 97 rows from 97 million. This in turn implies that the subset operation would return ~1.8GB data set as well.
Your memory usage was 4.3GB to begin with
The condition you provide value == 4 takes the space of a logical vector of size 97 million =~360MB.
data.table computes a which(that_value) to fetch indices = almost all the rows = another 360MB
The data that's being subset has to be allocated elsewhere first, and that's ~1.8GB.
Total comes to 4.3+1.8+0.72 =~ 6.8GB
And garbage collection hasn't happened yet. If you now do gc(), the memory corresponding to old DT.m should be released.
The only place where I can see we can save space is by replacing the logical vector with the integer vector (rather than storing the integer indices in another vector) to save the extra 360MB of space.
Usually which results in a much smaller (negligible) value - and therefore subset is faster - that being the reason for using which(). But in this case, you remove 97 rows.
But good to know that we can save a bit of memory. Could you please file an issue here?
Removing rows by reference, #635, when implemented, should both be fast and memory efficient.

Why does changing a column name take an extremely long time with a large data.frame?

I have a data.frame in R with 19 million rows and 90 columns. I have plenty of spare RAM and CPU cycles. It seems that changing a single column name in this data frame is a very intense operation for R.
system.time(colnames(my.df)[1] <- "foo")
user system elapsed
356.88 16.54 373.39
Why is this so? Does every row store the column name somehow? Is this creating an entirely new data frame? It seems this operation should complete in negligible time. I don't see anything obvious in the R manual entry.
I'm running build 7600 of R (64bit) on Windows 7, and in my current workspace, setting colnames on a small data.frame takes '0' time according to system.time().
Edit: I'm aware of the possibility of using data.table, and, honestly, I can wait 5 minutes for the rename to complete whilst I go get some tea. What I'm interested in is what is happening and why?
As several commenters have mentioned, renaming data frame columns is slow, because (depending on how you do it) it makes between 1 and 4 copies of the entire data.frame. Here, from data.table's ?setkey help page, is the nicest way of demonstrating this behavior that I've seen:
DF = data.frame(a=1:2,b=3:4) # base data.frame to demo copies
try(tracemem(DF)) # try() for non-Windows where R is
# faster without memory profiling
colnames(DF)[1] <- "A" # 4 copies of entire object
names(DF)[1] <- "A" # 3 copies of entire object
names(DF) <- c("A", "b") # 1 copy of entire object
`names<-`(DF,c("A","b")) # 1 copy of entire object
x=`names<-`(DF,c("A","b")) # still 1 copy (so not print method)
# What if DF is large, say 10GB in RAM. Copy 10GB just to change a column name?
To (start) understanding why things are done this way, you'll probably need to delve into some of the related discussions on R-devel. Here are a couple: R-devel: speeding up perception and R-devel: Confused about NAMES
My impressionistic reading of those threads is that:
At least one copy is made so that modifications to it can be 'tried out' before overwriting the original. Thus, if something is wrong with the value-to-be-reassigned, [<-.data.frame or names<- can 'back out' and deliver an error message without having done any damage to the original object.
Several members of R-core aren't completely satisfied with how things are working right now. Several folks explain that in some cases "R loses track"; Luke Tierney indicates that he's tried some modifications relating to this copying in the past "in a few cases and always had to back off"; and Simon Urbanek hints that "there may be some things coming up, too"
(As I said, though, that's just impressionistic: I'm simply not able to follow a full conversation about the details of R's internals!)
Also relevant, in case you haven't seen it, here's how something like names(z)[3] <- "c2" "really" works:
# From ?names<-
z <- "names<-"(z, "[<-"(names(z), 3, "c2"))
Note: Much of this answer comes from Matthew Dowle's answer to this other question. (I thought it was worth placing it here, and giving it some more exposure, since it's so relevant to your own question).

Out of memory when modifying a big R data.frame

I have a big data frame taking about 900MB ram. Then I tried to modify it like this:
dataframe[[17]][37544]=0
It seems that makes R using more than 3G ram and R complains "Error: cannot allocate vector of size 3.0 Mb", ( I am on a 32bit machine.)
I found this way is better:
dataframe[37544, 17]=0
but R's footprint still doubled and the command takes quite some time to run.
From a C/C++ background, I am really confused about this behavior. I thought something like dataframe[37544, 17]=0 should be completed in a blink without costing any extra memory (only one cell should be modified). What is R doing for those commands I posted? What is the right way to modify some elements in a data frame then without doubling the memory footprint?
Thanks so much for your help!
Tao
Following up on Joran suggesting data.table, here are some links. Your object, at 900MB, is manageable in RAM even in 32bit R, with no copies at all.
When should I use the := operator in data.table?
Why has data.table defined := rather than overloading <-?
Also, data.table v1.8.0 (not yet on CRAN but stable on R-Forge) has a set() function which provides even faster assignment to elements, as fast as assignment to matrix (appropriate for use inside loops for example). See latest NEWS for more details and example. Also see ?":=" which is linked from ?data.table.
And, here are 12 questions on Stack Overflow with the data.table tag containing the word "reference".
For completeness :
require(data.table)
DT = as.data.table(dataframe)
# say column name 17 is 'Q' (i.e. LETTERS[17])
# then any of the following :
DT[37544, Q:=0] # using column name (often preferred)
DT[37544, 17:=0, with=FALSE] # using column number
col = "Q"
DT[37544, col:=0, with=FALSE] # variable holding name
col = 17
DT[37544, col:=0, with=FALSE] # variable holding number
set(DT,37544L,17L,0) # using set(i,j,value) in v1.8.0
set(DT,37544L,"Q",0)
But, please do see linked questions and the package's documentation to see how := is more general than this simple example; e.g., combining := with binary search in an i join.
Look up 'copy-on-write' in the context of R discussions related to memory. As soon as one part of a (potentially really large) data structure changes, a copy is made.
A useful rule of thumb is that if your largest object is N mb/gb/... large, you need around 3*N of RAM. Such is life with an interpreted system.
Years ago when I had to handle large amounts of data on machines with (relative to the data volume) relatively low-ram 32-bit machines, I got good use out of early versions of the bigmemory package. It uses the 'external pointer' interface to keep large gobs of memory outside of R. That save you not only the '3x' factor, but possibly more as you may get away with non-contiguous memory (which is the other thing R likes).
Data frames are the worst structure you can choose to make modification to. Due to quite the complex handling of all features (such as keeping row names in synch, partial matching, etc.) which is done in pure R code (unlike most other objects that can go straight to C) they tend to force additional copies as you can't edit them in place. Check R-devel on the detailed discussions on this - it has been discussed in length several times.
The practical rule is to never use data frames for large data, unless you treat them read-only. You will be orders of magnitude more efficient if you either work on vectors or matrices.
There is type of object called a ffdf in the ff package which is basically a data.frame stored on disk. In addition to the other tips above you can try that.
You can also try the RSQLite package.

Resources