Make a shallow copy in data.table - r

I read in an SO topic an answer from Matt Dowle about a shallow function to make shallow copies in data.table. However, I can't find the topic again.
data.table does not have any exported function called shallow. There is an internal one but not documented. Can I use it safely? What is its behavior?
What I would like to do is a memory efficient copy of a big table. Let DT be a big table with n columns and f a function which memory efficiently adds a column. Is something like that possible?
DT2 = f(DT)
with DT2 being a data.table with n columns pointing to the original adresses (no deep copies) and an extra one existing only for DT2. If yes, what appends to DT1 if I do DT2[, col3 := NULL]?

You can't use data.table:::shallow safely, no. It is deliberately not exported and not meant for user use. Either from the point of view of it itself working, or its name or arguments changing in future.
Having said this, you could decide to use it as long as you can either i) guarantee that := or set* won't be called on the result either by you or your users (if you're creating a package) or ii) if := or set* is called on the result then you're ok with both objects being changed by reference. When shallow is used internally by data.table, that's what we promise ourselves.
More background in this answer a few days ago here :
https://stackoverflow.com/a/45891502/403310
In that question I asked for the bigger picture: why is this needed? Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count.
In your question you alluded to your bigger picture and that is very useful. So you'd like to create a function which adds working columns to a big data.table inside the function but doesn't change the big data.table. Can you explain more why you'd like to create a function like that? Why not load the big data.table, add the ephemeral working columns directly to it, and then proceed. Your R session is already a working copy in memory of the data which is persistent somewhere else.
Note that I am not saying no. I'm not saying that you don't have a valid reason. I'm asking to discover more about that valid reason so the priority can be raised.
If that isn't the answer you had seen, there are currently 39 question or answers returned by the search string "[data.table] shallow". Worst case, you could trawl through those to find it again.

Related

How to use daply (from plyr) on 2billion rows using less memory

Does any one know, how one could apply the following function that converts 3 columns table into a matrix using a file that has 2 billion rows (with less than 10GB memory).
where x is 1st, y is 2nd and z is 3rd column.
library(plyr)
daply(a, .(x, y), function(x) x$z)
If you cannot load all the tuples at once
I know this is not the answer you are looking for: use SQLite.
The problem with R is that it must load the entire frame at once. If you don't have enough memory, then it simply can't continue.
SQLite is way smarter than R to do aggregates. Perhaps the most important feature is that it optimizes the memory available, and if it can, it does not need to read all the elements at once. See this for details on how to do it.
http://www.r-bloggers.com/using-sqlite-in-r/
If SQLite does not support the aggregate you want, you can create it yourself (see user defined functions in SQLite).
Alternatively you can try to partition your data (outside R), so you can aggregate in stages. But that will still require some sort of program that can read process the files in less than the available memory. Unix/MacOS/Linux sort is one of those utilities that can deal with more-than-available-memory data. It might be useful.

attr(*, "internal.selfref")=<externalptr> appearing in data.table Rstudio

I am a new user of the R data.table package, and I have noticed something unusual in my data.tables that I have not found explained in the documentation or elsewhere on this site.
When using data.table package within Rstudio, and viewing a specific data.table within the 'Environment' panel, I see the following string appearing at the end of the data.table
attr(*,"internal.selref")=<externalptr>
If I print the same data.table within the Console, this string does not appear.
Is this a bug, or just an inherent feature of data.table (or Rstudio)? Should I be concerned about whether this is affecting how these data are handled by downstream processes?
The versions I am running are as follows:
data.table Version 1.9.6
Rstudio Version 0.99.447
OSX 10.10.5
Apologies in advance if this is just me being an ignorant newbie.
I actually asked Matt Dowle, the primary author of the data.table package, this very question a little while ago.
Is this a bug, or just an inherent feature of data.table (or Rstudio)?
Apparently this attribute is used internally by data.table, it isn't a bug in RStudio, in fact RStudio is doing its job of showing the attributes of the object.
Should I be concerned about whether this is affecting how these data are handled by downstream processes?
No, this isn't going to affect anything.
For those who are curious about why this attribute is created, I believe it's explained in the data.table manual under the section for setkey():
In v1.7.8, the key<- syntax was deprecated. The <- method copies the whole table and we know of
no way to avoid that copy without a change in R itself. Please use the set* functions instead, which
make no copy at all. setkey accepts unquoted column names for convenience, whilst setkeyv
accepts one vector of column names.
The problem (for data.table) with the copy by key<- (other than being slower) is that R doesn’t
maintain the over allocated truelength, but it looks as though it has. Adding a column by reference
using := after a key<- was therefore a memory overwrite and eventually a segfault; the
over allocated memory wasn’t really there after key<-’s copy. data.tables now have an attribute
.internal.selfref to catch and warn about such copies. This attribute has been implemented in
a way that is friendly with identical() and object.size().
For the same reason, please use the other set* functions which modify objects by reference, rather
than using the <- operator which results in copying the entire object.

Wrapper functions for data.table

I have a project that has already been written using context of data.frame. In order to improve calc times I'm trying to leverage the speed of using data.table instead. My methodology for this has been to construct wrapper functions that read in frames, convert them to tables, do the calculations and then convert back to frames. Here's one of the simple examples...
FastAgg<-function(x, FUN, aggFields, byFields = NULL, ...){
require('data.table')
y<-setDT(x)
y<-y[,lapply(X=.SD,FUN=FUN,...),.SDcols = aggFields,by=byFields]
y<-data.frame(y)
y
}
The problem I'm having is that after running this function x has been converted to a table and then lines of code that I have written using data.frame notation fail. How do I make sure that the data.frame I feed in is unchanged by the running function?
For your case, I'd recommend (of course) to use data.table through out and not just in a function :-).
But if it's not likely to happen, then I'd recommend the setDT + setDF setup. I'd recommend using setDT outside the function (and provide the data.table as input) - to convert your data.frame to a data.table by reference, and then after finishing the operations you'd like, you can use setDF to convert the result back to a data.frame using setDF and return that from the function. However, doing setDT(x) changes x to a data.table - as it operates by reference.
If that is not ideal, then use as.data.table(.) inside your function, as it operates on a copy. Then, you can still use setDF() to convert the resulting data.table to data.frame and return that data.frame from your function.
These functions are recently introduced (mostly due to user requests). The idea to avoid this confusion is to export shallow() function and keep track of objects that require columns to be copied, and do it all internally (and automatically). It's all in very early stages right now. When we've managed, I'll update this post.
Also have a look at ?copy, ?setDT and ?setDF. The first paragraph in these function's help page is:
In data.table parlance, all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column.. The only other data.table operator that modifies input by reference is :=. Check out the See Also section below for other set* function data.table provides.
And the example for setDT:
set.seed(45L)
X = data.frame(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE),
C=sample(10), stringsAsFactors=FALSE)
# get the frequency of each "A,B" combination
setDT(X)[, .N, by="A,B"][]
does no assignment (although I admit it could be explained slightly better here).
In setDF:
X = data.table(x=1:5, y=6:10)
## convert 'X' to data.frame, without any copy.
setDF(X)
I think this is pretty clear. But I'll try to provide more clarity. Also, I'll try and add how best to use these functions in the documentation as well.

R How to add multiple columns (:=) to data.table based on results of rollapplyr in .SD

I am trying to add some columns to a large data.table that are based on rolling calculations split by a unique identifier.
Based on Equivalent to ddply(...,transform,...) in data.table I have generated this statement:
alldefaults[,`:=`(.SD[,list(obsPaymentDownMax24m=rollapplyr(obsPaymentsDown,24,max,partial=TRUE)
,obsPaymentDownAvg24m=rollapplyr(obsPaymentsDown,24,mean,partial=TRUE)
,obsPaymentDownMax12m=rollapplyr(obsPaymentsDown,12,max,partial=TRUE)
,obsPaymentDownAvg12m=rollapplyr(obsPaymentsDown,12,mean,partial=TRUE))]),by=OriginalApplicationID]
It produces an error
Error in `[.data.table`(alldefaults, , `:=`(.SD[, list(obsPaymentDownMax24m = rollapplyr(obsPaymentsDown, :
In `:=`(col1=val1, col2=val2, ...) form, all arguments must be named.
When I run this without the function := named it work well but is a new dataset and then joining it back on would be required.
Inserting the assignment within the .SD
alldefaults[,.SD[,`:=`(list(obsPaymentDownMax24m=rollapplyr(obsPaymentsDown,24,max,partial=TRUE)
,obsPaymentDownAvg24m=rollapplyr(obsPaymentsDown,24,mean,partial=TRUE)
,obsPaymentDownMax12m=rollapplyr(obsPaymentsDown,12,max,partial=TRUE)
,obsPaymentDownAvg12m=rollapplyr(obsPaymentsDown,12,mean,partial=TRUE)))],by=OriginalApplicationID]
Produces this error
Error in `[.data.table`(.SD, , `:=`(list(obsPaymentDownMax24m = rollapplyr(obsPaymentsDown, :
.SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.
Is there a trick to making this update work that I've missed?
PS - Not sure if this a hefty enough question to require a reproducible example as it seems primarily syntax oriented and is hopefully easy to point out what the statement ought to be. Also if anyone has recommendations for making this faster again, I'd be very appreciative!
I'm guessing (guessing, because the question is not well formed, which is probably why you got downvoted) you want to do this:
alldefaults[,`:=`(obsPaymentDownMax24m=rollapplyr(obsPaymentsDown,24,max,partial=TRUE)
,obsPaymentDownAvg24m=rollapplyr(obsPaymentsDown,24,mean,partial=TRUE)
,obsPaymentDownMax12m=rollapplyr(obsPaymentsDown,12,max,partial=TRUE)
,obsPaymentDownAvg12m=rollapplyr(obsPaymentsDown,12,mean,partial=TRUE)),by=OriginalApplicationID]

Out of memory when modifying a big R data.frame

I have a big data frame taking about 900MB ram. Then I tried to modify it like this:
dataframe[[17]][37544]=0
It seems that makes R using more than 3G ram and R complains "Error: cannot allocate vector of size 3.0 Mb", ( I am on a 32bit machine.)
I found this way is better:
dataframe[37544, 17]=0
but R's footprint still doubled and the command takes quite some time to run.
From a C/C++ background, I am really confused about this behavior. I thought something like dataframe[37544, 17]=0 should be completed in a blink without costing any extra memory (only one cell should be modified). What is R doing for those commands I posted? What is the right way to modify some elements in a data frame then without doubling the memory footprint?
Thanks so much for your help!
Tao
Following up on Joran suggesting data.table, here are some links. Your object, at 900MB, is manageable in RAM even in 32bit R, with no copies at all.
When should I use the := operator in data.table?
Why has data.table defined := rather than overloading <-?
Also, data.table v1.8.0 (not yet on CRAN but stable on R-Forge) has a set() function which provides even faster assignment to elements, as fast as assignment to matrix (appropriate for use inside loops for example). See latest NEWS for more details and example. Also see ?":=" which is linked from ?data.table.
And, here are 12 questions on Stack Overflow with the data.table tag containing the word "reference".
For completeness :
require(data.table)
DT = as.data.table(dataframe)
# say column name 17 is 'Q' (i.e. LETTERS[17])
# then any of the following :
DT[37544, Q:=0] # using column name (often preferred)
DT[37544, 17:=0, with=FALSE] # using column number
col = "Q"
DT[37544, col:=0, with=FALSE] # variable holding name
col = 17
DT[37544, col:=0, with=FALSE] # variable holding number
set(DT,37544L,17L,0) # using set(i,j,value) in v1.8.0
set(DT,37544L,"Q",0)
But, please do see linked questions and the package's documentation to see how := is more general than this simple example; e.g., combining := with binary search in an i join.
Look up 'copy-on-write' in the context of R discussions related to memory. As soon as one part of a (potentially really large) data structure changes, a copy is made.
A useful rule of thumb is that if your largest object is N mb/gb/... large, you need around 3*N of RAM. Such is life with an interpreted system.
Years ago when I had to handle large amounts of data on machines with (relative to the data volume) relatively low-ram 32-bit machines, I got good use out of early versions of the bigmemory package. It uses the 'external pointer' interface to keep large gobs of memory outside of R. That save you not only the '3x' factor, but possibly more as you may get away with non-contiguous memory (which is the other thing R likes).
Data frames are the worst structure you can choose to make modification to. Due to quite the complex handling of all features (such as keeping row names in synch, partial matching, etc.) which is done in pure R code (unlike most other objects that can go straight to C) they tend to force additional copies as you can't edit them in place. Check R-devel on the detailed discussions on this - it has been discussed in length several times.
The practical rule is to never use data frames for large data, unless you treat them read-only. You will be orders of magnitude more efficient if you either work on vectors or matrices.
There is type of object called a ffdf in the ff package which is basically a data.frame stored on disk. In addition to the other tips above you can try that.
You can also try the RSQLite package.

Resources