I am a new user of the R data.table package, and I have noticed something unusual in my data.tables that I have not found explained in the documentation or elsewhere on this site.
When using data.table package within Rstudio, and viewing a specific data.table within the 'Environment' panel, I see the following string appearing at the end of the data.table
attr(*,"internal.selref")=<externalptr>
If I print the same data.table within the Console, this string does not appear.
Is this a bug, or just an inherent feature of data.table (or Rstudio)? Should I be concerned about whether this is affecting how these data are handled by downstream processes?
The versions I am running are as follows:
data.table Version 1.9.6
Rstudio Version 0.99.447
OSX 10.10.5
Apologies in advance if this is just me being an ignorant newbie.
I actually asked Matt Dowle, the primary author of the data.table package, this very question a little while ago.
Is this a bug, or just an inherent feature of data.table (or Rstudio)?
Apparently this attribute is used internally by data.table, it isn't a bug in RStudio, in fact RStudio is doing its job of showing the attributes of the object.
Should I be concerned about whether this is affecting how these data are handled by downstream processes?
No, this isn't going to affect anything.
For those who are curious about why this attribute is created, I believe it's explained in the data.table manual under the section for setkey():
In v1.7.8, the key<- syntax was deprecated. The <- method copies the whole table and we know of
no way to avoid that copy without a change in R itself. Please use the set* functions instead, which
make no copy at all. setkey accepts unquoted column names for convenience, whilst setkeyv
accepts one vector of column names.
The problem (for data.table) with the copy by key<- (other than being slower) is that R doesn’t
maintain the over allocated truelength, but it looks as though it has. Adding a column by reference
using := after a key<- was therefore a memory overwrite and eventually a segfault; the
over allocated memory wasn’t really there after key<-’s copy. data.tables now have an attribute
.internal.selfref to catch and warn about such copies. This attribute has been implemented in
a way that is friendly with identical() and object.size().
For the same reason, please use the other set* functions which modify objects by reference, rather
than using the <- operator which results in copying the entire object.
Related
I read in an SO topic an answer from Matt Dowle about a shallow function to make shallow copies in data.table. However, I can't find the topic again.
data.table does not have any exported function called shallow. There is an internal one but not documented. Can I use it safely? What is its behavior?
What I would like to do is a memory efficient copy of a big table. Let DT be a big table with n columns and f a function which memory efficiently adds a column. Is something like that possible?
DT2 = f(DT)
with DT2 being a data.table with n columns pointing to the original adresses (no deep copies) and an extra one existing only for DT2. If yes, what appends to DT1 if I do DT2[, col3 := NULL]?
You can't use data.table:::shallow safely, no. It is deliberately not exported and not meant for user use. Either from the point of view of it itself working, or its name or arguments changing in future.
Having said this, you could decide to use it as long as you can either i) guarantee that := or set* won't be called on the result either by you or your users (if you're creating a package) or ii) if := or set* is called on the result then you're ok with both objects being changed by reference. When shallow is used internally by data.table, that's what we promise ourselves.
More background in this answer a few days ago here :
https://stackoverflow.com/a/45891502/403310
In that question I asked for the bigger picture: why is this needed? Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count.
In your question you alluded to your bigger picture and that is very useful. So you'd like to create a function which adds working columns to a big data.table inside the function but doesn't change the big data.table. Can you explain more why you'd like to create a function like that? Why not load the big data.table, add the ephemeral working columns directly to it, and then proceed. Your R session is already a working copy in memory of the data which is persistent somewhere else.
Note that I am not saying no. I'm not saying that you don't have a valid reason. I'm asking to discover more about that valid reason so the priority can be raised.
If that isn't the answer you had seen, there are currently 39 question or answers returned by the search string "[data.table] shallow". Worst case, you could trawl through those to find it again.
The data.table package's documentation for setorder states:
x[order(.)] is now optimised internally to use data.table's fast
order by default. data.table always reorders in C-locale. To sort by
session locale, use x[base::order(.)] instead.
I am curious to know how simply writing order in this context causes a function different from R's own (base) order to be called.
I could have explained this behavior if data.table had exported its own function called order, thereby masking base::order, but this does not appear to be happing (in any case, there is no such thing as data.table::order). How, then, is the meaning of order changed?
Thanks in advance!
I am using the new data.table:::fread function (fastest read function I've used in R so far) and I got the following (self explanatory) exception:
R) fread(path)
Erreur dans fread(path) : Coercing integer64 to real needs to be implemented
My file (which is a csv separated by tabs) indeed holds big integers like 902160000671352000. My question is then, can I tell fread to #NOT# read the second columns (where those monsters int are)
Good question. Not yet, but yes you will be able to. Agree with all comments.
The TO DO list is at the top of the readfile.c source. If there's anything missing please let me know. That list covers allowing type overrides, implementing the unimplemented coercions and allowing columns to be skipped. Hopefully will all be done for first release in 1.9.0.
fread is currently in v1.8.7 which is in development on R-Forge. When finished it'll be released as 1.9.0 to CRAN. The .0 indicates that new features might possibly change argument names and behaviour; i.e., don't be surprised if backwards incompatabile changes are made to fread in 1.9.1. Given its nature it's hard to imagine anything major will change, though. But that's why I publicised its availability on R-Forge, to get it into the wild early and get things like this right.
I have a big data frame taking about 900MB ram. Then I tried to modify it like this:
dataframe[[17]][37544]=0
It seems that makes R using more than 3G ram and R complains "Error: cannot allocate vector of size 3.0 Mb", ( I am on a 32bit machine.)
I found this way is better:
dataframe[37544, 17]=0
but R's footprint still doubled and the command takes quite some time to run.
From a C/C++ background, I am really confused about this behavior. I thought something like dataframe[37544, 17]=0 should be completed in a blink without costing any extra memory (only one cell should be modified). What is R doing for those commands I posted? What is the right way to modify some elements in a data frame then without doubling the memory footprint?
Thanks so much for your help!
Tao
Following up on Joran suggesting data.table, here are some links. Your object, at 900MB, is manageable in RAM even in 32bit R, with no copies at all.
When should I use the := operator in data.table?
Why has data.table defined := rather than overloading <-?
Also, data.table v1.8.0 (not yet on CRAN but stable on R-Forge) has a set() function which provides even faster assignment to elements, as fast as assignment to matrix (appropriate for use inside loops for example). See latest NEWS for more details and example. Also see ?":=" which is linked from ?data.table.
And, here are 12 questions on Stack Overflow with the data.table tag containing the word "reference".
For completeness :
require(data.table)
DT = as.data.table(dataframe)
# say column name 17 is 'Q' (i.e. LETTERS[17])
# then any of the following :
DT[37544, Q:=0] # using column name (often preferred)
DT[37544, 17:=0, with=FALSE] # using column number
col = "Q"
DT[37544, col:=0, with=FALSE] # variable holding name
col = 17
DT[37544, col:=0, with=FALSE] # variable holding number
set(DT,37544L,17L,0) # using set(i,j,value) in v1.8.0
set(DT,37544L,"Q",0)
But, please do see linked questions and the package's documentation to see how := is more general than this simple example; e.g., combining := with binary search in an i join.
Look up 'copy-on-write' in the context of R discussions related to memory. As soon as one part of a (potentially really large) data structure changes, a copy is made.
A useful rule of thumb is that if your largest object is N mb/gb/... large, you need around 3*N of RAM. Such is life with an interpreted system.
Years ago when I had to handle large amounts of data on machines with (relative to the data volume) relatively low-ram 32-bit machines, I got good use out of early versions of the bigmemory package. It uses the 'external pointer' interface to keep large gobs of memory outside of R. That save you not only the '3x' factor, but possibly more as you may get away with non-contiguous memory (which is the other thing R likes).
Data frames are the worst structure you can choose to make modification to. Due to quite the complex handling of all features (such as keeping row names in synch, partial matching, etc.) which is done in pure R code (unlike most other objects that can go straight to C) they tend to force additional copies as you can't edit them in place. Check R-devel on the detailed discussions on this - it has been discussed in length several times.
The practical rule is to never use data frames for large data, unless you treat them read-only. You will be orders of magnitude more efficient if you either work on vectors or matrices.
There is type of object called a ffdf in the ff package which is basically a data.frame stored on disk. In addition to the other tips above you can try that.
You can also try the RSQLite package.
I'm wondering if there's any documentation about the efficiency of operations in R, specifically those related to data manipulation.
For example:
I imagine it's efficient to add columns to a data frame, because I'm guessing you're just adding an element to a linked list.
I imagine adding rows is slower because vectors are held in arrays at the C level and you have to allocate a new array of length n+1 and copy all the elements over.
The developers probably don't want to tie themselves to a particular implementation, but it would be nice to have something more solid than guesses to go on.
Also, I know the main R performance hint is to use vectored operations whenever possible as opposed to loops.
what about the various flavors of apply?
are those just hidden loops?
what about matrices vs. data frames?
Data IO was one of the features i looked into before i committed to learning R. For better or worse, here are my observations and solutions/palliatives on these issues:
1. That R doesn't handle big data (>2 GB?) To me this is a misnomer. By default, the common data input functions load your data into RAM. Not to be glib, but to me, this is a feature not a bug--anytime my data will fit in my available RAM, that's where i want it. Likewise, one of SQLite's most popular features is the in-memory option--the user has the easy option of loading the entire dB into RAM. If your data won't fit in memory, then R makes it astonishingly easy to persist it, via connections to the common RDBMS systems (RODBC, RSQLite, RMySQL, etc.), via no-frills options like the filehash package, and via systems that current technology/practices (for instance, i can recommend ff). In other words, the R developers have chosen a sensible (and probably optimal) default, from which it is very easy to opt out.
2. The performance of read.table (read.csv, read.delim, et al.), the most common means for getting data into R, can be improved 5x (and often much more in my experience) just by opting out of a few of read.table's default arguments--the ones having the greatest effect on performance are mentioned in the R's Help (?read.table). Briefly, the R Developers tell us that if you provide values for the parameters 'colClasses', 'nrows', 'sep', and 'comment.char' (in particular, pass in '' if you know your file begins with headers or data on line 1), you'll see a significant performance gain. I've found that to be true.
Here are the snippets i use for those parameters:
To get the number of rows in your data file (supply this snippet as an argument to the parameter, 'nrows', in your call to read.table):
as.numeric((gsub("[^0-9]+", "", system(paste("wc -l ", file_name, sep=""), intern=T))))
To get the classes for each column:
function(fname){sapply(read.table(fname, header=T, nrows=5), class)}
Note: You can't pass this snippet in as an argument, you have to call it first, then pass in the value returned--in other words, call the function, bind the returned value to a variable, and then pass in the variable as the value to to the parameter 'colClasses' in your call to read.table:
3. Using Scan. With only a little more hassle, you can do better than that (optimizing 'read.table') by using 'scan' instead of 'read.table' ('read.table' is actually just a wrapper around 'scan'). Once again, this is very easy to do. I use 'scan' to input each column individually then build my data.frame inside R, i.e., df = data.frame(cbind(col1, col2,....)).
4. Use R's Containers for persistence in place of ordinary file formats (e.g., 'txt', 'csv'). R's native data file '.RData' is a binary format that a little smaller than a compressed ('.gz') txt data file. You create them using save(, ). You load it back into the R namespace with load(). The difference in load times compared with 'read.table' is dramatic. For instance, w/ a 25 MB file (uncompressed size)
system.time(read.table("tdata01.txt.gz", sep=","))
=> user system elapsed
6.173 0.245 **6.450**
system.time(load("tdata01.RData"))
=> user system elapsed
0.912 0.006 **0.912**
5. Paying attention to data types can often give you a performance boost and reduce your memory footprint. This point is probably more useful in getting data out of R. The key point to keep in mind here is that by default, numbers in R expressions are interpreted as double-precision floating point, e.g., > typeof(5) returns "double." Compare the object size of a reasonable-sized array of each and you can see the significance (use object.size()). So coerce to integer when you can.
Finally, the 'apply' family of functions (among others) are not "hidden loops" or loop wrappers. They are loops implemented in C--big difference performance-wise. [edit: AWB has correctly pointed out that while 'sapply', 'tapply', and 'mapply' are implemented in C, 'apply' is simply a wrapper function.
These things do pop up on the lists, in particular on r-devel. One fairly well-established nugget is that e.g. matrix operations tend to be faster than data.frame operations. Then there are add-on packages that do well -- Matt's data.table package is pretty fast, and Jeff has gotten xts indexing to be quick.
But it "all depends" -- so you are usually best adviced to profile on your particular code. R has plenty of profiling support, so you should use it. My Intro to HPC with R tutorials have a number of profiling examples.
I will try to come back and provide more detail. If you have any question about the efficiency of one operation over another, you would do best to profile your own code (as Dirk suggests). The system.time() function is the easiest way to do this although there are many more advanced utilities (e.g. Rprof, as documented here).
A quick response for the second part of your question:
What about the various flavors of apply? Are those just hidden loops?
For the most part yes, the apply functions are just loops and can be slower than for statements. Their chief benefit is clearer code. The main exception that I have found is lapply which can be faster because it is coded in C directly.
And what about matrices vs. data frames?
Matrices are more efficient than data frames because they require less memory for storage. This is because data frames require additional attribute data. From R Introduction:
A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes