Access columns of an R data.table as a matrix, by reference - r

I recall there is a special (undocumented?) form that allows accessing selected columns of a data.table by reference as a matrix (in so doing allocating no memory) but for the life of me I can't find my note on the subject. Something analogous to setDF, but yielding a matrix, not a data.frame. I understand this might come with certain dangers, and would like to be reminded of how to do this, and what the dangers are.

It is not possible, because matrix has different layout in memory than list/data.frame/data.table, and we don't have any mapping that allows that.

Related

Is there a visual explanation of why data.table operations are faster than tidyverse operations when you need to group by a variable?

I understand from excellent resources here, here and here that data.table utilises automatic indexing (to create a key i.e. supercharged row names) and binary search based subset in contrast to tidyverse, which relies on vector scanning.
I understand that vector scanning requires scanning each individual row and the creation of nrow(dataset) length logical vectors, and that doing this repeatedly is not as efficient.
I'm wondering if someone can help me frame exactly how these two methods means that data.table operations run a lot faster compared to tidyverse when you need to group by a variable. I.e. is it because data.table automatically indexes the group_by column and breaks it into grouped subsets and runs operations on each subset, whilst a vector scanning approach would require the generation of n = unique groups of multiple logical vectors, and then run operations on each individual logical vector, before collating results?
Also, according to the data.table vignette,
We can set keys on multiple columns and the column can be of different
types...
Since the rows are reordered, a data.table can have at most one key
because it can not be sorted in more than one way.
What does it mean that we can set keys on multiple columns and yet a data.table can have at most one key? I.e. is it that during any moment when running an operation, there is only one reference key, but which column the reference key is set as can change as we progress to another component of the overall operation?
Thank you in advance!
There is no.
There are different ways to finding groups, and then to compute expression by groups. Each single thing can be differently implemented. They are not related to keys or index. Also data.table is not automatically creating key/index during group by (as of now).
data.table has very fast, carefully implemented, order function, it is being used to find groups. It was contributed to base R later on. There is an idea to use it in dplyr to speed up grouping: https://github.com/tidyverse/dplyr/issues/4406
Yet data.table order function got improved since then and now scales even better.
Aside from finding groups, there is a part about computing an expression. If we evaluate "user defined function" it will always be much slower. Many common functions are internally optimized, so they don't switch between R and C for every group. Here, data.table has also very carefully implemented "GForce" functions. Not sure but in dplyr they are called "hybrid evaluation".
It is always important to test on your particular data use case. If you have just 2 unique groups in data, then fast grouping algorithms will not shine much.
Also there is a community repository which meant to describe data.table algorithms https://github.com/asantucci/algo_data.table but it is not very active. I just recently posted there a comment about "groupby optimization", will paste it here as well. Answer was provided by data.table author Matt Dowle.
Q: does GForce allocate mem for biggest group, then copy there values of a group, to aggregate, so it can benefit from being contiguous in memory and will be more cache efficient? if so, do we check if groups aren't sorted already? so we can avoid doing allocation and copy?
A: gforce (gsum) assigns to many group results at once; it doesn't gather the groups together. You're describing non-gforce (dogroup.c) which copies to the largest group. See the branch in dogroups.c which knows whether groups are already grouped: it swithes to a memcpy. The memcpy is very fast (contiguous, pre-fetch) so it's pretty good already. We must copy because R's DATAPTR is not a pointer we can repoint, it's an offset from SEXP.

How to change to a less memory-hungry data type in R?

I am working with a big (500000*2000) matrix containing data that can be one of 4 values. Keeping it in the standard R data type is pushing the capabilities of my workstation.
Is there a data type in R that allows for more efficient memory usage by allocating only 2 bits to each one of these values? This would increase the efficiency of my code by a lot.
Thanks
Depends on what kind of analysis you are doing. Using the sparse matrix functions from package Matrix (as Shinobi_Atobe suggested above) might be helpful if your matrix is sparse, that is, contains "lots" of zero values, whereas the simplest operational definition of "lots of zero values" is: try it out (i.e., convert your data to a sparse matrix class) and see if it helps.
You can also make sure that your data is stored as (a) integer [check out 1L vs 1] or (b) factor [which is, technically, integer] but not character or "long" (i.e., non-integer but numeric). Integer seems to be R's least memory-hungry tata type, even truth values (TRUE vs FALSE) do not seem to occupy less memory than integers. (I'm not completely sure about that, have tried only a very simple comparison: object.size(rep(T, 100)) == object.size(rep(1L, 100)) but see ?storage.mode).
So converting your data to integer (using as.integer will disentangle your matrix so it's a little bit trickier than that) might help. At least a little.
Beyond that, the possibilities include increasing your memory allowance to R[*], dividing your matrix into sub-parts (if that does not ruin your analytic strategy; even a list of smaller matrices can be more efficient than a big matrix for some purposes; so instead of a single 500000*2000 mtx you could have, say, a list of 100 5000*2000 matrices), and doing some parts of analysis in another language within R (e.g., Rcpp) or completely without it (e.g., an external python script).
[*] Increasing (or decreasing) the memory available to R processes

Out of memory when modifying a big R data.frame

I have a big data frame taking about 900MB ram. Then I tried to modify it like this:
dataframe[[17]][37544]=0
It seems that makes R using more than 3G ram and R complains "Error: cannot allocate vector of size 3.0 Mb", ( I am on a 32bit machine.)
I found this way is better:
dataframe[37544, 17]=0
but R's footprint still doubled and the command takes quite some time to run.
From a C/C++ background, I am really confused about this behavior. I thought something like dataframe[37544, 17]=0 should be completed in a blink without costing any extra memory (only one cell should be modified). What is R doing for those commands I posted? What is the right way to modify some elements in a data frame then without doubling the memory footprint?
Thanks so much for your help!
Tao
Following up on Joran suggesting data.table, here are some links. Your object, at 900MB, is manageable in RAM even in 32bit R, with no copies at all.
When should I use the := operator in data.table?
Why has data.table defined := rather than overloading <-?
Also, data.table v1.8.0 (not yet on CRAN but stable on R-Forge) has a set() function which provides even faster assignment to elements, as fast as assignment to matrix (appropriate for use inside loops for example). See latest NEWS for more details and example. Also see ?":=" which is linked from ?data.table.
And, here are 12 questions on Stack Overflow with the data.table tag containing the word "reference".
For completeness :
require(data.table)
DT = as.data.table(dataframe)
# say column name 17 is 'Q' (i.e. LETTERS[17])
# then any of the following :
DT[37544, Q:=0] # using column name (often preferred)
DT[37544, 17:=0, with=FALSE] # using column number
col = "Q"
DT[37544, col:=0, with=FALSE] # variable holding name
col = 17
DT[37544, col:=0, with=FALSE] # variable holding number
set(DT,37544L,17L,0) # using set(i,j,value) in v1.8.0
set(DT,37544L,"Q",0)
But, please do see linked questions and the package's documentation to see how := is more general than this simple example; e.g., combining := with binary search in an i join.
Look up 'copy-on-write' in the context of R discussions related to memory. As soon as one part of a (potentially really large) data structure changes, a copy is made.
A useful rule of thumb is that if your largest object is N mb/gb/... large, you need around 3*N of RAM. Such is life with an interpreted system.
Years ago when I had to handle large amounts of data on machines with (relative to the data volume) relatively low-ram 32-bit machines, I got good use out of early versions of the bigmemory package. It uses the 'external pointer' interface to keep large gobs of memory outside of R. That save you not only the '3x' factor, but possibly more as you may get away with non-contiguous memory (which is the other thing R likes).
Data frames are the worst structure you can choose to make modification to. Due to quite the complex handling of all features (such as keeping row names in synch, partial matching, etc.) which is done in pure R code (unlike most other objects that can go straight to C) they tend to force additional copies as you can't edit them in place. Check R-devel on the detailed discussions on this - it has been discussed in length several times.
The practical rule is to never use data frames for large data, unless you treat them read-only. You will be orders of magnitude more efficient if you either work on vectors or matrices.
There is type of object called a ffdf in the ff package which is basically a data.frame stored on disk. In addition to the other tips above you can try that.
You can also try the RSQLite package.

Big Data convert to "transactions" from arules package

The arules package in R uses the class 'transactions'. So in order to use the function apriori() I need to convert my existing data. I've got a Matrix with 2 columns and roughly 1.6mm rows and tried to convert the data like this:
transaction_data <- as(split(original_data[,"id"], original_data[,"type"]), "transactions")
where original_data is my data matrix. Because of the amount of data I used the largest AWS Amazon machine with 64gb RAM. After a while I get
resulting vector exceeds vector length limit in 'AnswerType'
The Memory Usage of the machine was still 'only' at 60%. Is this a R-based limitation? Is there any way to work around this other than using sampling? When only using 1/4 of the data the transformation worked fine.
Edit: As pointed out, one of the variables was a factor instead of character. After changing the transformation was processed quickly and correct.
I suspect that your problem is arising because one of the functions uses integers (rather than, say, floats) to index values. In any case, the size isn't too big, so this is surprising. Maybe the data has some other issue, such as characters as factors?
In general, though, I'd really recommend using memory mapped files, via bigmemory, which you can also split and process via bigsplit or mwhich. If offloading the data works for you, then you can also use a much smaller instance size and save $$. :)

Efficiency of operations on R data structures

I'm wondering if there's any documentation about the efficiency of operations in R, specifically those related to data manipulation.
For example:
I imagine it's efficient to add columns to a data frame, because I'm guessing you're just adding an element to a linked list.
I imagine adding rows is slower because vectors are held in arrays at the C level and you have to allocate a new array of length n+1 and copy all the elements over.
The developers probably don't want to tie themselves to a particular implementation, but it would be nice to have something more solid than guesses to go on.
Also, I know the main R performance hint is to use vectored operations whenever possible as opposed to loops.
what about the various flavors of apply?
are those just hidden loops?
what about matrices vs. data frames?
Data IO was one of the features i looked into before i committed to learning R. For better or worse, here are my observations and solutions/palliatives on these issues:
1. That R doesn't handle big data (>2 GB?) To me this is a misnomer. By default, the common data input functions load your data into RAM. Not to be glib, but to me, this is a feature not a bug--anytime my data will fit in my available RAM, that's where i want it. Likewise, one of SQLite's most popular features is the in-memory option--the user has the easy option of loading the entire dB into RAM. If your data won't fit in memory, then R makes it astonishingly easy to persist it, via connections to the common RDBMS systems (RODBC, RSQLite, RMySQL, etc.), via no-frills options like the filehash package, and via systems that current technology/practices (for instance, i can recommend ff). In other words, the R developers have chosen a sensible (and probably optimal) default, from which it is very easy to opt out.
2. The performance of read.table (read.csv, read.delim, et al.), the most common means for getting data into R, can be improved 5x (and often much more in my experience) just by opting out of a few of read.table's default arguments--the ones having the greatest effect on performance are mentioned in the R's Help (?read.table). Briefly, the R Developers tell us that if you provide values for the parameters 'colClasses', 'nrows', 'sep', and 'comment.char' (in particular, pass in '' if you know your file begins with headers or data on line 1), you'll see a significant performance gain. I've found that to be true.
Here are the snippets i use for those parameters:
To get the number of rows in your data file (supply this snippet as an argument to the parameter, 'nrows', in your call to read.table):
as.numeric((gsub("[^0-9]+", "", system(paste("wc -l ", file_name, sep=""), intern=T))))
To get the classes for each column:
function(fname){sapply(read.table(fname, header=T, nrows=5), class)}
Note: You can't pass this snippet in as an argument, you have to call it first, then pass in the value returned--in other words, call the function, bind the returned value to a variable, and then pass in the variable as the value to to the parameter 'colClasses' in your call to read.table:
3. Using Scan. With only a little more hassle, you can do better than that (optimizing 'read.table') by using 'scan' instead of 'read.table' ('read.table' is actually just a wrapper around 'scan'). Once again, this is very easy to do. I use 'scan' to input each column individually then build my data.frame inside R, i.e., df = data.frame(cbind(col1, col2,....)).
4. Use R's Containers for persistence in place of ordinary file formats (e.g., 'txt', 'csv'). R's native data file '.RData' is a binary format that a little smaller than a compressed ('.gz') txt data file. You create them using save(, ). You load it back into the R namespace with load(). The difference in load times compared with 'read.table' is dramatic. For instance, w/ a 25 MB file (uncompressed size)
system.time(read.table("tdata01.txt.gz", sep=","))
=> user system elapsed
6.173 0.245 **6.450**
system.time(load("tdata01.RData"))
=> user system elapsed
0.912 0.006 **0.912**
5. Paying attention to data types can often give you a performance boost and reduce your memory footprint. This point is probably more useful in getting data out of R. The key point to keep in mind here is that by default, numbers in R expressions are interpreted as double-precision floating point, e.g., > typeof(5) returns "double." Compare the object size of a reasonable-sized array of each and you can see the significance (use object.size()). So coerce to integer when you can.
Finally, the 'apply' family of functions (among others) are not "hidden loops" or loop wrappers. They are loops implemented in C--big difference performance-wise. [edit: AWB has correctly pointed out that while 'sapply', 'tapply', and 'mapply' are implemented in C, 'apply' is simply a wrapper function.
These things do pop up on the lists, in particular on r-devel. One fairly well-established nugget is that e.g. matrix operations tend to be faster than data.frame operations. Then there are add-on packages that do well -- Matt's data.table package is pretty fast, and Jeff has gotten xts indexing to be quick.
But it "all depends" -- so you are usually best adviced to profile on your particular code. R has plenty of profiling support, so you should use it. My Intro to HPC with R tutorials have a number of profiling examples.
I will try to come back and provide more detail. If you have any question about the efficiency of one operation over another, you would do best to profile your own code (as Dirk suggests). The system.time() function is the easiest way to do this although there are many more advanced utilities (e.g. Rprof, as documented here).
A quick response for the second part of your question:
What about the various flavors of apply? Are those just hidden loops?
For the most part yes, the apply functions are just loops and can be slower than for statements. Their chief benefit is clearer code. The main exception that I have found is lapply which can be faster because it is coded in C directly.
And what about matrices vs. data frames?
Matrices are more efficient than data frames because they require less memory for storage. This is because data frames require additional attribute data. From R Introduction:
A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes

Resources