R - Order of the data table records from subsetting columns - r

I am currently learning data.table in R. a few questions which got me confused:
Does subsetting columns always preserve the order of records? (i.e. Row 1,2,3 will stay as Row 1,2,3 instead of Row 1,3,2)
Also, does the same conclusion apply to different expressions, such as DB[[1]], DB$V1, etc.
2.
When subsetting multiple columns, I know I need to use something like DB[,.(V1, V2)], but I am confused about what's the result from DB[,V1, V2]?
The code runs, seems to produce the result but the rows are not in the same order as the original table. If someone can explain what does the latter code mean, that would be great help.
Thanks a lot!

I wanted to start with small suggestion... if you create data processing related question on SO it is enormously better to ship reproducible code in the question, and expected output if it isn't clear. You will reach much bigger audience and gather more quality solutions. This is generally common practice on r tag.
Subsetting preserve order, underlying storage of data is column oriented unlike regular SQL db (which are not aware of row order), it works exactly the same as subsetting a vector in base R, just much faster.
Regarding [[ and $, these are just a methods for extracting column from data.table, and a list in general, you can use DB[[1]], DB[["V1"]], DB$V1. They behave differently depending if column/list element exists.
Third argument inside data.table [ operator is by which expect columns to group by over, so you query column V1 grouped by V2, without using any aggregate function. And this is very different than DB[, .(V1, V2)] or DB[, c("V1","V2"), with=FALSE] or DB[, list(V1,V2)] or DB[, .SD, .SDcols=c("V1","V2")], ... . Most of the api is borrowed from base R, functions like subset() or with().
At the end I would recommend to go through data.table vignettes, also there is my recent longish post that goes through various data.table examples: Boost Your Data Munging with R.

Related

Is there a visual explanation of why data.table operations are faster than tidyverse operations when you need to group by a variable?

I understand from excellent resources here, here and here that data.table utilises automatic indexing (to create a key i.e. supercharged row names) and binary search based subset in contrast to tidyverse, which relies on vector scanning.
I understand that vector scanning requires scanning each individual row and the creation of nrow(dataset) length logical vectors, and that doing this repeatedly is not as efficient.
I'm wondering if someone can help me frame exactly how these two methods means that data.table operations run a lot faster compared to tidyverse when you need to group by a variable. I.e. is it because data.table automatically indexes the group_by column and breaks it into grouped subsets and runs operations on each subset, whilst a vector scanning approach would require the generation of n = unique groups of multiple logical vectors, and then run operations on each individual logical vector, before collating results?
Also, according to the data.table vignette,
We can set keys on multiple columns and the column can be of different
types...
Since the rows are reordered, a data.table can have at most one key
because it can not be sorted in more than one way.
What does it mean that we can set keys on multiple columns and yet a data.table can have at most one key? I.e. is it that during any moment when running an operation, there is only one reference key, but which column the reference key is set as can change as we progress to another component of the overall operation?
Thank you in advance!
There is no.
There are different ways to finding groups, and then to compute expression by groups. Each single thing can be differently implemented. They are not related to keys or index. Also data.table is not automatically creating key/index during group by (as of now).
data.table has very fast, carefully implemented, order function, it is being used to find groups. It was contributed to base R later on. There is an idea to use it in dplyr to speed up grouping: https://github.com/tidyverse/dplyr/issues/4406
Yet data.table order function got improved since then and now scales even better.
Aside from finding groups, there is a part about computing an expression. If we evaluate "user defined function" it will always be much slower. Many common functions are internally optimized, so they don't switch between R and C for every group. Here, data.table has also very carefully implemented "GForce" functions. Not sure but in dplyr they are called "hybrid evaluation".
It is always important to test on your particular data use case. If you have just 2 unique groups in data, then fast grouping algorithms will not shine much.
Also there is a community repository which meant to describe data.table algorithms https://github.com/asantucci/algo_data.table but it is not very active. I just recently posted there a comment about "groupby optimization", will paste it here as well. Answer was provided by data.table author Matt Dowle.
Q: does GForce allocate mem for biggest group, then copy there values of a group, to aggregate, so it can benefit from being contiguous in memory and will be more cache efficient? if so, do we check if groups aren't sorted already? so we can avoid doing allocation and copy?
A: gforce (gsum) assigns to many group results at once; it doesn't gather the groups together. You're describing non-gforce (dogroup.c) which copies to the largest group. See the branch in dogroups.c which knows whether groups are already grouped: it swithes to a memcpy. The memcpy is very fast (contiguous, pre-fetch) so it's pretty good already. We must copy because R's DATAPTR is not a pointer we can repoint, it's an offset from SEXP.

Replace subset of data table with other data table

I feel a bit silly for this question, but I only want to something which I know how to do with a data.frame, but I have not yet found a nice way to do it in R. All other similar questions seem way more complicated for what I have in mind. I simply want to replace a subset of a data.table with another data.table only based on an row index and choosing some columns.
MWE follows
x.df <- data.frame(a=c(1,2,3),
b=c(2,NA,NA),
c=c(3,NA,NA))
x.dt <- data.table(x.df)
x.df.replace<- data.frame(b=c(10,11), c=c(22,21))
x.dt.replace<- data.table(x.df.replace)
This works like a charm in data Frame
x.df[is.na(x.df$b),2:3]<-x.df.replace
On the other hand I would like to call the columns by name and I only know how to replace each column individually, but not jointly
x.dt[is.na(b),]
x.dt[is.na(b),c:=x.dt.replace[,c]]
x.dt[is.na(b),b:=x.dt.replace[,b]]
x.dt[is.na(b), list(b,c)]<-x.dt.replace
x.dt[is.na(b), list(b,c):=x.dt.replace]
I was having the same issue and I came across this question with no answer. The comments above helped me to find the solution to my problem, so I decided to post it here. May simply be a difference between data.table versions (I am using version 1.11.8), since this is relatively old question.
The solution uses a () instead of a .() or a list() to declare the column names to be replaced:
colunas <- c("b","c")
x.dt[is.na(b), (colunas) := x.dt.replace]
Hope this is useful

What is the purpose of setting a key in data.table?

I am using data.table and there are many functions which require me to set a key (e.g. X[Y]). As such, I wish to understand what a key does in order to properly set keys in my data tables.
One source I read was ?setkey.
setkey() sorts a data.table and marks it as sorted. The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference. No copy is made at all, other than temporary working memory as large as one column.
My takeaway here is that a key would "sort" the data.table, resulting in a very similar effect to order(). However, it doesn't explain the purpose of having a key.
The data.table FAQ 3.2 and 3.3 explains:
3.2 I don't have a key on a large table, but grouping is still really quick. Why is that?
data.table uses radix sorting. This is signicantly faster than other
sort algorithms. Radix is specically for integers only, see
?base::sort.list(x,method="radix"). This is also one reason why
setkey() is quick. When no key is set, or we group in a different order
from that of the key, we call it an ad hoc by.
3.3 Why is grouping by columns in the key faster than an ad hoc by?
Because each group is contiguous in RAM, thereby minimising page
fetches, and memory can be copied in bulk (memcpy in C) rather than
looping in C.
From here, I guess that setting a key somehow allows R to use "radix sorting" over other algorithms, and that's why it is faster.
The 10 minute quick start guide also has a guide on keys.
Keys
Let's start by considering data.frame, specically rownames (or in
English, row names). That is, the multiple names belonging to a single
row. The multiple names belonging to the single row? That is not what
we are used to in a data.frame. We know that each row has at most one
name. A person has at least two names, a rst name and a second name.
That is useful to organise a telephone directory, for example, which
is sorted by surname, then rst name. However, each row in a
data.frame can only have one name.
A key consists of one or more
columns of rownames, which may be integer, factor, character or some
other class, not simply character. Furthermore, the rows are sorted by
the key. Therefore, a data.table can have at most one key, because it
cannot be sorted in more than one way.
Uniqueness is not enforced,
i.e., duplicate key values are allowed. Since the rows are sorted by
the key, any duplicates in the key will appear consecutively
The telephone directory was helpful in understanding what a key is, but it seems that a key is no different when compared to having a factor column. Furthermore, it does not explain why is a key needed (especially to use certain functions) and how to choose the column to set as key. Also, it seems that in a data.table with time as a column, setting any other column as key would probably mess the time column too, which makes it even more confusing as I do not know if I am allowed set any other column as key. Can someone enlighten me please?
In addition to this answer, please refer to the vignettes Secondary indices and auto indexing and Keys and fast binary search based subset as well.
This issue highlights the other vignettes that we plan to.
I've updated this answer again (Feb 2016) in light of the new on= feature that allows ad-hoc joins as well. See history for earlier (outdated) answers.
What exactly does setkey(DT, a, b) do?
It does two things:
reorders the rows of the data.table DT by the column(s) provided (a, b) by reference, always in increasing order.
marks those columns as key columns by setting an attribute called sorted to DT.
The reordering is both fast (due to data.table's internal radix sorting) and memory efficient (only one extra column of type double is allocated).
When is setkey() required?
For grouping operations, setkey() was never an absolute requirement. That is, we can perform a cold-by or adhoc-by.
## "cold" by
require(data.table)
DT <- data.table(x=rep(1:5, each=2), y=1:10)
DT[, mean(y), by=x] # no key is set, order of groups preserved in result
However, prior to v1.9.6, joins of the form x[i] required key to be set on x. With the new on= argument from v1.9.6+, this is not true anymore, and setting keys is therefore not an absolute requirement here as well.
## joins using < v1.9.6
setkey(X, a) # absolutely required
setkey(Y, a) # not absolutely required as long as 'a' is the first column
X[Y]
## joins using v1.9.6+
X[Y, on="a"]
# or if the column names are x_a and y_a respectively
X[Y, on=c("x_a" = "y_a")]
Note that on= argument can be explicitly specified even for keyed joins as well.
The only operation that requires key to be absolutely set is the foverlaps() function. But we are working on some more features which when done would remove this requirement.
So what's the reason for implementing on= argument?
There are quite a few reasons.
It allows to clearly distinguish the operation as an operation involving two data.tables. Just doing X[Y] does not distinguish this as well, although it could be clear by naming the variables appropriately.
It also allows to understand the columns on which the join/subset is being performed immediately by looking at that line of code (and not having to traceback to the corresponding setkey() line).
In operations where columns are added or updated by reference, on= operations are much more performant as it doesn't need the entire data.table to be reordered just to add/update column(s). For example,
## compare
setkey(X, a, b) # why physically reorder X to just add/update a column?
X[Y, col := i.val]
## to
X[Y, col := i.val, on=c("a", "b")]
In the second case, we did not have to reorder. It's not computing the order that's time consuming, but physically reordering the data.table in RAM, and by avoiding it, we retain the original order, and it is also performant.
Even otherwise, unless you're performing joins repetitively, there should be no noticeable performance difference between a keyed and ad-hoc joins.
This leads to the question, what advantage does keying a data.table have anymore?
Is there an advantage to keying a data.table?
Keying a data.table physically reorders it based on those column(s) in RAM. Computing the order is not usually the time consuming part, rather the reordering itself. However, once we've the data sorted in RAM, the rows belonging to the same group are all contiguous in RAM, and is therefore very cache efficient. It's the sortedness that speeds up operations on keyed data.tables.
It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.
In most cases therefore, there shouldn't be a need to set keys anymore. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.
Question: What do you think would be the performance like in comparison to a keyed join, if you use setorder() to reorder the data.table and use on=? If you've followed thus far, you should be able to figure it out :-).
A key is basically an index into a dataset, which allows for very fast and efficient sort, filter, and join operations. These are probably the best reasons to use data tables instead of data frames (the syntax for using data tables is also much more user friendly, but that has nothing to do with keys).
If you don't understand indexes, consider this: a phone book is "indexed" by name. So if I want to look up someone's phone number, it's pretty straightforward. But suppose I want to search by phone number (e.g., look up who has a particular phone number)? Unless I can "re-index" the phone book by phone number, it will take a very long time.
Consider the following example: suppose I have a table, ZIP, of all the zip codes in the US (>33,000) along with associated information (city, state, population, median income, etc.). If I want to look up the information for a specific zip code, the search (filter) is about 1000 times faster if I setkey(ZIP, zipcode) first.
Another benefit has to do with joins. Suppose a have a list of people and their zip codes in a data table (call it "PPL"), and I want to append information from the ZIP table (e.g. city, state, and so on). The following code will do it:
setkey(ZIP, zipcode)
setkey(PPL, zipcode)
full.info <- PPL[ZIP, nomatch = FALSE]
This is a "join" in the sense that I'm joining the information from 2 tables based in a common field (zipcode). Joins like this on very large tables are extremely slow with data frames, and extremely fast with data tables. In a real-life example I had to do more than 20,000 joins like this on a full table of zip codes. With data tables the script took about 20 min. to run. I didn't even try it with data frames because it would have taken more than 2 weeks.
IMHO you should not just read but study the FAQ and Intro material. It's easier to grasp if you have an actual problem to apply this to.
[Response to #Frank's comment]
Re: sorting vs. indexing - Based on the answer to this question, it appears that setkey(...) does in fact rearrange the columns in the table (e.g., a physical sort), and does not create an index in the database sense. This has some practical implications: for one thing if you set the key in a table with setkey(...) and then change any of the values in the key column, data.table merely declares the table to be no longer sorted (by turning off the sorted attribute); it does not dynamically re-index to maintain the proper sort order (as would happen in a database). Also, "removing the key" using setkey(DT, NULL) does not restore the table to it's original, unsorted order.
Re: filter vs. join - the practical difference is that filtering extracts a subset from a single dataset, whereas join combines data from two datasets based on a common field. There are many different kinds of join (inner, outer, left). The example above is an inner join (only records with keys common to both tables are returned), and this does have many similarities to filtering.

R How to add multiple columns (:=) to data.table based on results of rollapplyr in .SD

I am trying to add some columns to a large data.table that are based on rolling calculations split by a unique identifier.
Based on Equivalent to ddply(...,transform,...) in data.table I have generated this statement:
alldefaults[,`:=`(.SD[,list(obsPaymentDownMax24m=rollapplyr(obsPaymentsDown,24,max,partial=TRUE)
,obsPaymentDownAvg24m=rollapplyr(obsPaymentsDown,24,mean,partial=TRUE)
,obsPaymentDownMax12m=rollapplyr(obsPaymentsDown,12,max,partial=TRUE)
,obsPaymentDownAvg12m=rollapplyr(obsPaymentsDown,12,mean,partial=TRUE))]),by=OriginalApplicationID]
It produces an error
Error in `[.data.table`(alldefaults, , `:=`(.SD[, list(obsPaymentDownMax24m = rollapplyr(obsPaymentsDown, :
In `:=`(col1=val1, col2=val2, ...) form, all arguments must be named.
When I run this without the function := named it work well but is a new dataset and then joining it back on would be required.
Inserting the assignment within the .SD
alldefaults[,.SD[,`:=`(list(obsPaymentDownMax24m=rollapplyr(obsPaymentsDown,24,max,partial=TRUE)
,obsPaymentDownAvg24m=rollapplyr(obsPaymentsDown,24,mean,partial=TRUE)
,obsPaymentDownMax12m=rollapplyr(obsPaymentsDown,12,max,partial=TRUE)
,obsPaymentDownAvg12m=rollapplyr(obsPaymentsDown,12,mean,partial=TRUE)))],by=OriginalApplicationID]
Produces this error
Error in `[.data.table`(.SD, , `:=`(list(obsPaymentDownMax24m = rollapplyr(obsPaymentsDown, :
.SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.
Is there a trick to making this update work that I've missed?
PS - Not sure if this a hefty enough question to require a reproducible example as it seems primarily syntax oriented and is hopefully easy to point out what the statement ought to be. Also if anyone has recommendations for making this faster again, I'd be very appreciative!
I'm guessing (guessing, because the question is not well formed, which is probably why you got downvoted) you want to do this:
alldefaults[,`:=`(obsPaymentDownMax24m=rollapplyr(obsPaymentsDown,24,max,partial=TRUE)
,obsPaymentDownAvg24m=rollapplyr(obsPaymentsDown,24,mean,partial=TRUE)
,obsPaymentDownMax12m=rollapplyr(obsPaymentsDown,12,max,partial=TRUE)
,obsPaymentDownAvg12m=rollapplyr(obsPaymentsDown,12,mean,partial=TRUE)),by=OriginalApplicationID]

Sort a data.frame by multiple columns whose names are contained in a single object?

I want to sort a data.frame by multiple columns, ideally using base R without any external packages (though if necessary, so be it). Having read How to sort a dataframe by column(s)?, I know I can accomplish this with the order() function as long as I either:
Know the explicit names of each of the columns.
Have a separate object representing each individual column by which to sort.
But what if I only have one vector containing multiple column names, of length that's unknown in advance?
Say the vector is called sortnames.
data[order(data[, sortnames]), ] won't work, because order() treats that as a single sorting argument.
data[order(data[, sortnames[1]], data[, sortnames[2]], ...), ] will work if and only if I specify the exact correct number of sortname values, which I won't know in advance.
Things I've looked at but not been totally happy with:
eval(parse(text=paste("data[with(data, order(", paste(sortnames, collapse=","), ")), ]"))). Maybe this is fine, but I've seen plenty of hate for using eval(), so asking for alternatives seemed worthwhile.
I may be able to use the Deducer library to do this with sortData(), but like I said, I'd rather avoid using external packages.
If I'm being too stubborn about not using external packages, let me know. I'll get over it. All ideas appreciated in advance!
You can use do.call:
data<-data.frame(a=rnorm(10),b=rnorm(10))
data<-data.frame(a=rnorm(10),b=rnorm(10),c=rnorm(10))
sortnames <- c("a", "b")
data[do.call("order", data[sortnames]), ]
This trick is useful when you want to pass multiple arguments to a function and these arguments are in convenient named list.

Resources