R data.table J behavior - r

I am still puzzled by the behavior of data.table J.
> DT = data.table(A=7:3,B=letters[5:1])
> DT
A B
1: 7 e
2: 6 d
3: 5 c
4: 4 b
5: 3 a
> setkey(DT, A, B)
> DT[J(7,"e")]
A B
1: 7 e
> DT[J(7,"f")]
A B
1: 7 f # <- there is no such line in DT
but there is no such line in DT. Why do we get this result?

The data.table J(7, 'f') is literally a single-row data.table that you are joining your own data.table with. When you call x[i], you are looking at each row in i and finding all matches for this in x. The default is to give NA for rows in i that don't match anything, which is easier seen by adding another column to DT:
DT <- data.table(A=7:3,B=letters[5:1],C=letters[1:5])
setkey(DT, A, B)
DT[J(7,"f")]
# A B C
# 1: 7 f NA
What you are seeing is the only row in J with no match to anything in DT. To prevent data.table from reporting non-matches, you can use nomatch=0
DT[J(7,"f"), nomatch=0]
# Empty data.table (0 rows) of 3 cols: A,B,C

Perhaps adding an additional column will shed some light on what is going on.
DT[, C:=paste0(A, B)]
DT[J(7,"e")]
### A B C
### 1: 7 e 7e
DT[J(7,"f")]
### A B C
### 1: 7 f NA
This is the same behavior as without J:
setkey(DT, B)
DT["a"]
### B A C
### 1: a 3 3a
DT["A"]
### B A C
### 1: A NA NA
You can use the nomatch argument to change this behavior.
DT[J(7,"f"), nomatch=0L]
### Empty data.table (0 rows) of 3 cols: A,B,C

Related

R: Examine to see if a Datatable is subset of another Datatable

How I can check to see if a data table is subset of another data table, regardless of the row and column order? For instance, imagine someone rbinded the DT_x and DT_y with removing the duplicate and created DT_Z. Now, I want to know how I can compare DT_x and DT_Z and get the result which show/state that the DT_z is a subset of DT_Z?
as very simple example:
DT1 <- data.table(a= LETTERS[1:10], v=1:10)
DT2 <- data.table(a= LETTERS[1:6], v=1:6)
DT1
a v
1: A 1
2: B 2
3: C 3
4: D 4
5: E 5
6: F 6
7: G 7
8: H 8
9: I 9
10: J 10
DT2
a v
1: A 1
2: B 2
3: C 3
4: D 4
5: E 5
6: F 6
I am sure all.equal(DT1, DT2) will not answer my question.
I think you can use data.table's fintersect() and fsetequal():
is_df1_subset_of_df2 <- function(df1, df2) {
intersection <- data.table::fintersect(df1, df2)
data.table::fsetequal(df1, intersection)
}
The first line picks the elements in df1 that exists in df2.
The second line checks if that set is all of df1.

Expand data.table with combinations of two columns given condition in another column

I have a data.table that gives me the connections between locations (origin and destination) for different bus routes (route_id).
library(data.table)
library(magrittr)
# data for reproducible example
dt <- data.table( origin = c('A','B','C', 'F', 'G', 'H'),
destination = c('B','C','D', 'G', 'H', 'I'),
freq = c(2,2,2,10,10,10),
route_id = c(1,1,1,2,2,2), stringsAsFactors=FALSE )
# > dt
# origin destination freq route_id
# 1: A B 2 1
# 2: B C 2 1
# 3: C D 2 1
# 4: F G 10 2
# 5: G H 10 2
# 6: H I 10 2
For the purposes of what I'd want to do, if there is a route_id that gives a connection A-B and a connection B-C, then I want to add to the data a connection A-C for that same route_id and so on.
Problems: So far, I've created a simple code that does this job but:
it uses a for loop that takes a long time (my real data has hundreds of thousands observations)
it still does not cope well with direction. The direction of the connections matter here. So although there is a B-C connection in the original data, there should be no C-B in the output.
My slow solution
# loop
# a) get a data subset corresponding to each route_id
# b) get all combinations of origin-destination pairs
# c) row bind the new pairs to original data
for (i in unique(dt$route_id)) {
temp <- dt[ route_id== i,]
subset_of_pairs <- expand.grid(temp$origin, temp$destination) %>% setDT()
setnames(subset_of_pairs, c("origin", "destination"))
dt <- rbind(dt, subset_of_pairs, fill=T)
}
# assign route_id and freq to new pairs
dt[, route_id := route_id[1L], by=origin]
dt[, freq := freq[1L], by=route_id]
# Keepe only different pairs that are unique
dt[, origin := as.character(origin) ][, destination := as.character(destination) ]
dt <- dt[ origin != destination, ][order(route_id, origin, destination)]
dt <- unique(dt)
Desired output
origin destination freq route_id
1: A B 2 1
2: A C 2 1
3: A D 2 1
4: B C 2 1
5: B D 2 1
6: C D 2 1
7: F G 10 2
8: F H 10 2
9: F I 10 2
10: G H 10 2
11: G I 10 2
12: H I 10 2
One way:
res = dt[, {
stops = c(origin, last(destination))
pairs = combn(.N + 1L, 2L)
.(o = stops[pairs[1,]], d = stops[pairs[2,]])
}, by=route_id]
route_id o d
1: 1 A B
2: 1 A C
3: 1 A D
4: 1 B C
5: 1 B D
6: 1 C D
7: 2 F G
8: 2 F H
9: 2 F I
10: 2 G H
11: 2 G I
12: 2 H I
This is assuming that c(origin, last(destination)) is a full list of stops in order. If dt does not contain enough info to construct a complete order, the task becomes much more difficult.
If vars from dt are needed, an update join like res[dt, on=.(route_id), freq := i.freq] works.
Tasks like this always risk running out of memory. In this case, the OP has up to a million rows containing groups of up to 341 stops, so the end result could be as large as 1e6/341*choose(341,2) = 170 million rows. That's manageable, but in general this sort of analysis does not scale.
How it works
Generally, data.table syntax can be treated just like a loop over groups:
DT[, {
...
}, by=g]
This has a few advantages over loops:
Nothing created in the ... body will pollute the workspace.
All columns can be referenced by name.
Special symbols .N, .SD, .GRP and .BY are available, along with .() for list().
In the code above, pairs finds pairs of indices taken from 1 .. #stops (=.N+1 where .N is the number of rows in the subset of the data associated with a given route_id). It is a matrix with the first row corresponding to the first element of a pair; and the second row with the second. The ... should evaluate to a list of columns; and here list() is abbreviated as .().
Further improvements
I guess the time is mostly devoted to computing combn many times. If multiple routes have the same #stops, this can be addressed by computing beforehand:
Ns = dt[,.N, by=route_id][, unique(N)]
cb = lapply(setNames(,Ns), combn, 2)
Then grab pairs = cb[[as.character(.N)]] in the main code. Alternately, define a pairs function that uses memoization to avoid recomputing.

Group a data.table using a column which is list

I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]

Get the last row of a previous group in data.table

This is what my data table looks like:
library(data.table)
dt <- fread('
Product Group LastProductOfPriorGroup
A 1 NA
B 1 NA
C 2 B
D 2 B
E 2 B
F 3 E
G 3 E
')
The LastProductOfPriorGroup column is my desired column. I am trying to fetch the product from last row of the prior group. So in the first two rows, there are no prior groups and therefore it is NA. In the third row, the product in the last row of the prior group 1 is B. I am trying to accomplish this by
dt[,LastGroupProduct:= shift(Product,1), by=shift(Group,1)]
to no avail.
You could do
dt[, newcol := shift(dt[, last(Product), by = Group]$V1)[.GRP], by = Group]
This results in the following updated dt, where newcol matches your desired column with the unnecessarily long name. ;)
Product Group LastProductOfPriorGroup newcol
1: A 1 NA NA
2: B 1 NA NA
3: C 2 B B
4: D 2 B B
5: E 2 B B
6: F 3 E E
7: G 3 E E
Let's break the code down from the inside out. I will use ... to denote the accumulated code:
dt[, last(Product), by = Group]$V1 is getting the last values from each group as a character vector.
shift(...) shifts the character vector in the previous call
dt[, newcol := ...[.GRP], by = Group] groups by Group and uses the internal .GRP values for indexing
Update: Frank brings up a good point about my code above calculating the shift for every group over and over again. To avoid that, we can use either
shifted <- shift(dt[, last(Product), Group]$V1)
dt[, newcol := shifted[.GRP], by = Group]
so that we don't calculate the shift for every group. Or, we can take Frank's nice suggestion in the comments and do the following.
dt[dt[, last(Product), by = Group][, v := shift(V1)], on="Group", newcol := i.v]
Another way is to save the last group's value in a variable.
this = NA_character_ # initialize
dt[, LastProductOfPriorGroup:={ last<-this; this<-last(Product); last }, by=Group]
dt
Product Group LastProductOfPriorGroup
1: A 1 NA
2: B 1 NA
3: C 2 B
4: D 2 B
5: E 2 B
6: F 3 E
7: G 3 E
NB: last() is a data.table function which returns the last item of a vector (of the Product column in this case).
This should also be fast since no logic is being invoked to fetch the last group's value; it just relies on the groups running in order (which they do).

Filter data.table on multiple criteria in the same column

I have the following data.table:
> dt= data.table(num=c(1,2,1,1,2, 3, 3,2), letters[1:8])
> dt
num V2
1: 1 a
2: 1 c
3: 1 d
4: 2 b
5: 2 e
6: 2 h
7: 3 f
8: 3 g
I want to filter all num equals to 1 and 2 and get the resulting data.table. I can do this with:
> dt[num==1 | num==2,]
num V2
1: 1 a
2: 1 c
3: 1 d
4: 2 b
5: 2 e
6: 2 h
Or:
rbind(setkey(dt, num)[J(1)],setkey(dt, num)[J(2)])
But is there any option with setkey so that the second expression is shorter like:
setkey(dt, num)[1|2]
Since setkey code is quicker for very large amount ... I would appreciate some help!
In additions to KFB's comment:
setkey(dt, num)[num %in% c(1,2)]
If the filtering values are integers in a sequence:
setkey(dt,num)[J(1:2)] # OR
setkey(dt,num)[seq]
If they are arbitrary:
setkey(dt,num)[J(c(1,2)]
NOTE 1: This may not work in older versions of data.table
NOTE 2: . is a alias for J which is more readable:
setkey(dt,num)[.(1:2)]
FWIW, I like using the magrittr package with data.table and making everything is clear as possible:
dt %>% setkey(num)
dt[ .(1:2) ]
The drawback is that you can't do this neatly on one line.

Resources