I found a strange thing (bug?) about 'combn' function and 'data.table' package [all possible combinations by group] - r

I tried to find all possible combinations by group. I tried to use combn function and data.table package as a below post teaches [(here is the link)](Generate All ID Pairs, by group with data.table in R
This gives me the expected result.
dat1 <- data.table(ids=1:4, groups=c("B","A","B","A"))
dat1
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 A
dat1[, as.data.table(t(combn(ids, 2))), .(groups)]
groups V1 V2
1: B 1 3
2: A 2 4
But this gives me a strange result. It's very weird. I tried to understand this result for about 3 hours but I can't. Isn't it a bug?
dat2 <- data.table(ids=1:4, groups=c("B","A","B","C"))
dat2
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 C
dat2[, as.data.table(t(combn(ids, 2))), .( groups)]
groups V1 V2
1: B 1 3
2: A 1 2
3: C 1 2
4: C 1 3
5: C 1 4
6: C 2 3
7: C 2 4
8: C 3 4
I really appreciate it for your teaching.

Related

Merge multiple numeric column as list typed column in data.table [R]

I'm trying to find a way to merge multiple column numeric column as a new list type column.
Data Table
dt <- data.table(
a=c(1,2,3),
b=c(4,5,6),
c=c(7,8,9)
)
Expected Result
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Attempt 1
I have tried doing append with a list with dt[,d:=list(c(a,b,c))] but it just append everything instead and get the incorrect result
a b c d
1: 1 4 7 1,2,3,4,5,6,...
2: 2 5 8 1,2,3,4,5,6,...
3: 3 6 9 1,2,3,4,5,6,...
Do a group by row and place the elements in the list
dt[, d := .(list(unlist(.SD, recursive = FALSE))), 1:nrow(dt)]
-output
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Or another option is paste and strsplit
dt[, d := strsplit(do.call(paste, c(.SD, sep=",")), ",")]
Or may use transpose
dt[, d := lapply(data.table::transpose(unname(.SD)), unlist)]
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
dt[, d := purrr::pmap(.SD, ~c(...))]

Subsetting data.table by values with repetition

I would like to subset data.table in a way which would provide some rows repeatedly. This works well with indices but I am not sure how to do it in a simple way with values especially when the values does not appear in one row only.
E.g.:
library(data.table)
dt<-data.table(x1=c('a','a','b','b','c','c'),x2=c(1,2,3,4,5,6))
xsel<-c('a','b','a','a','c','b')
dt[x1%in%xsel,]
will provide this output:
x1 x2
1: a 1
2: a 2
3: b 3
4: b 4
5: c 5
6: c 6
I would like to get it in the original order and with repetition exactly as it is in the xsel vector. Is it possible to do it in a reasonably simple way without looping? Thanks.
Using:
setkey(dt, x1) # set the key
dt[J(xsel)] # or: dt[.(xsel)]
gives:
x1 x2
1: a 1
2: a 2
3: b 3
4: b 4
5: a 1
6: a 2
7: a 1
8: a 2
9: c 5
10: c 6
11: b 3
12: b 4
Without setting the key, you could use:
dt[.(xsel), on = .(x1)]

Keep only 'by' variables when collapsing data.table

I have a very large data.table:
DT <- data.table(a=c(1,1,1,1,2,2,2,2,3,3,3,3),b=c(1,1,2,2),c=1:12)
And I need to collapse it by several variables, e.g. list(a,b). Easy:
DT[,sum(c),by=list(a,b)]
a b V1
1: 1 1 3
2: 1 2 7
3: 2 1 11
4: 2 2 15
5: 3 1 19
6: 3 2 23
However, I don't want to take any operation on c, I just want to drop it:
DT[,,by=list(a,b)] # includes a,b,c, thus does not collapse
DT[,list(),by=list(a,b)] # zero rows
DT[,a,by=list(a,b)] # what I want but adds extraneous column a after 'by' columns
How can I specify X below to get the indicated result?
DT[,X,by=list(a,b)]
a b
1: 1 1
2: 1 2
3: 2 1
4: 2 2
5: 3 1
6: 3 2
unique.data.table has a by argument, you could then subset result to get the columns you want.
eg
unique(DT, by = c('a', 'b'))[, c('a','b')]

Extract the best attributes from a data.table

I have a data.table:
> (a <- data.table(id=c(1,1,1,2,2,3),
attribute=c("a","b","c","a","b","c"),
importance=1:6,
key=c("id","importance")))
id attribute importance
1: 1 a 1
2: 1 b 2
3: 1 c 3
4: 2 a 4
5: 2 b 5
6: 3 c 6
I want:
--1-- sort it by the second key in the decreasing order (i.e., the most important attributes should come first)
--2-- select the top 2 (or 10) attributes for each id, i.e.:
id attribute importance
3: 1 c 3
2: 1 b 2
5: 2 b 5
4: 2 a 4
6: 3 c 6
--3-- pivot the above:
id attribute.1 importance.1 attribute.2 importance.2
1 c 3 b 2
2 b 5 a 4
3 c 6 NA NA
It appears that the last operation can be done with something like:
a[,{
tmp <- .SD[.N:1];
list(a1 = tmp$attribute[1],
i1 = tmp$importance[1])
}, by=id]
Is this The Right Way?
How do I do the first two tasks?
I'd do the first two tasks like this:
a[a[, .I[.N:(.N-1)], by=list(id)]$V1]
The inner a[, .I[.N:(.N-1)], ,by=list(id)] gives you the indices in the order you require for every unique group in id. Then you subset a with the V1 column (which has the indices in the order you require).
You'll have to take care of negative indices here, maybe something like:
a[a[, .I[seq.int(.N, max(.N-1L, 1L))], by=list(id)]$V1]

data preparation part II

there's another problem I encountered which is (as I think) quite interesting:
dt <- data.table(K=c("A","A","A","B","B","B"),A=c(2,3,4,1,3,4),B=c(3,3,3,1,1,1))
dt
K A B
1: A 2 3
2: A 3 3
3: A 4 3
4: B 1 1
5: B 3 1
6: B 4 1
Now I want a somewhat "higher" level of the data. For each letter in K, there should only be one line and "A_sum" should include the length of A where B has the same value. So there are three values for B=3 and three values for B=1.
Resulting data.table:
dt_new
K A_sum B
1: A 3 3
2: B 3 1
It's not clear how you want to treat K, but here's one option:
dt_new <- dt[, list(A_sum = length(A)), by = list(K, B)]
# K B A_sum
# 1: A 3 3
# 2: B 1 3

Resources