Merging a sum by reference with data.table

Merging a sum by reference with data.table - r

Let's say I have two data.table, dt_a and dt_b defined as below.
library(data.table)
set.seed(20201111L)
dt_a <- data.table(
foo = c("a", "b", "c")
)
dt_b <- data.table(
bar = sample(c("a", "b", "c"), 10L, replace=TRUE),
value = runif(10L)
)
dt_b[]
## bar value
## 1: c 0.4904536
## 2: c 0.9067509
## 3: b 0.1831664
## 4: c 0.0203943
## 5: c 0.8707686
## 6: a 0.4224133
## 7: a 0.6025349
## 8: b 0.4916672
## 9: a 0.4566726
## 10: b 0.8841110
I want to left join dt_b on dt_a by reference, summing over the multiple match. A way to do so would be to first create a summary of dt_b (thus solving the multiple match issue) and merge if afterwards.
dt_b_summary <- dt_b[, .(value=sum(value)), bar]
dt_a[dt_b_summary, value_good:=value, on=c(foo="bar")]
dt_a[]
## foo value_good
## 1: a 1.481621
## 2: b 1.558945
## 3: c 2.288367
However, this will allow memory to the object dt_b_summary, which is inefficient.
I would like to have the same result by directly joining on dt_b and summing multiple match. I'm looking for something like below, but that won't work.
dt_a[dt_b, value_bad:=sum(value), on=c(foo="bar")]
dt_a[]
## foo value_good value_bad
## 1: a 1.481621 5.328933
## 2: b 1.558945 5.328933
## 3: c 2.288367 5.328933
Anyone knows if there is something possible?

We can use .EACHI with by
library(data.table)
dt_b[dt_a, .(value = sum(value)), on = .(bar = foo), by = .EACHI]
# bar value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
If we want to update the original object 'dt_a'
dt_a[, value := dt_b[.SD, sum(value), on = .(bar = foo), by = .EACHI]$V1]
dt_a
# foo value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
For multiple columns
dt_b$value1 <- dt_b$value
nm1 <- c('value', 'value1')
dt_a[, (nm1) := dt_b[.SD, lapply(.SD, sum),
on = .(bar = foo), by = .EACHI][, .SD, .SDcols = nm1]]

Related

Add relative complement of two data.table with rbind

I have a keyed data.table to which I would like to add rows from another table of the same key:
library(data.table)
key.cols <- c("ID", "Code")
set.seed(1)
DT1 = data.table(
ID = c("b","b","b","a","a","c"),
Code = LETTERS[seq(1,6)],
Number = runif(6)
);DT1
DT2 = data.table(
ID = c("a","a","c","b","b","b"),
Code = LETTERS[seq(4,9)],
Number = runif(6)
);DT2
I would like to only add to DT1 rows from DT2 of the keys that do not occur in DT1 i.e. rbind a relative complement:
https://en.wikipedia.org/wiki/Complement_(set_theory)#Relative_complement
I can try and use setops and just add the keys letting the non-keyed columns be filled NA and join them afterwards:
DT1 <- rbind(DT1, fsetdiff(DT2[,(key.cols), with=FALSE], DT1[,(key.cols), with=FALSE]), fill=TRUE)
DT1[DT2, Number:=ifelse(is.na(Number), i.Number, Number), on = key.cols];DT1
Is there a less cumbersome way to do it?

Slightly less cumbersome is:
rbind(DT1, DT2[!DT1, on = .(ID, Code)])
ID Code Number
1: b A 0.26550866
2: b B 0.37212390
3: b C 0.57285336
4: a D 0.90820779
5: a E 0.20168193
6: c F 0.89838968
7: b G 0.06178627
8: b H 0.20597457
9: b I 0.17655675
Perhaps more tractable would be to use unique():
unique(rbind(DT1, DT2), by = c("ID", "Code"))

Reordering each row of a datatable

I am searching a way to reorder each row of a data.table in alphatical order in an efficient way. So I assume that each column does give the same information and is comparable. When you see the example it will make more sense:
test <- data.table(A = c("A", "b", "c"),
B = c(1,"a","d"),
C = c("F", 0, 1))
Expected result:
result <- data.table(t(apply(test,1, sort)))
names(result) <- colnames(test)
In this solution I have to loop through all the rows, can this be prevented?
For 2 columns I found a efficient way to solve this problem:
result <- data.table(A = pmin(test$A, test$B), B = pmax(test$A, test$B) )
But this solution does not work well for more than 2 columns
EDIT:
Lets add a benchmark of the different solutions on two columns:
test <- data.table(A = sample(c("A","B", "C", "D"), 1000000, replace = T),
B = sample(c("A","B", "C", "D"), 1000000, replace = T))
OptionOne <- function(test){
result <- data.table(A = pmin(test$A, test$B), B = pmax(test$A, test$B) )
}
OptionTwo <- function(test){
test[, names(test) := as.list(sort(unlist(.SD))), 1:nrow(test)][]
}
OptionThree <- function(test){
test[, id := .I]
test <- melt(test, id.vars = "id")
setorder(test, id, value)
test[, variable1 := seq_len(.N), by = id]
dcast(test, id ~ variable1, value.var = "value")
}
system.time(OptionOne(test))
#user system elapsed
#0.13 0.00 0.12
system.time(OptionTwo(test))
# user system elapsed
# 17.58 0.00 18.27
system.time(OptionThree(test))
#user system elapsed
# 0.23 0.00 0.24
It seems like for two columns the pmin and pmax is the most efficient way but for more columns the reshape does a good job.

Your data.table is conceptionally in the wrong shape. Sorting over rows (i.e., over variables) does not make sense. Thus, to do this efficiently you need to reshape:
library(data.table)
test <- data.table(A = c("A", "b", "c"),
B = c(1,"a","d"),
C = c("F", 0, 1))
test[, id := .I]
test <- melt(test, id.vars = "id")
setorder(test, id, value)
# id variable value
#1: 1 B 1
#2: 1 A A
#3: 1 C F
#4: 2 C 0
#5: 2 B a
#6: 2 A b
#7: 3 C 1
#8: 3 A c
#9: 3 B d
If you must, you can then reshape again, though I would not recommend that.
test[, variable1 := seq_len(.N), by = id]
dcast(test, id ~ variable1, value.var = "value")
# id 1 2 3
#1: 1 1 A F
#2: 2 0 a b
#3: 3 1 c d

We can try
test[, names(test) := as.list(sort(unlist(.SD))), 1:nrow(test)][]

Get number of same individuals for different groups

I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.

Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3

As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C

We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]

Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3

yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....

Inconsistent data.table assignment by reference behaviour

When assigning by reference with a data.table using a column from a second data.table, the results are inconsistent. When there are no matches by the key columns of both data.tables, it appears the assigment expression y := y is totally ignored - not even NAs are returned.
library(data.table)
dt1 <- data.table(id = 1:2, x = 3:4, key = "id")
dt2 <- data.table(id = 3:4, y = 5:6, key = "id")
print(dt1[dt2, y := y])
## id x # Would have also expected column: y
## 1: 1 3 # NA
## 2: 2 4 # NA
However, when there is a partial match, non-matching columns have a placeholder NA.
dt2[, id := 2:3]
print(dt1[dt2, y := y])
## id x y
## 1: 1 3 NA # <-- placeholder NA here
## 2: 2 4 5
This wreaks havoc on later code that assumes a y column exists in all cases. Otherwise I keep having to write cumbersome additional checks to take into account both cases.
Is there an elegant way around this inconsistency?

With this recent commit, this issue, #759, is now fixed in v1.9.7. It works as expected when nomatch=NA (the current default).
require(data.table)
dt1 <- data.table(id = 1:2, x = 3:4, key = "id")
dt2 <- data.table(id = 3:4, y = 5:6, key = "id")
dt1[dt2, y := y][]
# id x y
# 1: 1 3 NA
# 2: 2 4 NA

Using merge works:
> dt3 <- merge(dt1, dt2, by='id', all.x=TRUE)
> dt3
id x y
1: 1 3 NA
2: 2 4 NA

Cross-correlation with multiple groups in one data.table

I'd like to calculate the cross-correlations between groups of time series within on data.table. I have a time series data in this format:
data = data.table( group = c(rep("a", 5),rep("b",5),rep("c",5)) , Y = rnorm(15) )
group Y
1: a 0.90855520
2: a -0.12463737
3: a -0.45754652
4: a 0.65789709
5: a 1.27632196
6: b 0.98483700
7: b -0.44282527
8: b -0.93169070
9: b -0.21878359
10: b -0.46713392
11: c -0.02199363
12: c -0.67125826
13: c 0.29263953
14: c -0.65064603
15: c -1.41143837
Each group has the same number of observations. What I am looking for is a way to obtain cross correlation between the groups:
group.1 group.2 correlation
a b 0.xxx
a c 0.xxx
b c 0.xxx
I am working on a script to subset each group and append the cross-correlations, but the data size is fairly large. Is there any efficient / zen way to do this?

Does this help?
data[,id:=rep(1:5,3)]
dtw = dcast.data.table(data, id ~ group, value.var="Y" )[, id := NULL]
cor(dtw)
See Correlation between groups in R data.table
Another way would be:
# data
set.seed(45L)
data = data.table( group = c(rep("a", 5),rep("b",5),rep("c",5)) , Y = rnorm(15) )
# method 2
setkey(data, "group")
data2 = data[J(c("b", "c", "a"))][, list(group2=group, Y2=Y)]
data[, c(names(data2)) := data2]
data[, cor(Y, Y2), by=list(group, group2)]
# group group2 V1
# 1: a b -0.2997090
# 2: b c 0.6427463
# 3: c a -0.6922734
And to generalize this "other" way to more than three groups...
data = data.table( group = c(rep("a", 5),rep("b",5),rep("c",5),rep("d",5)) ,
Y = rnorm(20) )
setkey(data, "group")
groups = unique(data$group)
ngroups = length(groups)
library(gtools)
pairs = combinations(ngroups,2,groups)
d1 = data[pairs[,1],,allow.cartesian=TRUE]
d2 = data[pairs[,2],,allow.cartesian=TRUE]
d1[,c("group2","Y2"):=d2]
d1[,cor(Y,Y2), by=list(group,group2)]
# group group2 V1
# 1: a b 0.10742799
# 2: a c 0.52823511
# 3: a d 0.04424170
# 4: b c 0.65407400
# 5: b d 0.32777779
# 6: c d -0.02425053

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merging a sum by reference with data.table - r

Related

Add relative complement of two data.table with rbind

Reordering each row of a datatable

Get number of same individuals for different groups

Inconsistent data.table assignment by reference behaviour

Cross-correlation with multiple groups in one data.table

Categories

Resources