Let's say I have two data.table, dt_a and dt_b defined as below.
library(data.table)
set.seed(20201111L)
dt_a <- data.table(
foo = c("a", "b", "c")
)
dt_b <- data.table(
bar = sample(c("a", "b", "c"), 10L, replace=TRUE),
value = runif(10L)
)
dt_b[]
## bar value
## 1: c 0.4904536
## 2: c 0.9067509
## 3: b 0.1831664
## 4: c 0.0203943
## 5: c 0.8707686
## 6: a 0.4224133
## 7: a 0.6025349
## 8: b 0.4916672
## 9: a 0.4566726
## 10: b 0.8841110
I want to left join dt_b on dt_a by reference, summing over the multiple match. A way to do so would be to first create a summary of dt_b (thus solving the multiple match issue) and merge if afterwards.
dt_b_summary <- dt_b[, .(value=sum(value)), bar]
dt_a[dt_b_summary, value_good:=value, on=c(foo="bar")]
dt_a[]
## foo value_good
## 1: a 1.481621
## 2: b 1.558945
## 3: c 2.288367
However, this will allow memory to the object dt_b_summary, which is inefficient.
I would like to have the same result by directly joining on dt_b and summing multiple match. I'm looking for something like below, but that won't work.
dt_a[dt_b, value_bad:=sum(value), on=c(foo="bar")]
dt_a[]
## foo value_good value_bad
## 1: a 1.481621 5.328933
## 2: b 1.558945 5.328933
## 3: c 2.288367 5.328933
Anyone knows if there is something possible?
We can use .EACHI with by
library(data.table)
dt_b[dt_a, .(value = sum(value)), on = .(bar = foo), by = .EACHI]
# bar value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
If we want to update the original object 'dt_a'
dt_a[, value := dt_b[.SD, sum(value), on = .(bar = foo), by = .EACHI]$V1]
dt_a
# foo value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
For multiple columns
dt_b$value1 <- dt_b$value
nm1 <- c('value', 'value1')
dt_a[, (nm1) := dt_b[.SD, lapply(.SD, sum),
on = .(bar = foo), by = .EACHI][, .SD, .SDcols = nm1]]
With data.table, we can join a value in one data set with the nearest value in another by using roll = "nearest". Some example data:
dt1 <- data.table(x = c(15,101), id1 = c("x", "y"))
dt2 <- data.table(x = c(10,50,100,200), id2 = c("a","b","c","d"))
Using roll = "nearest", I can join each 'x' in 'dt1' with the 'x' in dt2 which is nearest:
dt2[dt1, roll = "nearest", on = "x"]
# x id2 id1
# 1: 15 a x
# 2: 101 c y
E.g. for x = 15 in 'dt1', the nearest x value in 'dt2' is x = 10, and we get the corresponding 'id2' which is "a".
But what if instead of getting one nearest value, I want to get n nearest values? For example, if I want the 2 nearest x values, the result would be:
x id2 id1 roll
1: 15 a x nr1
2: 15 b x nr2
3: 101 c y nr1
4: 101 b y nr2
("nr" stands for "nearest")
I want a general approach that I can apply to any "n" (e.g. 2 nearest points, 3 nearest points, etc).
EDIT
I wonder if it is possible to also apply this to multi columns join where the join will match on the preceding column before getting the nearest on the last join column. For example:
dt1 <- data.table(group=c(1,2), x=(c(15,101)), id1=c("x","y"))
dt2 <- data.table(group=c(1,2,2,3), x=c(10,50,100,200),id2=c("a","b","c","d"))
If I join on=c("group","x"), the join will first match on "group" and then get the nearest on "x", so I would expect the result to be something like:
x group id2 id1 roll
1: 15 1 a x nr1
2: 101 2 c y nr1
3: 101 2 b y nr2
Here is something very raw (we go row by row):
n <- 2L
sen <- 1L:n
for (i in 1:nrow(dt1)) {
set(dt1, i, j = "nearest", list(which(frank(abs(dt1$x[i] - dt2$x)) %in% sen)))
}
dt1[, .(id1, nearest = unlist(nearest)), by = x
][, id2 := dt2$id2[nearest]
][, roll := paste0("nr", frank(abs(dt2$x[nearest] - x))), by = x][]
# x id1 nearest id2 roll
# 1: 15 x 1 a nr1
# 2: 15 x 2 b nr2
# 3: 101 y 2 b nr2
# 4: 101 y 3 c nr1
Slightly cleaner:
dt1[,
{
nrank <- frank(abs(x - dt2$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt2$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1] # assumes unique ids.
Data:
dt1 <- data.table(x = c(15, 101), id1 = c("x", "y"))
dt2 <- data.table(x = c(10, 50, 100, 200), id2 = c("a", "b", "c", "d"))
EDIT (as suggested/written by OP)
Joining with multiple keys:
dt1[,
{
g <- group
dt_tmp <- dt2[dt2$group == g]
nrank <- frank(abs(x - dt_tmp$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt_tmp$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1]
Edited for corrected ordering.
I don't know that roll= is going to allow nearest-n, but here's a possible workaround:
dt1[, id2 := lapply(x, function(z) { r <- head(order(abs(z - dt2$x)), n = 2); dt2[ r, .(id2, nr = order(r)) ]; }) ]
as.data.table(tidyr::unnest(dt1, id2))
# x id1 id2 nr
# 1: 15 x a 1
# 2: 15 x b 2
# 3: 101 y c 2
# 4: 101 y b 1
(I'm using tidyr::unnest because I think it fits and works well here, and data.table/#3672 is still open.)
Second batch of data:
dt1 = data.table(x = c(1, 5, 7), id1 = c("x", "y", "z"))
dt2 = data.table(x = c(2, 5, 6, 10), id2 = c(2, 5, 6, 10))
dt1[, id2 := lapply(x, function(z) { r <- head(order(abs(z - dt2$x)), n = 2); dt2[ r, .(id2, nr = order(r)) ]; }) ]
as.data.table(tidyr::unnest(dt1, id2))
# x id1 id2 nr
# 1: 1 x 2 1
# 2: 1 x 5 2
# 3: 5 y 5 1
# 4: 5 y 6 2
# 5: 7 z 6 2
# 6: 7 z 5 1
Here is another option using rolling join without an additional grouping key (an improvement on my initial naive cross join idea):
#for differentiating rows from both data.tables
dt1[, ID := .I]
dt2[, rn := .I]
#perform rolling join to find closest and
#then retrieve the +-n rows around that index from dt2
n <- 2L
adjacent <- dt2[dt1, on=.(x), roll="nearest", nomatch=0L, by=.EACHI,
c(.(ID=ID, id1=i.id1, val=i.x), dt2[unique(pmin(pmax(0L, seq(x.rn-n, x.rn+n, by=1L)), .N))])][,
(1L) := NULL]
#extract nth nearest
adjacent[order(abs(val-x)), head(.SD, n), keyby=ID]
output:
ID id1 val x id2 rn
1: 1 x 15 10 a 1
2: 1 x 15 50 b 2
3: 2 y 101 100 c 3
4: 2 y 101 50 b 2
And using Henrik's dataset:
dt1 = data.table(x = c(1, 5, 7), id1 = c("x", "y", "z"))
dt2 = data.table(x = c(2, 5, 6, 10), id2 = c(2, 5, 6, 10))
output:
ID id1 val x id2 rn
1: 1 x 1 2 2 1
2: 1 x 1 5 5 2
3: 2 y 5 5 5 2
4: 2 y 5 6 6 3
5: 3 z 7 6 6 3
6: 3 z 7 5 5 2
And also Henrik's 2nd dataset:
dt1 = data.table(x = 3L, id1="x")
dt2 = data.table(x = 1:2, id2=c("a","b"))
output:
ID id1 val x id2 rn
1: 1 x 3 2 b 2
2: 1 x 3 1 a 1
And also joining on an additional grouping key:
dt2[, rn := .I]
#perform rolling join to find closest and
#then retrieve the +-n rows around that index from dt2
n <- 2L
adjacent <- dt2[dt1, on=.(group, x), roll="nearest", by=.EACHI, {
xrn <- unique(pmax(0L, seq(x.rn-n, x.rn+n, by=1L)), .N)
c(.(id1=id1, x1=i.x),
dt2[.(group=i.group, rn=xrn), on=.(group, rn), nomatch=0L])
}][, (1L:2L) := NULL]
#extract nth nearest
adjacent[order(abs(x1-x)), head(.SD, 2L), keyby=id1] #use id1 to identify rows if its unique, otherwise create ID column like prev section
output:
id1 x1 group x id2 rn
1: x 15 1 10 a 1
2: y 101 2 100 c 3
3: y 101 2 50 b 2
data:
library(data.table)
dt1 <- data.table(group=c(1,2), x=(c(15,101)), id1=c("x","y"))
dt2 <- data.table(group=c(1,2,2,3), x=c(10,50,100,200), id2=c("a","b","c","d"))
A k nearest neighbour alternative using nabor::knn:
library(nabor)
k = 2L
dt1[ , {
kn = knn(dt2$x2, x, k)
c(.SD[rep(seq.int(.N), k)],
dt2[as.vector(kn$nn.idx),
.(x2 = x, id2, nr = rep(seq.int(k), each = dt1[ ,.N]))])
}]
# x id1 x2 id2 nr
# 1: 15 x 10 a 1
# 2: 101 y 100 c 1
# 3: 15 x 50 b 2
# 4: 101 y 50 b 2
In common with the answers by #sindri_baldur and #r2evans, an actual join (on = ) is not performed, we "only" do something in j.
Timings
On data of rather modest size (nrow(dt1): 1000; nrow(dt2): 10000), knn seems faster:
# Unit: milliseconds
# expr min lq mean median uq max neval
# henrik 8.09383 10.19823 10.54504 10.2835 11.00029 13.72737 20
# chinsoon 2140.48116 2154.15559 2176.94620 2171.5824 2192.54536 2254.20244 20
# r2evans 4496.68625 4562.03011 4677.35214 4680.0699 4751.35237 4935.10655 20
# sindri 4194.93867 4397.76060 4406.29278 4402.7913 4432.76463 4490.82789 20
I also tried one evaluation on 10 times larger data, and the differences were then even more pronounced.
Code for the timing:
v = 1:1e7
n1 = 10^3
n2 = n1 * 10
set.seed(1)
dt1_0 = data.table(x = sample(v, n1))
dt2_0 = data.table(x = sample(v, n2))
setorder(dt1_0, x)
setorder(dt2_0, x)
# unique row id
dt1_0[ , id1 := 1:.N]
# To make it easier to see which `x` values are joined in `dt1` and `dt2`
dt2_0[ , id2 := x]
bm = microbenchmark(
henrik = {
dt1 = copy(dt1_0)
dt2 = copy(dt2_0)
k = 2L
d_henrik = dt1[ , {
kn = knn(dt2$x, x, k)
c(.SD[as.vector(row(kn$nn.idx))],
dt2[as.vector(kn$nn.idx),
.(id2, nr = as.vector(col(kn$nn.idx)))])
}]
},
chinsoon = {
dt1 = copy(dt1_0)
dt2 = copy(dt2_0)
dt1[, ID := .I]
dt2[, rn := .I]
n <- 2L
adjacent <- dt2[dt1, on=.(x), roll="nearest", nomatch=0L, by=.EACHI,
c(.(ID=ID, id1=i.id1, val=i.x),
dt2[unique(pmin(pmax(0L, seq(x.rn-n, x.rn+n, by=1L)), .N))])][,(1L) := NULL]
d_chinsoon = adjacent[order(abs(val-x)), head(.SD, n), keyby=ID]
},
r2evans = {
dt1 = copy(dt1_0)
dt2 = copy(dt2_0)
dt1[, id2 := lapply(x, function(z) { r <- head(order(abs(z - dt2$x)), n = 2); dt2[ r, .(id2, nr = order(r)) ]; }) ]
d_r2evans = as.data.table(tidyr::unnest(dt1, id2))
},
sindri = {
dt1 = copy(dt1_0)
dt2 = copy(dt2_0)
n <- 2L
sen <- 1:n
d_sindri = dt1[ ,
{
nrank <- frank(abs(x - dt2$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt2$id2[nearest], roll = paste0("nr", nrank[nearest]))
}, by = id1]
}
, times = 20L)
# Unit: milliseconds
# expr min lq mean median uq max neval
# henrik 8.09383 10.19823 10.54504 10.2835 11.00029 13.72737 20
# chinsoon 2140.48116 2154.15559 2176.94620 2171.5824 2192.54536 2254.20244 20
# r2evans 4496.68625 4562.03011 4677.35214 4680.0699 4751.35237 4935.10655 20
# sindri 4194.93867 4397.76060 4406.29278 4402.7913 4432.76463 4490.82789 20
Check for equality, after some sorting:
setorder(d_henrik, x)
all.equal(d_henrik$id2, d_chinsoon$id2)
# TRUE
all.equal(d_henrik$id2, d_r2evans$id2)
# TRUE
setorder(d_sindri, x, roll)
all.equal(d_henrik$id2, d_sindri$id2)
# TRUE
Additional grouping variable
A quick and dirty work-around for an additional join variable; the knn is done by group:
d1 = data.table(g = 1:2, x = c(1, 5))
d2 = data.table(g = c(1L, 1L, 2L, 2L, 2L, 3L),
x = c(2, 5, 2, 3, 6, 10))
d1
# g x
# 1: 1 4
# 2: 2 4
d2
# g x
# 1: 1 2
# 2: 1 4 # nr 1
# 3: 1 5 # nr 2
# 4: 2 0
# 5: 2 1 # nr 2
# 6: 2 6 # nr 1
# 7: 3 10
d1[ , {
gg = g
kn = knn(d2[g == gg, x], x, k)
c(.SD[rep(seq.int(.N), k)],
d2[g == gg][as.vector(kn$nn.idx),
.(x2 = x, nr = rep(seq.int(k), each = d1[g == gg, .N]))])
}, by = g]
# g x x2 nr
# 1: 1 4 4 1
# 2: 1 4 5 2
# 3: 2 4 6 1
# 4: 2 4 1 2
You can use the package distances to get n nearest neighbours:
library(distances)
n <- 2
x <- nearest_neighbor_search(distances(c(dt2$x, dt1$x)), n
, nrow(dt2) + seq_len(nrow(dt1)), seq_len(nrow(dt2)))
x[] <- dt2$id2[x]
x <- t(x)
colnames(x) <- paste0("nr", seq_len(n))
cbind(dt1, x)
# x id1 nr1 nr2
#1: 15 x a b
#2: 101 y c b
I have a data.table in which I'd like to complete a column to fill in some missing values, however I'm having some trouble filling in the other columns.
dt = data.table(a = c(1, 3, 5), b = c('a', 'b', 'c'))
dt[, .(a = seq(min(a), max(a), 1), b = na.locf(b))]
# a b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 a
# 5: 5 b
However looking for something more like this:
dt %>%
complete(a = seq(min(a), max(a), 1)) %>%
mutate(b = na.locf(b))
# # A tibble: 5 x 2
# a b
# <dbl> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 c
where the last value is carried forward
Another possible solution with only the (rolling) join capabilities of data.table:
dt[.(min(a):max(a)), on = .(a), roll = Inf]
which gives:
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
On large datasets this will probably outperform every other solution.
Courtesy to #Mako212 who gave the hint by using seq in his answer.
First posted solution which works, but gives a warning:
dt[dt[, .(a = Reduce(":", a))], on = .(a), roll = Inf]
data.table recycles observations by default when you try dt[, .(a = seq(min(a), max(a), 1))] so it never generates any NA values for na.locf to fill. Pretty sure you need to use a join here to "complete" the cases, and then you can use na.locf to fill.
dt[dt[, .(a = min(a):max(a))], on = 'a'][, .(a, b = na.locf(b))]
Not sure if there's a way to skip the separate t1 line, but this gives you the desired result.
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
And I'll borrow #Jaap's min/max line to avoid creating the second table. So basically you can either use his rolling join solution, or if you want to use na.locf this gets the same result.
Using my example below, how can I rank multiple columns using different orders, so for example rank y as descending and z as ascending?
require(data.table)
dt <- data.table(x = c(rep("a", 5), rep("b", 5)),
y = abs(rnorm(10)) * 10, z = abs(rnorm(10)) * 10)
cols <- c("y", "z")
dt[, paste0("rank_", cols) := lapply(.SD, function(x) frankv(x, ties.method = "min")), .SDcols = cols, by = .(x)]
data.table's frank() function has some useful features which aren't available in base R's rank() function (see ?frank). E.g., we can reverse the order of the ranking by prepending the variable with a minus sign:
library(data.table)
# create reproducible data
set.seed(1L)
dt <- data.table(x = c(rep("a", 5), rep("b", 5)),
y = abs(rnorm(10)) * 10, z = abs(rnorm(10)) * 10)
# rank y descending, z ascending
dt[, rank_y := frank(-y), x][, rank_z := frank(z), x][]
x y z rank_y rank_z
1: a 6.264538 15.1178117 3 4
2: a 1.836433 3.8984324 5 1
3: a 8.356286 6.2124058 2 2
4: a 15.952808 22.1469989 1 5
5: a 3.295078 11.2493092 4 3
6: b 8.204684 0.4493361 1 2
7: b 4.874291 0.1619026 4 1
8: b 7.383247 9.4383621 2 5
9: b 5.757814 8.2122120 3 4
10: b 3.053884 5.9390132 5 3
If there are many columns which are to be ranked individually, some descending, some ascending, we can do this in two steps
# first rank all columns in descending order
cols_desc <- c("y")
dt[, paste0("rank_", cols_desc) := lapply(.SD, frankv, ties.method = "min", order = -1L),
.SDcols = cols_desc, by = x][]
# then rank all columns in ascending order
cols_asc <- c("z")
dt[, paste0("rank_", cols_asc) := lapply(.SD, frankv, ties.method = "min", order = +1L),
.SDcols = cols_asc, by = x][]
x y z rank_y rank_z
1: a 6.264538 15.1178117 3 4
2: a 1.836433 3.8984324 5 1
3: a 8.356286 6.2124058 2 2
4: a 15.952808 22.1469989 1 5
5: a 3.295078 11.2493092 4 3
6: b 8.204684 0.4493361 1 2
7: b 4.874291 0.1619026 4 1
8: b 7.383247 9.4383621 2 5
9: b 5.757814 8.2122120 3 4
10: b 3.053884 5.9390132 5 3
I'd like to calculate the cross-correlations between groups of time series within on data.table. I have a time series data in this format:
data = data.table( group = c(rep("a", 5),rep("b",5),rep("c",5)) , Y = rnorm(15) )
group Y
1: a 0.90855520
2: a -0.12463737
3: a -0.45754652
4: a 0.65789709
5: a 1.27632196
6: b 0.98483700
7: b -0.44282527
8: b -0.93169070
9: b -0.21878359
10: b -0.46713392
11: c -0.02199363
12: c -0.67125826
13: c 0.29263953
14: c -0.65064603
15: c -1.41143837
Each group has the same number of observations. What I am looking for is a way to obtain cross correlation between the groups:
group.1 group.2 correlation
a b 0.xxx
a c 0.xxx
b c 0.xxx
I am working on a script to subset each group and append the cross-correlations, but the data size is fairly large. Is there any efficient / zen way to do this?
Does this help?
data[,id:=rep(1:5,3)]
dtw = dcast.data.table(data, id ~ group, value.var="Y" )[, id := NULL]
cor(dtw)
See Correlation between groups in R data.table
Another way would be:
# data
set.seed(45L)
data = data.table( group = c(rep("a", 5),rep("b",5),rep("c",5)) , Y = rnorm(15) )
# method 2
setkey(data, "group")
data2 = data[J(c("b", "c", "a"))][, list(group2=group, Y2=Y)]
data[, c(names(data2)) := data2]
data[, cor(Y, Y2), by=list(group, group2)]
# group group2 V1
# 1: a b -0.2997090
# 2: b c 0.6427463
# 3: c a -0.6922734
And to generalize this "other" way to more than three groups...
data = data.table( group = c(rep("a", 5),rep("b",5),rep("c",5),rep("d",5)) ,
Y = rnorm(20) )
setkey(data, "group")
groups = unique(data$group)
ngroups = length(groups)
library(gtools)
pairs = combinations(ngroups,2,groups)
d1 = data[pairs[,1],,allow.cartesian=TRUE]
d2 = data[pairs[,2],,allow.cartesian=TRUE]
d1[,c("group2","Y2"):=d2]
d1[,cor(Y,Y2), by=list(group,group2)]
# group group2 V1
# 1: a b 0.10742799
# 2: a c 0.52823511
# 3: a d 0.04424170
# 4: b c 0.65407400
# 5: b d 0.32777779
# 6: c d -0.02425053