I have two data.tables that I want to join (form a Cartesian product of). One of the data.tables is keyed on a Date vector, and the other on a numeric vector:
# data.table with dates (as numeric)
dtDates2 = data.table(date =
as.numeric(seq(from = as.Date('2014/01/01'),
to = as.Date('2014/07/01'), by = 'weeks')),
data1 = rnorm(26))
# data.table with dates
dtDates1 = data.table(date =
seq(from = as.Date('2014/01/01'),
to = as.Date('2014/07/01'), by = 'weeks'),
data1 = rnorm(26))
# data.table with customer IDs
dtCustomers = data.table(customerID = seq(1, 100),
data2 = rnorm(100))
I setkey and try to cross-join them using CJ:
# cross join the two datatables
setkey(dtCustomers, customerID)
setkey(dtDates1, date)
setkey(dtDates2, date)
CJ(dtCustomers, dtDates1)
CJ(dtCustomers, dtDates2)
but get the following error:
Error in FUN(X[[1L]], ...) :
Invalid column: it has dimensions. Can't format it. If it's the result of data.table(table()), use as.data.table(table()) instead.
Not sure what I am doing wrong.
There is no cross join functionality available in data.table out of the box.
Yet there is CJ.dt function (a CJ like but designed for data.tables) to achieve cartesian product (cross join) available in optiRum package (available in CRAN).
You can create the function:
CJ.dt = function(X,Y) {
stopifnot(is.data.table(X),is.data.table(Y))
k = NULL
X = X[, c(k=1, .SD)]
setkey(X, k)
Y = Y[, c(k=1, .SD)]
setkey(Y, NULL)
X[Y, allow.cartesian=TRUE][, k := NULL][]
}
CJ.dt(dtCustomers, dtDates1)
CJ.dt(dtCustomers, dtDates2)
Yet there is a FR for convenience way to perform cross join filled in data.table#1717, so you could check there if there is a nicer api for cross join.
thank you jangorecki for the very useful function
I had to add support for empty X and/or Y:
CJ.dt = function(X, Y) {
stopifnot(is.data.table(X), is.data.table(Y))
if(nrow(X) > 0 & nrow(Y) > 0){
k = NULL
X = X[, c(k = 1, .SD)]
setkey(X, k)
Y = Y[, c(k = 1, .SD)]
setkey(Y, NULL)
return(X[Y, allow.cartesian = T][, k := NULL][])
} else {
duplicatedNames <- names(Y)[names(Y) %in% names(X)]
if(length(duplicatedNames) > 0) {
setnames(Y, duplicatedNames, paste0("i.", duplicatedNames))
}
setkey(Y)
setkey(X)
return(cbind(X[!X], Y[!Y]))
}
}
# X <- data.table(a = c(1, 2))
# Y <- data.table(a = c(2, 3), b = c(4, 5))
#
# CJ.dt(X, Y)
# CJ.dt(X[a > 2], Y)
Related
library(data.table)
If you have say 3 tables in a star schema. something like this. Where I have a dummy table dt1 which is dependent upon dt2 and dt3
dt1 <- data.table( x1 = 1:10, y1 = 11:20)
dt2 <- data.table( x2 = 1:10, y2 = letters[11:20])
dt3 <- data.table( x3 = letters[1:10], y3 = 11:20)
dt1[dt2, on = c( x1 = "x2")]
dt2[dt1, on = c( x2 = "x1")]
dt3[dt1, on = c( y3 = "y1")]
In the above scenario you will have to remember almost 4 combinations for every query. along with their order. Imagine a scenario where you have 10 or 20 of such tables directly from a database or csvs.
dt1[dt2]
dt2[dt1]
dt1[dt3]
dt3[dt1]
One approach is writing a function like this.
rememberJoin <- function(table1, table2){
if(
table1 == "dt1" &&
table2 == "dt2"
){
return(c( x1 = "x2"))
} else if(
table1 == "dt2" &&
table2 == "dt1"
){
return(c( x2 = "x1"))
} else if(
table1 == "dt3" &&
table2 == "dt1"
){
return(c( y3 = "y1"))
} else if(
table1 == "dt1" &&
table2 == "dt3"
){
return(c( y1 = "y3"))
}
}
dt1[dt2, on = rememberJoin("dt1", "dt2")]
dt2[dt1, on = rememberJoin("dt2", "dt1")]
dt3[dt1, on = rememberJoin("dt3", "dt1")]
But the entire solution is hacky and have too many if statements and is very hard to read or debug. I used another solution by creating named list.
Join <- list(
dt1dt2 = c( x1 = "x2"),
dt2dt1 = c( x2 = "x1"),
dt3dt1 = c( y3 = "y1"),
dt1dt3 = c( y1 = "y3")
)
dt1[dt2, on = Join$dt1dt2]
dt2[dt1, on = Join$dt2dt1]
dt3[dt1, on = Join$dt3dt1]
It works just fine. But I am still not sure if this is the optimal solution or is there any package which has join aware tables. Any suggestion or alternative will do. I have very complex dataset I am working with more than 25 tables. I am looking for a solution.
I just found out about a package that is some what related to the same issue. Mentioning it here just for the reference.
https://krlmlr.github.io/dm/articles/dm-visualization.html
Its keeping a record for all the joins for local r data frames. Must look at it.
You could look at binding all of the lookup tables together.
dt_dummy = dt1
dt_lookup = rbindlist(list(dt2, dt3[, .(y2 = y3, x2 = x3)]), idcol = 'ID')
dt_dummy[dt_lookup, on = .(x1 = x2), nomatch = 0L]
dt_dummy[dt_lookup, on = .(y1 = x2), nomatch = 0L]
If you must have a function call, this takes advantage of the laziness of R's evaluation. It requires you providing lookup tables ahead of time but is more in line with your question. Note to reduce logical statements, this solution always reorders your inputs. That is, rememberJoin2(dt1, dt2) evaluates the same as rememberJoin2(dt2, dt1) - it would look too messy otherwise.
dt1_lookup = c('x1', 'y1')
dtx_lookup = c('x2', 'y3')
rememberJoin2 <- function(t1, t2){
l = list(substitute(t1), substitute(t2))
#extract the number from dt#
n <- vapply(l, function(x) as.integer(gsub("dt", "", deparse(x))), 0L)
if(n[1] == n[2]) stop('must provide different data.tables to join')
r <- rank(n)
eval(
substitute(X[Y, on = .(Xkey == Ykey)],
list(X = l[[r[1]]],
Y = l[[r[2]]],
Xkey = as.name(dt1_lookup[max(n) - 1]),
Ykey = as.name(dtx_lookup[max(n) - 1])
)
)
)
}
rememberJoin2(dt1, dt2)
rememberJoin2(dt2, dt1)
I noticed the following peculiar behavior when I upgraded from R 3.4.3 to R 3.6.0 (both were using data.table 1.12.6). In 3.4.3 the code below leads to the all.equal statement being TRUE, but in 3.6.0 there is a mean relative difference that comes from the fact that even though we are trying to access the approxfun calculated from group "a", the values from group "b" are used (probably somehow due to lazy evaluation). In 3.6.0, this issue can be solved by adding a copy statement in the calls to approxfun based on this question:
Handling of closures in data.table
The fascinating thing to me is that I do not get an error in 3.4.3. Any idea what changed?
library(data.table)
data <- data.table(
group = c(rep("a", 4), rep("b", 4)),
x = rep(c(.02, .04, .12, .21), 2),
y = c(
0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)
dtFuncs <- data[ , list(
func = list(stats::approxfun(x, y, rule = 2))
), by = group]
f <- function(group, x) {
dtResults <- CJ(group = group, x = x)
dtResults <- dtResults[ , {
.g <- group
f2 <- dtFuncs[group == .g, func][[1]]
list(x = x, y = f2(x))
}, by = group]
dtResults
}
x0 <- .07
g <- "a"
all.equal(
with(data[group == g], approx(x, y, x0, rule = 2)$y),
f(group = g, x = x0)$y
)
After running git bisect on the r-source, I was able to deduce that it was this commit that caused the behavior: https://github.com/wch/r-source/commit/adcf18b773149fa20f289f2c8f2e45e6f7b0dbfe
What fundamentally happened was that in the case where x's were ordered in approxfun, an internal copy was no longer made. If the data had been randomly sorted, the code would have continued to work! (see snippet below)
Lesson for me is that its probably best not to mix complicated objects with data.table as the same environment is used over and over for each "by" group (or being very deliberate with data.table::copy)
## should be run under R > 3.6.0 to see disparity
library(data.table)
## original sorted x (does not work)
data <- data.table(
group = c(rep("a", 4), rep("b", 4)),
x = rep(c(.02, .04, .12, .21), 2),
y = c(
0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)
dtFuncs <- data[ , {
print(environment())
list(
func = list(stats::approxfun(x, y, rule = 2))
)
}, by = group]
f <- function(group, x) {
dtResults <- CJ(group = group, x = x)
dtResults <- dtResults[ , {
.g <- group
f2 <- dtFuncs[group == .g, func][[1]]
list(x = x, y = f2(x))
}, by = group]
dtResults
}
get("y", environment(dtFuncs$func[[1]]))
get("y", environment(dtFuncs$func[[2]]))
x0 <- .07
g <- "a"
all.equal(
with(data[group == g], approx(x, y, x0, rule = 2)$y),
f(group = g, x = x0)$y
)
## unsorted x (works)
data <- data.table(
group = c(rep("a", 4), rep("b", 4)),
x = rep(c(.02, .04, .12, .21), 2),
y = c(
0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)
set.seed(10)
data <- data[sample(1:.N, .N)]
dtFuncs <- data[ , {
print(environment())
list(
func = list(stats::approxfun(x, y, rule = 2))
)
}, by = group]
f <- function(group, x) {
dtResults <- CJ(group = group, x = x)
dtResults <- dtResults[ , {
.g <- group
f2 <- dtFuncs[group == .g, func][[1]]
list(x = x, y = f2(x))
}, by = group]
dtResults
}
get("y", environment(dtFuncs$func[[1]]))
get("y", environment(dtFuncs$func[[2]]))
x0 <- .07
g <- "a"
all.equal(
with(data[group == g], approx(x, y, x0, rule = 2)$y),
f(group = g, x = x0)$y
)
## better approach: maybe safer to avoid mixing objects treated by reference
## (data.table & closures) all together...
fList <- lapply(split(data, by = "group"), function(x){
with(x, stats::approxfun(x, y, rule = 2))
})
fList
fList[[1]](.07) != fList[[2]](.07)
I want to do an aggregation on the result of a join (x and i are keyed data.tables) without materializing the whole result. It is similar to what .EACHI tries to resolve but for columns that are added by the join.
D1 <- data.table(x = rnorm(1:100), i = sample(1:100, 100, replace = TRUE), k = sample(1:100, 100, replace = TRUE))
D2 <- data.table(x = rnorm(1:100), j = sample(1:100, 100, replace = TRUE), k = sample(1:100, 100, replace = TRUE))
setkey(D1, k)
setkey(D2, k)
I would like to sum x on the result of the join of D1 and D2 for the pairs i and j without materializing the result of D1[D2]
D1[D2, list(x = sum(x * i.x)), by = list(i, j), allow.cartesian = TRUE] ## Fails
D1[D2, allow.cartesian = TRUE][, list(x = sum(x * i.x)), by = list(i, j)] ## Is the result I want but uses up more memory
Is there a way to do this? If not will it be implemented in a future version?
I have the following data:
y <- data.table(cbind(week = rep(1:61,5352),
ID = rep(1:5352, each = 61), w = runif(326472), v = runif(326472)))
y$v[sample(1:326472, 10000, replace=FALSE)] <- NA
for which I'm running the code bellow that creates a rolling mean of variable v, ignoring outliers and NAs.
The code is working, but with poor perfomance.
I'm sure there is more efficient way to run it using apply or something similar, but I've been unsuccessful in creating a faster version. Can anyone shed some light on how to make it more efficient?
IDs <- unique(y$ID)
y$vol_m12 <- 0
for (i in 1:length(IDs)) {
x <- y[ID==IDs[i]]
outlier <- 0.2
w_outlier <- quantile(x$w, c(outlier), na.rm = T)
v_outlier <-quantile(x$v, c(1 - outlier), na.rm = T)
# Ignore outliers
x$v_temp <- x$v
x$v_temp[((x$v_temp >= v_outlier)
& (x$w <= w_outlier))] <- NA
# Creating rolling mean
y$vol_m12[y$ID==IDs[i]] <- x[, rollapplyr(v_temp, 12, (mean), fill = NA, na.rm=T)]
}
Thanks for the replies.
Following 42 advice, I've produced the following code:
library(RcppRoll)
# Ignore outliers
y[, w_out := quantile(w, c(outlier), na.rm = T), by=ID]
y[, v_out := quantile(v, c(1-outlier), na.rm = T), by=ID]
y[((v <= v_out) & (w >= w_out)), v_temp := v]
y[,w_out := NULL]
y[,v_out := NULL]
y[, v_m12 := roll_mean(as.matrix(v_temp), n =12L, fill = NA,
align = c("right"), normalize = TRUE, na.rm = T), by = ID]
System time is about .59 seconds against 10.36 for the solution bellow, which uses rollapplyr (but probably it is possible to make the outlier removal more efficient).
y[, v_m12 :=rollapplyr(v_temp, 12, (mean), fill = NA, na.rm=T), by = ID]
Lets start with some generated data which are pretty realistic:
tmp <- data.table(
label = sprintf( "X%03d", 1:500),
start = sample( 50:950, 500, replace=TRUE ),
length = round( 20 * rf( rep(1, 500), 5, 5 ), 0 )
)
DT <- tmp[ , list( t = seq( start, length.out=length ) ), by = label ]
DT[ , I := sample(1:100, 1) * dbeta( seq(from=0,to=1, length.out=length(t)), sample(3:6,1), sample(5:10,1) ), by = label ]
DT <- DT[ I > 1E-2 ]
DT represents time series data for (in this case) 500 labels:
library(ggplot2)
ggplot( DT[ t %between% c(100,200) ], aes( x = t, y = I, group = label ) ) +
geom_line()
I want to correlate the data by all label pairs, given that they have a sufficient overlap. This is my approach:
# feel free to use just a subset here
labs <- DT[ , unique( label ) ][1:50]
# is needed for fast intersecting
setkey( DT, t )
# just needed for tracking progress
count <- 0
progress <- round(seq( from = 1, to = length(labs) * (length(labs) -1) / 2, length.out=100 ),0)
corrs <-
combn( labs, m=2, simplify=TRUE, minOverlap = 5, FUN = function( x, minOverlap ) {
# progress
count <<- count + 1
if( count %in% progress ){
cat( round( 100*count/max(progress),0 ), ".." )
}
# check overlap and correlate
a <- DT[label == x[1]]
b <- DT[label == x[2]]
iscectT <- intersect( a[ , t], b[ , t] )
n <- length(iscectT)
if( n >= minOverlap ){
R <- cor( a[J(iscectT)][, I], b[J(iscectT)][, I] )
return( c( x[1], x[2], n, min(iscectT), max(iscectT), R) )
}
else{
# only needed because of simplify = TRUE
return( rep(NA, 6) )
}
})
This works pretty fine, but is much slower than expected. In the particular case this would take up to 10 minutes on my machine.
Any help on improving the performance of this approach is highly appreciated. Questions which came to my mind:
Do I have to expect any side effects concerning on DTif I would deploy one of R's parallelization mechanisms, e.g. foreach? Is there a parallelization interface for data.table as there is for example for plyr?
Is there a way of using combn with simplify = FALSE without having horrible runtimes the longer the process goes. I assume that a lot of list copying takes place because increasing list capacities.
Is there anything I can do on the algorithmic side to make this faster?
As Roland suggested in his comment, using combn just to calculate the combinations of labels and then perform directly joins on the data.table, is magnitudes faster:
corrs <- as.data.frame(do.call( rbind, combn(labs, m=2, simplify = FALSE) ), stringsAsFactors=FALSE)
names(corrs) <- c("a", "b")
setDT(corrs)
setkey(DT, label)
setkey( corrs, a )
corrs <- corrs[ DT, nomatch = 0, allow.cartesian = TRUE]
setkey(corrs, b, t)
setkey(DT, label, t)
corrs <- corrs[ DT, nomatch = 0 ]
corrs[ , overlap := .N >= minOverlap , by = list(a,b) ]
corrs <- corrs[ (overlap) ]
corrs <- corrs[ ,list( start = min(t), end = max(t), R = cor(I,I.1) ), by = list(a,b) ]