Performance: combn on large data.table - r

Lets start with some generated data which are pretty realistic:
tmp <- data.table(
label = sprintf( "X%03d", 1:500),
start = sample( 50:950, 500, replace=TRUE ),
length = round( 20 * rf( rep(1, 500), 5, 5 ), 0 )
)
DT <- tmp[ , list( t = seq( start, length.out=length ) ), by = label ]
DT[ , I := sample(1:100, 1) * dbeta( seq(from=0,to=1, length.out=length(t)), sample(3:6,1), sample(5:10,1) ), by = label ]
DT <- DT[ I > 1E-2 ]
DT represents time series data for (in this case) 500 labels:
library(ggplot2)
ggplot( DT[ t %between% c(100,200) ], aes( x = t, y = I, group = label ) ) +
geom_line()
I want to correlate the data by all label pairs, given that they have a sufficient overlap. This is my approach:
# feel free to use just a subset here
labs <- DT[ , unique( label ) ][1:50]
# is needed for fast intersecting
setkey( DT, t )
# just needed for tracking progress
count <- 0
progress <- round(seq( from = 1, to = length(labs) * (length(labs) -1) / 2, length.out=100 ),0)
corrs <-
combn( labs, m=2, simplify=TRUE, minOverlap = 5, FUN = function( x, minOverlap ) {
# progress
count <<- count + 1
if( count %in% progress ){
cat( round( 100*count/max(progress),0 ), ".." )
}
# check overlap and correlate
a <- DT[label == x[1]]
b <- DT[label == x[2]]
iscectT <- intersect( a[ , t], b[ , t] )
n <- length(iscectT)
if( n >= minOverlap ){
R <- cor( a[J(iscectT)][, I], b[J(iscectT)][, I] )
return( c( x[1], x[2], n, min(iscectT), max(iscectT), R) )
}
else{
# only needed because of simplify = TRUE
return( rep(NA, 6) )
}
})
This works pretty fine, but is much slower than expected. In the particular case this would take up to 10 minutes on my machine.
Any help on improving the performance of this approach is highly appreciated. Questions which came to my mind:
Do I have to expect any side effects concerning on DTif I would deploy one of R's parallelization mechanisms, e.g. foreach? Is there a parallelization interface for data.table as there is for example for plyr?
Is there a way of using combn with simplify = FALSE without having horrible runtimes the longer the process goes. I assume that a lot of list copying takes place because increasing list capacities.
Is there anything I can do on the algorithmic side to make this faster?

As Roland suggested in his comment, using combn just to calculate the combinations of labels and then perform directly joins on the data.table, is magnitudes faster:
corrs <- as.data.frame(do.call( rbind, combn(labs, m=2, simplify = FALSE) ), stringsAsFactors=FALSE)
names(corrs) <- c("a", "b")
setDT(corrs)
setkey(DT, label)
setkey( corrs, a )
corrs <- corrs[ DT, nomatch = 0, allow.cartesian = TRUE]
setkey(corrs, b, t)
setkey(DT, label, t)
corrs <- corrs[ DT, nomatch = 0 ]
corrs[ , overlap := .N >= minOverlap , by = list(a,b) ]
corrs <- corrs[ (overlap) ]
corrs <- corrs[ ,list( start = min(t), end = max(t), R = cor(I,I.1) ), by = list(a,b) ]

Related

Changing behavior for closure stored in data.table between R 3.4.3 and R 3.6.0

I noticed the following peculiar behavior when I upgraded from R 3.4.3 to R 3.6.0 (both were using data.table 1.12.6). In 3.4.3 the code below leads to the all.equal statement being TRUE, but in 3.6.0 there is a mean relative difference that comes from the fact that even though we are trying to access the approxfun calculated from group "a", the values from group "b" are used (probably somehow due to lazy evaluation). In 3.6.0, this issue can be solved by adding a copy statement in the calls to approxfun based on this question:
Handling of closures in data.table
The fascinating thing to me is that I do not get an error in 3.4.3. Any idea what changed?
library(data.table)
data <- data.table(
group = c(rep("a", 4), rep("b", 4)),
x = rep(c(.02, .04, .12, .21), 2),
y = c(
0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)
dtFuncs <- data[ , list(
func = list(stats::approxfun(x, y, rule = 2))
), by = group]
f <- function(group, x) {
dtResults <- CJ(group = group, x = x)
dtResults <- dtResults[ , {
.g <- group
f2 <- dtFuncs[group == .g, func][[1]]
list(x = x, y = f2(x))
}, by = group]
dtResults
}
x0 <- .07
g <- "a"
all.equal(
with(data[group == g], approx(x, y, x0, rule = 2)$y),
f(group = g, x = x0)$y
)
After running git bisect on the r-source, I was able to deduce that it was this commit that caused the behavior: https://github.com/wch/r-source/commit/adcf18b773149fa20f289f2c8f2e45e6f7b0dbfe
What fundamentally happened was that in the case where x's were ordered in approxfun, an internal copy was no longer made. If the data had been randomly sorted, the code would have continued to work! (see snippet below)
Lesson for me is that its probably best not to mix complicated objects with data.table as the same environment is used over and over for each "by" group (or being very deliberate with data.table::copy)
## should be run under R > 3.6.0 to see disparity
library(data.table)
## original sorted x (does not work)
data <- data.table(
group = c(rep("a", 4), rep("b", 4)),
x = rep(c(.02, .04, .12, .21), 2),
y = c(
0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)
dtFuncs <- data[ , {
print(environment())
list(
func = list(stats::approxfun(x, y, rule = 2))
)
}, by = group]
f <- function(group, x) {
dtResults <- CJ(group = group, x = x)
dtResults <- dtResults[ , {
.g <- group
f2 <- dtFuncs[group == .g, func][[1]]
list(x = x, y = f2(x))
}, by = group]
dtResults
}
get("y", environment(dtFuncs$func[[1]]))
get("y", environment(dtFuncs$func[[2]]))
x0 <- .07
g <- "a"
all.equal(
with(data[group == g], approx(x, y, x0, rule = 2)$y),
f(group = g, x = x0)$y
)
## unsorted x (works)
data <- data.table(
group = c(rep("a", 4), rep("b", 4)),
x = rep(c(.02, .04, .12, .21), 2),
y = c(
0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)
set.seed(10)
data <- data[sample(1:.N, .N)]
dtFuncs <- data[ , {
print(environment())
list(
func = list(stats::approxfun(x, y, rule = 2))
)
}, by = group]
f <- function(group, x) {
dtResults <- CJ(group = group, x = x)
dtResults <- dtResults[ , {
.g <- group
f2 <- dtFuncs[group == .g, func][[1]]
list(x = x, y = f2(x))
}, by = group]
dtResults
}
get("y", environment(dtFuncs$func[[1]]))
get("y", environment(dtFuncs$func[[2]]))
x0 <- .07
g <- "a"
all.equal(
with(data[group == g], approx(x, y, x0, rule = 2)$y),
f(group = g, x = x0)$y
)
## better approach: maybe safer to avoid mixing objects treated by reference
## (data.table & closures) all together...
fList <- lapply(split(data, by = "group"), function(x){
with(x, stats::approxfun(x, y, rule = 2))
})
fList
fList[[1]](.07) != fList[[2]](.07)

Parallelization with data.table

I have the following problem. I have a piece-wise linear function described by (xPoints, yPoints) and want to compute fast--I have to do it over and over again--the implied y-value for a long list of x's, where x could fall outside the range of xPoints. I have coded a function f_pwl that computes the implied y-value, but it is slow, so I was trying to parallelize its call. But it is actually slower than using data.table := syntax. I will appreciate suggestions to speed things up either by improving my f_pwl function, or by implementing an efficient parallelization, as I have access to 20 cores to speed things up.
Here is a sample code.
# libraries
require(data.table) # for fread, work with large data
require(abind) # for abind()
require(foreach) # for parallel processing, used with doParallel
require(doParallel) # for parallel processing, used with foreach
f_pwl <- function(x) {
temp <- as.vector( rep(NA, length = length(x)), mode = "double" )
for (i in seq(from = 1, to = length(x), by = 1)) {
if (x[i] > max(xPoints) | x[i] < min(xPoints)) {
# nothing to do, temp[i] <- NA
} else if (x[i] == max(xPoints)) {
# value equal max(yPoints)
temp[i] <- max(yPoints)
} else {
# value is f_pwl(x)
xIndexVector = as.logical( x[i] >= xPoints & abind(xPoints[2:length(xPoints)], max(xPoints)) > x[i] )
xIndexVector_plus1 = shift( xIndexVector, n = 1, fill = FALSE, type = "lag" )
alpha_j = (xPoints[xIndexVector_plus1] - x[i])/(xPoints[xIndexVector_plus1] - xPoints[xIndexVector])
temp[i] <- alpha_j %*% yPoints[xIndexVector] + (1-alpha_j) %*% yPoints[xIndexVector_plus1]
}
} # end for i
as.vector( temp, mode = "double" )
}
## Main program
xPoints <- c(4, 9, 12, 15, 18, 21)
yPoints <- c(1, 2, 3, 4, 5, 6)
x <- rnorm(1e4, mean = 12, sd = 5)
dt <- as.data.table( x )
dt[ , c("y1", "y2", "y3") := as.vector( mode = "double", NA ) ]
# data.table := command
system.time({
dt[, y2 := f_pwl( x ) ]
})
# mapply
system.time({
dt[ , y1 := mapply( f_pwl, x ), by=.I ]
})
# parallel
system.time({
#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1, type="FORK") #not to overload your computer
registerDoParallel(cl)
dt$y3 <- foreach(i=1:nrow(dt), .combine=cbind) %dopar% {
tempY <- f_pwl( dt$x[i] )
tempY
}
#stop cluster
stopCluster(cl)
})
summary( dt[ , .(y1-y2, y1-y3, y2-y3)] )
First, calculate and store the alpha_j's.
Then, sort DT by x first and cut it into the relevant intervals before performing your linear interpolation
alpha <- c(NA, diff(yPoints) / diff(xPoints))
DT[order(x),
y := alpha[.GRP] * (x - xPoints[.GRP-1L]) + yPoints[.GRP-1L],
by=cut(x, xPoints)]
Please let me know how it performs.
data:
library(data.table)
## Main program
set.seed(27L)
xPoints <- c(4, 9, 12, 15, 18, 21)
yPoints <- c(1, 2, 3, 4, 5, 6)
DT <- data.table(x=rnorm(1e4, mean=12, sd=5))
check:
f_pwl <- function(x) {
temp <- as.vector( rep(NA, length = length(x)), mode = "double" )
for (i in seq(from = 1, to = length(x), by = 1)) {
if (x[i] > max(xPoints) | x[i] < min(xPoints)) {
# nothing to do, temp[i] <- NA
} else if (x[i] == max(xPoints)) {
# value equal max(yPoints)
temp[i] <- max(yPoints)
} else {
# value is f_pwl(x)
xIndexVector = as.logical( x[i] >= xPoints & abind(xPoints[2:length(xPoints)], max(xPoints)) > x[i] )
xIndexVector_plus1 = shift( xIndexVector, n = 1, fill = FALSE, type = "lag" )
alpha_j = (xPoints[xIndexVector_plus1] - x[i])/(xPoints[xIndexVector_plus1] - xPoints[xIndexVector])
temp[i] <- alpha_j %*% yPoints[xIndexVector] + (1-alpha_j) %*% yPoints[xIndexVector_plus1]
}
} # end for i
as.vector( temp, mode = "double" )
}
system.time({
DT[, yOP := f_pwl( x ) ]
})
DT[abs(y-yOP) > 1e-6]
#Empty data.table (0 rows) of 3 cols: x,y,yOP

data.table operations with %dopar% are very slow

I run a loop over elements of list grouped_data_list using foreach and dopar.
The runtime is terribly slow, while workers are visibly busy.
If I make a vectorized routine with lapply, and without parallelling, this takes seconds. What is wrong with my dopar?
library(data.table)
library('doParallel') # parallel cpu implementation
library('foreach') # parallel looping
grouped_data_dt <- data.table(
Who=c("thdeg","mjg","dfdf","system","df","system","system","hegha","ydvw")
, DocumentExtension=c("jpg","com","dug","182","27","pdf","png","xslt","53")
, What_Action=c("added","removed","added","added","added","removed","added","added","added")
, Date=as.Date(c("2017-11-08","2017-10-10","2017-09-14","2017-09-20","2017-09-21","2017-10-20","2017-10-19","2017-08-24","2017-09-17"))
, Count=c(1,2,3,4,5,6,7,8,9)
)
reported_date_seq_dt <- data.table(
reported_date_seq = as.Date(c(
"2017-08-23","2017-08-24","2017-08-25","2017-08-26","2017-08-27","2017-08-28","2017-08-29","2017-08-30","2017-08-31","2017-09-01","2017-09-02"
,"2017-09-03","2017-09-04","2017-09-05","2017-09-06","2017-09-07","2017-09-08","2017-09-09","2017-09-10","2017-09-11","2017-09-12","2017-09-13"
,"2017-09-14","2017-09-15","2017-09-16","2017-09-17","2017-09-18","2017-09-19","2017-09-20","2017-09-21","2017-09-22","2017-09-23","2017-09-24"
,"2017-09-25","2017-09-26","2017-09-27","2017-09-28","2017-09-29","2017-09-30","2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05"
,"2017-10-06","2017-10-07","2017-10-08","2017-10-09","2017-10-10","2017-10-11","2017-10-12","2017-10-13","2017-10-14","2017-10-15","2017-10-16"
,"2017-10-17","2017-10-18","2017-10-19","2017-10-20","2017-10-21","2017-10-22","2017-10-23","2017-10-24","2017-10-25","2017-10-26","2017-10-27"
,"2017-10-28","2017-10-29","2017-10-30","2017-10-31","2017-11-01","2017-11-02","2017-11-03","2017-11-04","2017-11-05","2017-11-06","2017-11-07"
,"2017-11-08","2017-11-09","2017-11-10","2017-11-11","2017-11-12","2017-11-13","2017-11-14","2017-11-15","2017-11-16","2017-11-17","2017-11-18"
,"2017-11-19","2017-11-20","2017-11-21","2017-11-22","2017-11-23","2017-11-24","2017-11-25","2017-11-26","2017-11-27"
))
)
grouped_data_list <-
split(x = grouped_data_dt
, drop = T
, by = c("Who", "DocumentExtension", "What_Action")
, sorted = T
, keep.by = T
)
cl <- makeCluster(4)
registerDoParallel(cl)
## replace NA with zeros in the timeseries
grouped_data_list_2 <- list()
foreach(
i = 1:length(grouped_data_list)
) %dopar%
{
x <- grouped_data_list[[i]]
data.table::setkey(x, Date)
dt_params <- unlist(
x[1, -c('Date', 'Count'), with = F]
)
y <- x[reported_date_seq_dt]
y[is.na(Count), (colnames(y)[!colnames(y) %in% c('Date', 'Count')]) := lapply(1:length(dt_params), function(x) dt_params[x])]
y[is.na(Count), Count := 0]
grouped_data_list_2 <- c(grouped_data_list_2
, list(y)
)
}
stopCluster(cl)
lapply routine:
## after grouped_data_list is created
rm(group_replace_func)
group_replace_func <- function(x)
{
setkey(x, Date)
dt_params <- unlist(
x[1, -c('Date', 'Count'), with = F]
)
y <- x[reported_date_seq_dt]
y[is.na(Count), (colnames(y)[!colnames(y) %in% c('Date', 'Count')]) := lapply(1:length(dt_params), function(x) dt_params[x])]
y[is.na(Count), Count := 0]
return(y)
}
grouped_data_list_2 <- lapply(
grouped_data_list
, group_replace_func
)
A new version that works fast (#Roland's advice):
## parallel work
cl <- makeCluster(4)
registerDoParallel(cl)
## replace NA with zeros in the timeseries
grouped_data_list_2 <- list()
grouped_data_list_2 <- foreach(
x = grouped_data_list
) %dopar%
{
data.table::setkey(x, Date)
dt_params <- unlist(
x[1, -c('Date', 'Count'), with = F]
)
y <- x[reported_date_seq_dt]
y[is.na(Count), (colnames(y)[!colnames(y) %in% c('Date', 'Count')]) := lapply(1:length(dt_params), function(x) dt_params[x])]
y[is.na(Count), Count := 0]
y
}
stopCluster(cl)

Nested if else statements in R. Keep getting "Error: no function to return from, jumping to top level"

I know there are lots of questions like this around and tried the solutions proposed. But still I could not solve the following.
My aim is to create a function in R which would correlate pairs of columns in data frame. Dependent on the number of pairwise complete observations it would use slightly different approaches.
The problem here is no matter what I try, while declaring the function, I keep getting:
Error: no function to return from, jumping to top level
and
Error: unexpected '}' in "}"
Here is the function:
corr.loop <- function(df, varsA, varsB, normal, nonnormal) {
results <- matrix(ncol = 8)
colnames(results) <- c("varA", "varB", "type", "complete.obs.n", "estimate", "p", "lower.CI", "upper.CI")
for (i in 1:length(varsA)) {
for (j in 1:length(varsB)) {
if (
pairwise.complete.obs.n(df[, varsA[i]], df[, varsB[j]]) < 3
) {
results <- rbind(results,
c(
varsA[i],
varsB[j],
NA,
pairwise.complete.obs.n(df[, varsA[i]], df[, varsB[j]]),
rep(NA, 4)
))
} else {
type <- ifelse( (varsA[i] %in% nonnormal | varsB[j] %in% nonnormal), "spearman", "pearson")
cor.results <- ifelse(
type == "pearson",
cor.test(
x = df[, varsA[i]],
y = df[, varsB[j]],
alternative = "two.sided",
method = "pearson",
exact = TRUE,
conf.level = 0.95,
continuity = TRUE
),
cor.test(
x = df[, varsA[i]],
y = df[, varsB[j]],
alternative = "two.sided",
method = "spearman",
exact = TRUE,
conf.level = 0.95,
continuity = TRUE
)
)
if (
pairwise.complete.obs.n(df[, varsA[i]], df[, varsB[j]]) = 3
) {
results <- rbind(
results,
c(
varsA[i],
varsB[j],
type,
pairwise.complete.obs.n(df[, varsA[i]], df[, varsB[j]]),
cor.results$estimate,
cor.results$p.value,
NA,
NA
)
)
} else {
results <- rbind(
results,
c(
varsA[i],
varsB[j],
type,
pairwise.complete.obs.n(df[, varsA[i]], df[, varsB[j]]),
cor.results$estimate,
cor.results$p.value,
cor.results$conf.int[1],
cor.results$conf.int[2]
)
)
}
}
}
}
results <- as.data.frame(results[-1, ])
results[, 1:ncol(results)] <- lapply(results[, 1:ncol(results)], as.character)
results[, 4:ncol(results)] <- lapply(results[, 4:ncol(results)], as.numeric)
return(results)
}
Is there something obvious I am missing? Seems I just need a fresh eye here. Thank you!
Line 45 you mean == instead of =

create data frame in R

see: Selecting significant cases from a chi-squared test
The model example given in the case above is:
f = function(N=1000){
out <- data.frame("Row" = 1:N
, "Column" = 1:N
, "Chi.Square" = runif(N)
, "df"= sample(N, 1:10, replace=T)
, "p.value" = round(runif(N), 3)
)
return(out)
}
but when I would apply this to my model I would turn this into:
f = function(N=7000){
combos <- combn(ncol(final),2)
adply(combos, 2, function(x) {
test <- chisq.test(final[, x[1]], final[, x[2]])
out <- data.frame("Row" = colnames(final)[x[1]]
, "Column" = colnames(final[x[2]])
, "Chi.Square" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
}}
yet R does not see this as a finished command line. Why?
Get yourself a decent editor :-)
adply(
isn't closed.
Edit: nor
function(...){
It looks like the final } should really be a ) + }

Resources