Given a data frame or matrix with arbitrary number of rows and columns, what is the fastest way to apply a function to all pairwise combinations of columns?
For example, if I have a data table:
N <- 3
K <- 3
data <- data.table(id=seq(N))
for(k in seq(K)) {
data[[k]] <- runif(N)
}
And I want to compute the simple difference between all pairs of columns, I could loop (or lapply) over columns:
differences = data.table(foo=seq(N))
for(var1 in names(data)) {
for(var2 in names(data)) {
if (var1==var2) next
if (which(names(data)==var1)>which(names(data)==var2)) next
combo <- paste0(var1, var2)
differences[[combo]] <- data[[var1]]-data[[var2]]
}
}
But as K gets larger, this becomes absurdly slow.
One solution I've considered is to make two new data tables using combn and subtract them:
a <- data[,combn(colnames(data),2)[1,],with=F]
b <- data[,combn(colnames(data),2)[2,],with=F]
differences <- a-b
But as N and K get larger, this becomes very memory intensive (though faster than looping).
It seems to me that the outer product of the matrix with itself is probably the best way to go, but I can't piece it together. This is especially hard if I want to apply an arbitrary function (RMSE for example), instead of just the difference.
What's the fastest way?
If it is necessary to have the data in a matrix first, you can do the following:
library(data.table)
data <- matrix(runif(300*500), nrow = 300, ncol = 500)
data.DT <- setkey(data.table(c(data), colId = rep(1:500, each = 300), rowId = rep(1:300, times = 500)), colId)
diff.DT <- data.DT[
, {
ccl <- unique(colId)
vv <- V1
data.DT[colId > ccl, .(col2 = colId, V1 - vv)]
}
, keyby = .(col1 = colId)
]
Related
I have a data table that provides the length and composition of given vectors
for example:
set.seed(1)
dt = data.table(length = c(100, 150),
n_A = c(30, 30),
n_B = c(20, 100),
n_C = c(50, 20))
I need to randomly split each vector into two subsets with 80% and 20% of observations respectively. I can currently do this using a for loop. For example:
dt_80_list <- list() # create output lists
dt_20_list <- list()
for (i in 1:nrow(dt)){ # for each row in the data.table
sample_vec <- sample( c( rep("A", dt$n_A[i]), # create a randomised vector with the given nnumber of each component.
rep("B", dt$n_B[i]),
rep("C", dt$n_C[i]) ) )
sample_vec_80 <- sample_vec[1:floor(length(sample_vec)*0.8)] # subset 80% of the vector
dt_80_list[[i]] <- data.table( length = length(sample_vec_80), # count the number of each component in the subset and output to list
n_A = length(sample_vec_80[which(sample_vec_80 == "A")]),
n_B = length(sample_vec_80[which(sample_vec_80 == "B")]),
n_C = length(sample_vec_80[which(sample_vec_80 == "C")])
)
dt_20_list[[i]] <- data.table( length = dt$length[i] - dt_80_list[[i]]$length, # subtract the number of each component in the 80% to identify the number in the 20%
n_A = dt$n_A[i] - dt_80_list[[i]]$n_A,
n_B = dt$n_B[i] - dt_80_list[[i]]$n_B,
n_C = dt$n_C[i] - dt_80_list[[i]]$n_C
)
}
dt_80 <- do.call("rbind", dt_80_list) # collapse lists to output data.tables
dt_20 <- do.call("rbind", dt_20_list)
However, the dataset I need to apply this to is very large, and this is too slow. Does anyone have any suggestions for how I could improve performance?
Thanks.
(I assumed your dataset consists of many more rows (but only a few colums).)
Here's a version I came up with, with mainly three changes
use .N and by= to count the number of "A","B","C" drawn in each row
use the size argument in sample
join the original dt and dt_80 to calculate dt_20 without a for-loop
## draw training data
dt_80 <- dcast(
dt[,row:=1:nrow(dt)
][, .(draw=sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*length) )
, by=row
][,.N,
by=.(row,draw)],
row~draw,value.var="N")[,length80:=A80+B80+C80]
## draw test data
dt_20 <- dt[dt_80,
.(A20=n_A-A80,
B20=n_B-B80,
C20=n_C-C80),on="row"][,length20:=A20+B20+C20]
There is probably still room for optimization, but I hope it already helps :)
EDIT
Here I add my initial first idea, I did not post this because the code above is much faster. But this one might be more memory-efficient which seems crucial in your case. So, even if you already have a working solution, this might be of interest...
library(data.table)
library(Rfast)
## add row numbers
dt[,row:=1:nrow(dt)]
## sampling function
sampfunc <- function(n_A,n_B,n_C){
draw <- sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*(n_A+n_B+n_C))
out <- Rfast::Table(draw)
return(as.list(out))
}
## draw training data
dt_80 <- dt[,sampfunc(n_A,n_B,n_C),by=row]
My understanding regarding the difference between the merge() function (in base R) and the join() functions of plyr and dplyr are that join() is faster and more efficient when working with "large" data sets.
Is there some way to determine a threshold to regarding when to use join() over merge(), without using a heuristic approach?
I am sure you will be hard pressed to find a "hard and fast" rule around when to switch from one function to another. As others have mentioned, there are a set of tools in R to help you measure performance. object.size and system.time are two such function that look at memory usage and performance time, respectively. One general approach is to measure the two directly over an arbitrarily expanding data set. Below is one attempt at this. We will create a data frame with an 'id' column and a random set of numeric values, allowing the data frame to grow and measuring how it changes. I'll use inner_join here as you mentioned dplyr. We will measure time as "elapsed" time.
library(tidyverse)
setseed(424)
#number of rows in a cycle
growth <- c(100,1000,10000,100000,1000000,5000000)
#empty lists
n <- 1
l1 <- c()
l2 <- c()
#test for inner join in dplyr
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- inner_join(x,y, by = c('id' = 'id'))
l1[[n]] <- object.size(test)
print(system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3])
l2[[n]] <- system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3]
n <- n+1
}
#empty lists
n <- 1
l3 <- c()
l4 <- c()
#test for merge
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- merge(x,y, by = c('id'))
l3[[n]] <- object.size(test)
# print(object.size(test))
print(system.time(test <- merge(x,y, by = c('id')))[3])
l4[[n]] <- system.time(test <- merge(x,y, by = c('id')))[3]
n <- n+1
}
#ploting output (some coercing may happen, so be it)
plot <- bind_rows(data.frame("size_bytes" = l3, "time_sec" = l4, "id" = "merge"),
data.frame("size_bytes" = l1, "time_sec" = l2, "id" = "inner_join"))
plot$size_MB <- plot$size_bytes/1000000
ggplot(plot, aes(x = size_MB, y =time_sec, color = id)) + geom_line()
merge seems to perform worse out the gate, but really kicks off around ~20MB. Is this the final word on the matter? No. But such testing can give you a idea of how to choose a function.
I thought that the following problem must have been answered or a function must exist to do it, but I was unable to find an answer.
I have a nested loop that takes a row from one 3-col. data frame and copies it next to each of the other rows, to form a 6-col. data frame (with all possible combinations). This works fine, but with a medium sized data set (800 rows), the loops take forever to complete the task.
I will demonstrate on a sample data set:
Sdat <- data.frame(
x = c(10,20,30,40),
y = c(15,25,35,45),
ID =c(1,2,3,4)
)
compar <- data.frame(matrix(nrow=0, ncol=6)) # to contain all combinations
names(compar) <- c("x","y", "ID", "x","y", "ID")
N <- nrow(Sdat) # how many different points we have
for (i in 1:N)
{
for (j in 1:N)
{
Temp1 <- Sdat[i,] # data from 1st point
Temp2 <- Sdat[j,] # data from 2nd point
C <- cbind(Temp1, Temp2)
compar <- rbind(C,compar)
}
}
These loops provide exactly the output that I need for further analysis. Any suggestion for vectorizing this section?
You can do:
ind <- seq_len(nrow(Sdat))
grid <- expand.grid(ind, ind)
compar <- cbind(Sdat[grid[, 1], ], Sdat[grid[, 2], ])
A naive solution using rep (assuming you are happy with a data frame output):
compar <- data.frame(x = rep(Sdat$x, each = N),
y = rep(Sdat$y, each = N),
id = rep(1:n, each = N),
x1 = rep(Sdat$x, N),
y1 = rep(Sdat$y, N),
id_1 = rep(1:n, N))
EDIT: I found out that the Matrix package does everything I need. Super fast and flexible. Specifically, the related functions are
Data <- sparseMatrix(i=Data[,1], j=Data[,2], x=Data[,3])
or simply
Data <- Matrix(data=Data,sparse=T)
Once you have your matrix in this Matrix class, everything should work smoothly like a regular matrix (for the most part, anyway).
======================================================
I have a dataset in "Long format" right now, meaning that it has 3 columns: row name, column name, and value. All of the "missing" row-column pairs are equal to zero.
I need to come up with an efficient way to calculate the cosine similarity (or even just the regular dot product) between all possible pairs of rows. The full data matrix is 19000 x 62000, which is why I need to work with the Long format instead.
I came up with the following method, but it's WAY too slow. Any tips on maximizing efficiency, or any suggestions of a better method overall, would be GREATLY appreciated. Thanks!
Data <- matrix(c(1,1,1,2,2,2,3,3,3,1,2,3,1,2,4,1,4,5,1,2,2,1,1,1,1,3,1),
ncol = 3, byrow = FALSE)
Data <- data.frame(Data)
cosine.sparse <- function(data) {
a <- Sys.time()
colnames(data) <- c('V1', 'V2', 'V3')
nvars <- length(unique(data[,2]))
nrows <- length(unique(data[,1]))
sim <- matrix(nrow=nrows, ncol=nrows)
for (i in 1:nrows) {
data.i <- data[data$V1==i,]
length.i.sq <- sum(data.i$V3^2)
for (j in i:nrows) {
data.j <- data[data$V1==j,]
length.j.sq <- sum(data.j$V3^2)
common.vars <- intersect(data.i$V2, data.j$V2)
row1 <- data.i[data.i$V2 %in% common.vars,3]
row2 <- data.j[data.j$V2 %in% common.vars,3]
cos.sim <- sum(row1*row2)/sqrt(length.i.sq*length.j.sq)
sim[i,j] <- sim[j,i] <- cos.sim
}
if (i %% 500 == 0) {cat(i, " rows have been calculated.")}
}
b <- Sys.time()
time.elapsed <- b - a
print(time.elapsed)
return(sim)
}
cosine.sparse(Data2)
Assume I have a list of length D containing data.table objects. Each data.table has the same columns (X, Y) and same number of rows N. I'd like to construct another table with N rows, with the individual rows taken from the tables specified by an index vector also of length N. Restated, each row in the final table is taken from one and only one of the tables in the array, with the index of the source table specified by an existing vector.
N = 100 # rows in each table (actual ~1000000 rows)
D = 4 # number of tables in array (actual ~100 tables)
tableArray = vector("list", D)
for (d in 1:D) {
tableArray[[d]] = data.table(X=rnorm(N), Y=d) # actual ~100 columns
}
tableIndexVector = sample.int(D, N, replace=TRUE) # length N of random 1:D
finalTable = copy(tableArray[[1]]) # just for length and column names
for (n in 1:N) {
finalTable[n] = tableArray[[tableIndexVector[n]]][n]
}
This seems to work the way I want, but the array within array notation is hard to understand, and I presume the performance of the for loop isn't going to be very good. It seems like there should be some elegant way of doing this, but I haven't stumbled across it yet. Is there another way of doing this that is efficient and less arcane?
(In case you are wondering, each table in the array represents simulated counterfactual observations for a subject under a particular regime of treatment, and I want to sample from these with different probabilities to test the behavior of different regression approaches with different ratios of regimes observed.)
for loops work just fine with data.table but we can improve the performance of your specific loop significantly (I believe) using the following approaches.
Approach # 1
Use set instead, as it avoids the [.data.table overhead
Don't loop over 1:N because you can simplify your loop to run only on unique values of tableIndexVector and assign all the corresponding values at once. This should decrease the run time by at least x10K (as N is of size 1MM and D is only of size 100, while unique(tableIndexVector) <= D)
So you basically could convert your loop to the following
for (i in unique(tableIndexVector)) {
indx <- which(tableIndexVector == i)
set(finalTable, i = indx, j = 1:2, value = tableArray[[i]][indx])
}
Approach # 2
Another approach is to use rbindlist and combine all the tables into one big data.table while adding the new idcol parameter in order to identify the different tables within the big table. You will need the devel version for that. This will avoid the loop as requested, but the result will be ordered by the tables appearance
temp <- rbindlist(tableArray, idcol = "indx")
indx <- temp[, .I[which(tableIndexVector == indx)], by = indx]$V1
finalTable <- temp[indx]
Here's a benchmark on bigger data set
N = 100000
D = 10
tableArray = vector("list", D)
set.seed(123)
for (d in 1:D) {
tableArray[[d]] = data.table(X=rnorm(N), Y=d)
}
set.seed(123)
tableIndexVector = sample.int(D, N, replace=TRUE)
finalTable = copy(tableArray[[1]])
finalTable2 = copy(tableArray[[1]])
## Your approach
system.time(for (n in 1:N) {
finalTable[n] = tableArray[[tableIndexVector[n]]][n]
})
# user system elapsed
# 154.79 33.14 191.57
## My approach # 1
system.time(for (i in unique(tableIndexVector)) {
indx <- which(tableIndexVector == i)
set(finalTable2, i = indx, j = 1:2, value = tableArray[[i]][indx])
})
# user system elapsed
# 0.01 0.00 0.02
## My approach # 2
system.time({
temp <- rbindlist(tableArray, idcol = "indx")
indx <- temp[, .I[which(tableIndexVector == indx)], by = indx]$V1
finalTable3 <- temp[indx]
})
# user system elapsed
# 0.11 0.00 0.11
identical(finalTable, finalTable2)
## [1] TRUE
identical(setorder(finalTable, X), setorder(finalTable3[, indx := NULL], X))
## [1] TRUE
So to conclusion
My first approach is by far the fastest and elapses x15K times faster
than your original one. It is also returns identical result
My second approach is still x1.5K times faster than your original approach but avoids the loop (which you don't like for some reason). Though the result is order by the tables appearance, so the order isn't identical to your result.