Speeding up calculation across columns - r

I have a few moderately large data frames and need to do a calculation across different columns in the data; for example I want to compare column i in one data frame with i - 1 in another. I currently use a for loop. The calculation involves element-wise comparison of each pair of values so is somewhat slow: e.g. I take each column of data, turn it into a matrix and compare with the transpose of itself (with some additional complications). In my application (in which the data have about 100 columns and 3000 rows) this currently takes about 95 seconds. I am looking for ways to make this more efficient. If I were comparing the SAME column of each data frame I would try using mapply, but because I need to make comparisons across different columns I don't see how this could work. The current code is something like this:
d1 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
d2 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
r <- list()
ptm2 <- proc.time()
for(i in 2:100){
t <- matrix(0 + d1[,i] > 0,1000,1000)
u <- matrix(d1[,i],1000,1000)*t(matrix(d2[,i-1],1000,1000))
r[[i]] <- t * u
}
proc.time() - ptm2
This takes about 3 seconds on my computer; as mentioned the actual calculation is a bit more complicated than this MWE suggests. Obviously one could also improve efficiency in the calculation itself but I am looking for a solution to the 'compare column i to column i-1' issue.

Based on your example, if you align the d1 and d2 matrices ahead of time based on which columns you are comparing, then here is how you could use mapply. It appears to be only marginally faster, so parallel computing would be a better way to achieve speed gains.
d1 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
d2 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
r <- list()
ptm2 <- proc.time()
for(i in 2:100){
t <- matrix(0 + d1[,i] > 0,1000,1000)
u <- matrix(d1[,i],1000,1000)*t(matrix(d2[,i-1],1000,1000))
r[[i]] <- t * u
}
proc.time() - ptm2
#user system elapsed
#0.90 0.87 1.79
#select last 99 columns of d1 and first 99 columns of d2 based on your calcs
d1_99 <- as.data.frame(d1[,2:100]) #have to convert to data.frame for mapply to loop across columns; a data.frame is simply a list of vectors of equal length
d2_99 <- as.data.frame(d2[,1:99])
ptm3 <- proc.time()
r_test <- mapply(function(x, y) {
t <- matrix(x > 0, 1000, 1000) #didn't understand why you were adding 0 in your example
u <- matrix(x,1000,1000)*t(matrix(y,1000,1000))
t * u
}, x=d1_99, y=d2_99, SIMPLIFY = FALSE)
proc.time() - ptm3
#user system elapsed
#0.91 0.83 1.75
class(r_test)
#[1] "list"
length(r_test)
#[1] 99
#test for equality
all.equal(r[[2]], r_test[[1]])
#[1] TRUE
all.equal(r[[100]], r_test[[99]])
#[1] TRUE

Related

Efficient algorithm to turn matrix subdiagonal to columns r

I have a non-square matrix and need to do some calculations on it's subdiagonals. I figure out that the best way is too turn subdiagonals to columns/rows and use functions like cumprod. Right now I use a for loop and exdiag defined as below:
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
However it to be not really efficient. Do you know any other algorithm to achieve that kind of results.
A little example to show what I am doing:
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
mat <- matrix(1:72, nrow = 12, ncol = 6)
newmat <- matrix(nrow=11, ncol=6)
for (i in 1:11){
newmat[i,] <- c(cumprod(exdiag(mat,i)),rep(0,max(6-12+i,0)))
}
Best regards,
Artur
The fastest but by far the most cryptic solution to get all possible diagonals from a non-square matrix, would be to treat your matrix as a vector and simply construct an id vector for selection. In the end you can transform it back to a matrix if you want.
The following function does that:
exdiag <- function(mat){
NR <- nrow(mat)
NC <- ncol(mat)
smalldim <- min(NC,NR)
if(NC > NR){
id <- seq_len(NR) +
seq.int(0,NR-1)*NR +
rep(seq.int(1,NC - 1), each = NR)*NR
} else if(NC < NR){
id <- seq_len(NC) +
seq.int(0,NC-1)*NR +
rep(seq.int(1,NR - 1), each = NC)
} else {
return(diag(mat))
}
out <- matrix(mat[id],nrow = smalldim)
id <- (ncol(out) + 1 - row(out)) - col(out) < 0
out[id] <- NA
return(out)
}
Keep in mind you have to take into account how your matrix is formed.
In both cases I follow the same logic:
first construct a sequence indicating positions along the smallest dimension
To this sequence, add 0, 1, 2, ... times the row length.
This creates the first diagonal. After doing this, you simply add a sequence that shifts the entire previous sequence by 1 (either down or to the right) until you reach the end of the matrix. To shift right, I need to multiply this sequence by the number of rows.
In the end you can use these indices to select the correct positions from mat, and return all that as a matrix. Due to the vectorized nature of this code, you have to check that the last subdiagonals are correct. These contain less elements than the first, so you have to replace the values not part of that subdiagonal by NA. Also here you can simply use an indexing trick.
You can use it as follows:
> diag1 <- exdiag(amatrix)
> diag2 <- exdiag(t(amatrix))
> identical(diag1, diag2)
[1] TRUE
In order to come to your result
amatrix <- matrix(1:72, ncol = 6)
diag1 <- exdiag(amatrix)
res <- apply(diag1,2,cumprod)
res[is.na(res)] <- 0
t(res)
You can modify the diag() function.
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
exdiag2 <- function(matrix, off){diag(matrix[-1:-off,])}
Speed Test:
mat = diag(10, 10000,10000)
off = 4
> system.time(exdiag(mat,4))
user system elapsed
7.083 2.973 10.054
> system.time(exdiag2(mat,4))
user system elapsed
5.370 0.155 5.524
> system.time(diag(mat))
user system elapsed
0.002 0.000 0.002
It looks like that the subsetting from matrix take a lot of time, but it still performs better than your implementation. May be there are a lot of other subsetting approaches, which outperforms my solution. :)

Quickly split a large vector into chunks in R

My question is extremely closely related to this one:
Split a vector into chunks in R
I'm trying to split a large vector into known chunk sizes and it's slow. A solution for vectors with even remainders is here:
A quick solution when a factor exists is here:
Split dataframe into equal parts based on length of the dataframe
I would like to handle the case of no (large) factor existing, as I would like fairly large chunks.
My example for a vector much smaller than the one in my real life application:
d <- 1:6510321
# Sloooow
chunks <- split(d, ceiling(seq_along(d)/2000))
Using llply from the plyr package I was able to reduce the time.
chunks <- function(d, n){
chunks <- split(d, ceiling(seq_along(d)/n))
names(chunks) <- NULL
return(chunks)
}
require(plyr)
plyrChunks <- function(d, n){
is <- seq(from = 1, to = length(d), by = ceiling(n))
if(tail(is, 1) != length(d)) {
is <- c(is, length(d))
}
chunks <- llply(head(seq_along(is), -1),
function(i){
start <- is[i];
end <- is[i+1]-1;
d[start:end]})
lc <- length(chunks)
td <- tail(d, 1)
chunks[[lc]] <- c(chunks[[lc]], td)
return(chunks)
}
# testing
d <- 1:6510321
n <- 2000
system.time(chks <- chunks(d,n))
# user system elapsed
# 5.472 0.000 5.472
system.time(plyrChks <- plyrChunks(d, n))
# user system elapsed
# 0.068 0.000 0.065
identical(chks, plyrChks)
# TRUE
You can speed even more using the .parallel parameter from the llpyr function. Or you can add a progress bar using the .progress parameter.
A speed improvement from the parallel package:
chunks <- parallel::splitIndices(6510321, ncl = ceiling(6510321/2000))

Efficiently building a large (200 MM line) dataframe

I am attempting to build a large (~200 MM line) dataframe in R. Each entry in the dataframe will consist of approximately 10 digits (e.g. 1234.12345). The code is designed to walk through a list, subtract an item in position [i] from every item after [i], but not the items before [i] (If I was putting the output into a matrix it would be a triangular matrix). The code is simple and works fine on smaller lists, but I am wondering if there is a faster or more efficient way to do this? I assume the first part of the answer is going to entail "don't use a nested for loop," but I am not sure what the alternatives are.
The idea is that this will be an "edge list" for a social network analysis graph. Once I have 'outlist' I will reduce the number of edges based on some criteria(<,>,==,) so the final list (and graph) won't be quite so ponderous.
#Fake data of same approximate dimensions as real data
dlist<-sample(1:20,20, replace=FALSE)
#purge the output list before running the loop
rm(outlist)
outlist<-data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
outlist<-rbind(outlist, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
IIUC your final dataset will be ~200 million rows by 3 columns, all of type numeric, which takes a total space of:
200e6 (rows) * 3 (cols) * 8 (bytes) / (1024 ^ 3)
# ~ 4.5GB
That's quite a big data, where it's essential to avoid copies wherever possible.
Here's a method that uses data.table package's unexported (internal) vecseq function (written in C and is fast + memory efficient) and makes use of it's assignment by reference operator :=, to avoid copies.
fn1 <- function(x) {
require(data.table) ## 1.9.2
lx = length(x)
vx = as.integer(lx * (lx-1)/2)
# R v3.1.0 doesn't copy on doing list(.) - so should be even more faster there
ans = setDT(list(v1 = rep.int(head(x,-1L), (lx-1L):1L),
v2=x[data.table:::vecseq(2:lx, (lx-1L):1, vx)]))
ans[, v3 := v2-v1]
}
Benchmarking:
I'll benchmark with functions from other answers on your data dimensions. Note that my benchmark is on R v3.0.2, but fn1() should give better performance (both speed and memory) on R v3.1.0 because list(.) doesn't result in copy anymore.
fn2 <- function(x) {
diffmat <- outer(x, x, "-")
ss <- which(upper.tri(diffmat), arr.ind = TRUE)
data.frame(v1 = x[ss[,1]], v2 = x[ss[,2]], v3 = diffmat[ss])
}
fn3 <- function(x) {
idx <- combn(seq_along(x), 2)
out2 <- data.frame(v1=x[idx[1, ]], v2=x[idx[2, ]])
out2$v3 <- out2$v2-out2$v1
out2
}
set.seed(45L)
x = runif(20e3L)
system.time(ans1 <- fn1(x)) ## 18 seconds + ~8GB (peak) memory usage
system.time(ans2 <- fn2(x)) ## 158 seconds + ~19GB (peak) memory usage
system.time(ans3 <- fn3(x)) ## 809 seconds + ~12GB (peak) memory usage
Note that fn2() due to use of outer requires quite a lot of memory (peak memory usage was >=19GB) and is slower than fn1(). fn3() is just very very slow (due to combn, and unnecessary copy).
Another way to create that data is
#Sample Data
N <- 20
set.seed(15) #for reproducibility
dlist <- sample(1:N,N, replace=FALSE)
we could do
idx <- combn(1:N,2)
out2 <- data.frame(i=dlist[idx[1, ]], j=dlist[idx[2, ]])
out2$dist <- out2$j-out2$i
This uses combn to create all paris of indices in the data.set rather than doing loops. This allows us to build the data.frame all at once rather than adding a row at a time.
We compare that to
out1 <- data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
out1<-rbind(out1, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
we see that
all(out1==out2)
# [1] TRUE
Plus, if we compare with microbenchmark we see that
microbenchmark(loops(), combdata())
# Unit: microseconds
# expr min lq median uq max neval
# loops() 30888.403 32230.107 33764.7170 34821.2850 82891.166 100
# combdata() 684.316 800.384 873.5015 940.9215 4285.627 100
The method that doesn't use loops is much faster.
You can always start with a triangular matrix and then make your dataframe directly from that:
vec <- 1:10
diffmat <- outer(vec,vec,"-")
ss <- which(upper.tri(diffmat),arr.ind = TRUE)
data.frame(one = vec[ss[,1]],
two = vec[ss[,2]],
diff = diffmat[ss])
You need to preallocate out list, this will significantly increase the speed of your code. By preallocating I mean creating an output structure that already has the desired size, but filled with for example NA's.

How to improve performance of this linear interpolation

For a given column in a dataframe, I want to construct a new vector which for each point consists of the average of the points on either side. However for the last observation it will instead be the second to last. And for the first observation it will be second. I wrote this R code to solve the issue, however I am calling it repeatedly and it is extremely slow. Can someone give some tips on how to do it more efficiently? Thanks.
x1 <- c(rep('a',100),rep('b',100),rep('c',100))
x2 <- rnorm(300)
x <- data.frame(x1,x2)
names(x) <- c('col1','data1')
a.linear.interpolation <- function(x) {
require(zoo)
require(data.table)
a.dattab <- data.table(x)
setkey(a.dattab,col1)
#replace any NA values using LOCF / NOCB
a.dattab[,data1:=na.locf(data1,na.rm=FALSE),by=list(col1)]
a.dattab[,data1:=na.locf(data1,na.rm=FALSE,fromLast=TRUE),by=list(col1)]
#Adding a within group sequence number and a size of group field to facilitate
#row by row processing
a.dattab[,grpseq:=seq_len(.N),by=list(col1)]
a.dattab[,grpseq_max:=.N,by=list(col1)]
#convert back to data.frame
#data.frame seems faster than data.table for this row by row type processing
a.df <- data.frame(a.dattab)
new.col <- vector(length=nrow(a.df))
for(i in seq(nrow(a.df))){
if(a.df[i,"grpseq"]==1){
new.col[i] <- a.df[i+1,"data1"]
}
else if(a.df[i,"grpseq"]==a.df[i,"grpseq_max"]){
new.col[i] <- a.df[i-1,"data1"]
}
else {
new.col[i] <- (a.df[i-1,"data1"]+a.df[i+1,"data1"])/2
}
}
return(new.col)
}
Apart from using rollmeans, the base R filter function can do this sort of thing as well. E.g.:
linint <- function(vec) {
c(vec[2], filter(vec, c(0.5, 0, 0.5))[-c(1, length(vec))], vec[length(vec) - 1])
}
x <- c(1,3,6,10,1)
linint(x)
#[1] 3.0 3.5 6.5 3.5 10.0
And it's pretty quick, chewing through 10M cases in less than a second:
x <- rnorm(1e7)
system.time(linint(x))
#user system elapsed
#0.57 0.18 0.75

R: Tabulations and insertions with data.table

I am trying to take a very large set of records with multiple indices, calculate an aggregate statistic on groups determined by a subset of the indices, and then insert that into every row in the table. The issue here is that these are very large tables - over 10M rows each.
Code for reproducing the data is below.
The basic idea is that there are a set of indices, say ix1, ix2, ix3, ..., ixK. Generally, I am choosing only a couple of them, say ix1 and ix2. Then, I calculate an aggregation of all the rows with matching ix1 and ix2 values (over all combinations that appear), for a column called val. To keep it simple, I'll focus on a sum.
I have tried the following methods
Via sparse matrices: convert the values to a coordinate list, i.e. (ix1, ix2, val), then create a sparseMatrix - this nicely sums up everything, and then I need only convert back from the sparse matrix representation to the coordinate list. Speed: good, but it is doing more than is necessary and it doesn't generalize to higher dimensions (e.g. ix1, ix2, ix3) or more general functions than a sum.
Use of lapply and split: by creating a new index that is unique for all (ix1, ix2, ...) n-tuples, I can then use split and apply. The bad thing here is that the unique index is converted by split into a factor, and this conversion is terribly time consuming. Try system({zz <- as.factor(1:10^7)}).
I'm now trying data.table, via a command like sumDT <- DT[,sum(val),by = c("ix1","ix2")]. However, I don't yet see how I can merge sumDT with DT, other than via something like DT2 <- merge(DT, sumDT, by = c("ix1","ix2"))
Is there a faster method for this data.table join than via the merge operation I've described?
[I've also tried bigsplit from the bigtabulate package, and some other methods. Anything that converts to a factor is pretty much out - as far as I can tell, that conversion process is very slow.]
Code to generate data. Naturally, it's better to try a smaller N to see that something works, but not all methods scale very well for N >> 1000.
N <- 10^7
set.seed(2011)
ix1 <- 1 + floor(rexp(N, 0.01))
ix2 <- 1 + floor(rexp(N, 0.01))
ix3 <- 1 + floor(rexp(N, 0.01))
val <- runif(N)
DF <- data.frame(ix1 = ix1, ix2 = ix2, ix3 = ix3, val = val)
DF <- DF[order(DF[,1],DF[,2],DF[,3]),]
DT <- as.data.table(DF)
Well, it's possible you'll find that doing the merge isn't so bad as long as your keys are properly set.
Let's setup the problem again:
N <- 10^6 ## not 10^7 because RAM is tight right now
set.seed(2011)
ix1 <- 1 + floor(rexp(N, 0.01))
ix2 <- 1 + floor(rexp(N, 0.01))
ix3 <- 1 + floor(rexp(N, 0.01))
val <- runif(N)
DT <- data.table(ix1=ix1, ix2=ix2, ix3=ix3, val=val, key=c("ix1", "ix2"))
Now you can calculate your summary stats
info <- DT[, list(summary=sum(val)), by=key(DT)]
And merge the columns "the data.table way", or just with merge
m1 <- DT[info] ## the data.table way
m2 <- merge(DT, info) ## if you're just used to merge
identical(m1, m2)
[1] TRUE
If either of those ways of merging is too slow, you can try a tricky way to build info at the cost of memory:
info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)]
m3 <- transform(DT, summary=info2$summary)
identical(m1, m3)
[1] TRUE
Now let's see the timing:
#######################################################################
## Using data.table[ ... ] or merge
system.time(info <- DT[, list(summary=sum(val)), by=key(DT)])
user system elapsed
0.203 0.024 0.232
system.time(DT[info])
user system elapsed
0.217 0.078 0.296
system.time(merge(DT, info))
user system elapsed
0.981 0.202 1.185
########################################################################
## Now the two parts of the last version done separately:
system.time(info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)])
user system elapsed
0.574 0.040 0.616
system.time(transform(DT, summary=info2$summary))
user system elapsed
0.173 0.093 0.267
Or you can skip the intermediate info table building if the following doesn't seem too inscrutable for your tastes:
system.time(m5 <- DT[ DT[, list(summary=sum(val)), by=key(DT)] ])
user system elapsed
0.424 0.101 0.525
identical(m5, m1)
# [1] TRUE

Resources