How to optimize a code by avoiding loops? - r

I have some troubles with a code which take a huge amount of time to run. Could someone give me some help? Thanks in advance!
all_dist=c()
ddim=dim(b)[1]
ddimi=ddim-1
for (k in 1:dim(b)[2]){
for (i in seq(1,ddimi,1)){
for (j in seq(i+1,ddim,1)){
ze=(b[i,k])-(b[j,k])*(b[i,k])-(b[j,k])
all_dist=c(all_dist,ze)
}}}
Note that:
str(b)
num [1:5, 1:30007] -0.000292 -0.001384 -0.001412 -0.002603 -0.002848
...
- attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:30007] "V1" "V2" "V3" "V4" ...

For-loops and growing your vector with c() are slowing you down. It's best to try to take advantage of vectorization, and to use *apply (or map) functions as much as possible. Here's something that does a little of both by iterating over the columns with sapply, creating combinations and computing the products and differences of those combinations:
mat <- sapply(b, function(x) {y <- combn(x, 2); y[1,] - y[2,] * y[1,] - y[2,]})
It should be fast – maybe not quite as fast as user10488504's very efficient solution, but still pretty fast. It also has very tight syntax, and you might also find it useful that the output is a matrix, with each column corresponding to a column from b.
Data:
set.seed(12345)
b <- as.data.frame(matrix(runif(5*30007, -.001, -.0003), byrow = T, nrow = 5))

set.seed(0)
b <- matrix(rnorm(5*30007), nrow=5)
all_dist=c()
ddim=dim(b)[1]
ddimi=ddim-1
system.time(
#With foor-Loop
for (k in 1:dim(b)[2]){
for (i in seq(1,ddimi,1)){
for (j in seq(i+1,ddim,1)){
ze=(b[i,k])-(b[j,k])*(b[i,k])-(b[j,k])
all_dist=c(all_dist,ze)
}}}
)
# User System verstrichen
# 104.568 3.636 108.206
#Vectorized with matrix indices
system.time({
K <- 1:dim(b)[2] #for (k in 1:dim(b)[2]){... creates this vector
I <- seq(1,ddimi,1) #for (i in seq(1,ddimi,1)){... creates this vector
J <- unlist(lapply(I+1, function(x) seq(x,ddim,1))) #for (j in seq(i+1,ddim,1)){... creates this vector
IK <- as.matrix(expand.grid(I, K)) #Get all combinations between I and K as you will have with the nested for loops of k and i
IK <- IK[rep(seq_len(nrow(IK)), rep((ddim-1):1,length.out=nrow(IK))),] #IK-rows need to be repeated, as it is used repeatedly in the "for (j in seq(i+1,ddim,1)){" loop
JK <- as.matrix(expand.grid(j=J, k=K)) #Get all combinations between J and K as you will have with the nested for loops of k and j
#Now you have all the indexes of your for loop as vectors and can make the calculations
tt <- b[IK] - b[JK] * b[IK] - b[JK]
})
# User System verstrichen
# 0.056 0.000 0.097
identical(all_dist, tt)
#[1] TRUE
As you are using k only on the left side without interaction with the other loops you can partly vectorize by simply leaving the k loop and the index away.
system.time({
tt=c()
for (i in seq(1,ddimi,1)){
for (j in seq(i+1,ddim,1)){
tt=c(tt, (b[i,])-(b[j,])*(b[i,])-(b[j,]))
}}
dim(tt) <- c(30007, 10)
tt <- as.vector(t(tt))
})
# User System verstrichen
# 0.017 0.000 0.017
identical(all_dist, tt)
#[1] TRUE
Or you can replace the inner two for loops with index vectors and make an apply loop instead of the k-for loop:
system.time({
I <- seq(1,ddimi,1)
J <- unlist(lapply(I+1, function(x) seq(x,ddim,1)))
I <- I[rep(seq_along(I), rep((ddim-1):1,length.out=length(I)))]
tt <- as.vector(apply(b, 2, function(x) {x[I] - x[J] * x[I] - x[J]}))
})
# User System verstrichen
# 0.085 0.000 0.085
identical(all_dist, tt)
#[1] TRUE
Used time of the nice solution from gersht:
system.time({
mat <- as.vector(sapply(as.data.frame(b), function(x) {y <- combn(x, 2); y[1,] - y[2,] * y[1,] - y[2,]}))
})
# User System verstrichen
# 1.083 0.000 1.082
identical(all_dist, mat)
#[1] TRUE

Related

Optimization of apply

I have existing code that calculates concordance value for a dataframe/matrix. It's basically the number of rows where all the values are the same over the total number of rows.
...
concordance<-new[complete.cases(new),] #removes rows with NAs
TF<-apply(concordance, 1, function(x) if(length(unique(x))>1) F else T)
#outputs vector of T/F if it is concordant
numF<-table(TF)["TRUE"]#gets number of trues
concValue<-numF/NROW(TF) #true/total
...
Above is what I have now. It runs ok but I was wondering if there was any way to make it faster.
Edit: Dimensions of object is variable, but # of cols are typically 2-6 and there are typically 1,000,000+ rows. This is part of a package i'm developing so input data is variable.
Because the number of rows is much larger than the number of columns it makes sense to loop on columns instead, dropping rows, where there is more than different one value in the process:
propIdentical <- function(Mat){
nrowInit <- nrow(Mat)
for(i in 1:(ncol(Mat) - 1)){
if(!nrow(Mat)) break #stop if the matrix has no rows
else{
#check which elements of column i and column i+1 are equal:
equals <- Mat[,i] == Mat[, i+1]
# remove all other rows from the matrix
Mat <- Mat[equals,,drop = F]
}
}
return(nrow(Mat)/nrowInit)
}
some tests:
set.seed(1)
# normal case
dat <- matrix(sample(1:10, rep = T, size = 3*10^6), nrow = 10^6)
system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.053 0.017 0.070
[1] 0.009898
# normal case on my pc for comparison:
system.time(app <- mean(apply(dat, 1, function(x) length(unique(x))) == 1L)); app
user system elapsed
12.176 0.036 12.231
[1] 0.009898
# worst case
dat <- matrix(1L, nrow = 10^6, ncol = 6)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.302 0.044 0.348
[1] 1
# worst case on my pc for comparison
system.time(mean(apply(dat, 1, function(x) length(unique(x))) == 1L))
user system elapsed
12.562 0.001 12.578
# testing drop = F and if(!nrow(Mat)) break
dat <- matrix(1:2, ncol = 2)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0 0 0
[1] 0
Note: if you run this on a data.frame make sure to turn it into a matrix first.

Find non-first match in R

I know that match(x,y) returns the first match of all elements of x in y.
Assuming that x may contain the same value multiple time, I am looking for a concise way to match the nth occurrence in x with the nth occurrence in y.
For example: `
x <- c(3,4,4,3,2,4)
y <- c(1,2,3,4,1,2,3,4)
my.match(x, y)
## 3,4,8,7,2,NA
Using a for loop to match, store and overwrite a match with NA.
idx <- c()
for (i in x) {
k <- match(i, y)
idx <- c(idx, k)
y[k] <- NA
}
idx
#[1] 3 4 8 7 2 NA
The following function is much faster when vectors are large because it does not iterate over the whole vector
my.match <- function(x,y){
fidx <- rep(FALSE,length(x))
fidy <- rep(FALSE,length(y))
ret <- rep(NA,length(x))
repeat{
nidx <- which(!fidx)
nidy <- which(!fidy)
idx <- match(x[nidx],y[nidy])
idy <- match(y[nidy],x[nidx])
ret[nidx] <- nidy[idx]
fidx[nidx[unique(idy)]] <- TRUE
fidy[nidy[unique(idx)]] <- TRUE
if(sum(!is.na(idx))==0 | sum(!is.na(idy))==0){
break
}
}
return(ret)
}
Benchmarking with the other proposed method yields:
my.match1 <- function(x,y){
idx <- c()
for (i in x) {
k <- match(i, y)
idx <- c(idx, k)
y[k] <- NA
}
return(idx)
}
x <- sample.int(100,10000,replace=T)
y <- sample.int(100,10000,replace=T)
system.time(my.match1(x,y))
## user system elapsed
## 1.016 0.003 1.020
system.time(my.match(x,y))
## user system elapsed
## 0.049 0.000 0.049

Count how many vertices in a vertex's neighbourhood have an attribute in igraph for R

I have a large graph (several, actually) in igraph—on the order of 100,000 vertices—and each vertex has an attribute which is either true or false. For each vertex, I would like to count how many of the vertices directly connected to it have the attribute. My current solution is the following function, which takes as its argument a graph.
attrcount <- function(g) {
nb <- neighborhood(g,order=1)
return(sapply(nb,function(x) {sum(V(g)$attr[x]}))
}
This returns a vector of counts which is off by 1 for vertices which have the attribute, but I can adjust this easily.
The problem is that this runs incredibly slowly, and it seems like there should be a fast way to do this, since, for instance, computing the degree of each vertex is practically instantaneous with degree(g).
Am I doing this a stupid way?
As an example, suppose this was our graph.
set.seed(42)
g <- erdos.renyi.game(169081, 178058, type="gnm")
V(g)$att <- as.logical(rbinom(vcount(g), 1, 0.5))
Use get.adjlist to query all adjacent vertices, and then sapply (or tapply might be even faster) on this list to get the counts. It is also worth storing the attribute in a vector, because then you don't need to extract it all the time.
With sapply
system.time({
al <- get.adjlist(g)
att <- V(g)$att
res <- sapply(al, function(x) sum(att[x]))
})
# user system elapsed
# 0.571 0.005 0.576
With tapply
system.time({
al <- get.adjlist(g)
alv <- unlist(al)
alf <- factor(rep(seq_along(al), sapply(al, length)),
levels=seq_along(al))
att <- V(g)$att
res2 <- tapply(att[alv], alf, sum)
res2[is.na(res2)] <- 0
})
# user system elapsed
# 1.121 0.020 1.144
all(res == res2)
# TRUE
Somewhat a surprise to me, but the tapply solution is actually slower.
If this is still not enough, then I guess you can still make it faster by writing it in C/C++.
For faster computation, use get.adjacency to pull the adjacency matrix, then multiply the matrix by the attribute vector using %*%:
library(igraph)
set.seed(42)
g <- erdos.renyi.game(1000, 1000, type = "gnm")
V(g)$att <- as.logical(rbinom(vcount(g), 1, 0.5))
system.time({
ma <- get.adjacency(g)
att <- V(g)$att
res1 <- as.numeric(ma %*% att)
})
# user system elapsed
# 0.003 0.000 0.003
Compared to using get.adjlist and sapply:
system.time({
al <- get.adjlist(g)
att <- V(g)$att
res2 <- sapply(al, function(x) sum(att[x]))
})
# user system elapsed
# 9.733 0.243 10.107
After modifying the class of res1, the results vector is identical:
res1 <- as.numeric(res1)
identical(res1, res2)
# [1] TRUE

How to vectorize or otherwise speed-up this looping logic in R?

Long time lurker, first time asker.
I'm trying to calculate 'items in common between 2 sets of items' for a 20M+ items dataset. Sample data looks like this.
#serially numbered items
parents <- rep(1:10000)
#generate rnorm # of children items
numchild <- round(rnorm(10000, mean=30, sd=10))
#fill the parent-child list
parent_child <- list()
for (x in 1:length(parents)){
if (numchild[x]>0){
f1 <- sample(1:length(parents), size=numchild[x])
f2 <- list(parents[f1])
parent_child <- c(parent_child, f2)
}
else {
parent_child <- c(parent_child, list(x+1)) #if numchild=0, make up something
}
}
Here is what I want to do: say parent item #1 has 5 children items-- 1,2,3,4,5 and parent item #2 has 3 children item-- 4,10,22.
I want to compute the length(intersection) of every (parent_i, parent_j) combination. In the above case, it would be 1 common item-- 4.
I am doing this for 10M+ parent items that on average have 15-20 children items with a (0,100) range. So that's a 10M x 10M item-item matrix.
I have a foreach loop that I am testing out on a smaller subset that works but doesn't quite scale for the full dataset (64 core machine with 256GB RAM). With the loop below I am already computing only half of the user-user matrix--> (parent_i, parent_j) same as (parent_j, parent_i) for this purpose.
#small subset
a <- parent_child[1:1000]
outerresults <- foreach (i = 1:(length(a)), .combine=rbind, .packages=c('foreach','doParallel')) %dopar% {
b <- a[[i]]
rest <- a[i+1:length(a)]
foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
common <- length(intersect(b, rest[[j]]))
if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
}
}
I've been experimenting variations on this (using Reduce, storing parent-children in a daataframe etc.) but haven't had much luck.
Is there a way to make this scale?
I reversed the split, so that we have a child-parent relationship
len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE))
Something like the following constructs a string with pairs of parents sharing a child, across all children
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], function(x) {
x <- combn(sort(x), 2)
paste(x[1,], x[2,], sep=".")
})
and tallying
table(unlist(int, use.names=FALSE))
or a little more quickly
xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms), nms)
for
f1 <- function(parent_child) {
len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE))
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], function(x) {
x <- combn(sort(x), 2)
paste(x[1,], x[2,], sep=".")
})
xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms)), nms)
}
with (this is for all 10000 parent-child elements)
> system.time(ans1 <- f1(parent_child))
user system elapsed
14.625 0.012 14.668
> head(ans1)
542.1611 542.1832 542.2135 542.2435 542.2527 542.2806
1 1 1 1 1 1
I'm not sure that this would really scale to the size of problem you're talking about, though -- it's polynomial in the number of parents per child.
One possibility for speed-up is to 'memoize' the combinatorial calculation, using the length of the argument as a 'key' and storing the combination as 'value'. This reduces the number of times combn is called to the number of unique lengths of elements of child_parent.
combn1 <- local({
memo <- new.env(parent=emptyenv())
function(x) {
key <- as.character(length(x))
if (!exists(key, memo))
memo[[key]] <- t(combn(length(x), 2))
paste(x[memo[[key]][,1]], x[memo[[key]][,2]], sep=".")
}
})
f2 <- function(parent_child) {
len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE))
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], combn1)
xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms)), nms)
}
which helps somewhat
> system.time(ans2 <- f2(parent_child))
user system elapsed
5.337 0.000 5.347
> identical(ans1, ans2)
[1] TRUE
The slow part is now paste
> Rprof(); ans2 <- f2(parent_child); Rprof(NULL); summaryRprof()
$by.self
self.time self.pct total.time total.pct
"paste" 3.92 73.41 3.92 73.41
"match" 0.74 13.86 0.74 13.86
"unique.default" 0.40 7.49 0.40 7.49
"as.character" 0.08 1.50 0.08 1.50
"unlist" 0.08 1.50 0.08 1.50
"combn" 0.06 1.12 0.06 1.12
"lapply" 0.02 0.37 4.00 74.91
"any" 0.02 0.37 0.02 0.37
"setNames" 0.02 0.37 0.02 0.37
$by.total
...
We can avoid this by encoding the parents with shared child id into a single integer; because of the way floating point numbers are represented in R, this will be exact until about 2^21
encode <- function(x, y, n)
(x - 1) * (n + 1) + y
decode <- function(z, n)
list(x=ceiling(z / (n + 1)), y = z %% (n + 1))
and adjusting our combn1 and f2 functions as
combn2 <- local({
memo <- new.env(parent=emptyenv())
function(x, encode_n) {
key <- as.character(length(x))
if (!exists(key, memo))
memo[[key]] <- t(combn(length(x), 2))
encode(x[memo[[key]][,1]], x[memo[[key]][,2]], encode_n)
}
})
f3 <- function(parent_child) {
encode_n <- length(parent_child)
len <- sapply(parent_child, length)
child_parent <-
unname(split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE)))
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], combn2, encode_n)
id <- unlist(int, use.names=FALSE)
uid <- unique(xx)
n <- tabulate(match(xx, uid), length(uid))
do.call(data.frame, c(decode(uid, encode_n), list(n=n)))
}
leading to
> system.time(f3(parent_child))
user system elapsed
2.140 0.000 2.146
This compares very favorably (note that the timing in the previous line is for 10,000 parent-child relations) with jlhoward's revised answer
> system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
user system elapsed
2.465 0.000 2.468
> system.time(f3(parent_child[1:99]))
user system elapsed
0.016 0.000 0.014
and scales in a much more reasonable way.
For what it's worth, the data generation routine is in the second circle of Patrick Burn's R Inferno, using the 'copy-and-append' algorithm rather than pre-allocating the space and filling it in. Avoid this by writing the for loop body as a function, and using lapply. Avoid the need for the complicated conditional in the for loop by fixing the issue before-hand
numchild <- round(rnorm(10000, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))
or by sampling from a distribution (rpois, rbinom) that generates positive integer values. Data generation is then
n_parents <- 10000
numchild <- round(rnorm(n_parents, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))
parent_child <- lapply(numchild, sample, x=n_parents)
Here is another approach that is about 10X faster than my previous answer, and 17X faster than the original code (also simpler):
ff <- function(u2, u1, a) {
common <- length(intersect(a,parent_child[[u2]]))
if (common>0) {return(data.frame(u1,u2,common))}
}
gg <- function(u1) {
a <- parent_child[[u1]]
do.call("rbind",lapply((u1+1):100,ff,u1,a))
}
system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
user system elapsed
1.04 0.00 1.03
result.3 is identical to result.2 from previous answer:
max(abs(result.3-result.2))
[1] 0
Well, a small improvement (I think):
Original code (wrapped in function call):
f = function(n) {
#small subset
a <- parent_child[1:n]
outerresults <- foreach (i = 1:(length(a)),
.combine=rbind,
.packages=c('foreach','doParallel')) %dopar% {
b <- a[[i]]
rest <- a[i+1:length(a)]
foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
common <- length(intersect(b, rest[[j]]))
if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
}
}
return(outerresults)
}
Modified code:
g <- function(n) {
a <- parent_child[1:n]
outerresults <- foreach (i = 1:n,
.combine=rbind,
.packages=c('foreach','doParallel')) %dopar% {
b <- a[[i]]
foreach (j = (i):n, .combine=rbind) %dopar% {
if (i!=j) {
c <- a[[j]]
common <- length(intersect(b, c))
if (common > 0) {g <- data.frame(u1=i, u2=j, common)}
}
}
}
return(outerresults)
}
Benchmarks:
system.time(result.old<-f(100))
user system elapsed
17.21 0.00 17.33
system.time(result.new<-g(100))
user system elapsed
10.42 0.00 10.47
The numbering for u2 is a little different becasue of the different approaches, but both produce the same vector of matches:
max(abs(result.old$common-result.new$common))
[1] 0
I tried this with data table joins replacing intersect(...) and it was actually much slower(!!)

A^k for matrix multiplication in R?

Suppose A is some square matrix. How can I easily exponentiate this matrix in R?
I tried two ways already: Trial 1 with a for-loop hack and Trial 2 a bit more elegantly but it is still a far cry from Ak simplicity.
Trial 1
set.seed(10)
t(matrix(rnorm(16),ncol=4,nrow=4)) -> a
for(i in 1:2){a <- a %*% a}
Trial 2
a <- t(matrix(c(0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0),nrow=4))
i <- diag(4)
(function(n) {if (n<=1) a else (i+a) %*% Recall(n-1)})(10)
If A is diagonizable, you could use eigenvalue decomposition:
matrix.power <- function(A, n) { # only works for diagonalizable matrices
e <- eigen(A)
M <- e$vectors # matrix for changing basis
d <- e$values # eigen values
return(M %*% diag(d^n) %*% solve(M))
}
When A is not diagonalizable, the matrix M (matrix of eigenvectors) is singular. Thus, using it with A = matrix(c(0,1,0,0),2,2) would give Error in solve.default(M) : system is computationally singular.
The expm package has an %^% operator:
library("sos")
findFn("{matrix power}")
install.packages("expm")
library("expm")
?matpow
set.seed(10);t(matrix(rnorm(16),ncol=4,nrow=4))->a
a%^%8
Although Reduce is more elegant, a for-loop solution is faster and seems to be as fast as expm::%^%
m1 <- matrix(1:9, 3)
m2 <- matrix(1:9, 3)
m3 <- matrix(1:9, 3)
system.time(replicate(1000, Reduce("%*%" , list(m1,m1,m1) ) ) )
# user system elapsed
# 0.026 0.000 0.037
mlist <- list(m1,m2,m3)
m0 <- diag(1, nrow=3,ncol=3)
system.time(replicate(1000, for (i in 1:3 ) {m0 <- m0 %*% m1 } ) )
# user system elapsed
# 0.013 0.000 0.014
library(expm) # and I think this may be imported with pkg:Matrix
system.time(replicate(1000, m0%^%3))
# user system elapsed
#0.011 0.000 0.017
On the other hand the matrix.power solution is much, much slower:
system.time(replicate(1000, matrix.power(m1, 4)) )
user system elapsed
0.677 0.013 1.037
#BenBolker is correct (yet again). The for-loop appears linear in time as the exponent rises whereas the expm::%^% function appears to be even better than log(exponent).
> m0 <- diag(1, nrow=3,ncol=3)
> system.time(replicate(1000, for (i in 1:400 ) {m0 <- m0 %*% m1 } ) )
user system elapsed
0.678 0.037 0.708
> system.time(replicate(1000, m0%^%400))
user system elapsed
0.006 0.000 0.006
Indeed the expm's package does use exponentiation by squaring.
In pure r, this can be done rather efficiently like so,
"%^%" <- function(mat,power){
base = mat
out = diag(nrow(mat))
while(power > 1){
if(power %% 2 == 1){
out = out %*% base
}
base = base %*% base
power = power %/% 2
}
out %*% base
}
Timing this,
m0 <- diag(1, nrow=3,ncol=3)
system.time(replicate(10000, m0%^%4000))#expm's %^% function
user system elapsed
0.31 0.00 0.31
system.time(replicate(10000, m0%^%4000))# my %^% function
user system elapsed
0.28 0.00 0.28
So, as expected, they are the same speed because they use the same algorithm. It looks like the overhead of the looping r code does not make a significant difference.
So, if you don't want to use expm, and need that performance, then you can just use this, if you don't mind looking at imperative code.
A shorter solution with eigenvalue decomposition:
"%^%" <- function(S, power)
with(eigen(S), vectors %*% (values^power * t(vectors)))
Simple solution
`%^%` <- function(A, n) {
A1 <- A
for(i in seq_len(n-1)){
A <- A %*% A1
}
return(A)
}

Resources