I have a matrix and look for an efficient way to replicate it n times (where n is the number of observations in the dataset). For example, if I have a matrix A
A <- matrix(1:15, nrow=3)
then I want an output of the form
rbind(A, A, A, ...) #n times.
Obviously, there are many ways to construct such a large matrix, for example using a for loop or apply or similar functions. However, the call to the "matrix-replication-function" takes place in the very core of my optimization algorithm where it is called tens of thousands of times during one run of my program. Therefore, loops, apply-type of functions and anything similar to that are not efficient enough. (Such a solution would basically mean that a loop over n is performed tens of thousands of times, which is obviously inefficient.) I already tried to use the ordinary rep function, but haven't found a way to arrange the output of rep in a matrix of the desired format.
The solution
do.call("rbind", replicate(n, A, simplify=F))
is also too inefficient because rbind is used too often in this case. (Then, about 30% of the total runtime of my program are spent performing the rbinds.)
Does anyone know a better solution?
Two more solutions:
The first is a modification of the example in the question
do.call("rbind", rep(list(A), n))
The second involves unrolling the matrix, replicating it, and reassembling it.
matrix(rep(t(A),n), ncol=ncol(A), byrow=TRUE)
Since efficiency is what was requested, benchmarking is necessary
A <- matrix(1:15, nrow=3)
n <- 10
benchmark(rbind(A, A, A, A, A, A, A, A, A, A),
do.call("rbind", replicate(n, A, simplify=FALSE)),
do.call("rbind", rep(list(A), n)),
apply(A, 2, rep, n),
matrix(rep(t(A),n), ncol=ncol(A), byrow=TRUE),
order="relative", replications=100000)
which gives:
test replications elapsed
1 rbind(A, A, A, A, A, A, A, A, A, A) 100000 0.91
3 do.call("rbind", rep(list(A), n)) 100000 1.42
5 matrix(rep(t(A), n), ncol = ncol(A), byrow = TRUE) 100000 2.20
2 do.call("rbind", replicate(n, A, simplify = FALSE)) 100000 3.03
4 apply(A, 2, rep, n) 100000 7.75
relative user.self sys.self user.child sys.child
1 1.000 0.91 0 NA NA
3 1.560 1.42 0 NA NA
5 2.418 2.19 0 NA NA
2 3.330 3.03 0 NA NA
4 8.516 7.73 0 NA NA
So the fastest is the raw rbind call, but that assumes n is fixed and known ahead of time. If n is not fixed, then the fastest is do.call("rbind", rep(list(A), n). These were for a 3x5 matrix and 10 replications. Different sized matrices might give different orderings.
For n=600, the results are in a different order (leaving out the explicit rbind version):
A <- matrix(1:15, nrow=3)
n <- 600
benchmark(do.call("rbind", replicate(n, A, simplify=FALSE)),
do.call("rbind", rep(list(A), n)),
apply(A, 2, rep, n),
matrix(rep(t(A),n), ncol=ncol(A), byrow=TRUE),
order="relative", replications=10000)
test replications elapsed
4 matrix(rep(t(A), n), ncol = ncol(A), byrow = TRUE) 10000 1.74
3 apply(A, 2, rep, n) 10000 2.57
2 do.call("rbind", rep(list(A), n)) 10000 2.79
1 do.call("rbind", replicate(n, A, simplify = FALSE)) 10000 6.68
relative user.self sys.self user.child sys.child
4 1.000 1.75 0 NA NA
3 1.477 2.54 0 NA NA
2 1.603 2.79 0 NA NA
1 3.839 6.65 0 NA NA
If you include the explicit rbind version, it is slightly faster than the do.call("rbind", rep(list(A), n)) version, but not by much, and slower than either the apply or matrix versions. So a generalization to arbitrary n does not require a loss of speed in this case.
Probably this is more efficient:
apply(A, 2, rep, n)
There's also this way:
rep(1, n) %x% A
You can use indexing
A[rep(seq(nrow(A)), n), ]
I came here for the same reason as the original poster and ultimately updated #Brian Diggs comparison to include all of the other posted answers. Hopefully I did this correctly:
A <- matrix(1:15, nrow=3)
n <- 600
benchmark(do.call("rbind", replicate(n, A, simplify=FALSE)),
do.call("rbind", rep(list(A), n)),
apply(A, 2, rep, n),
matrix(rep(t(A),n), ncol=ncol(A), byrow=TRUE),
A[rep(seq(nrow(A)), n), ],
rep(1, n) %x% A,
apply(A, 2, rep, n),
order="relative", replications=10000)
# test replications elapsed relative user.self sys.self user.child sys.child
#5 A[rep(seq(nrow(A)), n), ] 10000 0.32 1.000 0.33 0.00 NA NA
#8 matrix(rep(as.integer(t(A)), n), nrow = nrow(A) * n, byrow = TRUE) 10000 0.36 1.125 0.35 0.02 NA NA
#4 matrix(rep(t(A), n), ncol = ncol(A), byrow = TRUE) 10000 0.38 1.188 0.37 0.00 NA NA
#3 apply(A, 2, rep, n) 10000 0.59 1.844 0.56 0.03 NA NA
#7 apply(A, 2, rep, n) 10000 0.61 1.906 0.58 0.03 NA NA
#6 rep(1, n) %x% A 10000 1.44 4.500 1.42 0.02 NA NA
#2 do.call("rbind", rep(list(A), n)) 10000 1.67 5.219 1.67 0.00 NA NA
#1 do.call("rbind", replicate(n, A, simplify = FALSE)) 10000 5.03 15.719 5.02 0.01 NA NA
what about transforming it into an array, replicate the content and create a new matrix with the updated number of rows?
A <- matrix(...)
n = 2 # just a test
a = as.integer(A)
multi.a = rep(a,n)
multi.A = matrix(multi.a,nrow=nrow(A)*n,byrow=T)
I would like to speed up my calculations and obtain results without using loop in function m. Reproducible example:
N <- 2500
n <- 500
r <- replicate(1000, sample(N, n))
m <- function(r, N) {
ic <- matrix(0, nrow = N, ncol = N)
for (i in 1:ncol(r)) {
p <- r[, i]
ic[p, p] <- ic[p, p] + 1
system.time(ic <- m(r, N))
# user system elapsed
# 6.25 0.51 6.76
# [1] TRUE
In every iteration of for loop we are dealing with matrix not vector, so how this could be Vectorized?
#joel.wilson The purpose of this function is to calculate pairwise frequencies of elements. So afterwards we could estimate pairwise inclusion probabilities.
Thanks to #Khashaa and #alexis_laz. Benchmarks:
> require(rbenchmark)
> benchmark(m(r, N),
+ m1(r, N),
+ mvec(r, N),
+ alexis(r, N),
+ replications = 10, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
4 alexis(r, N) 10 4.73 1.000 4.63 0.11 NA NA
3 mvec(r, N) 10 5.36 1.133 5.18 0.18 NA NA
2 m1(r, N) 10 5.48 1.159 5.29 0.19 NA NA
1 m(r, N) 10 61.41 12.983 60.43 0.90 NA NA
This should be significantly faster as it avoids operations on double indexing
m1 <- function(r, N) {
ic <- matrix(0, nrow = N, ncol=ncol(r))
for (i in 1:ncol(r)) {
p <- r[, i]
ic[, i][p] <- 1
system.time(ic1 <- m1(r, N))
# user system elapsed
# 0.53 0.01 0.55
all.equal(ic, ic1)
# [1] TRUE
Simple "counting/adding" operations can almost always be vectorized
mvec <- function(r, N) {
ic <- matrix(0, nrow = N, ncol=ncol(r))
i <- rep(1:ncol(r), each=nrow(r))
ic[cbind(as.vector(r), i)] <- 1
Let me delve right in. Imagine you have data that looks like this:
df <- data.frame(one = c(1, 1, NA, 13),
two = c(2, NA,10, 14),
three = c(NA,NA,11, NA),
four = c(4, 9, 12, NA))
This gives us:
# one two three four
# 1 1 2 NA 4
# 2 1 NA NA 9
# 3 NA 10 11 12
# 4 13 14 NA NA
Each row are measurements in week 1, 2, 3 and 4 respectively. Suppose the numbers represent some accumulated measure since the last time a measurement happened. For example, in row 1, the "4" in column "four" represents a cumulative value of week 3 and 4.
Now I want to "even out" these numbers (feel free to correct my terminology here) by evenly spreading out the measurements to all weeks before the measurement if no measurement took place in the preceeding weeks. For instance, row 1 should read
1 2 2 2
since the 4 in the original data represents the cumulative value of 2 weeks (week "three" and "four"), and 4/2 is 2.
The final end result should look like this:
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
I struggle a bit with how to best approach this. One candidate solution would be to get the indices of all missing values, then to count the length of runs (NAs occuring multiple times), and use that to fill up the values somehow. However, my real data is large, and I think such a strategy might be time consuming. Is there an easier and more efficient way?
A base R solution would be to first identify the indices that need to be replaced, then determine groupings of those indices, finally assigning grouped values with the ave function:
clean <- function(x) {
to.rep <- which(is.na(x) | c(FALSE, head(is.na(x), -1)))
groups <- cumsum(c(TRUE, head(!is.na(x[to.rep]), -1)))
x[to.rep] <- ave(x[to.rep], groups, FUN=function(y) {
rep(tail(y, 1) / length(y), length(y))
t(apply(df, 1, clean))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
If efficiency is important (your question implies it is), then an Rcpp solution could be a good option:
"NumericVector cleanRcpp(NumericVector x) {
const int n = x.size();
NumericVector y(x);
int consecNA = 0;
for (int i=0; i < n; ++i) {
if (R_IsNA(x[i])) {
} else if (consecNA > 0) {
const double replacement = x[i] / (consecNA + 1);
for (int j=i-consecNA; j <= i; ++j) {
y[j] = replacement;
consecNA = 0;
} else {
consecNA = 0;
return y;
t(apply(df, 1, cleanRcpp))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
We can compare performance on a larger instance (10000 x 100 matrix):
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
all.equal(apply(mat, 1, clean), apply(mat, 1, cleanRcpp))
# [1] TRUE
system.time(apply(mat, 1, clean))
# user system elapsed
# 4.918 0.035 4.992
system.time(apply(mat, 1, cleanRcpp))
# user system elapsed
# 0.093 0.016 0.120
In this case the Rcpp solution provides roughly a 40x speedup compared to the base R implementation.
Here's a base R solution that's nearly as fast as josilber's Rcpp function:
spread_left <- function(df) {
nc <- ncol(df)
x <- rev(as.vector(t(as.matrix(cbind(df, -Inf)))))
ii <- cumsum(!is.na(x))
f <- tabulate(ii)
v <- x[!duplicated(ii)]
xx <- v[ii]/f[ii]
xx[xx == -Inf] <- NA
m <- matrix(rev(xx), ncol=nc+1, byrow=TRUE)[,seq_len(nc)]
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
It manages to be relatively fast by vectorizing everything and completely avoiding time-expensive calls to apply(). (The downside is that it's also relatively obfuscated; to see how it works, do debug(spread_left) and then apply it to the small data.frame df in the OP.
Here are benchmarks for all currently posted solutions:
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
df <- as.data.frame(mat)
## First confirm that it produces the same results
identical(spread_left(df), as.data.frame(t(apply(mat, 1, clean))))
# [1] TRUE
## Then compare its speed
benchmark(josilberR = t(apply(mat, 1, clean)),
josilberRcpp = t(apply(mat, 1, cleanRcpp)),
Josh = spread_left(df),
Henrik = t(apply(df, 1, fn)),
replications = 10)
# test replications elapsed relative user.self sys.self
# 4 Henrik 10 38.81 25.201 38.74 0.08
# 3 Josh 10 2.07 1.344 1.67 0.41
# 1 josilberR 10 57.42 37.286 57.37 0.05
# 2 josilberRcpp 10 1.54 1.000 1.44 0.11
Another base possibility. I first create a grouping variable (grp), over which the 'spread' is then made with ave.
fn <- function(x){
grp <- rev(cumsum(!is.na(rev(x))))
res <- ave(x, grp, FUN = function(y) sum(y, na.rm = TRUE) / length(y))
res[grp == 0] <- NA
t(apply(df, 1, fn))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
I was thinking that if NAs are relatively rare, it might be better to make the edits by reference. (I'm guessing this is how the Rcpp approach works.) Here's how it can be done in data.table, borrowing #Henrik's function almost verbatim and converting to long format:
require(data.table) # 1.9.5
fill_naseq <- function(df){
# switch to long format
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
mDT[,badv := is.na(value)]
# subset to rows that need modification
# apply #Henrik's function, more or less
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
# revert to wide format
identical(fill_naseq(df),spread_left(df)) # TRUE
To show the best-case scenario for this approach, I simulated so that NAs are very infrequent:
nr = 1e4
nc = 100
nafreq = 1/1e4
mat <- matrix(sample(
df <- as.data.frame(mat)
# test replications elapsed relative user.self
# 1 F 10 3.82 1.394 3.72
# 2 Josh 10 2.74 1.000 2.70
# I don't have Rcpp installed and so left off josilber's even faster approach
So, it's still slower. However, with data kept in a long format, reshaping wouldn't be necessary:
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
fill_naseq_long <- function(mDT){
mDT[,badv := is.na(value)]
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
# test replications elapsed relative user.self
# 2 F 10 3.98 8.468 3.81
# 1 F2 10 0.47 1.000 0.45
# 3 Josh 10 2.72 5.787 2.69
Now it's a little faster. And who doesn't like keeping their data in long format? This also has the advantage of working even if we don't have the same number of observations per "id".
I need to merge (left join) two data sets x and y.
merge(x,y, by.x = "z", by.y = "zP", all.x = TRUE)
Every value of z is not there in zP but there must be nearest value in zP. So we need to use nearest value in zP for process of merging.
For example
z <- c(0.231, 0.045, 0.632, 0.217, 0.092, ...)
zP <- c(0.010,0.013, 0.017, 0.021, ...)
How can we do it in R ?
Based on the information you provided it sounds like you want to keep all the observations in x, and then for each observation in x you want to find the observation in y that minimizes the distance between columns z and zP. If that is what you are looking for then something like this might work
> library(data.table)
> x <- data.table(z = c(0.231, 0.045, 0.632, 0.217, 0.092), k = c("A","A","B","B","B"))
> y <- data.table(zP = c(0.010, 0.813, 0.017, 0.421), m = c(1,2,3,4))
> x
z k
1: 0.231 A
2: 0.045 A
3: 0.632 B
4: 0.217 B
5: 0.092 B
> y
zP m
1: 0.010 1
2: 0.813 2
3: 0.017 3
4: 0.421 4
> find.min.zP <- function(x){
+ y[which.min(abs(x - zP)), zP]
+ }
> x[, zP := find.min.zP(z), by = z]
> x
z k zP
1: 0.231 A 0.421
2: 0.045 A 0.017
3: 0.632 B 0.813
4: 0.217 B 0.017
5: 0.092 B 0.017
> merge(x, y, by="zP", all.x = T, all.y = F)
zP z k m
1: 0.017 0.045 A 3
2: 0.017 0.217 B 3
3: 0.017 0.092 B 3
4: 0.421 0.231 A 4
5: 0.813 0.632 B 2
This is the solution that popped into my head given that I use data.table quite a bit. Please note that using data.table here may or may not be the most elegant way and it may not even be the fastest way (although if x and y are large some solution involving data.table probably will be the fastest). Also note that this is likely an example of using data.table "badly" as I didn't make any effort to optimize for speed. If speed is important I would highly recommend reading the helpful documentation on the github wiki. Hope that helps.
As I suspected, data.table provides a much better way, which Arun pointed out in the comments.
> setkey(x, z)
> setkey(y, zP)
> y[x, roll="nearest"]
zP m k
1: 0.045 3 A
2: 0.092 3 B
3: 0.217 3 B
4: 0.231 4 A
5: 0.632 2 B
The only difference is that the z column is now named zP and the original zP column is gone. If preserving that column is important you can always copy the zP column in y to a new column named z and join on that.
> y[, z := zP]
> setkey(x, z)
> setkey(y, z)
> y[x, roll='nearest']
zP m z k
1: 0.017 3 0.045 A
2: 0.017 3 0.092 B
3: 0.017 3 0.217 B
4: 0.421 4 0.231 A
5: 0.813 2 0.632 B
This is slightly less typing, but the real improvement is in compute times with large datasets.
> x <- data.table(z = runif(100000, 0, 100), k = sample(LETTERS, 100000, replace = T))
> y <- data.table(zP = runif(50000, 0, 100), m = sample(letters, 50000, replace = T))
> start <- proc.time()
> x[, zP := find.min.zP(z), by = z]
> slow <- merge(x, y, by="zP", all.x = T, all.y = F)
> proc.time() - start
user system elapsed
104.849 0.072 106.432
> x[, zP := NULL] # Drop the zP column we added to x doing the merge the slow way
> start <- proc.time()
> y[, z := zP]
> setkey(x, z)
> setkey(y, z)
> fast <- y[x, roll='nearest']
> proc.time() - start
user system elapsed
0.046 0.000 0.045
# Reorder the rows and columns so that we can compare the two data tables
> setkey(slow, z)
> setcolorder(slow, c("z", "zP", "k", "m"))
> setcolorder(fast, c("z", "zP", "k", "m"))
> all.equal(slow, fast)
Notice, that the faster method is 2,365 times faster! I would expect the time gains to be even more dramatic for a data set with more than 100,000 observations (which is relatively small these days). This is why reading the data.table documentation is worth while if you are working with large data sets. You can often achieve very large speed ups by using the built in methods, but you won't know that they're there unless you look.
I want to interlace two vectors of same mode and equal length. Say:
a <- rpois(lambda=3,n=5e5)
b <- rpois(lambda=4,n=5e5)
I would like to interweave or interlace these two vectors, to create a vector that would be equivalently c(a[1],b[1],a[2],b[2],...,a[length(a)],b[length(b)])
My first attempt was this:
but it requires rpois to be called far more times than needed.
My best attempt so far has been to transform it into a matrix and reconvert back into a vector:
d <- c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5)))
d <- c(rbind(a,b))
Is there a better way to go about doing it? Or is there a function in base R that accomplishes the same thing?
Your rbind method should work well. You could also use
because R will automatically replicate the vector of lambda values to the required length. There's not much difference in speed:
# test replications elapsed relative
# 2 c(rbind(rpois(5e+05, 3), rpois(5e+05, 4))) 100 23.390 1.112168
# 1 rpois(1e+06, c(3, 4)) 100 21.031 1.000000
and elegance is in the eye of the beholder ... of course, the c(rbind(...)) method works in general for constructing alternating vectors, while the other solution is specific to rpois or other functions that replicate their arguments in that way.
Some speed tests, incorporating Ben Bolker's answer:
1 c(rbind(rpois(lambda = 3, n = 5e+05), rpois(lambda = 4, n = 5e+05)))
2 c(t(sapply(X = list(3, 4), FUN = rpois, n = 5e+05)))
4 rpois(lambda = c(3, 4), n = 1e+06)
5 rpois(lambda = rep.int(c(3, 4), times = 5e+05), n = 1e+06)
3 sapply(X = rep.int(c(3, 4), times = 5e+05), FUN = rpois, n = 1)
replications elapsed relative user.self sys.self user.child sys.child
1 100 6.14 1.000000 5.93 0.15 NA NA
2 100 7.11 1.157980 7.02 0.02 NA NA
4 100 14.09 2.294788 13.61 0.05 NA NA
5 100 14.24 2.319218 13.73 0.21 NA NA
3 100 700.84 114.143322 683.51 0.50 NA NA
I want to interlace two vectors of same mode and equal length. Say:
a <- rpois(lambda=3,n=5e5)
b <- rpois(lambda=4,n=5e5)
I would like to interweave or interlace these two vectors, to create a vector that would be equivalently c(a[1],b[1],a[2],b[2],...,a[length(a)],b[length(b)])
My first attempt was this:
but it requires rpois to be called far more times than needed.
My best attempt so far has been to transform it into a matrix and reconvert back into a vector:
d <- c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5)))
d <- c(rbind(a,b))
Is there a better way to go about doing it? Or is there a function in base R that accomplishes the same thing?
Your rbind method should work well. You could also use
because R will automatically replicate the vector of lambda values to the required length. There's not much difference in speed:
# test replications elapsed relative
# 2 c(rbind(rpois(5e+05, 3), rpois(5e+05, 4))) 100 23.390 1.112168
# 1 rpois(1e+06, c(3, 4)) 100 21.031 1.000000
and elegance is in the eye of the beholder ... of course, the c(rbind(...)) method works in general for constructing alternating vectors, while the other solution is specific to rpois or other functions that replicate their arguments in that way.
Some speed tests, incorporating Ben Bolker's answer:
1 c(rbind(rpois(lambda = 3, n = 5e+05), rpois(lambda = 4, n = 5e+05)))
2 c(t(sapply(X = list(3, 4), FUN = rpois, n = 5e+05)))
4 rpois(lambda = c(3, 4), n = 1e+06)
5 rpois(lambda = rep.int(c(3, 4), times = 5e+05), n = 1e+06)
3 sapply(X = rep.int(c(3, 4), times = 5e+05), FUN = rpois, n = 1)
replications elapsed relative user.self sys.self user.child sys.child
1 100 6.14 1.000000 5.93 0.15 NA NA
2 100 7.11 1.157980 7.02 0.02 NA NA
4 100 14.09 2.294788 13.61 0.05 NA NA
5 100 14.24 2.319218 13.73 0.21 NA NA
3 100 700.84 114.143322 683.51 0.50 NA NA