Yesterday I asked a very simple vectorization question and got some great answers. Today the question is a bit more complex and I'm wondering if R has a function to speed up the runtime of this loop through vectorization.
The loop is
for(j in 1:N) {
A[j,1] = B[max(which(C[j]>=D))];
}
I tried
A[,1] = B[max(which(C>=D))];
and this dropped the runtime considerably ... but the answer was wrong. Is there a "correct" way to do this in R?
EDIT1:
Thanks for the questions regarding data. I will give sizes of the arrays here:
We are looping over 1:N
A is N x 1
B is length M
C is length N
D is length M
If it matters in terms of speed, in this example, N = 844, M = 2500.
Edit2:
And here are some values for a smaller simulated dataset:
B <- c(1.0000000, 1.0000000, 1.0000000, 0.9565217, 0.9565217, 0.9565217, 0.9565217,
0.9565217, 0.9565217, 0.9565217, 0.8967391, 0.8369565, 0.7771739, 0.7173913,
0.7173913, 0.7173913, 0.7173913, 0.7173913, 0.6277174, 0.6277174, 0.5230978,
0.5230978, 0.3923234, 0.3923234, 0.3923234)
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074, 2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587,
0.58848, 0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164,
1.22956, 1.23760, 1.39074, 1.87604, 2.36452, 2.89497, 4.42393)
The result should be:
> A
[,1]
[1,] 1.0000000
[2,] 0.9565217
[3,] 0.9565217
[4,] 0.9565217
[5,] 0.7173913
[6,] 0.7173913
[7,] 0.7173913
[8,] 0.6277174
[9,] 0.5230978
[10,] 0.3923234
If you are eager to get the answer immediately, jump to Conclusion. I offer you a single line R code, with maximum efficiency. For details/ideas, read through the following.
Code re-shaping and problem re-definition
When OP asks a vectorization of the following loop:
for(j in 1:N) A[j, 1] <- B[max(which(C[j] >= D))]
The first thing I do is to transform it into a nice version:
## stage 1: index computation (need vectorization)
id <- integer(N); for(j in 1:N) id[j] <- max(which(D <= C[j]))
## stage 2: shuffling (readily vectorized)
A[, 1] <- B[id]
Now we see that only stage 1 needs be vectorized. This stage essentially does the following:
D[1] D[2] D[3] ... D[M]
C[1]
C[2]
C[3]
.
.
C[N]
For each row j, find the cut off location k(j) in D, such that D[k(j) + 1], D[k(j) + 2], ..., D[M] > C[j].
Efficient algorithm based on sorting
There is actually an efficient algorithm to do this:
sort C in ascending order, into CC (record ordering index iC, such that C[iC] == CC)
sort D in ascending order, into DD (record ordering index iD, such that D[iD] == DD)
By sorting, we substantially reduce the work complexity.
If data are unsorted, then we have to explicitly scan all elements: D[1], D[2], ..., D[M] in order to decide on k(j). So there is O(M) costs for each row, thus O(MN) costs in total.
However, If data are sorted, then we only need to do the following:
j = 1: search `D[1], D[2], ..., D[k(1)]`, till `D[k(1) + 1] > C[1]`;
j = 2: search `D[k(1) + 1], D[k(1)+2], ..., D[k(2)]`, till `D[k(2) + 1] > C[2]`;
...
For each row, only partial searching is applied, and the overall complexity is only O(M), i.e., D vector is only touched once, rather than N times as in the trivial implementation. As a result, after sorting, the algorithm is N times faster!! For large M and N, this is a huge difference! As you said in other comment, this code will be called millions of times, then we definitely want O(M) algorithm instead of O(MN) algorithm.
Also note, that the memory costs for this approach is O(M + N), i.e., we only concatenate two vectors together, rather than expanding it into an M-by-N matrix. So such storage saving is also noticeable.
In fact, we can take one step further, by converting this comparison problem into a matching problem, which is easier to vectorize in R.
## version 1:
CCDD <- c(CC, DD) ## combine CC and DD
CCDD <- sort(CCDD, decreasing = TRUE) ## sort into descending order
id0 <- M + N - match(CC, CCDD) + 1
id <- id0 - 1:N
To understand why this work, consider an alternative representation:
## version 2:
CCDD <- c(CC, DD) ## combine CC and DD
CCDD <- sort(CCDD) ## sort into ascending order
id0 <- match(CC, CCDD)
id <- id0 - 1:N
Now the following diagram illustrates what CCDD vector looks like:
CCDD: D[1] D[2] C[1] D[3] C[2] C[3] D[4] D[5] D[6] C[4] .....
id0: 3 5 6 10 .....
id : 2 3 3 6 .....
So, CCDD[id] gives: D[2], D[3], D[3], D[6], ...., exactly the last element no greater than C[1], C[2]. C[3], C[4], ...., Therefore, id is just the index we want!
Then people may wonder why I suggest doing "version 1" rather than "version 2". Because when there are tied values in CCDD, "version 2" will give wrong result, because match() will take the first element that matches, ignoring later matches. So instead of matching from left to right (in ascending index), we have to match from right to left (in descending index).
Using OP's data
With this in mind, I start looking at OP's data. Now amazingly, OP's data are already sorted:
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074, 2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587, 0.58848,
0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164, 1.22956, 1.23760,
1.39074, 1.87604, 2.36452, 2.89497, 4.42393)
M <- length(D); N <- length(C)
is.unsorted(C)
# FALSE
is.unsorted(D)
#FALSE
Furthermore, OP has already combined C and D:
all(C %in% D)
# TRUE
It seems that OP and I have the same idea on efficiency in mind. Presumably OP once had a shorter D vector, while the D vector he supplied is really the CCDD vector I mentioned above!
Now, in this situation, things are all the way simple: we just do a single line:
id <- M - match(C, rev(D)) + 1
Note I put rev() because OP has sorted D in ascending order so I need to reverse it. This single line may look very much different from the "version 1" code, but nothing is wrong here. Remember, The D used here is really the CCDD in "version 1" code, and the M here is really the M + N there. Also, there is no need to subtract 1:N from id, due to our different definition of D.
Checking result
Now, the trivial R-loop gives:
id <- integer(N); for(j in 1:N) id[j] <- max(which(D <= C[j]))
id
# [1] 1 4 7 9 14 15 16 19 21 23
Well, our single line, vectorized code gives:
id <- M - match(C, rev(D)) + 1
id
# [1] 1 4 7 9 14 15 16 19 21 23
Perfect match, hence we are doing the right thing.
Conclusion
So, Laurbert, this is the answer you want:
A[, 1] <- B[M - match(C, rev(D)) + 1]
You can use outer for this.
Your code:
A1 <- matrix(NA_real_, ncol = 1, nrow = length(C))
for(j in seq_along(C)) {
A1[j,1] = B[max(which(C[j]>=D))];
}
Test if the elements of C are larger/equal the elements of D with outer:
test <- outer(C, D, FUN = ">=")
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#[10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
Note that this can use a lot of memory for large vectors.
Then find the last TRUE value in each row:
ind <- max.col(test, ties.method = "last") * (rowSums(test) > 0)
rowSums(test) > 0 tests if there are any TRUE values and makes the corresponding element of ind 0 otherwise. It's undefined what you'd want to happen in this case. (A 0 index is ignored during subsetting. Possibly, you'd want NA instead in your final result?)
Now subset:
A2 <- as.matrix(B[ind], ncol = 1)
# [,1]
# [1,] 1.0000000
# [2,] 0.9565217
# [3,] 0.9565217
# [4,] 0.9565217
# [5,] 0.7173913
# [6,] 0.7173913
# [7,] 0.7173913
# [8,] 0.6277174
# [9,] 0.5230978
#[10,] 0.3923234
Are the results identical?
identical(A2, A1)
#[1] TRUE
The data (please use dput next time to provide example data):
B <- c(1.0000000, 1.0000000, 1.0000000, 0.9565217, 0.9565217, 0.9565217, 0.9565217,
0.9565217, 0.9565217, 0.9565217, 0.8967391, 0.8369565, 0.7771739, 0.7173913,
0.7173913, 0.7173913, 0.7173913, 0.7173913, 0.6277174, 0.6277174, 0.5230978,
0.5230978, 0.3923234, 0.3923234, 0.3923234)
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074,
2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587,
0.58848, 0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164,
1.22956, 1.23760, 1.39074, 1.87604, 2.36452, 2.89497, 4.42393)
After seeing #Roland's answer, I think I understand better what you are asking. To double check: you want to compare each value of C (individually) against all values of D, and get the largest index of D (let's call it k) that holds a value smaller than C[j]. You then want to use it to assign the corresponding value of B to A, thus A[j]=B[k]. Is this correct?
I don't have an answer regarding how to vectorize what you want to do, but do have some suggestions on how to speed it up. Before that, let me ask whether it's actually worth going through the effort. For the larger example you mentioned (N~1000, M~2500), your loop still runs in well under a second on my laptop. Unless this calculation is done many times over inside another loop, it seems like unnecessary optimization...
Also, like #Roland pointed out, it's not clear what should happen if there is a value in C that's smaller than all values in D. These functions (including your original loop) will not work if that happens and would need some slight tweaking.
Anyway, these are my suggestions:
First, let me wrap your loop into a function for convenience.
f_loop <- function(B, C, D){
N <- length(C)
A <- matrix(0, ncol=1, nrow=N)
for(j in 1:N) {
A[j,1] = B[max(which(C[j]>=D))]
}
return(A)
}
If you want it to look a bit more "R-like" you can replace the loop with one of the *apply functions. In this case, it also runs slightly faster than the loop.
vapply(C, function(x) B[max(which(x>=D))], 0)
## Wrapped into a function for easier reference
f_vapply <- function(B, C, D){
vapply(C, function(x) B[max(which(x>=D))], 0)
}
My other suggestion is uglier (and not really "R-like"), but can help speed things up a lot (if that's the end goal here). I used the inline package to create a compiled version of your loop (note that depending on your OS and R setup, you may need to download additional tools or packages to be able to compile code).
## Translate loop into Fortran
loopcode <-
" integer i, j, k
do i = 1, n
k = 0
do j = 1, m
if (C(i) >= D(j)) k = j
end do
A(i) = B(k)
end do
"
## Compile into function
library(inline)
loopfun <- cfunction(sig = signature(A="numeric", B="numeric", C="numeric", D="numeric", n="integer", m="integer"), dim=c("(n)", "(m)", "(n)", "(m)", "", ""), loopcode, language="F95")
## Wrap into function for easier reference
f_compiled <- function(B, C, D){
A <- C
n <- length(A)
m <- length(B)
out <- loopfun(A, B, C, D, n, m)
return(as.matrix(out$A, ncol=1))
}
Let's check that the results all match:
cbind(A, f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 1.0000000 1.0000000 1.0000000
## [2,] 0.9565217 0.9565217 0.9565217 0.9565217
## [3,] 0.9565217 0.9565217 0.9565217 0.9565217
## [4,] 0.9565217 0.9565217 0.9565217 0.9565217
## [5,] 0.7173913 0.7173913 0.7173913 0.7173913
## [6,] 0.7173913 0.7173913 0.7173913 0.7173913
## [7,] 0.7173913 0.7173913 0.7173913 0.7173913
## [8,] 0.6277174 0.6277174 0.6277174 0.6277174
## [9,] 0.5230978 0.5230978 0.5230978 0.5230978
## [10,] 0.3923234 0.3923234 0.3923234 0.3923234
And check the speed:
microbenchmark(f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
## Unit: microseconds
## expr min lq mean median uq max neval cld
## f_loop(B, C, D) 52.804 54.8075 57.34588 56.5420 58.4615 83.843 100 c
## f_vapply(B, C, D) 38.677 41.5055 43.21231 42.8825 44.1525 65.355 100 b
## f_compiled(B, C, D) 17.095 18.2775 20.55372 20.1770 21.4710 66.407 100 a
We can also try it with vectors of similar size to the larger ones you mentioned (note the change in units for the results):
## Make the vector larger for benchmark
B <- rep(B, 100) # M = 2500
C <- rep(C, 100) # N = 1000
D <- rep(D, 100) # M = 2500
microbenchmark(f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f_loop(B, C, D) 24.380069 24.85061 25.99855 25.839282 25.952433 62.75721 100 b
## f_vapply(B, C, D) 23.543749 24.18427 25.34881 25.015859 25.179924 62.60746 100 b
## f_compiled(B, C, D) 1.976611 2.01403 2.06750 2.032864 2.057594 3.13658 100 a
EDIT:
I realized that if you always want the largest index of D for which C[j]>=D holds, of course it makes much more sense to loop through D starting from the end of the array, and exiting as soon as the first instance is found (instead of looping through the full array).
This is a small tweak to the Fortran code I wrote above that takes advantage of that.
loopcode <-
" integer i, j, k
do j = 1, n
k = 0
do i = m, 1, -1
if (C(j) >= D(i)) then
k = i
exit
end if
end do
A(j) = B(k)
end do
"
I won't include it in the benchmarks, because it'll be much more dependent on the actual data points. But it is obvious that worst case behavior is the same as the previous loop (e.g. if the index of interest occurs at the beginning, D is looped through in full) and the best case behavior almost completely eliminates looping through D (e.g. if the condition holds at the end of the array).
I have a logical vector, for which I wish to insert new elements at particular indexes. I've come up with a clumsy solution below, but is there a neater way?
probes <- rep(TRUE, 15)
ind <- c(5, 10)
probes.2 <- logical(length(probes)+length(ind))
probes.ind <- ind + 1:length(ind)
probes.original <- (1:length(probes.2))[-probes.ind]
probes.2[probes.ind] <- FALSE
probes.2[probes.original] <- probes
print(probes)
gives
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
and
print(probes.2)
gives
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE TRUE TRUE
So it works but is ugly looking - any suggestions?
These are all very creative approaches. I think working with indexes is definitely the way to go (Marek's solution is very nice).
I would just mention that there is a function to do roughly that: append().
probes <- rep(TRUE, 15)
probes <- append(probes, FALSE, after=5)
probes <- append(probes, FALSE, after=11)
Or you could do this recursively with your indexes (you need to grow the "after" value on each iteration):
probes <- rep(TRUE, 15)
ind <- c(5, 10)
for(i in 0:(length(ind)-1))
probes <- append(probes, FALSE, after=(ind[i+1]+i))
Incidentally, this question was also previously asked on R-Help. As Barry says:
"Actually I'd say there were no ways of doing this, since I dont think you can actually insert into a vector - you have to create a new vector that produces the illusion of insertion!"
You can do some magic with indexes:
First create vector with output values:
probs <- rep(TRUE, 15)
ind <- c(5, 10)
val <- c( probs, rep(FALSE,length(ind)) )
# > val
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [13] TRUE TRUE TRUE FALSE FALSE
Now trick. Each old element gets rank, each new element gets half-rank
id <- c( seq_along(probs), ind+0.5 )
# > id
# [1] 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0
# [16] 5.5 10.5
Then use order to sort in proper order:
val[order(id)]
# [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
# [13] TRUE TRUE TRUE TRUE TRUE
probes <- rep(TRUE, 1000000)
ind <- c(50:100)
val <- rep(FALSE,length(ind))
new.probes <- vector(mode="logical",length(probes)+length(val))
new.probes[-ind] <- probes
new.probes[ind] <- val
Some timings:
My method
user system elapsed
0.03 0.00 0.03
Marek method
user system elapsed
0.18 0.00 0.18
R append with for loop
user system elapsed
1.61 0.48 2.10
How about this:
> probes <- rep(TRUE, 15)
> ind <- c(5, 10)
> probes.ind <- rep(NA, length(probes))
> probes.ind[ind] <- FALSE
> new.probes <- as.vector(rbind(probes, probes.ind))
> new.probes <- new.probes[!is.na(new.probes)]
> new.probes
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE TRUE TRUE
That is sorta tricky. Here's one way. It iterates over the list, inserting each time, so it's not too efficient.
probes <- rep(TRUE, 15)
probes.ind <- ind + 0:(length(ind)-1)
for (i in probes.ind) {
probes <- c(probes[1:i], FALSE, probes[(i+1):length(probes)])
}
> probes
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE TRUE TRUE
This should even work if ind has repeated elements, although ind does need to be sorted for the probes.ind construction to work.
Or you can do it using the insertRow function from the miscTools package.
probes <- rep(TRUE, 15)
ind <- c(5,10)
for (i in ind){
probes <- as.vector(insertRow(as.matrix(probes), i, FALSE))
}
I came up with a good answer that's easy to understand and fairly fast to run, building off Wojciech's answer above. I'll adapt the method for the example here, but it can be easily generalized to pretty much any data type for an arbitrary pattern of missing points (shown below).
probes <- rep(TRUE, 15)
ind <- c(5,10)
probes.final <- rep(FALSE, length(probes)+length(ind))
probes.final[-ind] <- probes
The data I needed this for is sampled at a regular interval, but many samples are thrown out, and the resulting data file only includes the timestamps and measurements for those retained. I needed to produce a vector containing all the timestamps and a data vector with NAs inserted for timestamps that were tossed. I used the "not in" function stolen from here to make it a bit simpler.
`%notin%` <- Negate(`%in%`)
dat <- rnorm(50000) # Data given
times <- seq(from=554.3, by=0.1, length.out=70000] # "Original" time stamps
times <- times[-sample(2:69999, 20000)] # "Given" times with arbitrary points missing from interior
times.final <- seq(from=times[1], to=times[length(times)], by=0.1)
na.ind <- which(times.final %notin% times)
dat.final <- rep(NA, length(times.final))
dat.final[-na.ind] <- dat
Um, hi, I had the same doubt, but I couldn't understand what people had answered, because I'm still learning the language. So I tried make my own and I suppose it works! I created a vector and I wanted to insert the value 100 after the 3rd, 5th and 6th indexes. This is what I wrote.
vector <- c(0:9)
indexes <- c(6, 3, 5)
indexes <- indexes[order(indexes)]
i <- 1
j <- 0
while(i <= length(indexes)){
vector <- append(vector, 100, after = indexes[i] + j)
i <-i + 1
j <- j + 1
}
vector
The vector "indexes" must be in ascending order for this to work. This is why I put them in order at the third line.
The variable "j" is necessary because at each iteration, the length of the new vector increases and the original values are moved.
In the case you wish to insert the new value next to each other, simply repeat the number of the index. For instance, by assigning indexes <- c(3, 5, 5, 5, 6), you should get vector == 0 1 2 100 3 4 100 100 100 5 100 6 7 8 9