More efficient strategy for which() or match() - r

I have a vector of positive and negative numbers
vec<-c(seq(-100,-1), rep(0,20), seq(1,100))
the vector is larger than the example, and takes on a random set of values. I have to repetitively find the number of negative numbers in the vector... I am finding this is quite inefficient.
Since I only need to find the number of negative numbers, and the vector is sorted, I only need to know the index of the first 0 or positive number (there may be no 0s in the actual random vectors).
Currently I am using this code to find the length
length(which(vec<0))
but this forces R to go through the entire vector, but since it is sorted, there is no need.
I could use
match(0, vec)
but my vector does not always have 0s
So my question is, is there some kind of match() function that applies a condition instead of finding a specific value? Or is there a more efficient way to run my which() code?

The solutions offered so far all imply creating a logical(length(vec)) and doing a full or partial scan on this. As you note, the vector is sorted. We can exploit this by doing a binary search. I started thinking I'd be super-clever and implement this in C for even greater speed, but had trouble with debugging the indexing of the algorithm (which is the tricky part!). So I wrote it in R:
f3 <- function(x) {
imin <- 1L
imax <- length(x)
while (imax >= imin) {
imid <- as.integer(imin + (imax - imin) / 2)
if (x[imid] >= 0)
imax <- imid - 1L
else
imin <- imid + 1L
}
imax
}
For comparison with the other suggestions
f0 <- function(v) length(which(v < 0))
f1 <- function(v) sum(v < 0)
f2 <- function(v) which.min(v < 0) - 1L
and for fun
library(compiler)
f3.c <- cmpfun(f3)
Leading to
> vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
> identical(f0(vec), f1(vec))
[1] TRUE
> identical(f0(vec), f2(vec))
[1] TRUE
> identical(f0(vec), f3(vec))
[1] TRUE
> identical(f0(vec), f3.c(vec))
[1] TRUE
> microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec))
Unit: microseconds
expr min lq median uq max neval
f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903 100
f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293 100
f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889 100
f3(vec) 51.715 56.050 75.4495 78.5295 100.730 100
f3.c(vec) 11.612 17.147 28.5570 31.3160 49.781 100
Probably there are some tricky edge cases that I've got wrong! Moving to C, I did
library(inline)
f4 <- cfunction(c(x = "numeric"), "
int imin = 0, imax = Rf_length(x) - 1, imid;
while (imax >= imin) {
imid = imin + (imax - imin) / 2;
if (REAL(x)[imid] >= 0)
imax = imid - 1;
else
imin = imid + 1;
}
return ScalarInteger(imax + 1);
")
with
> identical(f3(vec), f4(vec))
[1] TRUE
> microbenchmark(f3(vec), f3.c(vec), f4(vec))
Unit: nanoseconds
expr min lq median uq max neval
f3(vec) 52096 53192.0 54918.5 55539.0 69491 100
f3.c(vec) 10924 12233.5 12869.0 13410.0 20038 100
f4(vec) 553 796.0 893.5 1004.5 2908 100
findInterval came up when a similar question was asked on the R-help list. It is slow but safe, checking that vec is actually sorted and dealing with NA values. If one wants to live on the edge (arguably no worse that implementing f3 or f4) then
f5.i <- function(v)
.Internal(findInterval(v, 0 - .Machine$double.neg.eps, FALSE, FALSE))
is nearly as fast as the C implementation, but likely more robust and vectorized (i.e., look up a vector of values in the second argument, for easy range-like calculations).

Use sum() and logical comparison:
sum( vec < 0 )
[1] 100
This will be pretty quick, and when you sum a logical, TRUE is 1 and FALSE is 0 so the total will be the number of negative values.
Uh oh, I feel the need for a benchmarking comparison... :-) Vector length is 2e5
library(microbenchmark)
vec<-c(seq(-100,-1,length.out=1e5), rep(0,20), seq(1,100,length.out=1e5))
microbenchmark( (which.min(vec < 0) - 1L) , (sum( vec < 0 )) )
Unit: milliseconds
expr min lq median uq max neval
(which.min(vec < 0) - 1L) 1.883847 2.130746 2.554725 3.141787 75.943911 100
(sum(vec < 0)) 1.398100 1.500639 1.508688 1.745088 2.662164 100

You could use which.min
which.min(vec < 0) - 1L
This will return the first FALSE value, i.e. the first 0.

Related

Keeping vectors (from list of vectors) whose elements do not have a proper subset within that same list (using RCPP)

I have asked this question previously (see here) and received a satisfactory answer using the purr package. However, this has proved to be a bottle neck in my program so I would like to rewrite the section using the RCPP package.
Proper subset: A proper subset S' of a set S is a subset that is strictly contained in S and so excludes S itself (note I am also excluding the empty set).
Suppose you have the following vectors in a list:
a = c(1,2)
b = c(1,3)
c = c(2,4)
d = c(1,2,3,4)
e = c(2,4,5)
f = c(1,2,3)
My aim is to keep only vectors which have no proper subset within the list, which in this example would be a, b and c.
Previous Solution
library(purr)
possibilities <- list(a,b,c,d,e,f)
keep(possibilities,
map2_lgl(.x = possibilities,
.y = seq_along(possibilities),
~ !any(map_lgl(possibilities[-.y], function(z) all(z %in% .x)))))
The notion here is to avoid the O(N^3) and use a less order instead. The other answer provided here will be slow still since it is greater than O(N^2). Here is a solution with less than O(N^2), where the worst case scenario is O(N^2) when all the elements are unique.
onlySet <- function(x){
i <- 1
repeat{
y <- sapply(x[-1], function(el)!all(is.element(x[[1]], el)))
if(all(y)){
if(i==length(x)) break
else i <- i+1
}
x <- c(x[-1][y], x[1])
}
x
}
Now to show the time difference, check out the following:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
method1 <- function(a){
mat <- outer(a, a, match_fun)
a[colSums(mat) == 1]
}
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(method1(poss), onlySet(poss))
Unit: milliseconds
expr min lq mean median uq max neval cld
method1(poss) 840.7919 880.12635 932.255030 889.36380 923.32555 1420.1077 100 b
onlySet(poss) 1.9845 2.07005 2.191647 2.15945 2.24245 3.3656 100 a
Have you tried optimising the solution in base R first? For example, the following reproduces your expected output and uses (faster) base R array routines:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
mat <- outer(possibilities, possibilities, match_fun)
possibilities[colSums(mat) == 1]
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 1 3
#
#[[3]]
#[1] 2 4
Inspired by Onyambu's performant solution, here is another base R option using a recursive function
f_recursive <- function(x, i = 1) {
if (i > length(x)) return(x)
idx <- which(sapply(x[-i], function(el) all(x[[i]] %in% el))) + 1
if (length(idx) == 0) f_recursive(x, i + 1) else f_recursive(x[-idx], i + 1)
}
f(possibilities)
The performance is on par with Onyambu's solution.
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(
method1(poss),
onlySet(poss),
f_recursive(poss))
#Unit: milliseconds
# expr min lq mean median uq
# method1(poss) 682.558602 710.974831 750.325377 730.627996 765.040976
# onlySet(poss) 1.700646 1.782713 1.870972 1.819820 1.918669
# f_recursive(poss) 1.681120 1.737459 1.884685 1.806384 1.901582
# max neval
# 1200.562889 100
# 2.371646 100
# 3.217013 100

Iterate through the column and count the rows satisfying the condition in R

trying to write a for loop function to determine the number of schools with room costs in column 34 higher than board cost in column 23.
numrows <- dim(schools)[1]
for(ii in 1:numrows){
if(schools[ii, 34] > schools[ii, 23], na.rm = TRUE){
nrow(numrows)
}
}
I'm getting the following error
Error in if (schools[ii, 34] > schools[ii, 23]) { :
missing value where TRUE/FALSE needed
I did notice that some of the board costs are missing and i'd like to omit those in the comparisons. Also I'm expecting just the number of rows that satisfy the condition.
To further demonstrate my point, here is a simple example based on a 10,000-row sample data.frame
set.seed(2018)
df <- data.frame(one = runif(10^4), two = runif(10^4))
Running a microbenchmark analysis
library(microbenchmark)
res <- microbenchmark(
vectorised = sum(df[, 1] > df[, 2]),
for_loop = {
ss <- 0
for (i in seq_len(nrow(df))) if (df[i, 1] > df[i, 2]) ss <- ss + 1
ss
})
res
# Unit: microseconds
# expr min lq mean median uq
# vectorised 59.681 65.13 78.33118 72.8305 77.9195
# for_loop 346250.957 359535.08 398508.54996 379421.2305 426452.4265
# max neval
# 152.172 100
# 608490.869 100
library(ggplot2)
autoplot(res)
Notice the four order of magnitude (!!!) difference (that's a factor of 10,000!) between the for loop and the vectorised operation. Neither surprising nor interesting.
The structure of the data leading to the error
Error in if (schools[ii, 34] > schools[ii, 23]) { :
missing value where TRUE/FALSE needed
occurs when one or both of the values in the comparison is NA, because the NA propagates through the comparison x > y, e.g.,
> test = 1 > NA
> test
[1] NA
and the flow control if (test) {} can't determine whether the test is TRUE (and so the code should be executed) or FALSE
> if (test) {}
Error in if (test) { : missing value where TRUE/FALSE needed
A simple vectorized solution isn't good enough
> set.seed(123)
> n = 10; x = sample(n); y = sample(n); y[5] = NA
> sum(x > y)
[1] NA
though the 'fix' is obvious and inexpensive
> sum(x > y, na.rm = TRUE)
[1] 3
The for loop also fails, but it is not possible (as in part of the original question) to simply add an na.rm = TRUE clause to the if statement
s = 0
for (i in seq_along(x)) {
if (x[i] > y[i], na.rm = TRUE)
s <- s + 1
}
s
because this is not syntactically valid
Error: unexpected ',' in:
"for (i in seq_along(x)) {
if (x[i] > y[i],"
so a more creative solution needs to be found, e.g., testing whether the value of the comparison is actually TRUE
s <- 0
for (i in seq_along(x)) {
if (isTRUE(x[i] > y[i]))
s <- s + 1
}
s
Of course it is not useful to compare the performance of the incorrect code; one needs to write the correct solutions first
f1 <- function(x, y)
sum(x > y, na.rm = TRUE)
f2 <- function(x, y) {
s <- 0
for (i in seq_along(x))
if (isTRUE(x[i] > y[i]))
s <- s + 1
s
}
f1() seems much more compact and readable compared to f2(), but we need to make sure the results are sensible
> x > y
[1] FALSE TRUE FALSE FALSE NA TRUE FALSE FALSE FALSE TRUE
> f1(x, y)
[1] 3
and the same
> identical(f1(x, y), f2(x, y))
[1] FALSE
Hey wait, what's going on? They look the same?
> f2(x, y)
[1] 3
Actually, the results are numerically equal, but f1() returns an integer value whereas f2() returns a numeric
> all.equal(f1(x, y), f2(x, y))
[1] TRUE
> class(f1(x, y))
[1] "integer"
> class(f2(x, y))
[1] "numeric"
and if we're comparing performance we really need the results to be identical -- no sense comparing apples and oranges. We can update f2() to return an integer by making sure the sum s is always an integer -- use a suffix L, e.g., 0L, to create an integer value
> class(0)
[1] "numeric"
> class(0L)
[1] "integer"
and make sure an integer 1L is added to s on each successful iteration
f2a <- function(x, y) {
s <- 0L
for (i in seq_along(x))
if (isTRUE(x[i] > y[i]))
s <- s + 1L
s
}
We then have
> f2a(x, y)
[1] 3
> identical(f1(x, y), f2a(x, y))
[1] TRUE
and are now in a position to compare performance
> microbenchmark(f1(x, y), f2a(x, y))
Unit: microseconds
expr min lq mean median uq max neval
f1(x, y) 1.740 1.8965 2.05500 2.023 2.0975 6.741 100
f2a(x, y) 17.505 18.2300 18.67314 18.487 18.7440 34.193 100
Certainly f2a() is much slower, but for this size problem since the unit is 'microseconds' maybe this doesn't matter -- how do the solutions scale with problem size?
> set.seed(123)
> x = lapply(10^(3:7), sample)
> y = lapply(10^(3:7), sample)
> f = f1; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]))
Unit: microseconds
expr min lq mean median uq max neval
f(x[[1]], y[[1]]) 9.655 9.976 10.63951 10.3250 11.1695 17.098 100
f(x[[2]], y[[2]]) 76.722 78.239 80.24091 78.9345 79.7495 125.589 100
f(x[[3]], y[[3]]) 764.034 895.075 914.83722 908.4700 922.9735 1106.027 100
> f = f2a; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]))
Unit: milliseconds
expr min lq mean median uq
f(x[[1]], y[[1]]) 1.260307 1.296196 1.417762 1.338847 1.393495
f(x[[2]], y[[2]]) 12.686183 13.167982 14.067785 13.923531 14.666305
f(x[[3]], y[[3]]) 133.639508 138.845753 144.152542 143.349102 146.913338
max neval
3.345009 100
17.713220 100
165.990545 100
They both scale linearly (not surprising) but even for lengths of 100000 f2a() doesn't seem too bad -- 1/6th of a second -- and might be a candidate in a situation where the vectorization obfuscated the code rather than clarified it. The cost of extracting individual elements from columns of a data.frame change this calculus, but also point to the value of operating on atomic vectors rather than complicated data structures.
For what it's worth one can think of worse implementations, especially
f3 <- function(x, y) {
s <- logical(0)
for (i in seq_along(x))
s <- c(s, isTRUE(x[i] > y[i]))
sum(s)
}
which scales quadratically
> f = f3; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]), times = 1)
Unit: milliseconds
expr min lq mean median
f(x[[1]], y[[1]]) 7.018899 7.018899 7.018899 7.018899
f(x[[2]], y[[2]]) 371.248504 371.248504 371.248504 371.248504
f(x[[3]], y[[3]]) 42528.280139 42528.280139 42528.280139 42528.280139
uq max neval
7.018899 7.018899 1
371.248504 371.248504 1
42528.280139 42528.280139 1
(because c(s, ...) copies all of s to add one element to it) and is a pattern found very often in people's code.
A second common slowdown is use of complicated data structures (like the data.frame) rather than simple data structures (like atomic vectors), e.g., comparing
f4 <- function(df) {
s <- 0L
x <- df[[1]]
y <- df[[2]]
for (i in seq_len(nrow(df))) {
if (isTRUE(x[i] > y[i]))
s <- s + 1L
}
s
}
f5 <- function(df) {
s <- 0L
for (i in seq_len(nrow(df))) {
if (isTRUE(df[i, 1] > df[i, 2]))
s <- s + 1L
}
s
}
with
> df <- Map(data.frame, x, y)
> identical(f1(x[[1]], y[[1]]), f4(df[[1]]))
[1] TRUE
> identical(f1(x[[1]], y[[1]]), f5(df[[1]]))
[1] TRUE
> microbenchmark(f1(x[[1]], y[[1]]), f2(x[[1]], y[[1]]), f4(df[[1]]), f5(df[[1]]), times = 10)
Unit: microseconds
expr min lq mean median uq
f1(x[[1]], y[[1]]) 10.042 10.324 13.3511 13.4425 14.690
f2a(x[[1]], y[[1]]) 1310.186 1316.869 1480.1526 1344.8795 1386.322
f4(df[[1]]) 1329.307 1336.869 1363.4238 1358.7080 1365.427
f5(df[[1]]) 37051.756 37106.026 38187.8278 37876.0940 38416.276
max neval
20.753 10
2676.030 10
1439.402 10
42292.588 10

Finding the number of values above a value and less than a value in a df column without using a loop

Consider the following:
df <- data.frame(X = c(5000, 6000, 5500, 5000, 5300))
count_above <- function(vector)
{
counts <- vector()
counts[1] <- 0
for (i in 2:length(vector))
{
temp <- vector[1:i]
counts <- c(counts, sum(temp < vector[i]))
}
return(counts)
}
This gives me the correct output:
count_above(df$X)
[1] 0 1 1 0 2
For instance, the (column) vector here is
5000
6000
5500
5000
5300
At the very top 5000, there are no values above it. So this gives value 0.
At the 6000, there is one value which is above it and is less than 6000: the 5000. So this gives value 1.
At the 5500, there are two values above it, one of which is less than 5500, so this gives value 1, and so forth.
Is there any way I can write this out without using a loop?
Another approach, quite similar to aichao's solution (but a bit shorter)
X <- c(5000, 6000, 5500, 5000, 5300)
indices <- 1:length(X)
count_above <- colSums(outer(X, X, "<") & outer(indices, indices, "<"))
## [1] 0 1 1 0 2
Edit (Performance): Perhaps my idea was selected as the accepted answer because it is short and self-explaining code - but be careful to use it on large vectors! It's the slowest approach of all the solutions suggested here! Similar to that what dracodoc did, I also did a microbenchmark. But I used a random generated vector of 3000 values to get more significant run times:
count_above_loop <- function(v)
{
counts <- integer(length = length(v))
counts[1] <- 0
for (i in 2:length(v))
{
counts[i] <- sum(v[1:(i-1)] < v[i])
}
return(counts)
}
count_above_outer <- function(X) {
indices <- 1:length(X)
colSums(outer(X, X, "<") & outer(indices, indices, "<"))
}
count_above_apply <- function(X) {
sapply(seq_len(length(X)), function(i) sum(X[i:1] < X[i]))
}
X <- runif(3000)
microbenchmark::microbenchmark(count_above_loop(X),
count_above_apply(X),
count_above_outer(X), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
count_above_loop(X) 56.27923 58.17195 62.07571 60.08123 63.92010 77.31658 10 a
count_above_apply(X) 54.41776 55.07511 57.12006 57.22372 58.61982 59.95037 10 a
count_above_outer(X) 121.12352 125.56072 132.45728 130.08141 137.08873 154.28419 10 b
We see that the apply approach on a large vector and without the overhead of a data frame is slightly faster than the for-loop.
My outer product approach takes more than double the time.
So I would recommend to use the for-loop - it's also readable and faster. My approach might be considered if you want to have provable correct code (as this one-liner is quite near to a specification of the problem)
Consider a running conditional count with sapply(). Though this is still a loop, it is a vectorized method:
count_above <- sapply(seq_len(nrow(df)),
function(i) sum(df[i:1, c("X")] < df$X[i]))
count_above
# [1] 0 1 1 0 2
EDIT: I should use bigger dataset for benchmark, the tiny dataset make the benchmark results a little bit misleading. See PatrickRoocks's update.
I just commented that for loop is not necessarily bad than apply family then I saw this.
I did a microbenchmark comparing a optimized for loop and the sapply method. for loop is 6 times faster. The sapply method is not a proper function, modifying it into a function taking a vector instead of assuming data frame columns could improve a little bit.
df <- data.frame(X = c(5000, 6000, 5500, 5000, 5300))
count_above <- function(v)
{
counts <- integer(length = length(v))
counts[1] <- 0
for (i in 2:length(v))
{
counts[i] <- sum(v[1:(i-1)] < v[i])
}
return(counts)
}
count_above(df$X)
microbenchmark::microbenchmark(count_above(df$X), sapply(seq_len(nrow(df)), function(i) sum(df[i:1, c("X")] < df$X[i])), times = 10)
Unit: microseconds
expr
count_above(df$X)
sapply(seq_len(nrow(df)), function(i) sum(df[i:1, c("X")] < df$X[i]))
min lq mean median uq max neval cld
38.623 41.068 65.0722 55.0010 65.512 142.757 10 a
262.045 269.379 368.6231 339.2905 415.067 640.934 10 b
Update:
# modify Parfait's answer into a function, taking vector instead of data frame
count_above_2 <- function(v){
counts <- sapply(seq_len(length(v)),
function(i) sum(v[i:1] < v[i]))
return(counts)
}
X <- df$X
microbenchmark::microbenchmark(count_above(X), count_above_2(X), {indices <- 1:length(X); colSums(outer(X, X, "<") & outer(indices, indices, "<"))}, times = 100)
Unit: microseconds
expr
count_above(X)
count_above_2(X)
{ indices <- 1:length(X) colSums(outer(X, X, "<") & outer(indices, indices, "<")) }
min lq mean median uq max neval cld
21.023 23.4680 39.02878 26.1565 35.4450 144.224 100 a
41.067 49.3785 67.06162 53.2900 70.1565 166.712 100 b
37.646 40.0900 66.45059 53.0450 72.8455 258.623 100 b
For loop still wins.
Transfer a vector instead of df$X save time for all, so I give 3 solutions same vector to be comparable.
Parfait's answer is comparable with PatrickRoocks's.
Besides performance, there is a subtle point of correctness.
OP's function and Parfait's sum(v[i:1] < v[i]) give correct answer only because v[i] < v[i] is FALSE. By definition it should use v[1:(i-1)] < v[i].
My function can be written in a more concise version like this:
count_above <- function(v)
{
counts <- integer(length = length(v))
for (i in 1:length(v))
{
counts[i] <- sum(v[1:(i-1)] < v[i])
}
return(counts)
}
It looks better and give correct result. This also depend on v[1] < v[1] is FALSE. It is not necessarily wrong since it is only about the first row, though I would still prefer the longer but more obvious version.
Another approach (still a loop because of colSums):
xg <- expand.grid(df$X,df$X)
o <- matrix(xg$Var1 < xg$Var2, nrow=length(x))
o[lower.tri(o)] <- FALSE
count_above <- colSums(o)
##[1] 0 1 1 0 2
This will most likely not be as efficient as Parfait's answer, but it is an alternative.

Multiple replacements within a matrix in a single line in R

Is there a way to do the following replacement in a single line in R? If possible, would it be more/less efficient?
m <- matrix(rnorm(100), ncol=10)
threshold <- 0.5
# Is there a single-line way to do the following in R
m[m < threshold] <- 0
m[m >= threshold] <- 1
I'm wondering if the ifelse() function can accommodate this, in the sense of if < threshold then 0, else 1
Since you want a vector of 1s and 0s, you could just reverse your condition, convert the logical values to integer, and create a new matrix with the same dimensions as m.
matrix(as.integer(m >= threshold), nrow(m))
You could also just change the matrix's mode. Normally changing modes would be done in two lines, but you can do it in one with
`mode<-`(m >= threshold, "integer")
Additionally, as #nicola points out, the quick and dirty method is
(m >= threshold) + 0L
By adding the zero integer we coerce the entire matrix to integer.
A couple of others (thanks #Frank):
+(m >= threshold)
m[] <- m >= threshold
So basically, yes. All these perform the task in one line and I can almost guarantee they are all faster than ifelse().
Some benchmarks on a larger matrix (with the replacement method left out):
m <- matrix(rnorm(1e7), ncol=100)
threshold <- 0.5
library(microbenchmark)
microbenchmark(
matrix = matrix(as.integer(m >= threshold), nrow(m)),
mode = `mode<-`(m >= threshold, "integer"),
plus0 = (m >= threshold) + 0L,
unary = +(m >= threshold)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# matrix 295.9292 315.4463 351.9149 351.8144 379.9840 453.4915 100
# mode 163.2156 172.0180 208.9348 202.8014 232.4525 347.0616 100
# plus0 170.2059 177.6111 202.3536 192.3516 223.8284 294.8367 100
# unary 144.0128 150.2696 183.2914 173.4010 203.7955 382.2397 100
For the sake of completeness, here is a benchmark on the replacement method using times = 1.
microbenchmark(
replacement = { m[] <- m >= threshold },
times = 1
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# replacement 499.4005 499.4005 499.4005 499.4005 499.4005 499.4005 1

Vectorization while using which() function in R

I have 3 vectors and I want to apply separately on each of them the 'which()' function.
I'm trying to find the max index of values less than some given number.
How can I operate this task using vectorization?
my 3 vectors (may have various lengths)
vec1 <- c(1,2,3,4,5)
vec2 <- c(11,12,13)
vec3 <- c(1,2,3,4,5,6,7,8)
How can I vectorize it?
max(which(vec1<3))
max(which(vec2<12.3))
max(which(vec3<5.7))
The expected result is:
2
2
5
One way to get a speedup would be to use Rcpp to search for elements smaller than your cutoff, starting from the right side of the vector and moving left. You can return as soon as you find the element that meets your criteria, which means that if your target is near the right side of the vector you might avoid looking at most of the vector's elements (meanwhile, which looks at all vector elements and max looks at all values returned by which). The speedup would be largest for long vectors where the target element is close to the end.
library(Rcpp)
rightmost.small <- cppFunction(
'double rightmostSmall(NumericVector x, const double cutoff) {
for (int i=x.size()-1; i >= 0; --i) {
if (x[i] < cutoff) return i+1; // 1-index
}
return 0; // None found
}')
rightmost.small(vec1, 3)
# [1] 2
rightmost.small(vec2, 12.3)
# [1] 2
rightmost.small(vec3, 5.7)
# [1] 5
Let's look at the performance for a vector where we expect this to give us a big speedup:
set.seed(144)
vec.large <- rnorm(1000000)
all.equal(max(which(vec.large < -1)), rightmost.small(vec.large, -1))
# [1] TRUE
library(microbenchmark)
microbenchmark(max(which(vec.large < -1)), rightmost.small(vec.large, -1))
# Unit: microseconds
# expr min lq mean median uq max neval
# max(which(vec.large < -1)) 4912.016 8097.290 12816.36406 9189.0685 9883.9775 60405.585 100
# rightmost.small(vec.large, -1) 1.643 2.476 8.54274 8.8915 12.8375 58.152 100
For this vector of length 1 million, we see a speedup of about 1000x using the Rcpp code.
This speedup should carry directly over to the case where you have many vectors stored in a list; you can use #JoshO'Brien's mapply code and observe a speedup when you switch from max(which(...)) to the Rcpp code:
f <- function(v,m) max(which(v < m))
l <- list(vec.large)[rep(1, 100)]
m <- rep(-1, 100)
microbenchmark(mapply(f, l, m), mapply(rightmost.small, l, m))
Unit: microseconds
expr min lq mean median uq max neval
mapply(f, l, m) 865287.828 907893.8505 931448.1555 918637.343 935632.0505 1133909.950 100
mapply(rightmost.small, l, m) 253.573 281.6855 344.5437 303.094 335.1675 712.897 100
We see a 3000x speedup by using the Rcpp code here.
l <- list(vec1,vec2,vec3)
m <- c(3, 12.3, 5.7)
f <- function(v,m) max(which(v < m))
mapply(f,l,m)
# [1] 2 2 5

Resources