Vectorization while using which() function in R - r

I have 3 vectors and I want to apply separately on each of them the 'which()' function.
I'm trying to find the max index of values less than some given number.
How can I operate this task using vectorization?
my 3 vectors (may have various lengths)
vec1 <- c(1,2,3,4,5)
vec2 <- c(11,12,13)
vec3 <- c(1,2,3,4,5,6,7,8)
How can I vectorize it?
max(which(vec1<3))
max(which(vec2<12.3))
max(which(vec3<5.7))
The expected result is:
2
2
5

One way to get a speedup would be to use Rcpp to search for elements smaller than your cutoff, starting from the right side of the vector and moving left. You can return as soon as you find the element that meets your criteria, which means that if your target is near the right side of the vector you might avoid looking at most of the vector's elements (meanwhile, which looks at all vector elements and max looks at all values returned by which). The speedup would be largest for long vectors where the target element is close to the end.
library(Rcpp)
rightmost.small <- cppFunction(
'double rightmostSmall(NumericVector x, const double cutoff) {
for (int i=x.size()-1; i >= 0; --i) {
if (x[i] < cutoff) return i+1; // 1-index
}
return 0; // None found
}')
rightmost.small(vec1, 3)
# [1] 2
rightmost.small(vec2, 12.3)
# [1] 2
rightmost.small(vec3, 5.7)
# [1] 5
Let's look at the performance for a vector where we expect this to give us a big speedup:
set.seed(144)
vec.large <- rnorm(1000000)
all.equal(max(which(vec.large < -1)), rightmost.small(vec.large, -1))
# [1] TRUE
library(microbenchmark)
microbenchmark(max(which(vec.large < -1)), rightmost.small(vec.large, -1))
# Unit: microseconds
# expr min lq mean median uq max neval
# max(which(vec.large < -1)) 4912.016 8097.290 12816.36406 9189.0685 9883.9775 60405.585 100
# rightmost.small(vec.large, -1) 1.643 2.476 8.54274 8.8915 12.8375 58.152 100
For this vector of length 1 million, we see a speedup of about 1000x using the Rcpp code.
This speedup should carry directly over to the case where you have many vectors stored in a list; you can use #JoshO'Brien's mapply code and observe a speedup when you switch from max(which(...)) to the Rcpp code:
f <- function(v,m) max(which(v < m))
l <- list(vec.large)[rep(1, 100)]
m <- rep(-1, 100)
microbenchmark(mapply(f, l, m), mapply(rightmost.small, l, m))
Unit: microseconds
expr min lq mean median uq max neval
mapply(f, l, m) 865287.828 907893.8505 931448.1555 918637.343 935632.0505 1133909.950 100
mapply(rightmost.small, l, m) 253.573 281.6855 344.5437 303.094 335.1675 712.897 100
We see a 3000x speedup by using the Rcpp code here.

l <- list(vec1,vec2,vec3)
m <- c(3, 12.3, 5.7)
f <- function(v,m) max(which(v < m))
mapply(f,l,m)
# [1] 2 2 5

Related

Keeping vectors (from list of vectors) whose elements do not have a proper subset within that same list (using RCPP)

I have asked this question previously (see here) and received a satisfactory answer using the purr package. However, this has proved to be a bottle neck in my program so I would like to rewrite the section using the RCPP package.
Proper subset: A proper subset S' of a set S is a subset that is strictly contained in S and so excludes S itself (note I am also excluding the empty set).
Suppose you have the following vectors in a list:
a = c(1,2)
b = c(1,3)
c = c(2,4)
d = c(1,2,3,4)
e = c(2,4,5)
f = c(1,2,3)
My aim is to keep only vectors which have no proper subset within the list, which in this example would be a, b and c.
Previous Solution
library(purr)
possibilities <- list(a,b,c,d,e,f)
keep(possibilities,
map2_lgl(.x = possibilities,
.y = seq_along(possibilities),
~ !any(map_lgl(possibilities[-.y], function(z) all(z %in% .x)))))
The notion here is to avoid the O(N^3) and use a less order instead. The other answer provided here will be slow still since it is greater than O(N^2). Here is a solution with less than O(N^2), where the worst case scenario is O(N^2) when all the elements are unique.
onlySet <- function(x){
i <- 1
repeat{
y <- sapply(x[-1], function(el)!all(is.element(x[[1]], el)))
if(all(y)){
if(i==length(x)) break
else i <- i+1
}
x <- c(x[-1][y], x[1])
}
x
}
Now to show the time difference, check out the following:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
method1 <- function(a){
mat <- outer(a, a, match_fun)
a[colSums(mat) == 1]
}
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(method1(poss), onlySet(poss))
Unit: milliseconds
expr min lq mean median uq max neval cld
method1(poss) 840.7919 880.12635 932.255030 889.36380 923.32555 1420.1077 100 b
onlySet(poss) 1.9845 2.07005 2.191647 2.15945 2.24245 3.3656 100 a
Have you tried optimising the solution in base R first? For example, the following reproduces your expected output and uses (faster) base R array routines:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
mat <- outer(possibilities, possibilities, match_fun)
possibilities[colSums(mat) == 1]
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 1 3
#
#[[3]]
#[1] 2 4
Inspired by Onyambu's performant solution, here is another base R option using a recursive function
f_recursive <- function(x, i = 1) {
if (i > length(x)) return(x)
idx <- which(sapply(x[-i], function(el) all(x[[i]] %in% el))) + 1
if (length(idx) == 0) f_recursive(x, i + 1) else f_recursive(x[-idx], i + 1)
}
f(possibilities)
The performance is on par with Onyambu's solution.
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(
method1(poss),
onlySet(poss),
f_recursive(poss))
#Unit: milliseconds
# expr min lq mean median uq
# method1(poss) 682.558602 710.974831 750.325377 730.627996 765.040976
# onlySet(poss) 1.700646 1.782713 1.870972 1.819820 1.918669
# f_recursive(poss) 1.681120 1.737459 1.884685 1.806384 1.901582
# max neval
# 1200.562889 100
# 2.371646 100
# 3.217013 100

Compare Matrices in R efficiently

I have an array a with some matrices in it. Now i need to efficiently check how many different matrices I have and what indices (in ascending order) they have in the array. My approach is the following: Paste the columns of the matrixes as character vectors and have a look at the frequency table like this:
n <- 10 #observations
a <- array(round(rnorm(2*2*n),1),
c(2,2,n))
paste_a <- apply(a, c(3), paste, collapse=" ") #paste by column
names(paste_a) <- 1:n
freq <- as.numeric( table(paste_a) ) # frequencies of different matrices (in ascending order)
indizes <- as.numeric(names(sort(paste_a[!duplicated(paste_a)])))
nr <- length(freq) #number of different matrices
However, as you increase n to large numbers, this gets very inefficient (it's mainly paste() that's getting slower and slower). Does anyone have a better solution?
Here is a "real" dataset with 100 observations where some matrices are actual duplicates (as opposed to my example above): https://pastebin.com/aLKaSQyF
Thank you very much.
Since your actual data is made up of the integers 0,1,2,3, why not take advantage of base 4? Integers are much faster to compare than entire matrix objects. (All occurrences of a below are of the data found in the real data set from the link.)
Base4Approach <- function() {
toBase4 <- sapply(1:dim(a)[3], function(x) {
v <- as.vector(a[,,x])
pows <- which(v > 0)
coefs <- v[pows]
sum(coefs*(4^pows))
})
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
And since the question is about efficiency, let's benchmark:
MartinApproach <- function() {
### commented this out for comparison reasons
# dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3]
a
}
identical(MartinApproach(), Base4Approach())
[1] TRUE
microbenchmark(Base4Approach(), MartinApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 291.658 303.525 339.2712 325.4475 352.981 636.361 100
MartinApproach() 983.855 1000.958 1160.4955 1071.9545 1187.321 3545.495 100
The approach by #d.b. doesn't really do the same thing as the previous two approaches (it simply identifies and doesn't remove duplicates).
DBApproach <- function() {
a[, , 9] = a[, , 1]
#Convert to list
mylist = lapply(1:dim(a)[3], function(i) a[1:dim(a)[1], 1:dim(a)[2], i])
temp = sapply(mylist, function(x) sapply(mylist, function(y) identical(x, y)))
temp2 = unique(apply(temp, 1, function(x) sort(which(x))))
#The indices in 'a' where the matrices are same
temp2[lengths(temp2) > 1]
}
However, Base4Approach still dominates:
microbenchmark(Base4Approach(), MartinApproach(), DBApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 298.764 324.0555 348.8534 338.899 356.0985 476.475 100
MartinApproach() 1012.601 1087.9450 1204.1150 1110.662 1162.9985 3224.299 100
DBApproach() 9312.902 10339.4075 11616.1644 11438.967 12413.8915 17065.494 100
Update courtesy of #alexis_laz
As mentioned in the comments by #alexis_laz, we can do much better.
AlexisBase4Approach <- function() {
toBase4 <- colSums(a * (4 ^ (0:(prod(dim(a)[1:2]) - 1))), dims = 2)
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
microbenchmark(Base4Approach(), MartinApproach(), DBApproach(), AlexisBase4Approach(), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Base4Approach() 11.67992 10.55563 8.177654 8.537209 7.128652 5.288112 100
MartinApproach() 39.60408 34.60546 27.930725 27.870019 23.836163 22.488989 100
DBApproach() 378.91510 342.85570 262.396843 279.190793 231.647905 108.841199 100
AlexisBase4Approach() 1.00000 1.00000 1.000000 1.000000 1.000000 1.000000 100
## Still gives accurate results
identical(MartinApproach(), AlexisBase4Approach())
[1] TRUE
My first attempt was actually really slow. So here is slightly changed version of yours:
dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3] #number of different matrices
idx <- dimnames(a)[[3]] # indices of left over matrices
I don't know if this is exactly what you want but here is a way you can extract indices where the matrices are same. More processing may be necessary to get what you want
#DATA
n <- 10
a <- array(round(rnorm(2*2*n),1), c(2,2,n))
a[, , 9] = a[, , 1]
temp = unique(apply(X = sapply(1:dim(a)[3], function(i)
sapply(1:dim(a)[3], function(j) identical(a[, , i], a[, , j]))),
MARGIN = 1,
FUN = function(x) sort(which(x))))
temp[lengths(temp) > 1]
#[[1]]
#[1] 1 9

Find the second minimum value in R [duplicate]

R offers max and min, but I do not see a really fast way to find another value in the order, apart from sorting the whole vector and then picking a value x from this vector.
Is there a faster way to get the second highest value, for example?
Use the partial argument of sort(). For the second highest value:
n <- length(x)
sort(x,partial=n-1)[n-1]
Slightly slower alternative, just for the records:
x <- c(12.45,34,4,0,-234,45.6,4)
max( x[x!=max(x)] )
min( x[x!=min(x)] )
Rfast has a function called nth_element that does exactly what you ask.
Further the methods discussed above that are based on partial sort, don't support finding the k smallest values
Update (28/FEB/21) package kit offers a faster implementation (topn) see https://stackoverflow.com/a/66367996/4729755, https://stackoverflow.com/a/53146559/4729755
Disclaimer: An issue appears to occur when dealing with integers which can by bypassed by using as.numeric (e.g. Rfast::nth(as.numeric(1:10), 2)), and will be addressed in the next update of Rfast.
Rfast::nth(x, 5, descending = T)
Will return the 5th largest element of x, while
Rfast::nth(x, 5, descending = F)
Will return the 5th smallest element of x
Benchmarks below against most popular answers.
For 10 thousand numbers:
N = 10000
x = rnorm(N)
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxn = maxN(x,5),
order = x[order(x, decreasing = T)[5]])
Unit: microseconds
expr min lq mean median uq max neval
Rfast 160.364 179.607 202.8024 194.575 210.1830 351.517 100
maxN 396.419 423.360 559.2707 446.452 487.0775 4949.452 100
order 1288.466 1343.417 1746.7627 1433.221 1500.7865 13768.148 100
For 1 million numbers:
N = 1e6
x = rnorm(N)
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]])
Unit: milliseconds
expr min lq mean median uq max neval
Rfast 89.7722 93.63674 114.9893 104.6325 120.5767 204.8839 100
maxN 150.2822 207.03922 235.3037 241.7604 259.7476 336.7051 100
order 930.8924 968.54785 1005.5487 991.7995 1031.0290 1164.9129 100
I wrapped Rob's answer up into a slightly more general function, which can be used to find the 2nd, 3rd, 4th (etc.) max:
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
maxN(1:10)
Here is an easy way to find the indices of N smallest/largest values in a vector(Example for N = 3):
N <- 3
N Smallest:
ndx <- order(x)[1:N]
N Largest:
ndx <- order(x, decreasing = T)[1:N]
So you can extract the values as:
x[ndx]
For nth highest value,
sort(x, TRUE)[n]
Here you go... kit is the obvious winner!
N = 1e6
x = rnorm(N)
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]],
kit = x[kit::topn(x, 5L,decreasing = T)[5L]]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Rfast 12.311168 12.473771 16.36982 12.702134 16.110779 102.749873 100
# maxN 12.922118 13.124358 17.49628 18.977537 20.053139 28.928694 100
# order 50.443100 50.926975 52.54067 51.270163 52.323116 66.561606 100
# kit 1.177202 1.216371 1.29542 1.240228 1.297286 2.771715 100
Edit: I forgot that kit::topn has hasna option...let's do another run.
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]],
kit = x[kit::topn(x, 5L,decreasing = T)[5L]],
kit2 = x[kit::topn(x, 5L,decreasing = T,hasna = F)[5L]],
unit = "ms"
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Rfast 13.194314 13.358787 14.7227116 13.4560340 14.551194 24.524105 100
# maxN 7.378960 7.527661 10.0747803 7.7119715 12.217756 67.409526 100
# order 50.088927 50.488832 52.4714347 50.7415680 52.267003 70.062662 100
# kit 1.180698 1.217237 1.2975441 1.2429790 1.278243 3.263202 100
# kit2 0.842354 0.876329 0.9398055 0.9109095 0.944407 2.135903 100
Here is the simplest way I found,
num <- c(5665,1615,5154,65564,69895646)
num <- sort(num, decreasing = F)
tail(num, 1) # Highest number
head(tail(num, 2),1) # Second Highest number
head(tail(num, 3),1) # Third Highest number
head(tail(num, n),1) # Generl equation for finding nth Highest number
I found that removing the max element first and then do another max runs in comparable speed:
system.time({a=runif(1000000);m=max(a);i=which.max(a);b=a[-i];max(b)})
user system elapsed
0.092 0.000 0.659
system.time({a=runif(1000000);n=length(a);sort(a,partial=n-1)[n-1]})
user system elapsed
0.096 0.000 0.653
dplyr has the function nth, where the first argument is the vector and the second is which place you want. This goes for repeating elements as well.
For example:
x = c(1,2, 8, 16, 17, 20, 1, 20)
Finding the second largest value:
nth(unique(x),length(unique(x))-1)
[1] 17
When I was recently looking for an R function returning indexes of top N max/min numbers in a given vector, I was surprised there is no such a function.
And this is something very similar.
The brute force solution using base::order function seems to be the easiest one.
topMaxUsingFullSort <- function(x, N) {
sort(x, decreasing = TRUE)[1:min(N, length(x))]
}
But it is not the fastest one in case your N value is relatively small compared to length of the vector x.
On the other side if the N is really small, you can use base::whichMax function iteratively and in each iteration you can replace found value by -Inf
# the input vector 'x' must not contain -Inf value
topMaxUsingWhichMax <- function(x, N) {
vals <- c()
for(i in 1:min(N, length(x))) {
idx <- which.max(x)
vals <- c(vals, x[idx]) # copy-on-modify (this is not an issue because idxs is relative small vector)
x[idx] <- -Inf # copy-on-modify (this is the issue because data vector could be huge)
}
vals
}
I believe you see the problem - the copy-on-modify nature of R. So this will perform better for very very very small N (1,2,3) but it will rapidly slow down for larger N values. And you are iterating over all elements in vector x N times.
I think the best solution in clean R is to use partial base::sort.
topMaxUsingPartialSort <- function(x, N) {
N <- min(N, length(x))
x[x >= -sort(-x, partial=N)[N]][1:N]
}
Then you can select the last (Nth) item from the result of functions defiend above.
Note: functions defined above are just examples - if you want to use them, you have to check/sanity inputs (eg. N > length(x)).
I wrote a small article about something very similar (get indexes of top N max/min values of a vector) at http://palusga.cz/?p=18 - you can find here some benchmarks of similar functions I defined above.
head(sort(x),..) or tail(sort(x),...) should work
This will find the index of the N'th smallest or largest value in the input numeric vector x. Set bottom=TRUE in the arguments if you want the N'th from the bottom, or bottom=FALSE if you want the N'th from the top. N=1 and bottom=TRUE is equivalent to which.min, N=1 and bottom=FALSE is equivalent to which.max.
FindIndicesBottomTopN <- function(x=c(4,-2,5,-77,99),N=1,bottom=FALSE)
{
k1 <- rank(x)
if(bottom==TRUE){
Nindex <- which(k1==N)
Nindex <- Nindex[1]
}
if(bottom==FALSE){
Nindex <- which(k1==(length(x)+1-N))
Nindex <- Nindex[1]
}
return(Nindex)
}
topn = function(vector, n){
maxs=c()
ind=c()
for (i in 1:n){
biggest=match(max(vector), vector)
ind[i]=biggest
maxs[i]=max(vector)
vector=vector[-biggest]
}
mat=cbind(maxs, ind)
return(mat)
}
this function will return a matrix with the top n values and their indices.
hope it helps
VDevi-Chou
You can identify the next higher value with cummax(). If you want the location of the each new higher value for example you can pass your vector of cummax() values to the diff() function to identify locations at which the cummax() value changed. say we have the vector
v <- c(4,6,3,2,-5,6,8,12,16)
cummax(v) will give us the vector
4 6 6 6 6 6 8 12 16
Now, if you want to find the location of a change in cummax() you have many options I tend to use sign(diff(cummax(v))). You have to adjust for the lost first element because of diff(). The complete code for vector v would be:
which(sign(diff(cummax(v)))==1)+1
You can use the sort keyword like this:
sort(unique(c))[1:N]
Example:
c <- c(4,2,44,2,1,45,34,2,4,22,244)
sort(unique(c), decreasing = TRUE)[1:5]
will give the first 5 max numbers.

Multiple replacements within a matrix in a single line in R

Is there a way to do the following replacement in a single line in R? If possible, would it be more/less efficient?
m <- matrix(rnorm(100), ncol=10)
threshold <- 0.5
# Is there a single-line way to do the following in R
m[m < threshold] <- 0
m[m >= threshold] <- 1
I'm wondering if the ifelse() function can accommodate this, in the sense of if < threshold then 0, else 1
Since you want a vector of 1s and 0s, you could just reverse your condition, convert the logical values to integer, and create a new matrix with the same dimensions as m.
matrix(as.integer(m >= threshold), nrow(m))
You could also just change the matrix's mode. Normally changing modes would be done in two lines, but you can do it in one with
`mode<-`(m >= threshold, "integer")
Additionally, as #nicola points out, the quick and dirty method is
(m >= threshold) + 0L
By adding the zero integer we coerce the entire matrix to integer.
A couple of others (thanks #Frank):
+(m >= threshold)
m[] <- m >= threshold
So basically, yes. All these perform the task in one line and I can almost guarantee they are all faster than ifelse().
Some benchmarks on a larger matrix (with the replacement method left out):
m <- matrix(rnorm(1e7), ncol=100)
threshold <- 0.5
library(microbenchmark)
microbenchmark(
matrix = matrix(as.integer(m >= threshold), nrow(m)),
mode = `mode<-`(m >= threshold, "integer"),
plus0 = (m >= threshold) + 0L,
unary = +(m >= threshold)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# matrix 295.9292 315.4463 351.9149 351.8144 379.9840 453.4915 100
# mode 163.2156 172.0180 208.9348 202.8014 232.4525 347.0616 100
# plus0 170.2059 177.6111 202.3536 192.3516 223.8284 294.8367 100
# unary 144.0128 150.2696 183.2914 173.4010 203.7955 382.2397 100
For the sake of completeness, here is a benchmark on the replacement method using times = 1.
microbenchmark(
replacement = { m[] <- m >= threshold },
times = 1
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# replacement 499.4005 499.4005 499.4005 499.4005 499.4005 499.4005 1

More efficient strategy for which() or match()

I have a vector of positive and negative numbers
vec<-c(seq(-100,-1), rep(0,20), seq(1,100))
the vector is larger than the example, and takes on a random set of values. I have to repetitively find the number of negative numbers in the vector... I am finding this is quite inefficient.
Since I only need to find the number of negative numbers, and the vector is sorted, I only need to know the index of the first 0 or positive number (there may be no 0s in the actual random vectors).
Currently I am using this code to find the length
length(which(vec<0))
but this forces R to go through the entire vector, but since it is sorted, there is no need.
I could use
match(0, vec)
but my vector does not always have 0s
So my question is, is there some kind of match() function that applies a condition instead of finding a specific value? Or is there a more efficient way to run my which() code?
The solutions offered so far all imply creating a logical(length(vec)) and doing a full or partial scan on this. As you note, the vector is sorted. We can exploit this by doing a binary search. I started thinking I'd be super-clever and implement this in C for even greater speed, but had trouble with debugging the indexing of the algorithm (which is the tricky part!). So I wrote it in R:
f3 <- function(x) {
imin <- 1L
imax <- length(x)
while (imax >= imin) {
imid <- as.integer(imin + (imax - imin) / 2)
if (x[imid] >= 0)
imax <- imid - 1L
else
imin <- imid + 1L
}
imax
}
For comparison with the other suggestions
f0 <- function(v) length(which(v < 0))
f1 <- function(v) sum(v < 0)
f2 <- function(v) which.min(v < 0) - 1L
and for fun
library(compiler)
f3.c <- cmpfun(f3)
Leading to
> vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
> identical(f0(vec), f1(vec))
[1] TRUE
> identical(f0(vec), f2(vec))
[1] TRUE
> identical(f0(vec), f3(vec))
[1] TRUE
> identical(f0(vec), f3.c(vec))
[1] TRUE
> microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec))
Unit: microseconds
expr min lq median uq max neval
f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903 100
f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293 100
f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889 100
f3(vec) 51.715 56.050 75.4495 78.5295 100.730 100
f3.c(vec) 11.612 17.147 28.5570 31.3160 49.781 100
Probably there are some tricky edge cases that I've got wrong! Moving to C, I did
library(inline)
f4 <- cfunction(c(x = "numeric"), "
int imin = 0, imax = Rf_length(x) - 1, imid;
while (imax >= imin) {
imid = imin + (imax - imin) / 2;
if (REAL(x)[imid] >= 0)
imax = imid - 1;
else
imin = imid + 1;
}
return ScalarInteger(imax + 1);
")
with
> identical(f3(vec), f4(vec))
[1] TRUE
> microbenchmark(f3(vec), f3.c(vec), f4(vec))
Unit: nanoseconds
expr min lq median uq max neval
f3(vec) 52096 53192.0 54918.5 55539.0 69491 100
f3.c(vec) 10924 12233.5 12869.0 13410.0 20038 100
f4(vec) 553 796.0 893.5 1004.5 2908 100
findInterval came up when a similar question was asked on the R-help list. It is slow but safe, checking that vec is actually sorted and dealing with NA values. If one wants to live on the edge (arguably no worse that implementing f3 or f4) then
f5.i <- function(v)
.Internal(findInterval(v, 0 - .Machine$double.neg.eps, FALSE, FALSE))
is nearly as fast as the C implementation, but likely more robust and vectorized (i.e., look up a vector of values in the second argument, for easy range-like calculations).
Use sum() and logical comparison:
sum( vec < 0 )
[1] 100
This will be pretty quick, and when you sum a logical, TRUE is 1 and FALSE is 0 so the total will be the number of negative values.
Uh oh, I feel the need for a benchmarking comparison... :-) Vector length is 2e5
library(microbenchmark)
vec<-c(seq(-100,-1,length.out=1e5), rep(0,20), seq(1,100,length.out=1e5))
microbenchmark( (which.min(vec < 0) - 1L) , (sum( vec < 0 )) )
Unit: milliseconds
expr min lq median uq max neval
(which.min(vec < 0) - 1L) 1.883847 2.130746 2.554725 3.141787 75.943911 100
(sum(vec < 0)) 1.398100 1.500639 1.508688 1.745088 2.662164 100
You could use which.min
which.min(vec < 0) - 1L
This will return the first FALSE value, i.e. the first 0.

Resources