Faster way to find the first TRUE value in a vector - r

In one function I very often need to use code like:
which(x==1)[1]
which(x>1)[1]
x[x>10][1]
where x is a numeric vector. summaryRprof() shows that I spend >80% of the time on relational operators. I wonder if there is a function that does comparison only till the first TRUE value is reached to speed up my code. For-loop is slower than the options provided above.

I don't know of a pure R way to do this, so I wrote a C function to do it for the quantstrat package. This function was written with a specific purpose in mind, so it's not as general as I would like. For example, you may notice that it only works on real/double/numeric data, so be sure to coerce Data to that before calling the .firstCross function.
#include <R.h>
#include <Rinternals.h>
SEXP firstCross(SEXP x, SEXP th, SEXP rel, SEXP start)
{
int i, int_rel, int_start;
double *real_x=NULL, real_th;
if(ncols(x) > 1)
error("only univariate data allowed");
/* this currently only works for real x and th arguments
* support for other types may be added later */
real_th = asReal(th);
int_rel = asInteger(rel);
int_start = asInteger(start)-1;
switch(int_rel) {
case 1: /* > */
real_x = REAL(x);
for(i=int_start; i<nrows(x); i++)
if(real_x[i] > real_th)
return(ScalarInteger(i+1));
break;
case 2: /* < */
real_x = REAL(x);
for(i=int_start; i<nrows(x); i++)
if(real_x[i] < real_th)
return(ScalarInteger(i+1));
break;
case 3: /* == */
real_x = REAL(x);
for(i=int_start; i<nrows(x); i++)
if(real_x[i] == real_th)
return(ScalarInteger(i+1));
break;
case 4: /* >= */
real_x = REAL(x);
for(i=int_start; i<nrows(x); i++)
if(real_x[i] >= real_th)
return(ScalarInteger(i+1));
break;
case 5: /* <= */
real_x = REAL(x);
for(i=int_start; i<nrows(x); i++)
if(real_x[i] <= real_th)
return(ScalarInteger(i+1));
break;
default:
error("unsupported relationship operator");
}
/* return number of observations if relationship is never TRUE */
return(ScalarInteger(nrows(x)));
}
And here's the R function that calls it:
.firstCross <- function(Data, threshold=0, relationship, start=1) {
rel <- switch(relationship[1],
'>' = ,
'gt' = 1,
'<' = ,
'lt' = 2,
'==' = ,
'eq' = 3,
'>=' = ,
'gte' = ,
'gteq' = ,
'ge' = 4,
'<=' = ,
'lte' = ,
'lteq' = ,
'le' = 5)
.Call('firstCross', Data, threshold, rel, start)
}
Some benchmarks, just for fun.
> library(quantstrat)
> library(microbenchmark)
> firstCross <- quantstrat:::.firstCross
> set.seed(21)
> x <- rnorm(1e6)
> microbenchmark(which(x > 3)[1], firstCross(x,3,">"), times=10)
Unit: microseconds
expr min lq median uq max neval
which(x > 3)[1] 9482.081 9578.072 9597.3870 9690.448 9820.176 10
firstCross(x, 3, ">") 11.370 11.675 31.9135 34.443 38.614 10
> which(x>3)[1]
[1] 919
> firstCross(x,3,">")
[1] 919
Note that firstCross will yield a larger relative speedup the larger Data is (because R's relational operators have to finish comparing the entire vector).
> x <- rnorm(1e7)
> microbenchmark(which(x > 3)[1], firstCross(x,3,">"), times=10)
Unit: microseconds
expr min lq median uq max neval
which(x > 3)[1] 94536.21 94851.944 95799.857 96154.756 113962.794 10
firstCross(x, 3, ">") 5.08 5.507 25.845 32.164 34.183 10
> which(x>3)[1]
[1] 97
> firstCross(x,3,">")
[1] 97
...and it won't be appreciably faster if the first TRUE value is near the end of the vector.
> microbenchmark(which(x==last(x))[1], firstCross(x,last(x),"eq"),times=10)
Unit: milliseconds
expr min lq median uq max neval
which(x == last(x))[1] 92.56311 93.85415 94.38338 98.18422 106.35253 10
firstCross(x, last(x), "eq") 86.55415 86.70980 86.98269 88.32168 92.97403 10
> which(x==last(x))[1]
[1] 10000000
> firstCross(x,last(x),"eq")
[1] 10000000

Base R provides Position and Find for locating the first index and value, respectively, for which a predicate returns a true value. These higher-order functions return immediately upon the first hit.
f<-function(x) {
r<-vector("list",3)
r[[1]]<-which(x==1)[1]
r[[2]]<-which(x>1)[1]
r[[3]]<-x[x>10][1]
return(r)
}
p<-function(f,b) function(a) f(a,b)
g<-function(x) {
r<-vector("list",3)
r[[1]]<-Position(p(`==`,1),x)
r[[2]]<-Position(p(`>`,1),x)
r[[3]]<-Find(p(`>`,10),x)
return(r)
}
The relative performance depends greatly on the probability of finding a hit early relative to the cost of the predicate vs the overhead of Position/Find.
library(microbenchmark)
set.seed(1)
x<-sample(1:100,1e5,replace=TRUE)
microbenchmark(f(x),g(x))
Unit: microseconds
expr min lq mean median uq max neval cld
f(x) 5034.283 5410.1205 6313.861 5798.4780 6948.5675 26735.52 100 b
g(x) 587.463 650.4795 1013.183 734.6375 950.9845 20285.33 100 a
y<-rep(0,1e5)
microbenchmark(f(y),g(y))
Unit: milliseconds
expr min lq mean median uq max neval cld
f(y) 3.470179 3.604831 3.791592 3.718752 3.866952 4.831073 100 a
g(y) 131.250981 133.687454 137.199230 134.846369 136.193307 177.082128 100 b

That's a nice question and answer... just to add any() is no faster than which() or match()but both are quicker than [] which I guess may create a big vector useless T,F's. So I'm guessing No ..short of the answer above.
v=rep('A', 10e6)
v[5e6]='B'
v[10e6]='B'
microbenchmark(which(v=='B')[1])
Unit: milliseconds
expr min lq median uq max neval
which(v == "B")[1] 332.3788 337.6718 344.4076 347.1194 503.4022 100
microbenchmark(any(v=='B'))
Unit: milliseconds
expr min lq median uq max neval
any(v == "B") 334.4466 335.114 335.6714 347.5474 356.0261 100
microbenchmark(v[v=='B'][1])
Unit: milliseconds
expr min lq median uq max neval
v[v == "B"][1] 601.5923 605.3331 609.191 612.0689 707.1409 100
microbenchmark(match("B", v))
Unit: milliseconds
expr min lq median uq max neval
match("B", v) 339.2872 344.7648 350.5444 359.6746 915.6446 100
Any other ideas out there?

Related

Count number of palindromes within a string

I have written the below code to count the number of palindromic strings in a given string:
countPalindromes <- function(str){
len <- nchar(str)
count <- 0
for(i in 1:len){
for(j in i:len){
subs <- substr(str, i, j)
rev <- paste(rev(substring(subs, 1:nchar(subs), 1:nchar(subs))), collapse = "")
if(subs == rev){
count <- count + 1
}
}
}
count
}
This is actually working fine but the code needs to be optimized in such a way so that it executes at a faster rate.
Please suggest some ways to optimize this piece of code.
Here's a solution that uses the wonderful stringi package - just as Andre suggested - together with a wee bit of vectorization.
cp <- function(s) {
lenstr <- stri_length(s) # Get the length
res <- sapply(1:lenstr, function(i) {
# Get all substrings
sub_string <- stringi::stri_sub(s, i, i:lenstr)
# Count matches
sum((sub_string == stringi::stri_reverse(sub_string)))
})
sum(res)
}
This should give the same result as your function
> cp("enafdemderredmedfane")
[1] 30
> countPalindromes("enafdemderredmedfane")
[1] 30
There is not much speedup for short strings, but for longer strings you can really see a benefit:
> microbenchmark::microbenchmark(countPalindromes("howdoyoudo"), cp("howdoyoudo"))
Unit: microseconds
expr min lq mean median uq max neval cld
countPalindromes("howdoyoudo") 480.979 489.6180 508.9044 494.9005 511.201 662.605 100 b
cp("howdoyoudo") 156.117 163.1555 175.4785 169.5640 179.993 324.145 100 a
Compared to
> microbenchmark::microbenchmark(countPalindromes("enafdemderredmedfane"), cp("enafdemderredmedfane"))
Unit: microseconds
expr min lq mean median uq max neval cld
countPalindromes("enafdemderredmedfane") 2031.565 2115.0305 2475.5974 2222.354 2384.151 6696.484 100 b
cp("enafdemderredmedfane") 324.991 357.6055 430.8334 387.242 478.183 1298.390 100 a
Working with a vector the process is faster, I am thinking of eliminating the double for, but I can not find an efficient way.
countPalindromes_new <- function(str){
len <- nchar(str)
strsp <- strsplit(str, "")[[1]]
count <- 0
for(i in 1:len){
for(j in i:len){
if(all(strsp[i:j] == strsp[j:i])){
count <- count + 1
}
}
}
count
}
> microbenchmark::microbenchmark(countPalindromes("howdoyoudo"), cp("howdoyoudo"), countPalindromes_new("howdoyoudo"))
Unit: microseconds
expr min lq mean median uq max neval
countPalindromes("howdoyoudo") 869.121 933.1215 1069.68001 963.201 1022.081 6712.751 100
cp("howdoyoudo") 192.000 202.8805 243.11972 219.308 258.987 477.441 100
countPalindromes_new("howdoyoudo") 49.068 53.3340 62.32815 57.387 63.574 116.481 100
> microbenchmark::microbenchmark(countPalindromes("enafdemderredmedfane"), cp("enafdemderredmedfane"), countPalindromes_new("enafdemderredmedfane"))
Unit: microseconds
expr min lq mean median uq max neval
countPalindromes("enafdemderredmedfane") 3578.029 3800.9620 4170.0888 3987.416 4173.6550 10205.445 100
cp("enafdemderredmedfane") 391.254 438.4010 609.8782 481.708 534.6135 6116.270 100
countPalindromes_new("enafdemderredmedfane") 200.534 214.1875 235.3501 223.148 245.5475 448.854 100
UPDATE (NEW VERSION WIHTOUT LEN 1 COMPARASION):
countPalindromes_new2 <- function(str){
len <- nchar(str)
strsp <- strsplit(str, "")[[1]]
count <- len
for(i in 1:(len-1)){
for(j in (i + 1):len){
if(all(strsp[i:j] == strsp[j:i])){
count <- count + 1
}
}
}
count
}
Simply: normally I'm against using new libraries everywhere. But stringi is THE library for working with strings in R.
string_vec <- c("anna","nothing","abccba")
string_rev <- stringi::stri_reverse(string_vec)
sum(string_vec == string_rev)
#evals 2

Vectorizing double summations using R

I am struggling with translating this function into R using via vectorization technique:
Where all I have been able to do so far is this:
c <- matrix(1:9, 3)
z <- 1:3
sum(abs(outer(z, z,"-")) * c)/sum(c)
But I don't think its necessarily correct. I tried a for-loop version but that is too long and my answer is likely wrong anyway. Anyone keen on this? What am I missing (or doing wrong)? Any help would be appreciated.
Here's a double-loop version:
q =
function(z,c){
num = 0
for(i in 1:length(z)){
for(j in 1:length(z)){
num = num + abs(z[i]-z[j]) * c[i,j]
}
}
num/sum(c)
}
Here's your vectorised version, functionised:
q2 =
function(z,c){sum(c*abs(outer(z,z,'-')) /sum(c))}
Not a great difference in timing between them really for a small matrix:
> microbenchmark::microbenchmark(q(z,c), q2(z,c))
Unit: microseconds
expr min lq mean median uq max neval cld
q(z, c) 15.368 15.7505 16.59644 16.0225 16.6290 30.346 100 b
q2(z, c) 12.232 12.8885 13.79178 13.2225 13.6585 44.085 100 a
But for a larger test it's a big win:
> c2 = matrix(runif(100*100),100,100)
> z2 = runif(100)
> microbenchmark::microbenchmark(q(z2,c2), q2(z2,c2))
Unit: microseconds
expr min lq mean median uq max neval cld
q(z2, c2) 7437.031 7588.131 8046.92272 7794.927 8332.104 10729.799 100 b
q2(z2, c2) 74.742 78.647 94.20153 86.113 100.125 188.428 100 a
>
Numeric difference is within floating point tolerance:
> q(z2,c2) - q2(z2,c2)
[1] 6.661338e-16
So unless anyone has faster code, I'd stick with what you've got.
As perfectly explained by #Spacedman, your approach is very efficient, but if you still want to go faster you could try Rcpp :
library(Rcpp)
sourceCpp(code='
#include <Rcpp.h>
// [[Rcpp::export]]
double qRcpp(const Rcpp::NumericVector z, const Rcpp::NumericMatrix cm){
int zlen = z.length();
if(!(zlen == cm.nrow() && cm.nrow() == cm.ncol()))
Rcpp::stop("Invalid sizes");
double num = 0;
for(int i = 0 ; i < zlen ; i++){
for(int j = 0 ; j < zlen ; j++){
num = num + std::abs(z[i]-z[j]) * cm(i,j);
}
}
return num / Rcpp::sum(cm);
}
')
Benchmark :
c2 = matrix(runif(100*100),100,100)
z2 = runif(100)
microbenchmark::microbenchmark(q(z2,c2), q2(z2,c2),qRcpp(z2,c2))
# Unit: microseconds
# expr min lq mean median uq max neval
# q(z2, c2) 10273.035 10976.3050 11680.85554 11348.763 11765.2010 44115.632 100
# q2(z2, c2) 64.292 67.9455 80.56427 75.543 86.3565 244.019 100
# qRcpp(z2, c2) 21.042 21.9180 25.30515 24.256 26.8860 56.403 100

R: fast method to check if integer value is in sorted integer vector and return its index

Lets say I have vector x that:
is very large ( > 200 000 )
is integer
is sorted
all of it's values are unique
I would like to check if an integer value y is in this vector, and if it is, I would like to get the index of it. I would like to take advantage of the fact, that vector is sorted, so it can be done fast.
How would I accomplish such thing?
Here's some data
set.seed(123)
x = sort(unique(floor(runif(1e6, 1, 1e7))))
y = sample(1e7, 10000)
And a couple of approaches
f0 = function(y, vec) y %in% vec
f1 = function(y, vec) vec[findInterval(y, vec)] == y
The %in% does a full scan; findInterval() does a binary search (I think). They generate the same result
> identical(f0(y, x), f1(y, x))
[1] TRUE
And have approximately similar amortized performance
> library(microbenchmark)
> microbenchmark(f0(y, x), f1(y, x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f0(y, x) 99.35425 100.87319 102.32160 102.20107 103.67718 105.70854 10
f1(y, x) 94.83219 95.05068 95.93625 95.77822 96.72601 97.50961 10
But findInterval() is I think faster for small queries
> microbenchmark(f0(y[1:10], x), f1(y[1:10], x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f0(y[1:10], x) 83.441578 85.116818 86.264751 86.07515 87.13516 89.430801 10
f1(y[1:10], x) 7.731606 7.734207 7.757201 7.75199 7.77210 7.810957 10
David suggests (I think)
f2 = function(x, vec) vec[which.max(x == vec)] == x
which.max() is only good for scalar y, which is seldom (saying this for the benefit of OP) a good use of R. It appears less performant than findInterval()
> microbenchmark(f1(x[1000], x), f2(x[1000], x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f1(x[1000], x) 7.707420 7.709047 7.714576 7.711979 7.718953 7.729688 10
f2(x[1000], x) 9.353225 9.358874 9.381781 9.378680 9.400808 9.426102 10
Contrary to #Laterow I don't see any particular performance difference between which() and which.max() (in current R-devel or R-3-2-branch; also, the results aren't the same, so it's an apples-to-oranges comparison). I have a vague recollection of an R-devel conversation about this in the last 6 months...
> set.seed(123) ; x <- sample(2e5, replace = TRUE)
> microbenchmark(which.max(x == 1e7), which(x == 1e7)[1])
Unit: milliseconds
expr min lq mean median uq max
which.max(x == 1e+07) 4.240606 4.266470 5.975966 5.015947 5.217903 43.78467
which(x == 1e+07)[1] 4.060040 4.132667 5.550078 4.986287 5.059128 43.88074
neval
100
100
Performance of which versus which.max might have changed with this commit, where previously which.max() would coerce logical to numeric vectors before the scan, triggering a copy.

Make cumulative sum faster

I'm trying to take cumulative sums for each column of a matrix. Here's my code in R:
testMatrix = matrix(1:65536, ncol=256);
microbenchmark(apply(testMatrix, 2, cumsum), times=100L);
Unit: milliseconds
expr min lq mean median uq max neval
apply(testMatrix, 2, cumsum) 1.599051 1.766112 2.329932 2.15326 2.221538 93.84911 10000
I used Rcpp for comparison:
cppFunction('NumericMatrix apply_cumsum_col(NumericMatrix m) {
for (int j = 0; j < m.ncol(); ++j) {
for (int i = 1; i < m.nrow(); ++i) {
m(i, j) += m(i - 1, j);
}
}
return m;
}');
microbenchmark(apply_cumsum_col(testMatrix), times=10000L);
Unit: microseconds
expr min lq mean median uq max neval
apply_cumsum_col(testMatrix) 205.833 257.719 309.9949 265.986 276.534 96398.93 10000
So the C++ code is 7.5 times as fast. Is it possible to do better than apply(testMatrix, 2, cumsum) in pure R? It feels like I have an order of magnitude overhead for no reason.
It is difficult to beat C++ with just R code. The fastest way I can think of doing it is if you are willing to split your matrix in to a list. That way, R is using primitive functions and doesn't copy the object with each iteration (apply is essentially a pretty loop). You can see that C++ still wins out but there is a significant speedup with the list approach if you really just want to use R code.
fun1 <- function(){
apply(testMatrix, 2, cumsum)
}
testList <- split(testMatrix, col(testMatrix))
fun2 <- function(){
lapply(testList, cumsum)
}
microbenchmark(fun1(),
fun2(),
apply_cumsum_col(testMatrix),
times=100L)
Unit: microseconds
expr min lq mean median uq max neval
fun1() 3298.534 3411.9910 4376.4544 3477.608 3699.2485 9249.919 100
fun2() 558.800 596.0605 766.2377 630.841 659.3015 5153.100 100
apply_cumsum_col(testMatrix) 219.651 282.8570 576.9958 311.562 339.5680 4915.290 100
EDIT
Please note that this method is slower than fun1 if you include the time to split the matrix in to a list.
Using a byte-compiled for loop is slightly faster than the apply call on my system. I expected it to be faster because it does less work than apply. As expected, the R loop is still slower than the simple C++ function you wrote.
colCumsum <- compiler::cmpfun(function(x) {
for (i in 1:ncol(x))
x[,i] <- cumsum(x[,i])
x
})
testMatrix <- matrix(1:65536, ncol=256)
m <- testMatrix
require(microbenchmark)
microbenchmark(colCumsum(m), apply_cumsum_col(m), apply(m, 2, cumsum), times=100L)
# Unit: microseconds
# expr min lq median uq max neval
# matrixCumsum(m) 1478.671 1540.5945 1586.1185 2199.9530 37377.114 100
# apply_cumsum_col(m) 178.214 192.4375 204.3905 234.8245 1616.030 100
# apply(m, 2, cumsum) 1879.850 1940.1615 1991.3125 2745.8975 4346.802 100
all.equal(colCumsum(m), apply(m, 2, cumsum))
# [1] TRUE
Maybe it is to late but I will write my answer so anyone else can see it.
First of all, in your C++ code you need to clone you matrix otherwise you are write into R's memory and it is forbiden by CRAN. So your code becomes:
rcpp_apply<-cppFunction('NumericMatrix apply_cumsum_col(NumericMatrix m) {
NumericMatrix g=clone(m);
for (int j = 0; j < m.ncol(); ++j) {
for (int i = 1; i < m.nrow(); ++i) {
g(i, j) += g(i - 1, j);
}
}
return g;
}');
Since your matrix is typeof integer then you can change your C++'s argument to be IntegerMatrix.
rcpp_apply_integer<-cppFunction('IntegerMatrix apply_cumsum_col(IntegerMatrix m) {
NumericMatrix g=clone(m);
for (int j = 0; j < m.ncol(); ++j) {
for (int i = 1; i < m.nrow(); ++i) {
g(i, j) += g(i - 1, j);
}
}
return g;
}');
This impoved the code about 2 times. Here is a benchmark:
microbenchmark::microbenchmark(R=apply(testMatrix, 2, cumsum),Rcpp=rcpp_apply(testMatrix),Rcpp_integer=rcpp_apply_integer(testMatrix), times=10)
Unit: microseconds
expr min lq mean median uq max neval
R 1552.217 1706.165 1770.1264 1740.0345 1897.884 1940.989 10
Rcpp 502.900 523.838 637.7188 665.0605 699.134 743.471 10
Rcpp_integer 220.455 274.645 274.9327 275.8770 277.930 316.109 10
all.equal(rcpp_apply(testMatrix),rcpp_apply_integer(testMatrix))
[1] TRUE
If your matrix has large values then you have to use NumericMatrix.

R version of PHP function isset

I often find isset() from PHP very helpful. Trying to create an R version of it, I came up with the following:
isset <- function(x){
a <- try(x, silent = TRUE)
return(!inherits(a, "try-error") && !is.null(a) && (length(a) != 1L || !is.na(a)))
}
It's made by 3 checks:
Does the object exist?
If the case, is NULL?
If not, and is not a vector, is NA?
As written above isset is very slow:
1) Simple case, no need at all for isset
> microbenchmark(isset(NULL), isset(NA))
Unit: microseconds
expr min lq median uq max neval
isset(NULL) 11.867 12.509 12.830 13.310 62.54 100
isset(NA) 12.829 13.471 14.112 14.433 22.45 100
2) More useful, but still doable with is.null()
> test_var <- list(a = 1)
> microbenchmark(isset(test_var$b))
Unit: microseconds
expr min lq median uq max neval
isset(test_var$b) 12.509 13.15 13.791 14.112 112.892 100
3) Here it does help, but it also gets very slow
> rm(test_var)
> microbenchmark(isset(test_var), isset(test_var$b), try(test_var$b, silent = TRUE))
Unit: microseconds
expr min lq median uq max neval
isset(test_var) 736.038 764.1015 780.4575 815.5755 1223.844 100
isset(test_var$b) 737.001 764.7425 786.0700 815.0940 986.837 100
try(test_var$b, silent = TRUE) 732.832 760.2525 779.3345 815.4155 1034.944 100
Almost all the time is spent by try.
Q: Considering that it needs to return FALSE in each of the 3 cases above, how would you re-write isset to make it at least twice faster?
Use the built-in exists:
isset <- function(x) {
is_variable <- make.names(name <- deparse(substitute(x))) == name
if (is_variable && !exists(name)) return(FALSE)
!is.null(x) && (length(x) != 1L || !is.na(x))
}
# > microbenchmark(isset(test_var), isset(test_var$b))
# Unit: microseconds
# expr min lq median uq max neval
# isset(test_var) 28.011 28.9115 29.278 29.6445 41.960 100
# isset(test_var$b) 47.656 49.1515 49.690 50.4170 137.049 100

Resources