R version of PHP function isset - r

I often find isset() from PHP very helpful. Trying to create an R version of it, I came up with the following:
isset <- function(x){
a <- try(x, silent = TRUE)
return(!inherits(a, "try-error") && !is.null(a) && (length(a) != 1L || !is.na(a)))
}
It's made by 3 checks:
Does the object exist?
If the case, is NULL?
If not, and is not a vector, is NA?
As written above isset is very slow:
1) Simple case, no need at all for isset
> microbenchmark(isset(NULL), isset(NA))
Unit: microseconds
expr min lq median uq max neval
isset(NULL) 11.867 12.509 12.830 13.310 62.54 100
isset(NA) 12.829 13.471 14.112 14.433 22.45 100
2) More useful, but still doable with is.null()
> test_var <- list(a = 1)
> microbenchmark(isset(test_var$b))
Unit: microseconds
expr min lq median uq max neval
isset(test_var$b) 12.509 13.15 13.791 14.112 112.892 100
3) Here it does help, but it also gets very slow
> rm(test_var)
> microbenchmark(isset(test_var), isset(test_var$b), try(test_var$b, silent = TRUE))
Unit: microseconds
expr min lq median uq max neval
isset(test_var) 736.038 764.1015 780.4575 815.5755 1223.844 100
isset(test_var$b) 737.001 764.7425 786.0700 815.0940 986.837 100
try(test_var$b, silent = TRUE) 732.832 760.2525 779.3345 815.4155 1034.944 100
Almost all the time is spent by try.
Q: Considering that it needs to return FALSE in each of the 3 cases above, how would you re-write isset to make it at least twice faster?

Use the built-in exists:
isset <- function(x) {
is_variable <- make.names(name <- deparse(substitute(x))) == name
if (is_variable && !exists(name)) return(FALSE)
!is.null(x) && (length(x) != 1L || !is.na(x))
}
# > microbenchmark(isset(test_var), isset(test_var$b))
# Unit: microseconds
# expr min lq median uq max neval
# isset(test_var) 28.011 28.9115 29.278 29.6445 41.960 100
# isset(test_var$b) 47.656 49.1515 49.690 50.4170 137.049 100

Related

Reason to use dplyr::union_all over other row binding methods on a dataframe

Is there a reason to use dplyr::union_all on a dataframe to bind two dataframes together vs other ways to bind rows? When I test it there don't seem to be differences but don't know if there might be edge cases where this wouldn't be a good idea or if there are reasons to choose it over another method. The documentation shows an example on a dataframe.
library(dplyr)
library(data.table)
library(microbenchmark)
data <- data.frame(a=rep('a', 10^6), b= rep(1, 10^6))
dt <- as.data.table(data)
microbenchmark(df1 <- dplyr::union_all(data, data))
microbenchmark(df2 <- rbind(data, data))
microbenchmark(df3 <- dplyr::bind_rows(data, data))
microbenchmark(df4 <- data.table::rbindlist(list(data, data)))
microbenchmark(df5 <- rbind(dt, dt))
all((df1 == df2) && (df2 == df3) && (df3 == as.data.frame(df4)) && (df4 == df5))
# [1] TRUE
They all seem roughly the same speed except for rbind on a data.frame which seems particularly slow. There doesn't seem to be a speed reason to choose union_all over other methods.
microbenchmark(df1 <- dplyr::union_all(data, data))
Unit: milliseconds
expr min lq mean median uq max neval
df1 <- dplyr::union_all(data, data) 8.501586 10.19703 13.77899 11.62611 18.16747 25.73479 100
microbenchmark(df2 <- rbind(data, data))
Unit: milliseconds
expr min lq mean median uq max neval
df2 <- rbind(data, data) 48.4319 50.98856 63.70163 52.65343 61.17889 180.8519 100
microbenchmark(df3 <- dplyr::bind_rows(data, data))
Unit: milliseconds
expr min lq mean median uq max neval
df3 <- dplyr::bind_rows(data, data) 9.121883 10.36146 13.38456 11.13614 12.04666 127.5304 100
microbenchmark(df4 <- data.table::rbindlist(list(data, data)))
Unit: milliseconds
expr min lq mean median uq max neval
df4 <- data.table::rbindlist(list(data, data)) 11.2442 11.84408 13.50861 12.37741 13.17539 22.89314 100
microbenchmark(df5 <- rbind(dt, dt))
Unit: milliseconds
expr min lq mean median uq max neval
df5 <- rbind(dt, dt) 11.02781 12.04254 15.0049 12.69404 13.36917 135.747 100

Count number of palindromes within a string

I have written the below code to count the number of palindromic strings in a given string:
countPalindromes <- function(str){
len <- nchar(str)
count <- 0
for(i in 1:len){
for(j in i:len){
subs <- substr(str, i, j)
rev <- paste(rev(substring(subs, 1:nchar(subs), 1:nchar(subs))), collapse = "")
if(subs == rev){
count <- count + 1
}
}
}
count
}
This is actually working fine but the code needs to be optimized in such a way so that it executes at a faster rate.
Please suggest some ways to optimize this piece of code.
Here's a solution that uses the wonderful stringi package - just as Andre suggested - together with a wee bit of vectorization.
cp <- function(s) {
lenstr <- stri_length(s) # Get the length
res <- sapply(1:lenstr, function(i) {
# Get all substrings
sub_string <- stringi::stri_sub(s, i, i:lenstr)
# Count matches
sum((sub_string == stringi::stri_reverse(sub_string)))
})
sum(res)
}
This should give the same result as your function
> cp("enafdemderredmedfane")
[1] 30
> countPalindromes("enafdemderredmedfane")
[1] 30
There is not much speedup for short strings, but for longer strings you can really see a benefit:
> microbenchmark::microbenchmark(countPalindromes("howdoyoudo"), cp("howdoyoudo"))
Unit: microseconds
expr min lq mean median uq max neval cld
countPalindromes("howdoyoudo") 480.979 489.6180 508.9044 494.9005 511.201 662.605 100 b
cp("howdoyoudo") 156.117 163.1555 175.4785 169.5640 179.993 324.145 100 a
Compared to
> microbenchmark::microbenchmark(countPalindromes("enafdemderredmedfane"), cp("enafdemderredmedfane"))
Unit: microseconds
expr min lq mean median uq max neval cld
countPalindromes("enafdemderredmedfane") 2031.565 2115.0305 2475.5974 2222.354 2384.151 6696.484 100 b
cp("enafdemderredmedfane") 324.991 357.6055 430.8334 387.242 478.183 1298.390 100 a
Working with a vector the process is faster, I am thinking of eliminating the double for, but I can not find an efficient way.
countPalindromes_new <- function(str){
len <- nchar(str)
strsp <- strsplit(str, "")[[1]]
count <- 0
for(i in 1:len){
for(j in i:len){
if(all(strsp[i:j] == strsp[j:i])){
count <- count + 1
}
}
}
count
}
> microbenchmark::microbenchmark(countPalindromes("howdoyoudo"), cp("howdoyoudo"), countPalindromes_new("howdoyoudo"))
Unit: microseconds
expr min lq mean median uq max neval
countPalindromes("howdoyoudo") 869.121 933.1215 1069.68001 963.201 1022.081 6712.751 100
cp("howdoyoudo") 192.000 202.8805 243.11972 219.308 258.987 477.441 100
countPalindromes_new("howdoyoudo") 49.068 53.3340 62.32815 57.387 63.574 116.481 100
> microbenchmark::microbenchmark(countPalindromes("enafdemderredmedfane"), cp("enafdemderredmedfane"), countPalindromes_new("enafdemderredmedfane"))
Unit: microseconds
expr min lq mean median uq max neval
countPalindromes("enafdemderredmedfane") 3578.029 3800.9620 4170.0888 3987.416 4173.6550 10205.445 100
cp("enafdemderredmedfane") 391.254 438.4010 609.8782 481.708 534.6135 6116.270 100
countPalindromes_new("enafdemderredmedfane") 200.534 214.1875 235.3501 223.148 245.5475 448.854 100
UPDATE (NEW VERSION WIHTOUT LEN 1 COMPARASION):
countPalindromes_new2 <- function(str){
len <- nchar(str)
strsp <- strsplit(str, "")[[1]]
count <- len
for(i in 1:(len-1)){
for(j in (i + 1):len){
if(all(strsp[i:j] == strsp[j:i])){
count <- count + 1
}
}
}
count
}
Simply: normally I'm against using new libraries everywhere. But stringi is THE library for working with strings in R.
string_vec <- c("anna","nothing","abccba")
string_rev <- stringi::stri_reverse(string_vec)
sum(string_vec == string_rev)
#evals 2

R *apply vector as input; matrix as output

I'd like to apply over each element of a vector, a function that outputs a vector.
After applying the function to each element of that vector, I should have many vectors, which I'd like to rbind in order to have a matrix.
The code should be equivalent to the following:
my_function <- function(x) x:(x+10)
my_vec <- 1:10
x <- vector()
for(i in seq_along(vec)){
x <- rbind(x,my_function(my_vec[i]))
}
Of course, my_function and my_vec are just examples.
try:
tmp <- lapply(my_vec, my_function)
do.call(rbind, tmp)
or, like Heroka suggested, use sapply. i prefer lapply, then bind my output the way i like (rbind/cbind) instead of potentially transposing.
Here is an alternative:
matrix( unlist(lapply(my_vec,my_function)), length(my_vec), byrow=TRUE )
Speed is almost the same:
library(microbenchmark)
my_function <- function(x) sin(x:(x+10))
for ( n in 1:4 )
{
my_vec <- 1:10^n
print(
microbenchmark( mra68 = matrix( unlist(lapply(my_vec,my_function)), length(my_vec), byrow=TRUE ),
stas.g = do.call(rbind, lapply(my_vec, my_function)),
times = 1000 )
)
print("identical?")
print( identical( matrix( unlist(lapply(my_vec,my_function)), length(my_vec), byrow=TRUE ),
do.call(rbind, lapply(my_vec, my_function)) ) )
}
.
Unit: microseconds
expr min lq mean median uq max neval
mra68 38.496 40.307 68.00539 41.213 110.052 282.148 1000
stas.g 41.213 42.572 72.86443 43.930 115.939 445.186 1000
[1] "identical?"
[1] TRUE
Unit: microseconds
expr min lq mean median uq max neval
mra68 793.002 810.212 850.4857 818.3640 865.2375 7231.669 1000
stas.g 876.786 894.901 946.8165 906.2235 966.9100 7051.873 1000
[1] "identical?"
[1] TRUE
Unit: milliseconds
expr min lq mean median uq max neval
mra68 2.605448 3.028442 5.269003 4.020940 7.807512 14.51225 1000
stas.g 2.959604 3.390071 5.823661 4.500546 8.800462 92.54977 1000
[1] "identical?"
[1] TRUE
Unit: milliseconds
expr min lq mean median uq max neval
mra68 27.29810 30.99387 51.44223 41.20167 79.46185 559.0059 1000
stas.g 33.63622 37.22420 60.10224 49.07643 92.94333 395.3315 1000
[1] "identical?"
[1] TRUE
>

Why does which work faster on a data frame column compared to a matrix column?

I have the following data:
height = 1:10000000
length = -(1:10000000)
body_dim = data.frame(height,length)
body_dim_mat = as.matrix(body_dim)
Why does which() work faster for the data frame compared to the matrix?
> microbenchmark(body_dim[which(body_dim$height==50000),"length"])
Unit: milliseconds
expr min lq median uq max neval
body_dim[which(body_dim$height == 50000), "length"] 124.4586 125.1625 125.9281 127.9496 284.9824 100
> microbenchmark(body_dim_mat[which(body_dim_mat[,1] == 50000),2])
Unit: milliseconds
expr min lq median uq max neval
body_dim_mat[which(body_dim_mat[, 1] == 50000), 2] 251.1282 252.4457 389.7251 400.313 1004.25 100
A data.frame is a list and a column is a simple vector and very easy to extract from the list. A matrix is a vector with dimension attributes. Which values belong to one column has to be calculated from the dimensions. This effects subsetting, which you include in your benchmarks:
library(microbenchmark)
set.seed(42)
m <- matrix(rnorm(1e5), ncol=10)
DF <- as.data.frame(m)
microbenchmark(m[,1], DF[,1], DF$V1)
#Unit: microseconds
# expr min lq median uq max neval
# m[, 1] 80.997 82.536 84.230 87.1560 1147.795 100
#DF[, 1] 15.399 16.939 20.789 22.6365 100.090 100
# DF$V1 1.849 2.772 3.389 4.3130 90.235 100
However, the take-home message is not that you should always use a data.frame. Because if you do subsetting, where the result is not a vector:
microbenchmark(m[1:10, 1:10], DF[1:10, 1:10])
# Unit: microseconds
# expr min lq median uq max neval
# m[1:10, 1:10] 1.233 1.8490 3.2345 3.697 11.087 100
# DF[1:10, 1:10] 211.267 219.7355 228.2050 252.226 1265.131 100
It seems that problem is before which(), subsetting of data.frame column is simply faster if compared to subsetting of whole matrix:
microbenchmark(body_dim$height==50000)
# Unit: milliseconds
# expr min lq median uq max neval
# body_dim$height == 50000 138.2619 148.5132 170.1895 170.8909 249.4592 100
microbenchmark(body_dim_mat[,1]==50000)
# Unit: milliseconds
# expr min lq median uq max neval
# body_dim_mat[, 1] == 50000 299.599 308.6066 310.9036 354.4641 432.7833 100
By the way, this case is where data.table can shine:
require(data.table)
dt <- data.table(body_dim, key="height")
microbenchmark(dt[J(50000)]$length, unit="ms")
# Unit: milliseconds
# expr min lq median uq max neval
# dt[J(50000)]$length 0.96637 0.97908 0.989772 1.025257 2.588402 100

Faster way to find the first TRUE value in a vector

In one function I very often need to use code like:
which(x==1)[1]
which(x>1)[1]
x[x>10][1]
where x is a numeric vector. summaryRprof() shows that I spend >80% of the time on relational operators. I wonder if there is a function that does comparison only till the first TRUE value is reached to speed up my code. For-loop is slower than the options provided above.
I don't know of a pure R way to do this, so I wrote a C function to do it for the quantstrat package. This function was written with a specific purpose in mind, so it's not as general as I would like. For example, you may notice that it only works on real/double/numeric data, so be sure to coerce Data to that before calling the .firstCross function.
#include <R.h>
#include <Rinternals.h>
SEXP firstCross(SEXP x, SEXP th, SEXP rel, SEXP start)
{
int i, int_rel, int_start;
double *real_x=NULL, real_th;
if(ncols(x) > 1)
error("only univariate data allowed");
/* this currently only works for real x and th arguments
* support for other types may be added later */
real_th = asReal(th);
int_rel = asInteger(rel);
int_start = asInteger(start)-1;
switch(int_rel) {
case 1: /* > */
real_x = REAL(x);
for(i=int_start; i<nrows(x); i++)
if(real_x[i] > real_th)
return(ScalarInteger(i+1));
break;
case 2: /* < */
real_x = REAL(x);
for(i=int_start; i<nrows(x); i++)
if(real_x[i] < real_th)
return(ScalarInteger(i+1));
break;
case 3: /* == */
real_x = REAL(x);
for(i=int_start; i<nrows(x); i++)
if(real_x[i] == real_th)
return(ScalarInteger(i+1));
break;
case 4: /* >= */
real_x = REAL(x);
for(i=int_start; i<nrows(x); i++)
if(real_x[i] >= real_th)
return(ScalarInteger(i+1));
break;
case 5: /* <= */
real_x = REAL(x);
for(i=int_start; i<nrows(x); i++)
if(real_x[i] <= real_th)
return(ScalarInteger(i+1));
break;
default:
error("unsupported relationship operator");
}
/* return number of observations if relationship is never TRUE */
return(ScalarInteger(nrows(x)));
}
And here's the R function that calls it:
.firstCross <- function(Data, threshold=0, relationship, start=1) {
rel <- switch(relationship[1],
'>' = ,
'gt' = 1,
'<' = ,
'lt' = 2,
'==' = ,
'eq' = 3,
'>=' = ,
'gte' = ,
'gteq' = ,
'ge' = 4,
'<=' = ,
'lte' = ,
'lteq' = ,
'le' = 5)
.Call('firstCross', Data, threshold, rel, start)
}
Some benchmarks, just for fun.
> library(quantstrat)
> library(microbenchmark)
> firstCross <- quantstrat:::.firstCross
> set.seed(21)
> x <- rnorm(1e6)
> microbenchmark(which(x > 3)[1], firstCross(x,3,">"), times=10)
Unit: microseconds
expr min lq median uq max neval
which(x > 3)[1] 9482.081 9578.072 9597.3870 9690.448 9820.176 10
firstCross(x, 3, ">") 11.370 11.675 31.9135 34.443 38.614 10
> which(x>3)[1]
[1] 919
> firstCross(x,3,">")
[1] 919
Note that firstCross will yield a larger relative speedup the larger Data is (because R's relational operators have to finish comparing the entire vector).
> x <- rnorm(1e7)
> microbenchmark(which(x > 3)[1], firstCross(x,3,">"), times=10)
Unit: microseconds
expr min lq median uq max neval
which(x > 3)[1] 94536.21 94851.944 95799.857 96154.756 113962.794 10
firstCross(x, 3, ">") 5.08 5.507 25.845 32.164 34.183 10
> which(x>3)[1]
[1] 97
> firstCross(x,3,">")
[1] 97
...and it won't be appreciably faster if the first TRUE value is near the end of the vector.
> microbenchmark(which(x==last(x))[1], firstCross(x,last(x),"eq"),times=10)
Unit: milliseconds
expr min lq median uq max neval
which(x == last(x))[1] 92.56311 93.85415 94.38338 98.18422 106.35253 10
firstCross(x, last(x), "eq") 86.55415 86.70980 86.98269 88.32168 92.97403 10
> which(x==last(x))[1]
[1] 10000000
> firstCross(x,last(x),"eq")
[1] 10000000
Base R provides Position and Find for locating the first index and value, respectively, for which a predicate returns a true value. These higher-order functions return immediately upon the first hit.
f<-function(x) {
r<-vector("list",3)
r[[1]]<-which(x==1)[1]
r[[2]]<-which(x>1)[1]
r[[3]]<-x[x>10][1]
return(r)
}
p<-function(f,b) function(a) f(a,b)
g<-function(x) {
r<-vector("list",3)
r[[1]]<-Position(p(`==`,1),x)
r[[2]]<-Position(p(`>`,1),x)
r[[3]]<-Find(p(`>`,10),x)
return(r)
}
The relative performance depends greatly on the probability of finding a hit early relative to the cost of the predicate vs the overhead of Position/Find.
library(microbenchmark)
set.seed(1)
x<-sample(1:100,1e5,replace=TRUE)
microbenchmark(f(x),g(x))
Unit: microseconds
expr min lq mean median uq max neval cld
f(x) 5034.283 5410.1205 6313.861 5798.4780 6948.5675 26735.52 100 b
g(x) 587.463 650.4795 1013.183 734.6375 950.9845 20285.33 100 a
y<-rep(0,1e5)
microbenchmark(f(y),g(y))
Unit: milliseconds
expr min lq mean median uq max neval cld
f(y) 3.470179 3.604831 3.791592 3.718752 3.866952 4.831073 100 a
g(y) 131.250981 133.687454 137.199230 134.846369 136.193307 177.082128 100 b
That's a nice question and answer... just to add any() is no faster than which() or match()but both are quicker than [] which I guess may create a big vector useless T,F's. So I'm guessing No ..short of the answer above.
v=rep('A', 10e6)
v[5e6]='B'
v[10e6]='B'
microbenchmark(which(v=='B')[1])
Unit: milliseconds
expr min lq median uq max neval
which(v == "B")[1] 332.3788 337.6718 344.4076 347.1194 503.4022 100
microbenchmark(any(v=='B'))
Unit: milliseconds
expr min lq median uq max neval
any(v == "B") 334.4466 335.114 335.6714 347.5474 356.0261 100
microbenchmark(v[v=='B'][1])
Unit: milliseconds
expr min lq median uq max neval
v[v == "B"][1] 601.5923 605.3331 609.191 612.0689 707.1409 100
microbenchmark(match("B", v))
Unit: milliseconds
expr min lq median uq max neval
match("B", v) 339.2872 344.7648 350.5444 359.6746 915.6446 100
Any other ideas out there?

Resources