increase efficiency and speed of R function - r

When using R I always have in mind: "Avoid using loops if possible". However, I am stuck right now, I haven't been able to figure out a CRANTASTIC way to code what I need.
For the record, after several comments, my statement above is not the right statement, there's no need to avoid loops here to improve the efficiency.
I have two string vectors as input, let us call them, a and b - they can only contain the letters "M", "I" and "D".
a = c("M","I","D","D","M","M","M","M","M","M")
b = c("M","M","M","M","M","M","D","M","M")
My desired output is:
d = c("M","I","D","D","M","M","M","M","I","M","M")
The following function gives me such output:
my.function <- function(a, b)
{
nrow.df = length(a) + length(which(b=="D"))
my.df = data.frame(a = rep(NA, nrow.df),
b = rep(NA, nrow.df),
d = rep(NA, nrow.df))
my.df$a[1:length(a)] = a
my.df$b[1:length(b)] = b
for (i in 1:nrow.df)
{
if(my.df$a[i] == "D") {
my.df$d[i] = "D"
my.df$b[(i+1):nrow.df] = my.df$b[i:(nrow.df-1)]
} else if (my.df$b[i] == "D") {
my.df$d[i] = "I"
my.df$a[(i+1):nrow.df] = my.df$a[i:(nrow.df-1)]
} else if (my.df$a[i] == "I") {
my.df$d[i] = "I"
} else if (my.df$b[i] == "I") {
my.df$d[i] = "D"
} else {
my.df$d[i] = my.df$a[i]
}
}
return(my.df$d)
}
> d = my.function(a,b)
> d
[1] "M" "I" "D" "D" "M" "M" "M" "M" "I" "M" "M"
The function logic is as follows, whenever there is a "D" in a, it puts a "D" in d and shift the vector b by 1, and vice versa, whenever there is a "D" in b, it puts an "I" in d and shifts a by 1.
Next, when there is an "I" in a, but not a "D" in b, put an "I" in a, and vice versa, whenever there is an "I" in b, and not a "D" in a, put a "D" in d. Otherwise, d = a.
It is not a complex function but I am struggling on how to make it R efficient. I am applying this function millions of times with mclapply so having a fast implementation of such function would save me lots of time.
Do you recommend using Rcpp? Would it be much faster? Is there any slow down on communicating R with Cpp millions of time, or it's just auto with Rcpp?

Building on my comment, if speed is one concern, step 1 is to not unnecessarily use data.frames. This answer doesn't address the loop (as others have already said, there is nothing wrong with using a loop in R if it is done properly).
Here is a very slightly modified version of your function, using vectors instead of data.frames to store the data.
my.function.v <- function(a, b) {
nrow.df = length(a) + length(which(b=="D"))
A <- B <- D <- vector(length = nrow.df)
A[1:length(a)] = a
B[1:length(b)] = b
for (i in 1:nrow.df)
{
if(A[i] == "D") {
D[i] = "D"
B[(i+1):nrow.df] = B[i:(nrow.df-1)]
} else if (B[i] == "D") {
D[i] = "I"
A[(i+1):nrow.df] = A[i:(nrow.df-1)]
} else if (A[i] == "I") {
D[i] = "I"
} else if (B[i] == "I") {
D[i] = "D"
} else {
D[i] = A[i]
}
}
return(D)
}
Notice the relative difference in speed below:
library(microbenchmark)
microbenchmark(my.function(a, b), my.function.v(a, b), f(a, b))
# Unit: microseconds
# expr min lq median uq max neval
# my.function(a, b) 1448.416 1490.8780 1511.3435 1547.3880 6674.332 100
# my.function.v(a, b) 157.248 165.8725 171.6475 179.1865 324.722 100
# f(a, b) 168.874 177.5455 184.8775 193.3455 416.551 100
As can be seen, #mrip's function also fares much better than your original function.

I don't see any easy way to avoid a loop here. However, there is still a more efficient way of doing this. The problem is that you are actually shifting a and b every time you come across the character D, and shifting a vector like this is an O(n) operation, so the running time of this loop would actually be O(n^2).
You can simplify the code and get slightly better performance like this:
f<-function(a,b){
aSkipped<-0
bSkipped<-0
d<-rep(0,length(a)+sum(b=="D"))
for(i in 1:length(d)){
if(a[i-aSkipped] == "D") {
d[i] = "D"
bSkipped<-bSkipped+1
} else if (b[i-bSkipped] == "D") {
d[i] = "I"
aSkipped<-aSkipped+1
} else if (a[i-aSkipped] == "I") {
d[i] = "I"
} else if (b[i-bSkipped] == "I") {
d[i] = "D"
} else {
d[i] = a[i-aSkipped]
}
}
d
}
On edit. You will really see large performance improvements when the input gets big. For small strings, and not too many "D"s this and Ananda Mahto's solution run in about the same time:
> set.seed(123)
> a<-c(sample(c("M","I"),500,T))
> b<-c(sample(c("M","I"),500,T))
> a[sample(500,50)]<-"D"
> b[sample(500,50)]<-"D"
> microbenchmark(f(a,b),my.function.v(a,b))
Unit: milliseconds
expr min lq median uq max neval
f(a, b) 4.259970 4.324046 4.368018 4.463925 9.694951 100
my.function.v(a, b) 4.442873 4.497172 4.533196 4.639543 9.901044 100
But for strings of length 50000 with 5000 "D"s the difference is substantial:
> set.seed(123)
> a<-c(sample(c("M","I"),50000,T))
> b<-c(sample(c("M","I"),50000,T))
> a[sample(50000,5000)]<-"D"
> b[sample(50000,5000)]<-"D"
> system.time(f(a,b))
user system elapsed
0.460 0.000 0.463
> system.time(my.function.v(a,b))
user system elapsed
7.056 0.008 7.077

OK, here's the Rcpp solution, and as expected, it beats the R solution by a lot:
rcppFun<-"
CharacterVector fcpp(CharacterVector a,CharacterVector b,int size){
int aSkipped = 0;
int bSkipped = 0;
int asize = a.size();
Rcpp::CharacterVector d(size);
for(int i=0; i<size; i++){
if(i-aSkipped<asize && a[i-aSkipped][0] == 'D') {
d[i] = \"D\";
bSkipped++;
} else if (b[i-bSkipped][0] == 'D') {
d[i] = \"I\";
aSkipped++;
} else if (a[i-aSkipped][0] == 'I') {
d[i] = \"I\";
} else if (b[i-bSkipped][0] == 'I') {
d[i] = \"D\";
} else {
d[i] = a[i-aSkipped];
}
}
return d;
}"
require("Rcpp")
fcpp<-cppFunction(rcppFun)
f3<-function(a,b){
fcpp(a,b,as.integer(length(a)+sum(b=="D")))
}
Warning: that function does no parameter checking at all, so if you feed it bad data you can easily get a seg fault.
If you are going to be calling this a lot, Rcpp is definitely the way to go:
> with(ab(10),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
expr min lq median uq max neval
f(a, b) 103.993 107.5155 108.6815 109.7455 178.801 100
f3(a, b) 7.354 8.1305 8.5575 9.1220 18.014 100
f2(a, b) 87.081 90.4150 92.2730 94.2585 146.502 100
my.function.v(a, b) 84.389 86.5140 87.6090 88.8340 109.106 100
> with(ab(100),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
expr min lq median uq max neval
f(a, b) 992.082 1018.9850 1032.0180 1071.0690 2784.710 100
f3(a, b) 12.873 14.3605 14.7370 15.5095 35.582 100
f2(a, b) 119.396 125.4405 129.3015 134.9915 1909.930 100
my.function.v(a, b) 769.618 786.7865 802.2920 824.0820 905.737 100
> with(ab(1000),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
expr min lq median uq max neval
f(a, b) 9816.295 10065.065 10233.1350 10392.696 12383.373 100
f3(a, b) 66.057 67.869 83.9075 87.231 1167.086 100
f2(a, b) 1637.972 1760.258 2667.6985 3138.229 47610.317 100
my.function.v(a, b) 9692.885 10272.425 10997.2595 11402.602 54315.922 100
> with(ab(10000),microbenchmark(f(a,b),f3(a,b),f2(a,b)))
Unit: microseconds
expr min lq median uq max neval
f(a, b) 101644.922 103311.678 105185.5955 108342.4960 144620.777 100
f3(a, b) 607.702 610.039 669.8515 678.1845 785.415 100
f2(a, b) 221305.641 247952.345 254478.1580 341195.5510 656408.378 100
>

Just for the sake of showing how it might be done, it can be done without a loop in R; here's one way. It's faster when the length is about roughly 1000 or less but slower when larger. One takeaway is that you surely could speed this up in Rcpp.
f2 <- function(a,b) {
da <- which(a=="D")
db <- which(b=="D")
dif <- outer(da, db, `<`)
da <- da + rowSums(!dif)
db <- db + colSums(dif)
ia <- which(a=="I")
ia <- ia + colSums(outer(db, ia, `<`))
ib <- which(b=="I")
ib <- ib + colSums(outer(da, ib, `<`))
out <- rep("M", length(a) + length(db))
out[da] <- "D"
out[db] <- "I"
out[ia] <- "I"
out[ib] <- "D"
out
}
For generating data
ab <- function(N) {
set.seed(123)
a<-c(sample(c("M","I"),N,TRUE))
b<-c(sample(c("M","I"),N,TRUE))
a[sample(N,N/10)]<-"D"
b[sample(N,N/10)]<-"D"
list(a=a,b=b)
}
Timings:
> library(microbenchmark)
> with(ab(10), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: microseconds
expr min lq median uq max neval
my.function.v(a, b) 79.102 86.9005 89.3680 93.2410 279.761 100
f(a, b) 84.334 91.1055 94.1790 98.2645 215.579 100
f2(a, b) 94.807 101.5405 105.1625 108.9745 226.149 100
> with(ab(100), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: microseconds
expr min lq median uq max neval
my.function.v(a, b) 732.849 750.4480 762.906 845.0835 1953.371 100
f(a, b) 789.380 805.8905 819.022 902.5865 1921.064 100
f2(a, b) 124.442 129.1450 134.543 137.5910 237.498 100
> with(ab(1000), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: milliseconds
expr min lq median uq max neval
my.function.v(a, b) 10.146865 10.387144 10.695895 11.123164 13.08263 100
f(a, b) 7.776286 7.973918 8.266882 8.633563 9.98204 100
f2(a, b) 1.322295 1.355601 1.385302 1.465469 1.85349 100
> with(ab(10000), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b), times=10))
Unit: milliseconds
expr min lq median uq max neval
my.function.v(a, b) 429.4030 435.00373 439.06706 442.51650 465.00124 10
f(a, b) 80.7709 83.71715 85.14887 88.02067 89.00047 10
f2(a, b) 164.7807 170.37608 175.94281 247.78353 251.14653 10

Related

Blend vectors in R [duplicate]

Background
Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL elements in them.
I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.
I have made a naive implementation myself but it is ridiculously slow.
coalesce <- function(...) {
apply(cbind(...), 1, function(x) {
x[which(!is.na(x))[1]]
})
}
Example
a <- c(1, 2, NA, 4, NA)
b <- c(NA, NA, NA, 5, 6)
c <- c(7, 8, NA, 9, 10)
coalesce(a,b,c)
# [1] 1 2 NA 4 6
Question
Is there any efficient way to implement coalesce in R?
On my machine, using Reduce gets a 5x performance improvement:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438 100
coalesce2(a, b, c) 19.601 21.4055 22.8835 23.8315 45.419 100
Looks like coalesce1 is still available
coalesce1 <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- is.na(ans)
ans[i] <- elt[i]
}
ans
}
which is faster still (but more-or-less a hand re-write of Reduce, so less general)
> identical(coalesce(a, b, c), coalesce1(a, b, c))
[1] TRUE
> microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348 100
coalesce1(a, b, c) 8.287 9.4110 10.9515 12.1295 20.940 100
coalesce2(a, b, c) 37.711 40.1615 42.0885 45.1705 67.258 100
Or for larger data compare
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
showing that which() can sometimes be effective, even though it implies a second pass through the index.
> aa <- sample(a, 100000, TRUE)
> bb <- sample(b, 100000, TRUE)
> cc <- sample(c, 100000, TRUE)
> microbenchmark(coalesce1(aa, bb, cc),
+ coalesce1a(aa, bb, cc),
+ coalesce2(aa,bb,cc), times=10)
Unit: milliseconds
expr min lq median uq max neval
coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533 10
coalesce1a(aa, bb, cc) 2.906067 2.953266 2.962729 2.971761 3.452251 10
coalesce2(aa, bb, cc) 3.080842 3.115607 3.139484 3.166642 3.198977 10
From data.table >= 1.12.3 you can use fcoalesce.
library(data.table)
fcoalesce(a, b, c)
# [1] 1 2 NA 4 6
fcoalesce can also take "a single plain list, data.table or data.frame". Thus, if the vectors above were columns in a data.frame (or a data.table), we could simply supply the name of the data set:
d = data.frame(a, b, c)
# or d = data.table(a, b, c)
fcoalesce(d)
# [1] 1 2 NA 4 6
For more info, including a benchmark, see NEWS item #18 for development version 1.12.3.
Using dplyr package:
library(dplyr)
coalesce(a, b, c)
# [1] 1 2 NA 4 6
Benchamark, not as fast as accepted solution:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
microbenchmark::microbenchmark(
coalesce(a, b, c),
coalesce2(a, b, c)
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293 100 b
# coalesce2(a, b, c) 7.127 8.553 9.68731 9.123 9.6930 27.368 100 a
But on a larger dataset, it is comparable:
aa <- sample(a, 100000, TRUE)
bb <- sample(b, 100000, TRUE)
cc <- sample(c, 100000, TRUE)
microbenchmark::microbenchmark(
coalesce(aa, bb, cc),
coalesce2(aa, bb, cc))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766 100 a
# coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223 100 a
I have a ready-to-use implementation called coalesce.na in my misc package. It seems to be competitive, but not fastest.
It will also work for vectors of different length, and has a special treatment for vectors of length one:
expr min lq median uq max neval
coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389 10
coalesce1(aa, bb, cc) 11.356584 11.448455 11.804239 12.507659 14.922052 10
coalesce1a(aa, bb, cc) 2.739395 2.786594 2.852942 3.312728 5.529927 10
coalesce2(aa, bb, cc) 2.929364 3.041345 3.593424 3.868032 7.838552 10
coalesce.na(aa, bb, cc) 4.640552 4.691107 4.858385 4.973895 5.676463 10
Here's the code:
coalesce.na <- function(x, ...) {
x.len <- length(x)
ly <- list(...)
for (y in ly) {
y.len <- length(y)
if (y.len == 1) {
x[is.na(x)] <- y
} else {
if (x.len %% y.len != 0)
warning('object length is not a multiple of first object length')
pos <- which(is.na(x))
x[pos] <- y[(pos - 1) %% y.len + 1]
}
}
x
}
Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.
A very simple solution is to use the ifelse function from the base package:
coalesce3 <- function(x, y) {
ifelse(is.na(x), y, x)
}
Although it appears to be slower than coalesce2 above:
test <- function(a, b, func) {
for (i in 1:10000) {
func(a, b)
}
}
system.time(test(a, b, coalesce2))
user system elapsed
0.11 0.00 0.10
system.time(test(a, b, coalesce3))
user system elapsed
0.16 0.00 0.15
You can use Reduce to make it work for an arbitrary number of vectors:
coalesce4 <- function(...) {
Reduce(coalesce3, list(...))
}
Here is my solution:
coalesce <- function(x){
y <- head( x[is.na(x) == F] , 1)
return(y)
}
It returns first vaule which is not NA and it works on data.table, for example if you want to use coalesce on few columns and these column names are in vector of strings:
column_names <- c("col1", "col2", "col3")
how to use:
ranking[, coalesce_column := coalesce( mget(column_names) ), by = 1:nrow(ranking)]
Another apply method, with mapply.
mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, a, b, c)
[1] 1 2 NA 4 6
This selects the first non-NA value if more than one exists. The last non-missing element could be selected using tail.
Maybe a bit more speed could be squeezed out of this alternative using the bare bones .mapply function, which looks a little different.
unlist(.mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]},
dots=list(a, b, c), MoreArgs=NULL))
[1] 1 2 NA 4 6
.mapplydiffers in important ways from its non-dotted cousin.
it returns a list (like Map) and so must be wrapped in some function like unlist or c to return a vector.
the set of arguments to be fed in parallel to the function in FUN must be given in a list to the dots argument.
Finally, mapply, the moreArgs argument does not have a default, so must explicitly be fed NULL.
Another option is to use do.call and pmin:
do.call(pmin, c(list(a,b,c), list(na.rm=TRUE)))
Output
[1] 1 2 NA 4 6

Count number of palindromes within a string

I have written the below code to count the number of palindromic strings in a given string:
countPalindromes <- function(str){
len <- nchar(str)
count <- 0
for(i in 1:len){
for(j in i:len){
subs <- substr(str, i, j)
rev <- paste(rev(substring(subs, 1:nchar(subs), 1:nchar(subs))), collapse = "")
if(subs == rev){
count <- count + 1
}
}
}
count
}
This is actually working fine but the code needs to be optimized in such a way so that it executes at a faster rate.
Please suggest some ways to optimize this piece of code.
Here's a solution that uses the wonderful stringi package - just as Andre suggested - together with a wee bit of vectorization.
cp <- function(s) {
lenstr <- stri_length(s) # Get the length
res <- sapply(1:lenstr, function(i) {
# Get all substrings
sub_string <- stringi::stri_sub(s, i, i:lenstr)
# Count matches
sum((sub_string == stringi::stri_reverse(sub_string)))
})
sum(res)
}
This should give the same result as your function
> cp("enafdemderredmedfane")
[1] 30
> countPalindromes("enafdemderredmedfane")
[1] 30
There is not much speedup for short strings, but for longer strings you can really see a benefit:
> microbenchmark::microbenchmark(countPalindromes("howdoyoudo"), cp("howdoyoudo"))
Unit: microseconds
expr min lq mean median uq max neval cld
countPalindromes("howdoyoudo") 480.979 489.6180 508.9044 494.9005 511.201 662.605 100 b
cp("howdoyoudo") 156.117 163.1555 175.4785 169.5640 179.993 324.145 100 a
Compared to
> microbenchmark::microbenchmark(countPalindromes("enafdemderredmedfane"), cp("enafdemderredmedfane"))
Unit: microseconds
expr min lq mean median uq max neval cld
countPalindromes("enafdemderredmedfane") 2031.565 2115.0305 2475.5974 2222.354 2384.151 6696.484 100 b
cp("enafdemderredmedfane") 324.991 357.6055 430.8334 387.242 478.183 1298.390 100 a
Working with a vector the process is faster, I am thinking of eliminating the double for, but I can not find an efficient way.
countPalindromes_new <- function(str){
len <- nchar(str)
strsp <- strsplit(str, "")[[1]]
count <- 0
for(i in 1:len){
for(j in i:len){
if(all(strsp[i:j] == strsp[j:i])){
count <- count + 1
}
}
}
count
}
> microbenchmark::microbenchmark(countPalindromes("howdoyoudo"), cp("howdoyoudo"), countPalindromes_new("howdoyoudo"))
Unit: microseconds
expr min lq mean median uq max neval
countPalindromes("howdoyoudo") 869.121 933.1215 1069.68001 963.201 1022.081 6712.751 100
cp("howdoyoudo") 192.000 202.8805 243.11972 219.308 258.987 477.441 100
countPalindromes_new("howdoyoudo") 49.068 53.3340 62.32815 57.387 63.574 116.481 100
> microbenchmark::microbenchmark(countPalindromes("enafdemderredmedfane"), cp("enafdemderredmedfane"), countPalindromes_new("enafdemderredmedfane"))
Unit: microseconds
expr min lq mean median uq max neval
countPalindromes("enafdemderredmedfane") 3578.029 3800.9620 4170.0888 3987.416 4173.6550 10205.445 100
cp("enafdemderredmedfane") 391.254 438.4010 609.8782 481.708 534.6135 6116.270 100
countPalindromes_new("enafdemderredmedfane") 200.534 214.1875 235.3501 223.148 245.5475 448.854 100
UPDATE (NEW VERSION WIHTOUT LEN 1 COMPARASION):
countPalindromes_new2 <- function(str){
len <- nchar(str)
strsp <- strsplit(str, "")[[1]]
count <- len
for(i in 1:(len-1)){
for(j in (i + 1):len){
if(all(strsp[i:j] == strsp[j:i])){
count <- count + 1
}
}
}
count
}
Simply: normally I'm against using new libraries everywhere. But stringi is THE library for working with strings in R.
string_vec <- c("anna","nothing","abccba")
string_rev <- stringi::stri_reverse(string_vec)
sum(string_vec == string_rev)
#evals 2

Remove duplicate pair in string

In a string variable I would like to remove both parts of a duplicates; so that I only select the unique strings. That is:
I have a string
MyString <- c("aaa", "bbb", "ccc", "ddd", "aaa", "ddd")
I would like to remove both pair of a duplicate; and thus select:
[1] "bbb" "ccc"
With not luck I tried:
unique((MyString)
x <- table(MyString)
names(x[x==1])
[1] "bbb" "ccc"
also:
MyString[ !duplicated(MyString) & !duplicated(MyString,fromLast = T) ]
[1] "bbb" "ccc"
Find the set of duplicates
dups = MyString[ duplicated(MyString) ]
and drop all occurrences in the set
MyString[ !MyString %in% dups ]
Alternative:
setdiff(MyString, dups)
The table-based solution from #Moody_Mudskipper provides more flexibility, e.g., to choose strings that occur twice. An alternative (probably faster than but analogous to table()-solutions, when MyString is long), create a index into the unique strings, find the number of times each unique string is matched (tabulate() == 1) and use these to subset the unique strings:
UString = unique(MyString)
UString[ tabulate(match(MyString, UString)) == 1 ]
or save the need to create UString
MyString[ which(tabulate(match(MyString, MyString)) == 1) ]
Alternative: sort and then find runs of length 1.
r = rle(sort(MyString))
r$values[ r$lengths == 1 ]
For performance, here are some functions implementing the various solutions
f0 = function(x) x[ !x %in% x[duplicated(x)] ]
f1 = function(x) setdiff( x, x[duplicated(x)] )
f2 = function(x) { ux = unique(x); ux[ tabulate(match(x, ux)) == 1 ] }
f3 = function(x) x[ which( tabulate( match(x, x) ) == 1 ) ]
f4 = function(x) { r = rle(sort(x)); r$values[ r$lengths == 1] }
f5 = function(x) { x = table(x); names(x)[x==1] }
f6 = function(x) x[ !duplicated(x) & !duplicated(x, fromLast = TRUE) ]
evidence that they produce identical results
> identical(f0(x), f1(x))
[1] TRUE
> identical(f0(x), f2(x))
[1] TRUE
> identical(f0(x), f3(x))
[1] TRUE
> identical(f0(x), f4(x))
[1] TRUE
> identical(f0(x), f5(x))
[1] TRUE
> identical(f0(x), f6(x))
[1] TRUE
f5() (also the original implementation) fails for x = character(0)
> f1(character(0))
character(0)
> f5(character(0))
NULL
f4() and f5() return values in alphabetical order, whereas the others preserve the order in the input, like unique(). All methods but f5() work with vectors of other type, e.g., integer() (f5() always returns a character vector, the others return a vector with the same type as the input). f4() and f5() do not recognize unique occurrences of NA.
And timings:
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x))
Unit: microseconds
expr min lq mean median uq max neval
f0(x) 9.195 10.9730 12.35724 11.8120 13.0580 29.100 100
f1(x) 20.471 22.6625 50.15586 24.6750 25.9915 2600.307 100
f2(x) 13.708 15.2265 58.58714 16.8180 18.4685 4180.829 100
f3(x) 7.533 8.8775 52.43730 9.9855 11.0060 4252.063 100
f4(x) 74.333 79.4305 124.26233 83.1505 87.4455 4091.371 100
f5(x) 147.744 154.3080 196.05684 158.4880 163.6625 3721.522 100
f6(x) 12.458 14.2335 58.11869 15.4805 17.0440 4250.500 100
Here's performance with 10,000 unique words
> x = readLines("/usr/share/dict/words", 10000)
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
f0(x) 848.086 871.359 880.8841 873.637 899.669 916.528 10
f1(x) 1440.904 1460.704 1556.7154 1589.405 1607.048 1640.347 10
f2(x) 2143.997 2257.041 2288.1878 2288.329 2334.494 2372.639 10
f3(x) 1420.144 1548.055 1547.8093 1562.927 1596.574 1601.176 10
f4(x) 11829.680 12141.870 12369.5407 12311.334 12716.806 12952.950 10
f5(x) 15796.546 15833.650 16176.2654 15858.629 15913.465 18604.658 10
f6(x) 1219.036 1356.807 1354.3578 1363.276 1372.831 1407.077 10
And with substantial duplication
> x = sample(head(x, 1000), 10000, TRUE)
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x))
Unit: milliseconds
expr min lq mean median uq max neval
f0(x) 1.914699 1.922925 1.992511 1.945807 2.030469 2.246022 100
f1(x) 1.888959 1.909469 2.097532 1.948002 2.031083 5.310342 100
f2(x) 1.396825 1.404801 1.447235 1.420777 1.479277 1.820402 100
f3(x) 1.248126 1.257283 1.295493 1.285652 1.329139 1.427220 100
f4(x) 24.075280 24.298454 24.562576 24.459281 24.700579 25.752481 100
f5(x) 4.044137 4.120369 4.307893 4.174639 4.283030 7.740830 100
f6(x) 1.221024 1.227792 1.264572 1.243201 1.295888 1.462007 100
f0() seems to be the speed winner when duplicates are rare
> x = readLines("/usr/share/dict/words", 100000)
> microbenchmark(f0(x), f1(x), f3(x), f6(x))
Unit: milliseconds
expr min lq mean median uq max neval
f0(x) 11.03298 11.17124 12.17688 11.36114 11.62769 19.83124 100
f1(x) 21.16154 21.33792 22.76237 21.67234 22.26473 31.99544 100
f3(x) 21.15801 21.49355 22.60749 21.77821 22.54203 31.17288 100
f6(x) 18.72260 18.97623 20.29060 19.46875 19.94892 28.17551 100
f3() and f6() look correct and fast; f6() is probably easier to understand (but only handles the special case of keeping words that occur exactly once).

Vectorizing double summations using R

I am struggling with translating this function into R using via vectorization technique:
Where all I have been able to do so far is this:
c <- matrix(1:9, 3)
z <- 1:3
sum(abs(outer(z, z,"-")) * c)/sum(c)
But I don't think its necessarily correct. I tried a for-loop version but that is too long and my answer is likely wrong anyway. Anyone keen on this? What am I missing (or doing wrong)? Any help would be appreciated.
Here's a double-loop version:
q =
function(z,c){
num = 0
for(i in 1:length(z)){
for(j in 1:length(z)){
num = num + abs(z[i]-z[j]) * c[i,j]
}
}
num/sum(c)
}
Here's your vectorised version, functionised:
q2 =
function(z,c){sum(c*abs(outer(z,z,'-')) /sum(c))}
Not a great difference in timing between them really for a small matrix:
> microbenchmark::microbenchmark(q(z,c), q2(z,c))
Unit: microseconds
expr min lq mean median uq max neval cld
q(z, c) 15.368 15.7505 16.59644 16.0225 16.6290 30.346 100 b
q2(z, c) 12.232 12.8885 13.79178 13.2225 13.6585 44.085 100 a
But for a larger test it's a big win:
> c2 = matrix(runif(100*100),100,100)
> z2 = runif(100)
> microbenchmark::microbenchmark(q(z2,c2), q2(z2,c2))
Unit: microseconds
expr min lq mean median uq max neval cld
q(z2, c2) 7437.031 7588.131 8046.92272 7794.927 8332.104 10729.799 100 b
q2(z2, c2) 74.742 78.647 94.20153 86.113 100.125 188.428 100 a
>
Numeric difference is within floating point tolerance:
> q(z2,c2) - q2(z2,c2)
[1] 6.661338e-16
So unless anyone has faster code, I'd stick with what you've got.
As perfectly explained by #Spacedman, your approach is very efficient, but if you still want to go faster you could try Rcpp :
library(Rcpp)
sourceCpp(code='
#include <Rcpp.h>
// [[Rcpp::export]]
double qRcpp(const Rcpp::NumericVector z, const Rcpp::NumericMatrix cm){
int zlen = z.length();
if(!(zlen == cm.nrow() && cm.nrow() == cm.ncol()))
Rcpp::stop("Invalid sizes");
double num = 0;
for(int i = 0 ; i < zlen ; i++){
for(int j = 0 ; j < zlen ; j++){
num = num + std::abs(z[i]-z[j]) * cm(i,j);
}
}
return num / Rcpp::sum(cm);
}
')
Benchmark :
c2 = matrix(runif(100*100),100,100)
z2 = runif(100)
microbenchmark::microbenchmark(q(z2,c2), q2(z2,c2),qRcpp(z2,c2))
# Unit: microseconds
# expr min lq mean median uq max neval
# q(z2, c2) 10273.035 10976.3050 11680.85554 11348.763 11765.2010 44115.632 100
# q2(z2, c2) 64.292 67.9455 80.56427 75.543 86.3565 244.019 100
# qRcpp(z2, c2) 21.042 21.9180 25.30515 24.256 26.8860 56.403 100

What is the fastest way to perform multiple logical comparisons in R?

What is the fastest way to perform multiple logical comparisons in R?
Consider for example the vector x
set.seed(14)
x = sample(LETTERS[1:4], size=10, replace=TRUE)
I want to test if each entry of x is either a "A" or a "B" (and not anything else). The following works
x == "A" | x == "B"
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
The above code loops three times through the length of the whole vector. Is there a way in R to loop only once and test for each item whether it satisfies one or another condition?
If your objective is just to make a single pass, that is pretty straightforward to write in Rcpp, even if you don't have much experience with C++:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::LogicalVector single_pass(Rcpp::CharacterVector x, Rcpp::String a, Rcpp::String b) {
R_xlen_t i = 0, n = x.size();
Rcpp::LogicalVector result(n);
for ( ; i < n; i++) {
result[i] = (x[i] == a || x[i] == b);
}
return result;
}
For such a small object as the one used in your example, the slight overhead of .Call (presumably) masks the speed of the Rcpp version,
r_fun <- function(X) X == "A" | X == "B"
##
cpp_fun <- function(X) single_pass(X, "A", "B")
##
all.equal(r_fun(x), cpp_fun(x))
#[1] TRUE
microbenchmark::microbenchmark(
r_fun(x), cpp_fun(x), times = 1000L)
#Unit: microseconds
#expr min lq mean median uq max neval
#r_fun(x) 1.499 1.584 1.974156 1.6795 1.8535 37.903 1000
#cpp_fun(x) 1.860 2.334 3.042671 2.7450 3.1140 51.870 1000
But for larger vectors (I'm assuming this is your real intention), it is considerably faster:
x2 <- sample(LETTERS, 10E5, replace = TRUE)
##
all.equal(r_fun(x2), cpp_fun(x2))
# [1] TRUE
microbenchmark::microbenchmark(
r_fun(x2), cpp_fun(x2), times = 200L)
#Unit: milliseconds
#expr min lq mean median uq max neval
#r_fun(x2) 78.044518 79.344465 83.741901 80.999538 86.368627 149.5106 200
#cpp_fun(x2) 7.104929 7.201296 7.797983 7.605039 8.184628 10.7250 200
Here's a quick attempt at generalizing the above, if you have any use for it.

Resources