Remove duplicate pair in string - r

In a string variable I would like to remove both parts of a duplicates; so that I only select the unique strings. That is:
I have a string
MyString <- c("aaa", "bbb", "ccc", "ddd", "aaa", "ddd")
I would like to remove both pair of a duplicate; and thus select:
[1] "bbb" "ccc"
With not luck I tried:
unique((MyString)

x <- table(MyString)
names(x[x==1])
[1] "bbb" "ccc"
also:
MyString[ !duplicated(MyString) & !duplicated(MyString,fromLast = T) ]
[1] "bbb" "ccc"

Find the set of duplicates
dups = MyString[ duplicated(MyString) ]
and drop all occurrences in the set
MyString[ !MyString %in% dups ]
Alternative:
setdiff(MyString, dups)
The table-based solution from #Moody_Mudskipper provides more flexibility, e.g., to choose strings that occur twice. An alternative (probably faster than but analogous to table()-solutions, when MyString is long), create a index into the unique strings, find the number of times each unique string is matched (tabulate() == 1) and use these to subset the unique strings:
UString = unique(MyString)
UString[ tabulate(match(MyString, UString)) == 1 ]
or save the need to create UString
MyString[ which(tabulate(match(MyString, MyString)) == 1) ]
Alternative: sort and then find runs of length 1.
r = rle(sort(MyString))
r$values[ r$lengths == 1 ]
For performance, here are some functions implementing the various solutions
f0 = function(x) x[ !x %in% x[duplicated(x)] ]
f1 = function(x) setdiff( x, x[duplicated(x)] )
f2 = function(x) { ux = unique(x); ux[ tabulate(match(x, ux)) == 1 ] }
f3 = function(x) x[ which( tabulate( match(x, x) ) == 1 ) ]
f4 = function(x) { r = rle(sort(x)); r$values[ r$lengths == 1] }
f5 = function(x) { x = table(x); names(x)[x==1] }
f6 = function(x) x[ !duplicated(x) & !duplicated(x, fromLast = TRUE) ]
evidence that they produce identical results
> identical(f0(x), f1(x))
[1] TRUE
> identical(f0(x), f2(x))
[1] TRUE
> identical(f0(x), f3(x))
[1] TRUE
> identical(f0(x), f4(x))
[1] TRUE
> identical(f0(x), f5(x))
[1] TRUE
> identical(f0(x), f6(x))
[1] TRUE
f5() (also the original implementation) fails for x = character(0)
> f1(character(0))
character(0)
> f5(character(0))
NULL
f4() and f5() return values in alphabetical order, whereas the others preserve the order in the input, like unique(). All methods but f5() work with vectors of other type, e.g., integer() (f5() always returns a character vector, the others return a vector with the same type as the input). f4() and f5() do not recognize unique occurrences of NA.
And timings:
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x))
Unit: microseconds
expr min lq mean median uq max neval
f0(x) 9.195 10.9730 12.35724 11.8120 13.0580 29.100 100
f1(x) 20.471 22.6625 50.15586 24.6750 25.9915 2600.307 100
f2(x) 13.708 15.2265 58.58714 16.8180 18.4685 4180.829 100
f3(x) 7.533 8.8775 52.43730 9.9855 11.0060 4252.063 100
f4(x) 74.333 79.4305 124.26233 83.1505 87.4455 4091.371 100
f5(x) 147.744 154.3080 196.05684 158.4880 163.6625 3721.522 100
f6(x) 12.458 14.2335 58.11869 15.4805 17.0440 4250.500 100
Here's performance with 10,000 unique words
> x = readLines("/usr/share/dict/words", 10000)
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
f0(x) 848.086 871.359 880.8841 873.637 899.669 916.528 10
f1(x) 1440.904 1460.704 1556.7154 1589.405 1607.048 1640.347 10
f2(x) 2143.997 2257.041 2288.1878 2288.329 2334.494 2372.639 10
f3(x) 1420.144 1548.055 1547.8093 1562.927 1596.574 1601.176 10
f4(x) 11829.680 12141.870 12369.5407 12311.334 12716.806 12952.950 10
f5(x) 15796.546 15833.650 16176.2654 15858.629 15913.465 18604.658 10
f6(x) 1219.036 1356.807 1354.3578 1363.276 1372.831 1407.077 10
And with substantial duplication
> x = sample(head(x, 1000), 10000, TRUE)
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x))
Unit: milliseconds
expr min lq mean median uq max neval
f0(x) 1.914699 1.922925 1.992511 1.945807 2.030469 2.246022 100
f1(x) 1.888959 1.909469 2.097532 1.948002 2.031083 5.310342 100
f2(x) 1.396825 1.404801 1.447235 1.420777 1.479277 1.820402 100
f3(x) 1.248126 1.257283 1.295493 1.285652 1.329139 1.427220 100
f4(x) 24.075280 24.298454 24.562576 24.459281 24.700579 25.752481 100
f5(x) 4.044137 4.120369 4.307893 4.174639 4.283030 7.740830 100
f6(x) 1.221024 1.227792 1.264572 1.243201 1.295888 1.462007 100
f0() seems to be the speed winner when duplicates are rare
> x = readLines("/usr/share/dict/words", 100000)
> microbenchmark(f0(x), f1(x), f3(x), f6(x))
Unit: milliseconds
expr min lq mean median uq max neval
f0(x) 11.03298 11.17124 12.17688 11.36114 11.62769 19.83124 100
f1(x) 21.16154 21.33792 22.76237 21.67234 22.26473 31.99544 100
f3(x) 21.15801 21.49355 22.60749 21.77821 22.54203 31.17288 100
f6(x) 18.72260 18.97623 20.29060 19.46875 19.94892 28.17551 100
f3() and f6() look correct and fast; f6() is probably easier to understand (but only handles the special case of keeping words that occur exactly once).

Related

Is there a way to find the index of the final last element in a vector that has an actual value (ie. Not N/A) in R

This is my vector
x <- c("1", "1", "PNP004", "10", "10", NA, NA)
I need to find a way to return the index/value of the final element that is not NA.
Does anyone know a good way of doing this?
Any help is appreciated!
Ind <- max(which(!is.na(yourvec)))
yourvec[Ind]
Using dplyr:
dplyr::last(which(!is.na(yourvec)))
You can use tail from the result of which looking for !is.na of your vector.
tail(which(!is.na(x)), 1)
#[1] 5
Or a for loop.
idxLNNA <- function(x) {
if(length(x) > 0) {
for(i in length(x):1) if(!is.na(x[i])) break
if(i == 1 & is.na(x[i])) {0} else {i}
} else {0}
}
idxLNNA(x)
#[1] 5
Or using cumsum with which.max, what will return 1 in case there are only NA's.
which.max(cumsum(!is.na(x)))
#[1] 5
Or subtracting the rev hit from the length of x.
length(x) - which.min(rev(is.na(x))) + 1 #Will fail in case on only NA
#length(x) - match(FALSE, is.na(rev(x))) + 1 #Alternative
#[1] 5
Benchmark:
library(microbenchmark)
fun <- alist(Dason = max(which(!is.na(x)))
, juljo = dplyr::last(which(!is.na(x)))
, GKiTail = tail(which(!is.na(x)), 1)
, GKiCum = which.max(cumsum(!is.na(x)))
, GKiRev = length(x) - which.min(rev(is.na(x))) + 1
, GKiMatch = length(x) - match(FALSE, is.na(rev(x))) + 1
, GKiFor = idxLNNA(x)
)
x <- numeric(1e6)
microbenchmark(list = fun, control=list(order="block"))
#Unit: microseconds
# expr min lq mean median uq max neval cld
# Dason 4855.744 5740.1355 7941.21809 7082.5535 7671.371 107165.201 100 bc
# juljo 4145.322 4616.7815 5756.38147 6134.9200 6625.008 7378.724 100 b
# GKiTail 4082.716 4434.3880 5576.70509 6051.1465 6489.966 7433.579 100 b
# GKiCum 6552.213 7445.1525 8632.12253 8988.8700 9420.481 16791.845 100 c
# GKiRev 4005.929 4138.4735 5845.70457 4212.7470 5851.034 101665.685 100 b
# GKiMatch 5180.600 5483.8545 7507.82723 5998.2760 7373.458 108327.967 100 bc
# GKiFor 1.541 1.5775 2.16462 1.6145 1.724 20.436 100 a
x <- rep(NA, 1e6) #Dason, GKiCum and GKiRev Fail
microbenchmark(list = fun[-c(1,4,5)], control=list(order="block"))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# juljo 3.011272 3.076340 3.525396 3.111676 3.494768 6.367839 100 a
# GKiTail 2.942336 3.014327 3.529691 3.063891 3.809653 6.136984 100 a
# GKiMatch 4.928626 4.975369 7.490588 5.039941 6.823780 98.194653 100 b
# GKiFor 155.078444 159.314918 163.706542 160.168266 163.464146 258.136977 100 c
x <- numeric(0) #Dason Fails
microbenchmark(list = fun[-1], control=list(order="block"))
Unit: nanoseconds
# expr min lq mean median uq max neval cld
# juljo 26794 27324.0 28694.75 27640.0 27933.0 120143 100 d
# GKiTail 6746 7027.5 7396.45 7206.5 7432.5 21898 100 c
# GKiCum 869 880.0 947.72 890.0 948.0 3403 100 a
# GKiRev 2466 2527.0 2657.99 2565.5 2652.0 8071 100 b
# GKiMatch 2739 2807.5 2919.78 2862.5 2935.5 5651 100 b
# GKiFor 492 512.5 671.74 537.5 604.5 9088 100 a

Iterate through the column and count the rows satisfying the condition in R

trying to write a for loop function to determine the number of schools with room costs in column 34 higher than board cost in column 23.
numrows <- dim(schools)[1]
for(ii in 1:numrows){
if(schools[ii, 34] > schools[ii, 23], na.rm = TRUE){
nrow(numrows)
}
}
I'm getting the following error
Error in if (schools[ii, 34] > schools[ii, 23]) { :
missing value where TRUE/FALSE needed
I did notice that some of the board costs are missing and i'd like to omit those in the comparisons. Also I'm expecting just the number of rows that satisfy the condition.
To further demonstrate my point, here is a simple example based on a 10,000-row sample data.frame
set.seed(2018)
df <- data.frame(one = runif(10^4), two = runif(10^4))
Running a microbenchmark analysis
library(microbenchmark)
res <- microbenchmark(
vectorised = sum(df[, 1] > df[, 2]),
for_loop = {
ss <- 0
for (i in seq_len(nrow(df))) if (df[i, 1] > df[i, 2]) ss <- ss + 1
ss
})
res
# Unit: microseconds
# expr min lq mean median uq
# vectorised 59.681 65.13 78.33118 72.8305 77.9195
# for_loop 346250.957 359535.08 398508.54996 379421.2305 426452.4265
# max neval
# 152.172 100
# 608490.869 100
library(ggplot2)
autoplot(res)
Notice the four order of magnitude (!!!) difference (that's a factor of 10,000!) between the for loop and the vectorised operation. Neither surprising nor interesting.
The structure of the data leading to the error
Error in if (schools[ii, 34] > schools[ii, 23]) { :
missing value where TRUE/FALSE needed
occurs when one or both of the values in the comparison is NA, because the NA propagates through the comparison x > y, e.g.,
> test = 1 > NA
> test
[1] NA
and the flow control if (test) {} can't determine whether the test is TRUE (and so the code should be executed) or FALSE
> if (test) {}
Error in if (test) { : missing value where TRUE/FALSE needed
A simple vectorized solution isn't good enough
> set.seed(123)
> n = 10; x = sample(n); y = sample(n); y[5] = NA
> sum(x > y)
[1] NA
though the 'fix' is obvious and inexpensive
> sum(x > y, na.rm = TRUE)
[1] 3
The for loop also fails, but it is not possible (as in part of the original question) to simply add an na.rm = TRUE clause to the if statement
s = 0
for (i in seq_along(x)) {
if (x[i] > y[i], na.rm = TRUE)
s <- s + 1
}
s
because this is not syntactically valid
Error: unexpected ',' in:
"for (i in seq_along(x)) {
if (x[i] > y[i],"
so a more creative solution needs to be found, e.g., testing whether the value of the comparison is actually TRUE
s <- 0
for (i in seq_along(x)) {
if (isTRUE(x[i] > y[i]))
s <- s + 1
}
s
Of course it is not useful to compare the performance of the incorrect code; one needs to write the correct solutions first
f1 <- function(x, y)
sum(x > y, na.rm = TRUE)
f2 <- function(x, y) {
s <- 0
for (i in seq_along(x))
if (isTRUE(x[i] > y[i]))
s <- s + 1
s
}
f1() seems much more compact and readable compared to f2(), but we need to make sure the results are sensible
> x > y
[1] FALSE TRUE FALSE FALSE NA TRUE FALSE FALSE FALSE TRUE
> f1(x, y)
[1] 3
and the same
> identical(f1(x, y), f2(x, y))
[1] FALSE
Hey wait, what's going on? They look the same?
> f2(x, y)
[1] 3
Actually, the results are numerically equal, but f1() returns an integer value whereas f2() returns a numeric
> all.equal(f1(x, y), f2(x, y))
[1] TRUE
> class(f1(x, y))
[1] "integer"
> class(f2(x, y))
[1] "numeric"
and if we're comparing performance we really need the results to be identical -- no sense comparing apples and oranges. We can update f2() to return an integer by making sure the sum s is always an integer -- use a suffix L, e.g., 0L, to create an integer value
> class(0)
[1] "numeric"
> class(0L)
[1] "integer"
and make sure an integer 1L is added to s on each successful iteration
f2a <- function(x, y) {
s <- 0L
for (i in seq_along(x))
if (isTRUE(x[i] > y[i]))
s <- s + 1L
s
}
We then have
> f2a(x, y)
[1] 3
> identical(f1(x, y), f2a(x, y))
[1] TRUE
and are now in a position to compare performance
> microbenchmark(f1(x, y), f2a(x, y))
Unit: microseconds
expr min lq mean median uq max neval
f1(x, y) 1.740 1.8965 2.05500 2.023 2.0975 6.741 100
f2a(x, y) 17.505 18.2300 18.67314 18.487 18.7440 34.193 100
Certainly f2a() is much slower, but for this size problem since the unit is 'microseconds' maybe this doesn't matter -- how do the solutions scale with problem size?
> set.seed(123)
> x = lapply(10^(3:7), sample)
> y = lapply(10^(3:7), sample)
> f = f1; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]))
Unit: microseconds
expr min lq mean median uq max neval
f(x[[1]], y[[1]]) 9.655 9.976 10.63951 10.3250 11.1695 17.098 100
f(x[[2]], y[[2]]) 76.722 78.239 80.24091 78.9345 79.7495 125.589 100
f(x[[3]], y[[3]]) 764.034 895.075 914.83722 908.4700 922.9735 1106.027 100
> f = f2a; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]))
Unit: milliseconds
expr min lq mean median uq
f(x[[1]], y[[1]]) 1.260307 1.296196 1.417762 1.338847 1.393495
f(x[[2]], y[[2]]) 12.686183 13.167982 14.067785 13.923531 14.666305
f(x[[3]], y[[3]]) 133.639508 138.845753 144.152542 143.349102 146.913338
max neval
3.345009 100
17.713220 100
165.990545 100
They both scale linearly (not surprising) but even for lengths of 100000 f2a() doesn't seem too bad -- 1/6th of a second -- and might be a candidate in a situation where the vectorization obfuscated the code rather than clarified it. The cost of extracting individual elements from columns of a data.frame change this calculus, but also point to the value of operating on atomic vectors rather than complicated data structures.
For what it's worth one can think of worse implementations, especially
f3 <- function(x, y) {
s <- logical(0)
for (i in seq_along(x))
s <- c(s, isTRUE(x[i] > y[i]))
sum(s)
}
which scales quadratically
> f = f3; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]), times = 1)
Unit: milliseconds
expr min lq mean median
f(x[[1]], y[[1]]) 7.018899 7.018899 7.018899 7.018899
f(x[[2]], y[[2]]) 371.248504 371.248504 371.248504 371.248504
f(x[[3]], y[[3]]) 42528.280139 42528.280139 42528.280139 42528.280139
uq max neval
7.018899 7.018899 1
371.248504 371.248504 1
42528.280139 42528.280139 1
(because c(s, ...) copies all of s to add one element to it) and is a pattern found very often in people's code.
A second common slowdown is use of complicated data structures (like the data.frame) rather than simple data structures (like atomic vectors), e.g., comparing
f4 <- function(df) {
s <- 0L
x <- df[[1]]
y <- df[[2]]
for (i in seq_len(nrow(df))) {
if (isTRUE(x[i] > y[i]))
s <- s + 1L
}
s
}
f5 <- function(df) {
s <- 0L
for (i in seq_len(nrow(df))) {
if (isTRUE(df[i, 1] > df[i, 2]))
s <- s + 1L
}
s
}
with
> df <- Map(data.frame, x, y)
> identical(f1(x[[1]], y[[1]]), f4(df[[1]]))
[1] TRUE
> identical(f1(x[[1]], y[[1]]), f5(df[[1]]))
[1] TRUE
> microbenchmark(f1(x[[1]], y[[1]]), f2(x[[1]], y[[1]]), f4(df[[1]]), f5(df[[1]]), times = 10)
Unit: microseconds
expr min lq mean median uq
f1(x[[1]], y[[1]]) 10.042 10.324 13.3511 13.4425 14.690
f2a(x[[1]], y[[1]]) 1310.186 1316.869 1480.1526 1344.8795 1386.322
f4(df[[1]]) 1329.307 1336.869 1363.4238 1358.7080 1365.427
f5(df[[1]]) 37051.756 37106.026 38187.8278 37876.0940 38416.276
max neval
20.753 10
2676.030 10
1439.402 10
42292.588 10

Count number of palindromes within a string

I have written the below code to count the number of palindromic strings in a given string:
countPalindromes <- function(str){
len <- nchar(str)
count <- 0
for(i in 1:len){
for(j in i:len){
subs <- substr(str, i, j)
rev <- paste(rev(substring(subs, 1:nchar(subs), 1:nchar(subs))), collapse = "")
if(subs == rev){
count <- count + 1
}
}
}
count
}
This is actually working fine but the code needs to be optimized in such a way so that it executes at a faster rate.
Please suggest some ways to optimize this piece of code.
Here's a solution that uses the wonderful stringi package - just as Andre suggested - together with a wee bit of vectorization.
cp <- function(s) {
lenstr <- stri_length(s) # Get the length
res <- sapply(1:lenstr, function(i) {
# Get all substrings
sub_string <- stringi::stri_sub(s, i, i:lenstr)
# Count matches
sum((sub_string == stringi::stri_reverse(sub_string)))
})
sum(res)
}
This should give the same result as your function
> cp("enafdemderredmedfane")
[1] 30
> countPalindromes("enafdemderredmedfane")
[1] 30
There is not much speedup for short strings, but for longer strings you can really see a benefit:
> microbenchmark::microbenchmark(countPalindromes("howdoyoudo"), cp("howdoyoudo"))
Unit: microseconds
expr min lq mean median uq max neval cld
countPalindromes("howdoyoudo") 480.979 489.6180 508.9044 494.9005 511.201 662.605 100 b
cp("howdoyoudo") 156.117 163.1555 175.4785 169.5640 179.993 324.145 100 a
Compared to
> microbenchmark::microbenchmark(countPalindromes("enafdemderredmedfane"), cp("enafdemderredmedfane"))
Unit: microseconds
expr min lq mean median uq max neval cld
countPalindromes("enafdemderredmedfane") 2031.565 2115.0305 2475.5974 2222.354 2384.151 6696.484 100 b
cp("enafdemderredmedfane") 324.991 357.6055 430.8334 387.242 478.183 1298.390 100 a
Working with a vector the process is faster, I am thinking of eliminating the double for, but I can not find an efficient way.
countPalindromes_new <- function(str){
len <- nchar(str)
strsp <- strsplit(str, "")[[1]]
count <- 0
for(i in 1:len){
for(j in i:len){
if(all(strsp[i:j] == strsp[j:i])){
count <- count + 1
}
}
}
count
}
> microbenchmark::microbenchmark(countPalindromes("howdoyoudo"), cp("howdoyoudo"), countPalindromes_new("howdoyoudo"))
Unit: microseconds
expr min lq mean median uq max neval
countPalindromes("howdoyoudo") 869.121 933.1215 1069.68001 963.201 1022.081 6712.751 100
cp("howdoyoudo") 192.000 202.8805 243.11972 219.308 258.987 477.441 100
countPalindromes_new("howdoyoudo") 49.068 53.3340 62.32815 57.387 63.574 116.481 100
> microbenchmark::microbenchmark(countPalindromes("enafdemderredmedfane"), cp("enafdemderredmedfane"), countPalindromes_new("enafdemderredmedfane"))
Unit: microseconds
expr min lq mean median uq max neval
countPalindromes("enafdemderredmedfane") 3578.029 3800.9620 4170.0888 3987.416 4173.6550 10205.445 100
cp("enafdemderredmedfane") 391.254 438.4010 609.8782 481.708 534.6135 6116.270 100
countPalindromes_new("enafdemderredmedfane") 200.534 214.1875 235.3501 223.148 245.5475 448.854 100
UPDATE (NEW VERSION WIHTOUT LEN 1 COMPARASION):
countPalindromes_new2 <- function(str){
len <- nchar(str)
strsp <- strsplit(str, "")[[1]]
count <- len
for(i in 1:(len-1)){
for(j in (i + 1):len){
if(all(strsp[i:j] == strsp[j:i])){
count <- count + 1
}
}
}
count
}
Simply: normally I'm against using new libraries everywhere. But stringi is THE library for working with strings in R.
string_vec <- c("anna","nothing","abccba")
string_rev <- stringi::stri_reverse(string_vec)
sum(string_vec == string_rev)
#evals 2

Count the number of unique characters in a string

I have a dataframe where one of the columns is of type string.
I would like to count the number of unique/distinct characters in that string.
eg.
"banana" -> 3
'he' -> 2
A reproducible example:
I have a data frame where a column is type string. I would need to filter out those rows where the string has only one distinct character.
col1 col2 col3
new york
qqqq
melbourne
aaaaaa
I would need to have a final data frame like
col1 col2 col3
new york
melbourne
So delete those rows completely.
This makes no assumption about "characters" being in letters and avoids making R data structures:
library(inline)
.char_unique_code <- "
std::vector < std::string > s = as< std::vector < std::string > >(x);
unsigned int input_size = s.size();
std::vector < std::string > chrs(input_size);
for (unsigned int i=0; i<input_size; i++) {
std::string t = s[i];
for (std::string::iterator chr=t.begin();
chr != t.end(); ++chr) {
if (chrs[i].find(*chr) == std::string::npos) {
chrs[i] += *chr;
}
}
}
return(wrap(chrs));
"
char_unique <-
rcpp(sig=signature(x="std::vector < std::string >"),
body=.char_unique_code,
includes=c("#include <string>",
"#include <iostream>"))
nchar(char_unique("banana"))
## [1] 3
Why avoid making R lists?
library(stringr)
library(microbenchmark)
library(ggplot2)
str_char_ct_unique <- function(x) sum(!!str_count(x, letters))
char_ct_unique <- function(x) nchar(char_unique(x))
r_char_ct_unique <- function(x) length(unique(strsplit(x, "")[[1]]))
microbenchmark(stringr=str_char_ct_unique("banana"),
rcpp=char_ct_unique("banana"),
r=r_char_ct_unique("banana"),
times=1000) -> mb
## Unit: microseconds
## expr min lq mean median uq max neval cld
## stringr 125.978 129.1765 139.271061 130.9415 139.3870 334.563 1000 c
## rcpp 1.458 2.0160 3.002184 2.6345 3.1365 32.244 1000 a
## r 4.797 6.1070 8.292847 7.3380 8.0505 86.709 1000 b
Let's make a vectorized version of Cath's pure R solution (not bothering with the other one since it's way too constrained) and compare against a vector of small random strings:
library(random)
library(purrr)
char_ct_unique <- function(x) nchar(char_unique(x))
r_char_ct_unique <- function(x) map_int(map(x, function(x) unique(strsplit(x, "")[[1]])), length)
tst <- as.vector(randomStrings(n=100, len=20, unique=FALSE))
sum(char_ct_unique(tst) == r_char_ct_unique(tst))
## [1] 100
microbenchmark(rcpp=char_ct_unique(tst),
r=r_char_ct_unique(tst),
times=1000)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## rcpp 53.643 56.2375 66.69311 60.2740 68.178 250.992 1000 a
## r 683.420 759.4070 952.14407 822.8905 922.710 6513.508 1000 b
And, now for the 10,000 character random string:
dat <- readLines("https://gist.githubusercontent.com/hrbrmstr/f80b157b383134b37fb3/raw/534b4c79e7c51710c6db6961bc5dc5ec25c4242b/gistfile1.txt")
digest::digest(dat, "sha1", serialize=FALSE)
## [1] "6c6695dd2f314762c81e6e6891ec1c138a4f3a08"
nchar(dat)
## [1] 10000
char_ct_unique(dat) == r_char_ct_unique(dat)
## [1] TRUE
microbenchmark(rcpp=char_ct_unique(dat),
r=r_char_ct_unique(dat),
times=1000)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## rcpp 73.801 110.681 122.9091 118.330 139.373 308.602 1000 a
## r 377.556 430.703 533.9120 448.631 492.466 4275.568 1000 b
I forgot to do David's "fixed" version:
f_r_char_ct_unique <- function(x) map_int(map(x, function(x) unique(strsplit(x, "", fixed=TRUE)[[1]])), length)
and, let's make it more interesting:
dat <- c(dat, toupper(dat), tolower(dat))
microbenchmark(rcpp=char_ct_unique(dat),
r=r_char_ct_unique(dat),
fr=f_r_char_ct_unique(dat),
times=1000)
## Unit: microseconds
## expr min lq mean median uq max neval
## rcpp 218.299 284.143 331.319 332.281 358.1215 696.907 1000
## r 1266.976 1442.460 1720.320 1494.167 1634.7870 5896.685 1000
## fr 1260.027 1444.298 1769.664 1501.416 1652.8895 78457.729 1000
We can use str_count
library(stringr)
sum(!!str_count(str1, letters))
#[1] 3
Update
Using the new dataset
i1 <- !sapply(df1$col1, function(x) any(str_count(x, letters)>1))
df1[i1,,drop=FALSE]
data
str1 <- "banana"

Match a vector to a list of vectors

I have a list of vectors lis which I need to match to another vector vec
lis <- list(c(2,0,0),c(1,1,0), c(1,0,1), c(0,2,0), c(0,1,1), c(0,0,2))
vec <- c(1,1,0)
So either I would get a logical output
[1] FALSE TRUE FALSE FALSE FALSE FALSE
Or just the position within lis of the match
I've been trying things along these lines:
unlist(lis) %in% vec
but the problem is the position of the number is important i.e. distinguish between c(1,1,0) and c(1,0,1) which I haven't been able to do. I would like to avoid for loops as this needs to be quite efficient (fast).
The answers by #agstudy and #Julius involve a loop over the (long) lis object; here's an alternative assuming that all elements of lis are the same length as vec to allow vector comparison of the unlisted reference
shortloop <- function(x, lst)
colSums(matrix(unlist(lst) == x, length(x))) == length(x)
which is fast when lis is long compared to vec.
longloop <- function(x, lst)
sapply(lst, identical, x)
l1 = rep(lis, 1000)
microbenchmark(shortloop(vec, l1), longloop(vec, l1))
## Unit: microseconds
## expr min lq median uq max neval
## shortloop(vec, l1) 793.009 808.2175 896.299 905.8795 1058.79 100
## longloop(vec, l1) 18732.338 21649.0770 21797.646 22107.7805 24788.86 100
Interestingly, using for is not that bad from a performance perspective compared to the implicit loop in lapply (though more complicated and error-prone)
longfor <- function(x, lst) {
res <- logical(length(lst))
for (i in seq_along(lst))
res[[i]] = identical(x, lst[[i]])
res
}
library(compiler)
longforc = cmpfun(longfor)
microbenchmark(longloop(vec, l1), longfor(vec, l1), longforc(vec, l1))
## Unit: milliseconds
## expr min lq median uq max neval
## longloop(vec, l1) 18.92824 21.20457 21.71295 21.80009 23.23286 100
## longfor(vec, l1) 23.64756 26.73481 27.43815 27.61699 28.33454 100
## longforc(vec, l1) 17.40998 18.28686 20.47844 20.75303 21.49532 100
sapply(lis, identical, vec)
# [1] FALSE TRUE FALSE FALSE FALSE FALSE
Benchmark:
l1 <- list(1:1000)[rep(1, 10000)]
v1 <- sample(1000)
AG <- function() sapply(l1,function(x)all(x==v1))
J <- function() sapply(l1, identical, v1)
microbenchmark(AG(), J())
# Unit: milliseconds
# expr min lq median uq max neval
# AG() 76.42732 84.26958 103.99233 111.62671 148.2824 100
# J() 32.14965 37.54198 47.34538 50.93195 104.4036 100
sapply(lis,function(x)all(x==vec))
[1] FALSE TRUE FALSE FALSE FALSE FALSE

Resources