Difference of two character vectors with substring - r

I have two lists:
a <- c("da", "ba", "cs", "dd", "ek")
b <- c("zyc", "ulk", "mae", "csh", "ddi", "dada")
I want to remove the elements from list b which would have a substring match with any of the values in a, e.g.
grepl("da","dada") # TRUE
How would you go about doing this efficiently?

We can paste the 'a' elements to a single string with | as the delimiter, use that as pattern in grepl, negate (!) to subset 'b'.
b[!grepl(paste(a, collapse="|"), b)]

And another solution using a simple for loop:
sel <- rep(FALSE, length(b))
for (i in seq_along(a)) {
sel <- sel | grepl(a[i], b, fixed = TRUE)
}
b[!sel]
Not as elegant as some as the other solutions (especially the one by akrun), but showing that a for loop isn't always as slow in R as people believe:
fun1 <- function(a, b) {
sel <- rep(FALSE, length(b))
for (i in seq_along(a)) {
sel <- sel | grepl(a[i], b, fixed = TRUE)
}
b[!sel]
}
fun2 <- function(a, b) {
b[!apply(sapply(a, function(x) grepl(x,b, fixed=TRUE)),1,sum)]
}
fun3 <- function(a, b) {
b[-which(sapply(a, grepl, b, fixed=TRUE), arr.ind = TRUE)[, "row"]]
}
fun4 <- function(a, b) {
b[!grepl(paste(a, collapse="|"), b)]
}
library(stringr)
fun5 <- function(a, b) {
b[!sapply(b, function(u) any(str_detect(u,a)))]
}
a <- c("da", "ba", "cs", "dd", "ek")
b <- c("zyc", "ulk", "mae", "csh", "ddi", "dada")
b <- rep(b, length.out = 1E3)
library(microbenchmark)
microbenchmark(fun1(a, b), fun2(a, b), fun3(a,b), fun4(a,b), fun5(a,b))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# fun1(a, b) 389.630 399.128 408.6146 406.007 411.7690 540.969 100 a
# fun2(a, b) 5274.143 5445.038 6183.3945 5544.522 5762.1750 35830.143 100 c
# fun3(a, b) 2568.734 2629.494 2691.8360 2686.552 2729.0840 2956.618 100 b
# fun4(a, b) 482.585 511.917 530.0885 528.993 541.6685 779.679 100 a
# fun5(a, b) 53846.970 54293.798 56337.6531 54861.585 55184.3100 132921.883 100 d

You could try the following:
b[!(+(apply(sapply(a, function(x) grepl(x,b)),1,sum)) > 0)]
[1] "zyc" "ulk" "mae"
'Peeling' this previous call from the inside, the results are the following: First, obtain a matrix of matches from the grepl: call (with sapply):
sapply(a, function(x) grepl(x,b))
# da ba cs dd ek
#[1,] FALSE FALSE FALSE FALSE FALSE
#[2,] FALSE FALSE FALSE FALSE FALSE
#[3,] FALSE FALSE FALSE FALSE FALSE
#[4,] FALSE FALSE TRUE FALSE FALSE
#[5,] FALSE FALSE FALSE TRUE FALSE
#[6,] TRUE FALSE FALSE FALSE FALSE
Note that the columns are the elements of a and the rows are the elements of b.
Then, apply the function sum per rows (in R, TRUE is 1 and FALSE is 0:
apply(sapply(a, function(x) grepl(x,b)),1,sum)
#[1] 0 0 0 1 1 1
Note that here, the row sums might be > 1 (if there is more than 1 match), so it must be coerced into a logical with the previous call wrapped around:
+() > 0
With this, we can match ([) the indices of b, but since we want the opposite, we use the operator !.
#full code:
step.one <- sapply(a, function(x) grepl(x,b))
step.two <- apply(step.one,1,sum)
step.three <- +(step.two > 0)
step.four <- !step.three
#finally:
b[step.four]
As David shows in the comments, this is a much more elegant approach:
b[-which(sapply(a, grepl, b), arr.ind = TRUE)[, "row"]]

Related

Function to count of consecutive digits in a string vector

I would like to create a function that takes a string object of at least 1 element and contains the numbers 2 through 5, and determine if there are consecutive digits of at least N length where N is the actual digit value.
If so, return the string true, otherwise return the string false.
For Example:
Input: "555123"
Output: false
Because 5 is found only 3 times instead of 5.
Or:
Input: "57333"
Output: true
Because 3 is found exactly 3 times.
Try rle + strsplit if you are working with base R
f <- function(s) {
with(
rle(unlist(strsplit(s, ""))),
any(as.numeric(values) <= lengths & lengths > 1)
)
}
and you will see
> f("555123")
[1] FALSE
> f("57333")
[1] TRUE
Late to the party but maybe still worth your while:
Data:
x <- c("555123", "57333", "21112", "12345", "22144", "44440")
Define vector with allowed numbers:
digits <- 2:5
Define alternation pattern with multiple backreferences:
patt <- paste0("(", digits, ")\\", c(1, digits), "{", digits - 1, "}", collapse = "|")
Input patt into str_detect:
library(stringr)
str_detect(x, patt)
[1] FALSE TRUE FALSE FALSE TRUE TRUE
You could check if the values in table correspond to the names.
x <- c('555123', '57333')
f <- \(x) {
s <- strsplit(x, '')
lapply(s, \(x) {
tb <- table(x)
names(tb) == tb
}) |> setNames(x)
}
f(x)
# $`555123`
# x
# 1 2 3 5
# TRUE FALSE FALSE FALSE
#
# $`57333`
# x
# 3 5 7
# TRUE FALSE FALSE
Another way would be:
my_func <- function(x) {
as.numeric(unlist(strsplit(x, ""))) -> all
table(all[all %in% 2:5]) -> f
any(names(f) == f)
}
# Input <- "555123"
# (my_func(Input))
# FALSE
# Input <- "57333"
# (my_func(Input))
# TRUE

R check if list of lists contains specific list

There are three lists:
a = list(1,2)
b = list(2,3)
c = list(a,b)
The command a %in% c yields FALSE FALSE. The result I would like to see is TRUE since a is an element of list c. How do I achieve this?
Check whether each component is identical to a and return TRUE if any of those comparisons are TRUE.
any(sapply(c, identical, a))
## [1] TRUE
This should also help:
list(a) %in% c
Examples:
a = list(1,2)
b = list(2,3)
c = list(a,b)
y = list(3,4)
z = list(1)
list(a) %in% c # True
list(b) %in% c # True
list(y) %in% c # False
list(z) %in% c # False

Replace elements of vector by vector

I want to replace few elements of vector by whole second vector. Condition is, that replaced elements of first vector are equal to third vector. Here is an example:
a <- 1:10
b <- 5:7
v <- rnorm(2, mean = 1, sd = 5)
my output should be
c(a[1:4], v, a[8:10])
I have already tried
replace(a, a == b, v)
a[a == b] <- v
but with a little success. Can anyone help?
The == operator is best used to match vectors of the same length, or when one of the vector is only length 1.
Try this, and notice in neither case do you get the positional match that you desire.
> a == b
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Warning message:
In a == b : longer object length is not a multiple of shorter object length
> b == a
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Warning message:
In b == a : longer object length is not a multiple of shorter object length
Instead, use match() - this gives you the index position where there is a match in the values.
> match(b, a)
[1] 5 6 7
Then:
a <- 1:10
b <- 5:7
v <- rnorm(3, mean=1, sd=5)
a[match(b, a)] <- v
The results:
a
[1] 1.0000000 2.0000000 3.0000000 4.0000000 -4.6843669 0.9014578 -0.7601413 8.0000000
[9] 9.0000000 10.0000000
Here' another option:
a[a %in% b] <- v
Since in the example described in the OP there are three common numbers in the vectors a and b while v <- rnorm(2, mean = 1, sd = 5)
contains only 2 numbers, the vector v will be recycled and a warning will be issued.
The warning and recycling can be prevented, e.g., by defining v as
v <- rnorm(sum(a %in% b), mean = 1, sd = 5)

R: find vector in list of vectors

i'm working with R and my goal is to check wether a given vector is in a list of unique vectors.
The list looks like
final_states <- list(c("x" = 5, "y" = 1),
c("x" = 5, "y" = 2),
c("x" = 5, "y" = 3),
c("x" = 5, "y" = 4),
c("x" = 5, "y" = 5),
c("x" = 3, "y" = 5))
Now I want to check wether a given state is in the list. For example:
state <- c("x" = 5, "y" = 3)
As you can see, the vector state is an element of the list final_states. My idea was to check it with %in% operator:
state %in% final_states
But I get this result:
[1] FALSE FALSE
Can anyone tell me, what is wrong?
Greets,
lupi
If you just want to determine if the vector is in the list, try
Position(function(x) identical(x, state), final_states, nomatch = 0) > 0
# [1] TRUE
Position() basically works like match(), but on a list. If you set nomatch = 0 and check for Position > 0, you'll get a logical result telling you whether state is in final_states
"final_states" is a "list", so you could convert the "state" to list and then do
final_states %in% list(state)
#[1] FALSE FALSE TRUE FALSE FALSE FALSE
or use mapply to check whether all the elements in "state" are present in each of the list elements of "final_states" (assuming that the lengths are the same for the vector and the list elements)
f1 <- function(x,y) all(x==y)
mapply(f1, final_states, list(state))
#[1] FALSE FALSE TRUE FALSE FALSE FALSE
Or rbind the list elements to a matrix and then check whether "state" and the "rows" of "m1" are the same.
m1 <- do.call(rbind, final_states)
!rowSums(m1!=state[col(m1)])
#[1] FALSE FALSE TRUE FALSE FALSE FALSE
Or
m1[,1]==state[1] & m1[,2]==state[2]
#[1] FALSE FALSE TRUE FALSE FALSE FALSE
Update
If you need to get a single TRUE/FALSE
any(mapply(f1, final_states, list(state)))
#[1] TRUE
Or
any(final_states %in% list(state))
#[1] TRUE
Or
list(state) %in% final_states
#[1] TRUE
Or use the "faster" fmatch from fastmatch
library(fastmatch)
fmatch(list(state), final_states) >0
#[1] TRUE
Benchmarks
#Richard Sciven's base R function is very fast compared to other solutions except the one with fmatch
set.seed(295)
final_states <- replicate(1e6, sample(1:20, 20, replace=TRUE),
simplify=FALSE)
state <- final_states[[151]]
richard <- function() {Position(function(x) identical(x, state),
final_states, nomatch = 0) > 0}
Bonded <- function(){any( sapply(final_states, identical, state) )}
akrun2 <- function() {fmatch(list(state), final_states) >0}
akrun1 <- function() {f1 <- function(x,y) all(x==y)
any(mapply(f1, final_states, list(state)))}
library(microbenchmark)
microbenchmark(richard(), Bonded(), akrun1(), akrun2(),
unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq
# richard() 35.22635 29.47587 17.49164 15.66833 14.58235
# Bonded() 109440.56885 101382.92450 55252.86141 47734.96467 44289.80309
# akrun1() 167001.23864 138812.85016 75664.91378 61417.59871 62667.94867
# akrun2() 1.00000 1.00000 1.00000 1.00000 1.00000
# max neval cld
# 14.62328 20 a
# 46299.43325 20 b
# 63890.68133 20 c
# 1.00000 20 a
Whenever i see a list object I first think of lapply. Seems to deliver the expected result with identical as the test and 'state' as the second argument:
> lapply(final_states, identical, state)
[[1]]
[1] FALSE
[[2]]
[1] FALSE
[[3]]
[1] TRUE
[[4]]
[1] FALSE
[[5]]
[1] FALSE
[[6]]
[1] FALSE
You get a possibly useful intermediate result with:
lapply(final_states, match, state)
... but it comes back as a series of position vectors where c(1,2) is the correct result.
If you want the result to come back as a vector , say for instance you want to use any, then use sapply instead of lapply.
> any( sapply(final_states[-3], identical, state) )
[1] FALSE
> any( sapply(final_states, identical, state) )
[1] TRUE

The diag() function in R

Is there a way to use the diag() function in a Matrix without using the built-in function or iteration?
M<-matrix(1:9, ncol=3) # make a matrix
q5b<-function(M){ #function
}
I know that M[1,1], M[2,2], and M[3,3] will give me the same output as diag(M). However, I can't think of a way to do this without a for loop.
My thought process was I should have a condition where row index == column index in the Matrix then print that value. I appreciate any suggestions.
You can use the functions row and col to find the indices where the column number is identical to the row number:
row(M) == col(M)
# [,1] [,2] [,3]
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE TRUE
M[row(M) == col(M)]
# [1] 1 5 9
Just subset based on another matrix:
> diag(M)
[1] 1 5 9
> M[matrix(rep(sequence(ncol(M)), 2), ncol = 2)]
[1] 1 5 9
The above would run into a problem in a non-square matrix, so we modify it as below.
As your function, one answer for question 5b could be:
q5b <- function(M) {
A <- sequence(ncol(M))[sequence(min(nrow(M), ncol(M)))]
M[cbind(A, A)]
}
Update: Benchmarks are always fun
library(microbenchmark)
fun1 <- function(M) diag(M)
fun2 <- function(M) M[row(M) == col(M)]
fun3 <- function(M) {
A <- sequence(ncol(M))[sequence(min(nrow(M), ncol(M)))]
M[cbind(A, A)]
}
set.seed(1)
M <- matrix(rnorm(1000*1000), ncol = 1000)
microbenchmark(fun1(M), fun2(M), fun3(M), times = 100)
# Unit: microseconds
# expr min lq median uq max neval
# fun1(M) 4654.825 4747.408 4822.8865 4912.690 5877.866 100
# fun2(M) 53270.266 54813.606 55059.0695 55749.062 200384.531 100
# fun3(M) 66.284 82.321 118.8835 129.361 191.155 100

Resources