R: Find vector elements containing multiple string matches - r

I would like to find the elements of a vector (strings) which contain all of the strings specified by another vector. For example,
x <- c("xxxabcxdxexfxx", "xxaxbcdexx", "xaxxxbc")
a <- c("a", "b", "c", "d", "e", "f")
I would like to find the elements of x that contain all the strings in a, so to get
[1] TRUE FALSE FALSE

sapply(x, function(string) all(Vectorize(grepl)(pattern = a, x = string)))
#xxxabcxdxexfxx xxaxbcdexx xaxxxbc
# TRUE FALSE FALSE
OR
rowSums(sapply(a, function(P) grepl(P, x))) == length(a)
#[1] TRUE FALSE FALSE
OR
grepl(pattern = paste(sort(a), collapse = ""),
x = sapply(strsplit(x, ""),
function(x) paste(sort(x), collapse = "")))
#[1] TRUE FALSE FALSE
OR
lengths(sapply(strsplit(x,""), setdiff, x = a)) == 0
#[1] TRUE FALSE FALSE

Another one:
sapply(strsplit(x,""), function(y) all(a %in% y))

Using gregexpr:
lengths(gregexpr(pattern = paste(a, collapse = "|"), text = x)) == length(a)
# [1] TRUE FALSE FALSE

Related

Function to count of consecutive digits in a string vector

I would like to create a function that takes a string object of at least 1 element and contains the numbers 2 through 5, and determine if there are consecutive digits of at least N length where N is the actual digit value.
If so, return the string true, otherwise return the string false.
For Example:
Input: "555123"
Output: false
Because 5 is found only 3 times instead of 5.
Or:
Input: "57333"
Output: true
Because 3 is found exactly 3 times.
Try rle + strsplit if you are working with base R
f <- function(s) {
with(
rle(unlist(strsplit(s, ""))),
any(as.numeric(values) <= lengths & lengths > 1)
)
}
and you will see
> f("555123")
[1] FALSE
> f("57333")
[1] TRUE
Late to the party but maybe still worth your while:
Data:
x <- c("555123", "57333", "21112", "12345", "22144", "44440")
Define vector with allowed numbers:
digits <- 2:5
Define alternation pattern with multiple backreferences:
patt <- paste0("(", digits, ")\\", c(1, digits), "{", digits - 1, "}", collapse = "|")
Input patt into str_detect:
library(stringr)
str_detect(x, patt)
[1] FALSE TRUE FALSE FALSE TRUE TRUE
You could check if the values in table correspond to the names.
x <- c('555123', '57333')
f <- \(x) {
s <- strsplit(x, '')
lapply(s, \(x) {
tb <- table(x)
names(tb) == tb
}) |> setNames(x)
}
f(x)
# $`555123`
# x
# 1 2 3 5
# TRUE FALSE FALSE FALSE
#
# $`57333`
# x
# 3 5 7
# TRUE FALSE FALSE
Another way would be:
my_func <- function(x) {
as.numeric(unlist(strsplit(x, ""))) -> all
table(all[all %in% 2:5]) -> f
any(names(f) == f)
}
# Input <- "555123"
# (my_func(Input))
# FALSE
# Input <- "57333"
# (my_func(Input))
# TRUE

Unexpected result comparing strings with `==`

I have two vectors:
a = strsplit("po","")[[1]]
[1] "p" "o"
b = strsplit("polo","")[[1]]
[1] "p" "o" "l" "o"
I'm trying to compare them using ==.
Unfortunately, a==b gives an unexpected result.
a==b
[1] TRUE TRUE FALSE TRUE
While I expect to have:
[1] TRUE TRUE FALSE FALSE
So, what is causing this? and how can one achieve the expected result?
The problem seems to be related to the fact that the last element of both vectors is the same as changing b to e.g. polf does give the expected result, and also because setting b to pooo gives TRUE TRUE FALSE TRUE and not TRUE TRUE TRUE TRUE.
Edit
In other words, I'd expect missing elements (when lengths differ) to be passed as nothing (only "" seems to give TRUE TRUE FALSE FALSE, NA and NULL give different results).
c("p","o","","")==c("p","o","l","o")
[1] TRUE TRUE FALSE FALSE
The problem you've encountered here is due to recycling (not the eco-friendly kind). When applying an operation to two vectors that requires them to be the same length, R often automatically recycles, or repeats, the shorter one, until it is long enough to match the longer one. Your unexpected results are due to the fact that R recycles the vector c("p", "o") to be length 4 (length of the larger vector) and essentially converts it to c("p", "o", "p", "o"). If we compare c("p", "o", "p", "o") and c("p", "o", "l", "o") we can see we get the unexpected results of above:
c("p", "o", "p", "o") == c("p", "o", "l", "o")
#> [1] TRUE TRUE FALSE TRUE
It's not exactly clear to me why you would expect the result to be TRUE TRUE FALSE FALSE, as it's somewhat of an ambiguous comparison to compare a length-2 vector to a length-4 vector, and recycling the length-2 vector (which is what R is doing) seems to be the most reasonable default aside from throwing an error.
To get the result shown in OP we may put the two vectors in a list, adapt their lengths to maximum lengths (by adding NA's) and test if the comparison is %in% TRUE.
list(a, b) |>
(\(.) lapply(., `length<-`, max(lengths(.))))() |>
(\(.) do.call(\(x, y, ...) (x == y) %in% TRUE, .))()
# [1] TRUE TRUE FALSE FALSE
Note: R version 4.1.2 (2021-11-01)
Data:
a <- c("p", "o")
b <- c("p", "o", "l", "o")
We may create a function to pad space (stringr::str_pad) on the right if any of the strings have less number of characters before the strsplit
checkStrings <- function(s1, s2) {
n1 <- nchar(s1)
n2 <- nchar(s2)
if(n1 != n2) {
n <- max(n1, n2)
i1 <- which.min(c(n1, n2))
if(i1 == 1) {
s1 <- stringr::str_pad(s1, width = n, pad = " ", side = "right")
} else {
s2 <- stringr::str_pad(s1, width = n, pad = " ", side = "right")
}
}
s1v <- strsplit(s1, "")[[1]]
s2v <- strsplit(s2, "")[[1]]
return(s1v == s2v)
}
-testing
> checkStrings(str1, str2)
[1] TRUE TRUE FALSE FALSE
data
str1 <- "po"
str2 <- "polo"
Another way to solve the problem is to create a vector of length(b) and replace the first values with a:
a <- replace(character(length(b)), seq(a), a)
a
# [1] "p" "o" "" ""
Then we can appropriately compare the two vectors using ==:
a==b
# [1] TRUE TRUE FALSE FALSE
character(length(b)) creates a vector of "" of length(b). vector(,length(b)) is another option, but it creates a vector of FALSE instead.
If one wants to do it over two or more strings, a possible function is:
matchLength = function(strings){
l = lapply(strings,\(x) strsplit(x,"")[[1]])
larger = which.max(lengths(l))
lapply(l, function(x) replace(character(length(l[[larger]])), seq(x), x))
}
Which gives the desired output:
strings=c("po","polo","polka")
matchLength(strings)
# [[1]]
# [1] "p" "o" "" "" ""
#
# [[2]]
# [1] "p" "o" "l" "o" ""
#
# [[3]]
# [1] "p" "o" "l" "k" "a"

Is there a better way to check if all elements in a list are named?

I want to check if all elements in a list are named. I've came up with this solution, but I wanted to know if there is a more elegant way to check this.
x <- list(a = 1, b = 2)
y <- list(1, b = 2)
z <- list (1, 2)
any(stringr::str_length(methods::allNames(x)) == 0L) # FALSE, all elements are
# named.
any(stringr::str_length(methods::allNames(y)) == 0L) # TRUE, at least one
# element is not named.
# Throw an error here.
any(stringr::str_length(methods::allNames(z)) == 0L) # TRUE, at least one
# element is not named.
# Throw an error here.
I am not sure if the following base R code works for your general cases, but it seems work for the ones in your post.
Define a function f to check the names
f <- function(lst) length(lst) == sum(names(lst) != "",na.rm = TRUE)
and you will see
> f(x)
[1] TRUE
> f(y)
[1] FALSE
> f(z)
[1] FALSE
We can create a function to check if the the names attribute is NULL or (|) there is blank ("") name, negate (!)
f1 <- function(lst1) is.list(lst1) && !(is.null(names(lst1))| '' %in% names(lst1))
-checking
f1(x)
#[1] TRUE
f1(y)
#[1] FALSE
f1(z)
#[1] FALSE
Or with allNames
f2 <- function(lst1) is.list(lst1) && !("" %in% allNames(lst1))
-checking
f2(x)
#[1] TRUE
f2(y)
#[1] FALSE
f2(z)
#[1] FALSE

Behavior of identical() in apply in R

This is weird.
apply( matrix(c(1,NA,2,3,NA,NA,2,4),ncol = 2), 1, function(x) identical(x[1], x[2]) )
#[1] FALSE TRUE TRUE FALSE
apply( data.frame(a = c(1,NA,2,3),b = c(NA,NA,2,4)), 1, function(x) identical(x[1], x[2]) )
#[1] FALSE FALSE FALSE FALSE
apply( as.matrix(data.frame(a = c(1,NA,2,3),b = c(NA,NA,2,4))), 1, function(x) identical(x[1], x[2]) )
#[1] FALSE FALSE FALSE FALSE
This is due to the names attribute as indicated below by joran. I can obtain the result I expected by:
apply( data.frame(a = c(1,NA,2,3),b = c(NA,NA,2,4)), 1, function(x) identical(unname(x[1]), unname(x[2])) )
or:
apply( data.frame(a = c(1,NA,2,3),b = c(NA,NA,2,4)), 1, function(x) identical(x[[1]], x[[2]]) )
Is there a more natural way to approach this? It would seem that there should be an option to ignore attributes, like in all.equal().
Probably
mapply(identical, x$a, x$b)
#[1] FALSE TRUE TRUE FALSE
where x is a data frame.
As an aside, using apply with a data frame is almost always a mistake. It will coerce the data frame to a matrix which often leads to unexpected results.

Detect the n-th duplication (and n+1-th,...) in a vector

Given a vector like:
x <- c("r", "r", "b", "b", "b", "b", "r", "r", "y", "y")
How can I detect the elements that represent the (at least) n-th duplication of a value?
For this case, if we do not want more than two duplications, this should give:
duplicatedN(x, 2)
# F, F, F, F, T, T, T, T, F, F
In other words: An element i with value v should be labeled as TRUE if there are at least N previous elements with the same value v.
A possible solution using data.table :
library(data.table)
duplicatedN <- function(x,n=2){
DT <- data.table(A=x)
DT[,dup:=1:.N > n,by=A]
return(DT$dup)
}
x <- c("r", "r", "b", "b", "b", "b", "r", "r", "y", "y")
> duplicatedN(x,1)
[1] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
> duplicatedN(x,2)
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
> duplicatedN(x,3)
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
> duplicatedN(x,4)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
This solution around table does the job:
If you want to return a logical:
duplicateN <- function(x, n){
x %in% names(which(table(x) >= n))
}
> duplicateN(x, 3)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
Or if you directly want to return the actual elements:
atleastN <- function(x, n){
x[x %in% names(which(table(x) >= n))]
}
# x[duplicateN(x, n)] would also work
> atleastN(x, 3)
[1] "r" "r" "b" "b" "b" "b" "r" "r"
Is this what you needed?

Resources