Unexpected result comparing strings with `==` - r

I have two vectors:
a = strsplit("po","")[[1]]
[1] "p" "o"
b = strsplit("polo","")[[1]]
[1] "p" "o" "l" "o"
I'm trying to compare them using ==.
Unfortunately, a==b gives an unexpected result.
a==b
[1] TRUE TRUE FALSE TRUE
While I expect to have:
[1] TRUE TRUE FALSE FALSE
So, what is causing this? and how can one achieve the expected result?
The problem seems to be related to the fact that the last element of both vectors is the same as changing b to e.g. polf does give the expected result, and also because setting b to pooo gives TRUE TRUE FALSE TRUE and not TRUE TRUE TRUE TRUE.
Edit
In other words, I'd expect missing elements (when lengths differ) to be passed as nothing (only "" seems to give TRUE TRUE FALSE FALSE, NA and NULL give different results).
c("p","o","","")==c("p","o","l","o")
[1] TRUE TRUE FALSE FALSE

The problem you've encountered here is due to recycling (not the eco-friendly kind). When applying an operation to two vectors that requires them to be the same length, R often automatically recycles, or repeats, the shorter one, until it is long enough to match the longer one. Your unexpected results are due to the fact that R recycles the vector c("p", "o") to be length 4 (length of the larger vector) and essentially converts it to c("p", "o", "p", "o"). If we compare c("p", "o", "p", "o") and c("p", "o", "l", "o") we can see we get the unexpected results of above:
c("p", "o", "p", "o") == c("p", "o", "l", "o")
#> [1] TRUE TRUE FALSE TRUE
It's not exactly clear to me why you would expect the result to be TRUE TRUE FALSE FALSE, as it's somewhat of an ambiguous comparison to compare a length-2 vector to a length-4 vector, and recycling the length-2 vector (which is what R is doing) seems to be the most reasonable default aside from throwing an error.

To get the result shown in OP we may put the two vectors in a list, adapt their lengths to maximum lengths (by adding NA's) and test if the comparison is %in% TRUE.
list(a, b) |>
(\(.) lapply(., `length<-`, max(lengths(.))))() |>
(\(.) do.call(\(x, y, ...) (x == y) %in% TRUE, .))()
# [1] TRUE TRUE FALSE FALSE
Note: R version 4.1.2 (2021-11-01)
Data:
a <- c("p", "o")
b <- c("p", "o", "l", "o")

We may create a function to pad space (stringr::str_pad) on the right if any of the strings have less number of characters before the strsplit
checkStrings <- function(s1, s2) {
n1 <- nchar(s1)
n2 <- nchar(s2)
if(n1 != n2) {
n <- max(n1, n2)
i1 <- which.min(c(n1, n2))
if(i1 == 1) {
s1 <- stringr::str_pad(s1, width = n, pad = " ", side = "right")
} else {
s2 <- stringr::str_pad(s1, width = n, pad = " ", side = "right")
}
}
s1v <- strsplit(s1, "")[[1]]
s2v <- strsplit(s2, "")[[1]]
return(s1v == s2v)
}
-testing
> checkStrings(str1, str2)
[1] TRUE TRUE FALSE FALSE
data
str1 <- "po"
str2 <- "polo"

Another way to solve the problem is to create a vector of length(b) and replace the first values with a:
a <- replace(character(length(b)), seq(a), a)
a
# [1] "p" "o" "" ""
Then we can appropriately compare the two vectors using ==:
a==b
# [1] TRUE TRUE FALSE FALSE
character(length(b)) creates a vector of "" of length(b). vector(,length(b)) is another option, but it creates a vector of FALSE instead.
If one wants to do it over two or more strings, a possible function is:
matchLength = function(strings){
l = lapply(strings,\(x) strsplit(x,"")[[1]])
larger = which.max(lengths(l))
lapply(l, function(x) replace(character(length(l[[larger]])), seq(x), x))
}
Which gives the desired output:
strings=c("po","polo","polka")
matchLength(strings)
# [[1]]
# [1] "p" "o" "" "" ""
#
# [[2]]
# [1] "p" "o" "l" "o" ""
#
# [[3]]
# [1] "p" "o" "l" "k" "a"

Related

Is there a better way to check if all elements in a list are named?

I want to check if all elements in a list are named. I've came up with this solution, but I wanted to know if there is a more elegant way to check this.
x <- list(a = 1, b = 2)
y <- list(1, b = 2)
z <- list (1, 2)
any(stringr::str_length(methods::allNames(x)) == 0L) # FALSE, all elements are
# named.
any(stringr::str_length(methods::allNames(y)) == 0L) # TRUE, at least one
# element is not named.
# Throw an error here.
any(stringr::str_length(methods::allNames(z)) == 0L) # TRUE, at least one
# element is not named.
# Throw an error here.
I am not sure if the following base R code works for your general cases, but it seems work for the ones in your post.
Define a function f to check the names
f <- function(lst) length(lst) == sum(names(lst) != "",na.rm = TRUE)
and you will see
> f(x)
[1] TRUE
> f(y)
[1] FALSE
> f(z)
[1] FALSE
We can create a function to check if the the names attribute is NULL or (|) there is blank ("") name, negate (!)
f1 <- function(lst1) is.list(lst1) && !(is.null(names(lst1))| '' %in% names(lst1))
-checking
f1(x)
#[1] TRUE
f1(y)
#[1] FALSE
f1(z)
#[1] FALSE
Or with allNames
f2 <- function(lst1) is.list(lst1) && !("" %in% allNames(lst1))
-checking
f2(x)
#[1] TRUE
f2(y)
#[1] FALSE
f2(z)
#[1] FALSE

R: Find vector elements containing multiple string matches

I would like to find the elements of a vector (strings) which contain all of the strings specified by another vector. For example,
x <- c("xxxabcxdxexfxx", "xxaxbcdexx", "xaxxxbc")
a <- c("a", "b", "c", "d", "e", "f")
I would like to find the elements of x that contain all the strings in a, so to get
[1] TRUE FALSE FALSE
sapply(x, function(string) all(Vectorize(grepl)(pattern = a, x = string)))
#xxxabcxdxexfxx xxaxbcdexx xaxxxbc
# TRUE FALSE FALSE
OR
rowSums(sapply(a, function(P) grepl(P, x))) == length(a)
#[1] TRUE FALSE FALSE
OR
grepl(pattern = paste(sort(a), collapse = ""),
x = sapply(strsplit(x, ""),
function(x) paste(sort(x), collapse = "")))
#[1] TRUE FALSE FALSE
OR
lengths(sapply(strsplit(x,""), setdiff, x = a)) == 0
#[1] TRUE FALSE FALSE
Another one:
sapply(strsplit(x,""), function(y) all(a %in% y))
Using gregexpr:
lengths(gregexpr(pattern = paste(a, collapse = "|"), text = x)) == length(a)
# [1] TRUE FALSE FALSE

R: find vector in list of vectors

i'm working with R and my goal is to check wether a given vector is in a list of unique vectors.
The list looks like
final_states <- list(c("x" = 5, "y" = 1),
c("x" = 5, "y" = 2),
c("x" = 5, "y" = 3),
c("x" = 5, "y" = 4),
c("x" = 5, "y" = 5),
c("x" = 3, "y" = 5))
Now I want to check wether a given state is in the list. For example:
state <- c("x" = 5, "y" = 3)
As you can see, the vector state is an element of the list final_states. My idea was to check it with %in% operator:
state %in% final_states
But I get this result:
[1] FALSE FALSE
Can anyone tell me, what is wrong?
Greets,
lupi
If you just want to determine if the vector is in the list, try
Position(function(x) identical(x, state), final_states, nomatch = 0) > 0
# [1] TRUE
Position() basically works like match(), but on a list. If you set nomatch = 0 and check for Position > 0, you'll get a logical result telling you whether state is in final_states
"final_states" is a "list", so you could convert the "state" to list and then do
final_states %in% list(state)
#[1] FALSE FALSE TRUE FALSE FALSE FALSE
or use mapply to check whether all the elements in "state" are present in each of the list elements of "final_states" (assuming that the lengths are the same for the vector and the list elements)
f1 <- function(x,y) all(x==y)
mapply(f1, final_states, list(state))
#[1] FALSE FALSE TRUE FALSE FALSE FALSE
Or rbind the list elements to a matrix and then check whether "state" and the "rows" of "m1" are the same.
m1 <- do.call(rbind, final_states)
!rowSums(m1!=state[col(m1)])
#[1] FALSE FALSE TRUE FALSE FALSE FALSE
Or
m1[,1]==state[1] & m1[,2]==state[2]
#[1] FALSE FALSE TRUE FALSE FALSE FALSE
Update
If you need to get a single TRUE/FALSE
any(mapply(f1, final_states, list(state)))
#[1] TRUE
Or
any(final_states %in% list(state))
#[1] TRUE
Or
list(state) %in% final_states
#[1] TRUE
Or use the "faster" fmatch from fastmatch
library(fastmatch)
fmatch(list(state), final_states) >0
#[1] TRUE
Benchmarks
#Richard Sciven's base R function is very fast compared to other solutions except the one with fmatch
set.seed(295)
final_states <- replicate(1e6, sample(1:20, 20, replace=TRUE),
simplify=FALSE)
state <- final_states[[151]]
richard <- function() {Position(function(x) identical(x, state),
final_states, nomatch = 0) > 0}
Bonded <- function(){any( sapply(final_states, identical, state) )}
akrun2 <- function() {fmatch(list(state), final_states) >0}
akrun1 <- function() {f1 <- function(x,y) all(x==y)
any(mapply(f1, final_states, list(state)))}
library(microbenchmark)
microbenchmark(richard(), Bonded(), akrun1(), akrun2(),
unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq
# richard() 35.22635 29.47587 17.49164 15.66833 14.58235
# Bonded() 109440.56885 101382.92450 55252.86141 47734.96467 44289.80309
# akrun1() 167001.23864 138812.85016 75664.91378 61417.59871 62667.94867
# akrun2() 1.00000 1.00000 1.00000 1.00000 1.00000
# max neval cld
# 14.62328 20 a
# 46299.43325 20 b
# 63890.68133 20 c
# 1.00000 20 a
Whenever i see a list object I first think of lapply. Seems to deliver the expected result with identical as the test and 'state' as the second argument:
> lapply(final_states, identical, state)
[[1]]
[1] FALSE
[[2]]
[1] FALSE
[[3]]
[1] TRUE
[[4]]
[1] FALSE
[[5]]
[1] FALSE
[[6]]
[1] FALSE
You get a possibly useful intermediate result with:
lapply(final_states, match, state)
... but it comes back as a series of position vectors where c(1,2) is the correct result.
If you want the result to come back as a vector , say for instance you want to use any, then use sapply instead of lapply.
> any( sapply(final_states[-3], identical, state) )
[1] FALSE
> any( sapply(final_states, identical, state) )
[1] TRUE

Trouble determining use of & in this function

A friend wrote up this function for determining unique members of a vector. I can't figure out (mentally) what this one line is doing and it's the crux of the function. Any help is greatly appreciated
myUniq <- function(x){
len = length(x) # getting the length of the argument
logical = rep(T, len) # creating a vector of logicals as long as the arg, populating with true
for(i in 1:len){ # for i -> length of the argument
logical = logical & x != x[i] # logical vector = logical vector & arg vector where arg vector != x[i] ??????
logical[i] = T
}
x[logical]
}
This line I can't figure out:
logical = logical & x != x[i]
can anyone explain it to me?
Thanks,
Tom
logical is a vector, I presume a logical one containing len values TRUE. x is a vector of some other data of the same length.
The second part x != x[i] is creating a logical vector with TRUE where elements of x aren't the same as the current value of x for this iteration, and FALSE otherwise.
As a result, both sides of & are now logical vector. & is an element-wise AND comparison the result of this is TRUE if elements of logical and x != x[i] are both TRUE and FALSE otherwise. Hence, after the first iteration, logical gets changed to a logical vector with TRUE for all elements x not the same as the i=1th element of x, and FALSE if they are the same.
Here is a bit of an example:
logical <- rep(TRUE, 10)
set.seed(1)
x <- sample(letters[1:4], 10, replace = TRUE)
> x
[1] "b" "b" "c" "d" "a" "d" "d" "c" "c" "a"
> logical
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> x != x[1]
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> logical & x != x[1]
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
This seems very complex. Do you get the same results as:
unique(x)
gives you? If I run my x above through myUniq() and unique() I get the same output:
> myUniq(x)
[1] "b" "d" "c" "a"
> unique(x)
[1] "b" "c" "d" "a"
(well, except for the ordering...)

Why is R's which function not returning "correct" answer

I'm writing a variant of the Monty Hall problem, building up on another person's code. The difference is that instead of 3 doors, I have "n" doors. Let's say n = 4 for this question. The doors are labeled A, B, C and D.
The code is as follows:
n <- 4
doors <- LETTERS[seq( from = 1, to = n )]
xdata = c()
for(i in 1:10000) {
prize <- sample(doors)[1]
pick <- sample(doors)[1]
open1 <- doors[which(doors != pick & doors != prize)]
open <- sample(open1,n-2)
# the line with the problem
switchyes <- doors[which( doors != open & doors != pick)]
if(pick==prize) {
xdata <- c(xdata, "noswitchwin")
}
if(switchyes==prize) {
xdata=c(xdata, "switchwin")
}
}
When I run the code, I get the warning:
There were 50 or more warnings (use warnings() to see the first 50)
The problem seems to be due to the line:
switchyes <- doors[which( doors != open & doors != pick)]
This should only return 1 item (C) since the statement doors != open and doors != pick eliminates doors A and B and D. However, I'm getting more than one, B and C. Anybody see what's going on?
length(which(xdata == "switchwin"))
# [1] 4728
length(which(xdata == "noswitchwin"))
# [1] 2424
switchyes
# [1] "B" "C"
open
# [1] "B" "D"
open1
# [1] "B" "D"
pick
# [1] "A"
prize
# [1] "C"
The problem you have is the usage of != when LHS and RHS size differ:
p <- letters[1:4]
# [1] "a" "b" "c" "d"
q <- c("a", "e", "d", "d")
# [1] "a" "e" "d" "d"
p == q
# [1] TRUE FALSE FALSE TRUE
p != q
# [1] FALSE TRUE TRUE FALSE
What is happening? since p and q are of equal size, each element of p is compared to the value at the corresponding index of q. Now, what if we change q to this:
q <- c("b", "d")
p == q
# [1] FALSE FALSE FALSE TRUE
What's happening here? Since the length of q (RHS) is not equal to p (LHS), q gets recycled to get to the length of p. That is,
# p q p q
a == b, b == d # first two comparisons
c == b, d == d # recycled comparisons
Instead you should use
!(doors %in% open) & !(doors %in% pick).
Also, by noting that !A AND !B = !(A OR B). So, you could rewrite this as
!(doors %in% open | doors %in% pick)
In turn, this could be simplified to use only one %in% as:
!(doors %in% c(open, pick))
Further, you could create a function using Negate, say %nin% (corresponding to !(x %in% y) and replace the ! and %in% in the above statement as follows:
`%nin%` <- Negate(`%in%`)
doors %nin% c(open, pick) # note the %nin% here
So basically your statement assigning to switchyes could read just:
# using %bin% after defining the function
switchyes <- doors[doors %nin% c(open, pick)]
You don't need to use which here as you are not looking for indices. You can directly use the logicals here to get the result. Hope this helps.

Resources