Compare 2 strings in R

Compare 2 strings in R - r

I have data as below:
vec <- c("ABC|ADC|1","ABC|ADG|2")
I need to check if below substring is present or not
"ADC|DFG", it should return false for this as I need to match exact pattern.
"ABC|ADC|1|5" should return True as this is a child element for the first element in vector.
I tried using grepl but it returns true if I just pass ADC as well, any help is appreciated.

grepl returns true because the pipe character | in regex is a special one. a|b means match a or b. all you need to do is escape it.
frtest<-c("ABC|ADC","ABC|ADC|1|2","ABC|ADG","ABC|ADG|2|5")
# making the last number and it's pipe optional
test <- gsub('(\\|\\d)$', '(\\1)?', frtest)
# escaping all pipes
test<-gsub('\\|' ,'\\\\\\\\|',test)
# testing if any of the strings is in vec
res <- sapply(test, function(x) any(grepl(x, vec)) )
# reassigning the names so they're readable
names(res) <-frtest
#> ABC|ADC ABC|ADC|1|2 ABC|ADG ABC|ADG|2|5
TRUE TRUE TRUE TRUE

For two vectors vec and test, this returns a vector which is TRUE if either the corresponding element of test is the start of one of the elements of vec, or one of the elements of vec is the start of the corresponding element of test.
vec <- c("ABC|ADC|1","ABC|ADG|2")
test <- c("ADC|DFG", "ABC|ADC|1|5", "ADC|1", "ABC|ADC")
colSums(sapply(test, startsWith, vec) | t(sapply(vec, startsWith, test))) > 0
# ADC|DFG ABC|ADC|1|5 ADC|1 ABC|ADC
# FALSE TRUE FALSE TRUE

Related

Substring match when filtering rows

I have strings in file1 that matches part of the strings in file2. I want to filter out the strings from file2 that partly matches those in file1. Please see my try. Not sure how to define substring match in this way.
file1:
V1
species1
species121
species14341
file2
V1
genus1|species1|strain1
genus1|species121|strain1
genus1|species1442|strain1
genus1|species4242|strain1
genus1|species4131|strain1
my try:
file1[!file1$V1 %in% file2$V1]

You cannot use the %in% operator in this way in R. It is used to determine whether an element of a vector is in another vector, not like in in Python which can be used to match a substring: Look at this:
"species1" %in% "genus1|species1|strain1" # FALSE
"species1" %in% c("genus1", "species1", "strain1") # TRUE
You can, however, use grepl for this (the l is for logical, i.e. it returns TRUE or FALSE).
grepl("species1", "genus1|species1|strain1") # TRUE
There's an additional complication here in that you cannot use grepl with a vector, as it will only compare the first value:
grepl(file1$V1, "genus1|species1|strain1")
[1] TRUE
Warning message:
In grepl(file1$V1, "genus1|species1|strain1") :
argument 'pattern' has length > 1 and only the first element will be used
The above simply tells you that the first element of file1$V1 is in "genus1|species1|strain1".
Furthermore, you want to compare each element in file1$V1 to an entire vector of strings, rather than just one string. That's OK but you will get a vector the same length as the second vector as an output:
grepl("species1", file2$V1)
[1] TRUE TRUE TRUE FALSE FALSE
We can just see if any() of those are a match. As you've tagged your question with tidyverse, here's a dplyr solution:
library(dplyr)
file1 |>
rowwise() |> # This makes sure you only pass one element at a time to `grepl`
mutate(
in_v2 = any(grepl(V1, file2$V1))
) |>
filter(!in_v2)
# A tibble: 1 x 2
# Rowwise:
# V1 in_v2
# <chr> <lgl>
# 1 species14341 FALSE

One way to get what you want is using the grepl function. So, you can run the following code:
# Load library
library(qdapRegex)
# Extract the names of file2$V1 you are interested in (those between | |)
v <- unlist(rm_between(file2$V1, "|", "|", extract = T))
# Which of theese elements are in file1$V1?
elem.are <- which(v %in% file1$V1)
# Delete the elements in elem.are
file2$V1[-elem.are]
In v we save the names of file2$V1 we are interested in (those
between | |)
Then we save at elem.are the positions of those names which appear
in file1$V1
Finally, we omit those elements using file2$V1[-elem.are]

How to apply list of regex pattern on list

I have a list of strings and a list of patterns
like:
links <- c(
"http://www.google.com"
,"google.com"
,"www.google.com"
,"http://google.com"
,"http://google.com/"
,"www.google.com/#"
,"www.google.com/xpto"
,"http://google.com/xpto"
,"http://google.com/xpto&utml"
,"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA")
patterns <- c(".com$","/$")
what i want is wipe out all links that matches this patterns.
and get this result:
"www.google.com/#"
"www.google.com/xpto"
"http://google.com/xpto"
"http://google.com/xpto&utml"
"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
if i use
x<-lapply (patterns, grepl, links)
i get
[[1]]
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
what takes me to this 2 lists
> links[!x[[2]]]
[1] "http://www.google.com" "google.com"
[3] "www.google.com" "http://google.com"
[5] "www.google.com/#" "www.google.com/xpto"
[7] "http://google.com/xpto" "http://google.com/xpto&utml"
[9] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
> links[!x[[1]]]
[1] "http://google.com/" "www.google.com/#"
[3] "www.google.com/xpto" "http://google.com/xpto"
[5] "http://google.com/xpto&utml" "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
in this case each result list wiped 1 pattern out.. but i wanted 1 list with all patterns wiped... how to apply the regex to only one result ... or somehow to merge the n boolean vectors always choosing true.
like:
b[1] <- c(TRUE,FALSE,FALSE,TRUE,FALSE)
b[2] <- c(FALSE,FALSE,TRUE,TRUE,FALSE)
b[3] <- c(FALSE,FALSE,FALSE,FALSE,FALSE)
res <- somefunction(b)
res
TRUE,FALSE,TRUE,TRUE,FALSE

In most cases the best solution will be to merge the regular expression patterns, and to apply a single pattern search, as shown in Thomas’ answer.
However, it is also trivial to merge logical vectors by combining them with logical operations. In your case, you want to compute the member-wise logical disjunction. Between two vectors, this can be computed as x | y. Between a list of multiple vectors, it can be computed using Reduce(|, logical_list).
In your case, this results in:
any_matching = Reduce(`|`, lapply(patterns, grepl, links))
result = links[! any_matching]

This should do what you want:
links[!sapply("(\\.com|/)$", grepl, links)]
Explanation:
You can use sapply so you get a vector and not a list
I'd use the pattern "(\\.com|/)$" (i.e. ends with .com OR /).
In the end I negate the resulting boolean vector using !.

You can try the base R code below, using grep
r <- grep(paste0(patterns,collapse = "|"),links,value = TRUE,invert = TRUE)
such that
> r
[1] "www.google.com/#"
[2] "www.google.com/xpto"
[3] "http://google.com/xpto"
[4] "http://google.com/xpto&utml"
[5] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"

You can do this using stringr::str_subset() function.
library(stringr)
str_subset(links, pattern = ".com$|/$", negate = TRUE)

If any value from the first column is matching (partial/ full) in an another column in R

I have two lists, which are like following. I am looking for an output where every row of dat1 will match on complete column in dat, and on the basis of that, I will get the results
dat <- data.frame(v=c('apple', 'le123', 'app', 'being', 'aple',"beiling"))
dat1 <- data.frame(v1=c('app','123', 'be'))
I have tried following two alternatives but without success
test <- mapply(grepl, pattern=dat1$v1, x=dat$v)
str_detect(as.character(dat$v),dat1)
the output I am getting is
TRUE TRUE FALSE FALSE FALSE TRUE
but the desired output I am looking for is
TRUE TRUE TRUE TRUE FALSE TRUE
How can I go ahead with this, every help is important

We can paste the pattern dataset column ('dat1$v1') together by collapseing with "|" and this will look for any matches. It is basically telling that either one of these patterns are in the 'v' column of 'dat'
stringr::str_detect(as.character(dat$v),paste(as.character(dat1$v1), collapse="|"))
#[1] TRUE TRUE TRUE TRUE FALSE TRUE
Note: To avoid any substring mismatches it is better wrap with word boundary (\\b)
pat <- paste0("\\b(", paste(as.character(dat1$v1), collapse="|"), ")\\b")
stringr::str_detect(as.character(dat$v), pat)
which seems to be not the case in the OP's data
Update
If the pattern list is very long, then we can loop over the patterns, get a list of logical vectors and Reduce it to single vector
Reduce(`|`, lapply(as.character(dat1$v1), str_detect, string = as.character(dat$v)))
#[1] TRUE TRUE TRUE TRUE FALSE TRUE

Moreover, you can use sqldf and do this in SQL format:
require(sqldf)
dat <- data.frame(v=c('apple', 'le123', 'app', 'being', 'aple','beiling'))
dat1 <- data.frame(v1=c('app','123', 'be'))
sqldf("SELECT dat.* FROM dat JOIN dat1 on dat.v like ('%' || dat1.v1 || '%')")
And result would be:
v
1 apple
2 le123
3 app
4 being
5 beiling

R: Checking if mutliple elements of a vector appear in vector of strings

I'm trying to create a function that checks if all elements of a vector appear in a vector of strings. The test code is presented below:
test_values = c("Alice", "Bob")
test_list = c("Alice,Chris,Mark", "Alice,Bob,Chris", "Alice,Mark,Zach", "Alice,Bob,Mark", "Mark,Bob,Zach", "Alice,Chris,Bob", "Mark,Chris,Zach")
I would like the output for this to be FALSE TRUE FALSE TRUE FALSE TRUE FALSE.
I first thought I'd be able to switch the | to & in the command grepl(paste(test_values, collapse='|'), test_list) to get when Alice and Bob are both in the string instead of when either of them appear, but I was unable to get the correct answer.
I also would rather not use the command: grepl(test_values[1], test_list) & grepl(test_values[2], test_list) because the test_values vector will change dynamically (varying from length 0 to 3), so I'm looking for something to take that into account.

We can use Reduce with grepl
Reduce(`&`, lapply(test_values, grepl, test_list))
#[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE

R match part of string against vector of strings

I have a comma separated character class
A = "123,456,789"
and I am trying to get a logical vector for when one of the items in the character class are present in a character array.
B <- as.array(c("456", "135", "789", "111"))
I am looking for logical result of size 4 (length of B)
[1] TRUE FALSE TRUE FALSE
Fairly new to R so any help would be appreciated. Thanks in advance.

You can use a combination of sapply and grepl, which returns a logical if matched
sapply(B, grepl, x=A)

Since your comparison vector is comma-separated, you can use this as a non-looping method.
B %in% strsplit(A, ",")[[1]]
# [1] TRUE FALSE TRUE FALSE
And one other looping method would be to use Vectorize with grepl. This uses mapply internally.
Vectorize(grepl, USE.NAMES = FALSE)(B, A)
# [1] TRUE FALSE TRUE FALSE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Compare 2 strings in R - r

Related

Substring match when filtering rows

How to apply list of regex pattern on list

If any value from the first column is matching (partial/ full) in an another column in R

R: Checking if mutliple elements of a vector appear in vector of strings

R match part of string against vector of strings

Categories

Resources