R match part of string against vector of strings

R match part of string against vector of strings - r

I have a comma separated character class
A = "123,456,789"
and I am trying to get a logical vector for when one of the items in the character class are present in a character array.
B <- as.array(c("456", "135", "789", "111"))
I am looking for logical result of size 4 (length of B)
[1] TRUE FALSE TRUE FALSE
Fairly new to R so any help would be appreciated. Thanks in advance.

You can use a combination of sapply and grepl, which returns a logical if matched
sapply(B, grepl, x=A)

Since your comparison vector is comma-separated, you can use this as a non-looping method.
B %in% strsplit(A, ",")[[1]]
# [1] TRUE FALSE TRUE FALSE
And one other looping method would be to use Vectorize with grepl. This uses mapply internally.
Vectorize(grepl, USE.NAMES = FALSE)(B, A)
# [1] TRUE FALSE TRUE FALSE

Related

Compare 2 strings in R

I have data as below:
vec <- c("ABC|ADC|1","ABC|ADG|2")
I need to check if below substring is present or not
"ADC|DFG", it should return false for this as I need to match exact pattern.
"ABC|ADC|1|5" should return True as this is a child element for the first element in vector.
I tried using grepl but it returns true if I just pass ADC as well, any help is appreciated.

grepl returns true because the pipe character | in regex is a special one. a|b means match a or b. all you need to do is escape it.
frtest<-c("ABC|ADC","ABC|ADC|1|2","ABC|ADG","ABC|ADG|2|5")
# making the last number and it's pipe optional
test <- gsub('(\\|\\d)$', '(\\1)?', frtest)
# escaping all pipes
test<-gsub('\\|' ,'\\\\\\\\|',test)
# testing if any of the strings is in vec
res <- sapply(test, function(x) any(grepl(x, vec)) )
# reassigning the names so they're readable
names(res) <-frtest
#> ABC|ADC ABC|ADC|1|2 ABC|ADG ABC|ADG|2|5
TRUE TRUE TRUE TRUE

For two vectors vec and test, this returns a vector which is TRUE if either the corresponding element of test is the start of one of the elements of vec, or one of the elements of vec is the start of the corresponding element of test.
vec <- c("ABC|ADC|1","ABC|ADG|2")
test <- c("ADC|DFG", "ABC|ADC|1|5", "ADC|1", "ABC|ADC")
colSums(sapply(test, startsWith, vec) | t(sapply(vec, startsWith, test))) > 0
# ADC|DFG ABC|ADC|1|5 ADC|1 ABC|ADC
# FALSE TRUE FALSE TRUE

How to apply list of regex pattern on list

I have a list of strings and a list of patterns
like:
links <- c(
"http://www.google.com"
,"google.com"
,"www.google.com"
,"http://google.com"
,"http://google.com/"
,"www.google.com/#"
,"www.google.com/xpto"
,"http://google.com/xpto"
,"http://google.com/xpto&utml"
,"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA")
patterns <- c(".com$","/$")
what i want is wipe out all links that matches this patterns.
and get this result:
"www.google.com/#"
"www.google.com/xpto"
"http://google.com/xpto"
"http://google.com/xpto&utml"
"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
if i use
x<-lapply (patterns, grepl, links)
i get
[[1]]
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
what takes me to this 2 lists
> links[!x[[2]]]
[1] "http://www.google.com" "google.com"
[3] "www.google.com" "http://google.com"
[5] "www.google.com/#" "www.google.com/xpto"
[7] "http://google.com/xpto" "http://google.com/xpto&utml"
[9] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
> links[!x[[1]]]
[1] "http://google.com/" "www.google.com/#"
[3] "www.google.com/xpto" "http://google.com/xpto"
[5] "http://google.com/xpto&utml" "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
in this case each result list wiped 1 pattern out.. but i wanted 1 list with all patterns wiped... how to apply the regex to only one result ... or somehow to merge the n boolean vectors always choosing true.
like:
b[1] <- c(TRUE,FALSE,FALSE,TRUE,FALSE)
b[2] <- c(FALSE,FALSE,TRUE,TRUE,FALSE)
b[3] <- c(FALSE,FALSE,FALSE,FALSE,FALSE)
res <- somefunction(b)
res
TRUE,FALSE,TRUE,TRUE,FALSE

In most cases the best solution will be to merge the regular expression patterns, and to apply a single pattern search, as shown in Thomas’ answer.
However, it is also trivial to merge logical vectors by combining them with logical operations. In your case, you want to compute the member-wise logical disjunction. Between two vectors, this can be computed as x | y. Between a list of multiple vectors, it can be computed using Reduce(|, logical_list).
In your case, this results in:
any_matching = Reduce(`|`, lapply(patterns, grepl, links))
result = links[! any_matching]

This should do what you want:
links[!sapply("(\\.com|/)$", grepl, links)]
Explanation:
You can use sapply so you get a vector and not a list
I'd use the pattern "(\\.com|/)$" (i.e. ends with .com OR /).
In the end I negate the resulting boolean vector using !.

You can try the base R code below, using grep
r <- grep(paste0(patterns,collapse = "|"),links,value = TRUE,invert = TRUE)
such that
> r
[1] "www.google.com/#"
[2] "www.google.com/xpto"
[3] "http://google.com/xpto"
[4] "http://google.com/xpto&utml"
[5] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"

You can do this using stringr::str_subset() function.
library(stringr)
str_subset(links, pattern = ".com$|/$", negate = TRUE)

Grep when pattern is found exactly n times [duplicate]

This question already has answers here:
r grep by regex - finding a string that contains a sub string exactly one once
(6 answers)
Closed 3 years ago.
I am looking for a regex expression to capture strings where the pattern is repeated n times. Here is an example with expected output.
# find sentences with 2 occurrences of the word "is"
z = c("this is what it is and is not", "this is not", "this is it it is")
regex_function(z)
[1] FALSE FALSE TRUE
I have gotten this far:
grepl("(.*\\bis\\b.*){2}",z)
[1] TRUE FALSE TRUE
But this will return TRUE if there are at least 2 matches. How can I force it to look for strings with exactly 2 occurrences?

To find where the word is is contained two times you can remove all is with gsub and compare the length of the strings with nchar.
nchar(z) - nchar(gsub("(\\bis\\b)", "", z)) == 4
#[1] FALSE FALSE TRUE
or count the hits of gregexpr like:
sapply(gregexpr("\\bis\\b", z), function(x) sum(x>0)) == 2
#[1] FALSE FALSE TRUE
or with a regex in grepl
grepl("^(?!(.*\\bis\\b){3})(.*\\bis\\b){2}.*$", z, perl=TRUE)
#[1] FALSE FALSE TRUE

This is an option that works but needs 2 regex calls. I am still looking for a compact regex call which correctly solves this issue.
grepl("(.*\\bis\\b.*){2}",z) & !grepl("(.*\\bis\\b.*){3}",z)
Basically adding a grepl of n+1 and only keeping the ones that satisfy grep no 1 and do not satisfy grep no2.

library(stringi)
stri_count_regex(z, "\\bis\\b") == 2L
# [1] FALSE FALSE TRUE

with stringr:
library(stringr)
library(magrittr)
regex_function = function(str){
str_extract_all(str,"\\bis\\b")%>%
lapply(.,function(x){length(x) == 2}) %>%
unlist()
}
> regex_function(z)
[1] FALSE FALSE TRUE

r language: how to find rows with match in list?

In R, if I have a data frame with a column for which each row entry contains a list, how can I search for those rows containing a match in that list?
For example, how can I return the indices for those rows containing "Algebra" (e.g. rows 1 and 3) in the following:
> df[1:3,]$classes
[[1]]
[1] "Algebra" "Calculus"
[[2]]
[1] "Geometry"
[[3]]
[1] "Geometry" "Quantum Mechanics" "Algebra"

We can use sapply to loop over the list, use grepl to get logical vector, wrap with any to return only a single TRUE/FALSE value per list element.
sapply(df[1:3,]$classes, function(x) any(grepl('Algebra', x)))
#[1] TRUE FALSE TRUE
Or we can use %in% to return only a single TRUE/FALSE per list element.
sapply(df[1:3,]$classes, '%in%', x='Algebra')
#[1] TRUE FALSE TRUE
Another option is is.element
sapply(df[1:3,]$classes, is.element, el='Algebra')
#[1] TRUE FALSE TRUE
Or as #Richard Scriven mentioned, == can be used
sapply(df[1:3,]$classes, function(x) any(x == "Algebra"))

How can I check if multiple strings exist in another string?

I have this string:
myStr <- "I am very beautiful btw"
str <- c("very","beauti","bt")
Now I want to check whether myStr includes all strings in str, how can I do this in R? For example above it should be TRUE.
Many Thanks

Yes, you can use grepl (not grep, actually), but you must run it once for each substring:
> sapply(str, grepl, myStr)
very beauti bt
TRUE TRUE TRUE
To get only one result if all of them are true, use all:
> all(sapply(str, grepl, myStr))
[1] TRUE
Edit:
In case you have more than one string to check, say:
myStrings <- c("I am very beautiful btw", "I am not beautiful btw")
You then run the sapply code, which will return a matrix with one row for each string in myStrings. Apply all on each row:
> apply(sapply(str, grepl, myStrings), 1, all)
[1] TRUE FALSE

Using stringr you could do:
str_detect(myStr, str)
Which returns a result for each substring:
#[1] TRUE TRUE TRUE
Or as per #thelatemail suggestion, if you want to know if all of them are true:
all(str_detect(myStr,str))
Which gives:
#[1] TRUE
You could also find the location (start, end) of every character in myStr that matches str
str_locate(myStr, str)
Which gives:
# start end
#[1,] 6 9
#[2,] 11 16
#[3,] 21 22

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R match part of string against vector of strings - r

You can use a combination of sapply and grepl, which returns a logical if matched sapply(B, grepl, x=A)

Related

Compare 2 strings in R

How to apply list of regex pattern on list

Grep when pattern is found exactly n times [duplicate]

r language: how to find rows with match in list?

How can I check if multiple strings exist in another string?

Categories

Resources