detect string with both AND and OR boolean operator in R

detect string with both AND and OR boolean operator in R - r

I have a text like this:
text = 'I love apple, pear, grape and peach'
If I want to know if the text contain either apple or pear. I can do the following and works fine:
str_detect(text,"apple|pear")
[1] TRUE
my question is what if I want to use boolean like this (apple OR pear) AND (grape).
Is there anyway that I can put it in str_detect(). Is that possible?
The following is NOT working:
str_detect(text,"(apple|pear) & (grape)" )
[1] FALSE
The reason I want to know this is I want to program to convert a 'boolean query' and feed into the grep or str_detect. something like:
str_detect(text, '(word1|word2) AND (word2|word3|word4) AND (word5|word6) AND .....')
The number of AND varies....
No solution with multiple str_detect please.

You can pass all the patterns to str_detect as a vector and check that they're all TRUE with all.
patterns <- c('apple|pear', 'grape')
all(str_detect(text, patterns))
Or with base R
all(sapply(patterns, grepl, x = text))
Or, you could put the patterns in a list and use map, which would give more detailed output for the ORs (or anything else you may want to put as a list element)
patterns <- list(c('apple', 'pear'), 'peach')
patterns %>%
map(str_detect, string = text)
# [[1]]
# [1] TRUE TRUE
#
# [[2]]
# [1] TRUE
It's also possible to write it as a single regular expression, but I see no reason to do this
patterns <- c('apple|pear', 'grape')
patt_combined <- paste(paste0('(?=.*', patterns, ')'), collapse = '')
str_detect(text, patt_combined)
patt_combined is
# [1] "(?=.*apple|pear)(?=.*grape)"

Related

Accent insensitive regex in R

I'm trying to use filter(grepl()) to match some words in my column. Let's suppose I want to extract the word "Guartelá". In my column, i have variations such as "guartela" "guartelá" and "Guartela". To match upper/lowercase words I'm using (?i). However, I haven't found a good way to match accent/no-accent (i.e., "guartelá" and "guartela").
I know that I can simply substitute á by a, but is there a way to assign the accent-insensitive in the code? It can be base R/tidyverse/any, I don't mind.
Here's how my curent code line is:
cobras <- final %>% filter(grepl("(?i)guartelá", NAME)
| grepl("(?i)guartelá", locality))
Cheers

you can use stri_trans_general fron stringi to remove all accents:
unaccent_chars= stringi::stri_trans_general(c("guartelá","with_é","with_â","with_ô") ,"Latin-ASCII")
unaccent_chars
# [1] "guartela" "with_e" "with_a" "with_o"
# grepl(paste(unaccent_chars,collapse = "|"), string)

You can pass options in OR statements using [ to capture different combinations
> string <- c("Guartelá", "Guartela", "guartela", "guartelá", "any")
> grepl("[Gg]uartel[aá]", string)
[1] TRUE TRUE TRUE TRUE FALSE

Another option using str_detect():
library(tidyverse)
tibble(name = c("guartela","guartelá", "Guartela", "Other")) |>
filter(str_detect(name, "guartela|guartelá|Guartela"))

German keyword search - look for every possible combination

I'm working on a project where I define some nouns like Haus, Boot, Kampf, ... and what to detect every version (singular/plurar) and every combination of these words in sentences. For example, the algorithm should return true if a sentences does contain one of : Häuser, Hausboot, Häuserkampf, Kampfboot, Hausbau, Bootsanleger, ....
Are you familiar with an algorithm that can do such a thing (preferable in R)? Of course I could implement this manually, but I'm pretty sure that something should already exist.
Thanks!

you can use stringr library and the grepl function as it is done in this example:
> # Toy example text
> text1 <- c(" This is an example where Hausbau appears twice (Hausbau)")
> text2 <- c(" Here it does not appear the name")
> # Load library
> library(stringr)
> # Does it appear "Hausbau"?
> grepl("Hausbau", text1)
[1] TRUE
> grepl("Hausbau", text2)
[1] FALSE
> # Number of "Hausbau" in the text
> str_count(text1, "Hausbau")
[1] 2

check <- c("Der Häuser", "Das Hausboot ist", "Häuserkampf", "Kampfboot im Wasser", "NotMe", "Hausbau", "Bootsanleger", "Schauspiel")
base <- c("Haus", "Boot", "Kampf")
unlist(lapply(str_to_lower(stringi::stri_trans_general(check, "Latin-ASCII")), function(x) any(str_detect(x, str_to_lower(base)) == T)))
# [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
Breaking it down
Note the comment of Roland, you will match false TRUE values in words like "Schauspiel"
You need to get rid of the special characters, you can use stri_trans_general to translate them to Latin-ASCII
You need to convert your strings to lowercase (i.e. match Boot in Kampfboot)
Then apply over the strings to test and check if they are in the base list, if any of those values is true. You got a match.

stringr::str_starts returns TRUE when it shouldn't

I am trying to detect whether a string starts with either of the provided strings (separated by | )
name = "KKSWAP"
stringr::str_starts(name, "RTT|SWAP")
returns TRUE, but
str_starts(name, "SWAP|RTT")
returns FALSE
This behaviour seems wrong, as KKSWAP doesn't start with "RTT" or "SWAP". I would expect this to be false in both above cases.

The reason can be found in the code of the function :
function (string, pattern, negate = FALSE)
{
switch(type(pattern), empty = , bound = stop("boundary() patterns are not supported."),
fixed = stri_startswith_fixed(string, pattern, negate = negate,
opts_fixed = opts(pattern)), coll = stri_startswith_coll(string,
pattern, negate = negate, opts_collator = opts(pattern)),
regex = {
pattern2 <- paste0("^", pattern)
attributes(pattern2) <- attributes(pattern)
str_detect(string, pattern2, negate)
})
}
You can see, it pastes '^' in front of the parttern, so in your example it looks for '^RR|SWAP' and finds 'SWAP'.
If you want to look at more than one pattern you should use a vector:
name <- "KKSWAP"
stringr::str_starts(name, c("RTT","SWAP"))
# [1] FALSE FALSE
If you want just one answer, you can combine with any()
name <- "KKSWAP"
stringr::str_starts(name, c("RTT","SWAP"))
# [1] FALSE
The advantage of stringr::str_starts() is the vectorisation of the pattern argument, but if you don't need it grepl('^RTT|^SWAP', name), as suggested by TTS, is a good base R alternative.
Alternatively, the base function startsWith() suggested by jpsmith offers both the vectorized and | options :
startsWith(name, c("RTT","SWAP"))
# [1] FALSE FALSE
startsWith(name, "RTT|SWAP")
# [1] FALSE

I'm not familiar with the stringr version, but the base R version startsWith returns your desired result. If you don't have to use stringr, this may be a solution:
startsWith(name, "RTT|SWAP")
startsWith(name, "SWAP|RTT")
startsWith(name, "KK")
# > startsWith(name, "RTT|SWAP")
# [1] FALSE
# > startsWith(name, "SWAP|RTT")
# [1] FALSE
# > startsWith(name, "KK")
# [1] TRUE

The help text describes str_starts: Detect the presence or absence of a pattern at the beginning or end of a string. This might be why it's not behaving quite as expected.
pattern is the Pattern with which the string starts or ends.
We can add ^ regex to make it search at the beginning of string and get the expected result.
name = 'KKSWAP'
str_starts(name, '^RTT|^SWAP')
I would prefer grepl in this instance because it seems less misleading.
grepl('^RTT|^SWAP', name)

How to specify to grepl to take into account a vowel irrespective of the accent

I am working on data where the words are in French.
I want grepl to take into account the word, whether the vowel has an accent or not.
Here is a part of code:
Here I want grepl to spot all word that are radiothérapie or radiotherapie, that is to ignore the accent
ifelse(grepl("Radiothe(é)rapie",mydata$word),"yes","no")

ifelse(grepl("Radioth[eé]rapie", c("Radiotherapie", "Radiothérapie", "Radio")),"yes","no")

Well the brute force way of doing this would be to use a character class containing all variations of the letter, e.g.
ifelse(grepl("Radioth[eé]rapie", mydata$word), "yes", "no")

possible solution: convert to Latin-ASCII before using grepl.
x <- c("radiothérapie", "radiotherapie")
grepl("radiotherapie", stringi::stri_trans_general(x,"Latin-ASCII"))
[1] TRUE TRUE
this should work for most (all?) accents..

You can take a more universal approach
string = "ábçdêfgàõp"
iconv(string, to='ASCII//TRANSLIT')
# [1] "abcdefgaop"
For your scenario
x <- "Radiotherapie"
y <- c("Radiotherapie", "Radiothérapie", "Radio")
grepl(iconv(x, to='ASCII//TRANSLIT'), iconv(y, to='ASCII//TRANSLIT'))
# [1] TRUE TRUE FALSE

Here is a trick using grepl with regex:
Use caret inside the group which negates your selection:
x <- c("radiothérapie", "radiotherapie")
grepl('radioth[é^e]rapie', x)
[1] TRUE TRUE

Ignore or display NA in a row if the search word is not available in a list- R

How to print or display Not Available if any of my search list in (Table_search) is not available in the list I input. In the input I have three lines and I have 3 keywords to search through these lines and tell me if the keyword is present in those lines or not. If present print that line else print Not available like I showed in the desired output.
My code just prints all the available lines but that doesn't help as I need to know where is the word is missing as well.
Table_search <- list("Table 14", "Source Data:","VERSION")
Table_match_list <- sapply(Table_search, grep, x = tablelist, value = TRUE)
Input:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry
Desired Output:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
NA
#r2evans
sapply(unlist(Table_search), grepl, x = dat)
I get a good output with this code actually, but instead of true or false I would like to print the actual data.

I think a single regex will do it:
replace(dat, !grepl(paste(unlist(Table_search), collapse="|"), dat), NA)
# [1] "Table 14.1.1.1 (Page 1 of 2)" "Source Data: Listing 16.2.1.1.1"
# [3] NA
One problem with using sapply(., grep) is that grep returns integer indices, and if no match is made then it returns a length-0 vector. For sapply (a class-unsafe function), this means that you may or may not get a integer vector in return. Each return may be length 0 (nothing found) or length 1 (something found), and when sapply finds that each return value is not exactly the same length, it returns a list instead (ergo my "class-unsafe" verbiage above).
This doesn't change when you use value=TRUE: change my reasoning above about "0 or 1 logical" into "0 or 1 character", and it's the same exact problem.
Because of this, I suggest grepl: it should always return logical indicating found or not found.
Further, since you don't appear to need to differentiate which of the patterns is found, just "at least one of them", then we can use a single regex, joined with the regex-OR operator |. This works with an arbitrary length of your Table_search list.
If you somehow needed to know which of the patterns was found, then you might want something like:
sapply(unlist(Table_search), grepl, x = dat)
# Table 14 Source Data: VERSION
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE FALSE
and then figure out what to do with the different columns (each row indicates a string within the dat vector).
One way (that is doing the same as my first code suggestion, albeit less efficiently) is
rowSums(sapply(unlist(Table_search), grepl, x = dat)) > 0
# [1] TRUE TRUE FALSE
where the logical return value indicates if something was found. If, for instance, you want to know if two or more of the patterns were found, one might use rowSums(.) >= 2).
Data
Table_search <- list("Table 14", "Source Data:","VERSION")
dat <- c("Table 14.1.1.1 (Page 1 of 2)", "Source Data: Listing 16.2.1.1.1", "Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

detect string with both AND and OR boolean operator in R - r

Related

Accent insensitive regex in R

German keyword search - look for every possible combination

stringr::str_starts returns TRUE when it shouldn't

How to specify to grepl to take into account a vowel irrespective of the accent

Ignore or display NA in a row if the search word is not available in a list- R

Categories

Resources