Accent insensitive regex in R - r

I'm trying to use filter(grepl()) to match some words in my column. Let's suppose I want to extract the word "Guartelá". In my column, i have variations such as "guartela" "guartelá" and "Guartela". To match upper/lowercase words I'm using (?i). However, I haven't found a good way to match accent/no-accent (i.e., "guartelá" and "guartela").
I know that I can simply substitute á by a, but is there a way to assign the accent-insensitive in the code? It can be base R/tidyverse/any, I don't mind.
Here's how my curent code line is:
cobras <- final %>% filter(grepl("(?i)guartelá", NAME)
| grepl("(?i)guartelá", locality))
Cheers

you can use stri_trans_general fron stringi to remove all accents:
unaccent_chars= stringi::stri_trans_general(c("guartelá","with_é","with_â","with_ô") ,"Latin-ASCII")
unaccent_chars
# [1] "guartela" "with_e" "with_a" "with_o"
# grepl(paste(unaccent_chars,collapse = "|"), string)

You can pass options in OR statements using [ to capture different combinations
> string <- c("Guartelá", "Guartela", "guartela", "guartelá", "any")
> grepl("[Gg]uartel[aá]", string)
[1] TRUE TRUE TRUE TRUE FALSE

Another option using str_detect():
library(tidyverse)
tibble(name = c("guartela","guartelá", "Guartela", "Other")) |>
filter(str_detect(name, "guartela|guartelá|Guartela"))

Related

stringr::str_starts returns TRUE when it shouldn't

I am trying to detect whether a string starts with either of the provided strings (separated by | )
name = "KKSWAP"
stringr::str_starts(name, "RTT|SWAP")
returns TRUE, but
str_starts(name, "SWAP|RTT")
returns FALSE
This behaviour seems wrong, as KKSWAP doesn't start with "RTT" or "SWAP". I would expect this to be false in both above cases.
The reason can be found in the code of the function :
function (string, pattern, negate = FALSE)
{
switch(type(pattern), empty = , bound = stop("boundary() patterns are not supported."),
fixed = stri_startswith_fixed(string, pattern, negate = negate,
opts_fixed = opts(pattern)), coll = stri_startswith_coll(string,
pattern, negate = negate, opts_collator = opts(pattern)),
regex = {
pattern2 <- paste0("^", pattern)
attributes(pattern2) <- attributes(pattern)
str_detect(string, pattern2, negate)
})
}
You can see, it pastes '^' in front of the parttern, so in your example it looks for '^RR|SWAP' and finds 'SWAP'.
If you want to look at more than one pattern you should use a vector:
name <- "KKSWAP"
stringr::str_starts(name, c("RTT","SWAP"))
# [1] FALSE FALSE
If you want just one answer, you can combine with any()
name <- "KKSWAP"
stringr::str_starts(name, c("RTT","SWAP"))
# [1] FALSE
The advantage of stringr::str_starts() is the vectorisation of the pattern argument, but if you don't need it grepl('^RTT|^SWAP', name), as suggested by TTS, is a good base R alternative.
Alternatively, the base function startsWith() suggested by jpsmith offers both the vectorized and | options :
startsWith(name, c("RTT","SWAP"))
# [1] FALSE FALSE
startsWith(name, "RTT|SWAP")
# [1] FALSE
I'm not familiar with the stringr version, but the base R version startsWith returns your desired result. If you don't have to use stringr, this may be a solution:
startsWith(name, "RTT|SWAP")
startsWith(name, "SWAP|RTT")
startsWith(name, "KK")
# > startsWith(name, "RTT|SWAP")
# [1] FALSE
# > startsWith(name, "SWAP|RTT")
# [1] FALSE
# > startsWith(name, "KK")
# [1] TRUE
The help text describes str_starts: Detect the presence or absence of a pattern at the beginning or end of a string. This might be why it's not behaving quite as expected.
pattern is the Pattern with which the string starts or ends.
We can add ^ regex to make it search at the beginning of string and get the expected result.
name = 'KKSWAP'
str_starts(name, '^RTT|^SWAP')
I would prefer grepl in this instance because it seems less misleading.
grepl('^RTT|^SWAP', name)

How to specify to grepl to take into account a vowel irrespective of the accent

I am working on data where the words are in French.
I want grepl to take into account the word, whether the vowel has an accent or not.
Here is a part of code:
Here I want grepl to spot all word that are radiothérapie or radiotherapie, that is to ignore the accent
ifelse(grepl("Radiothe(é)rapie",mydata$word),"yes","no")
ifelse(grepl("Radioth[eé]rapie", c("Radiotherapie", "Radiothérapie", "Radio")),"yes","no")
Well the brute force way of doing this would be to use a character class containing all variations of the letter, e.g.
ifelse(grepl("Radioth[eé]rapie", mydata$word), "yes", "no")
possible solution: convert to Latin-ASCII before using grepl.
x <- c("radiothérapie", "radiotherapie")
grepl("radiotherapie", stringi::stri_trans_general(x,"Latin-ASCII"))
[1] TRUE TRUE
this should work for most (all?) accents..
You can take a more universal approach
string = "ábçdêfgàõp"
iconv(string, to='ASCII//TRANSLIT')
# [1] "abcdefgaop"
For your scenario
x <- "Radiotherapie"
y <- c("Radiotherapie", "Radiothérapie", "Radio")
grepl(iconv(x, to='ASCII//TRANSLIT'), iconv(y, to='ASCII//TRANSLIT'))
# [1] TRUE TRUE FALSE
Here is a trick using grepl with regex:
Use caret inside the group which negates your selection:
x <- c("radiothérapie", "radiotherapie")
grepl('radioth[é^e]rapie', x)
[1] TRUE TRUE

Count number of dots in character string with str_count?

I am trying to count the number of dots in a character string.
I have tried to use str_count but it gives me the number of letters of the string instead.
ex_str <- "This.is.a.string"
str_count(ex_str, '.')
nchar(ex_str)
. is a special regex symbol, so you need to escape it:
str_count(ex_str, '\\.')
# [1] 3
Using just base R you could do:
nchar(gsub("[^.]", "", ex_str))
Using stringi:
stri_count_fixed(ex_str, '.')
Another base R solution could be:
length(grepRaw(".", ex_str, fixed = TRUE, all = TRUE))
[1] 3
You may also use the base function gregexpr:
sum(gregexpr(".", ex_str, fixed=TRUE)[[1]] > 0)
[1] 3
You can use stringr::str_count with a fixed(...) argument to avoid treating it as a regular expression:
str_count(ex_str, fixed('.'))
See the online R demo:
library(stringr)
ex_str <- "This.is.a.string"
str_count(ex_str, fixed('.'))
## => [1] 3

detect string with both AND and OR boolean operator in R

I have a text like this:
text = 'I love apple, pear, grape and peach'
If I want to know if the text contain either apple or pear. I can do the following and works fine:
str_detect(text,"apple|pear")
[1] TRUE
my question is what if I want to use boolean like this (apple OR pear) AND (grape).
Is there anyway that I can put it in str_detect(). Is that possible?
The following is NOT working:
str_detect(text,"(apple|pear) & (grape)" )
[1] FALSE
The reason I want to know this is I want to program to convert a 'boolean query' and feed into the grep or str_detect. something like:
str_detect(text, '(word1|word2) AND (word2|word3|word4) AND (word5|word6) AND .....')
The number of AND varies....
No solution with multiple str_detect please.
You can pass all the patterns to str_detect as a vector and check that they're all TRUE with all.
patterns <- c('apple|pear', 'grape')
all(str_detect(text, patterns))
Or with base R
all(sapply(patterns, grepl, x = text))
Or, you could put the patterns in a list and use map, which would give more detailed output for the ORs (or anything else you may want to put as a list element)
patterns <- list(c('apple', 'pear'), 'peach')
patterns %>%
map(str_detect, string = text)
# [[1]]
# [1] TRUE TRUE
#
# [[2]]
# [1] TRUE
It's also possible to write it as a single regular expression, but I see no reason to do this
patterns <- c('apple|pear', 'grape')
patt_combined <- paste(paste0('(?=.*', patterns, ')'), collapse = '')
str_detect(text, patt_combined)
patt_combined is
# [1] "(?=.*apple|pear)(?=.*grape)"

Exact match with grepl R

I'm trying to extract certain records from a dataframe with grepl.
This is based on the comparison between two columns Result and Names. This variable is build like this "WordNumber" but for the same word I have multiple numbers (more than 30), so when I use the grepl expression to get for instance Word1 I get also results that I would like to avoid, like Word12.
Any ideas on how to fix this?
Names <- c("Word1")
colnames(Names) <- name
Results <- c("Word1", "Word11", "Word12", "Word15")
Records <- c("ThisIsTheResultIWant", "notThis", "notThis", "notThis")
Relationships <- data.frame(Results, Records)
Relationships <- subset(Relationships, grepl(paste(Names$name, collapse = "|"), Relationships$Results))
This doesn't work, if I use fixed = TRUE than it doesn't return any result at all (which is weird). I have also tried concatenating the name part with other numbers like this, but with no success:
Relationships <- subset(Relationships, grepl(paste(paste(Names$name, '3', sep = ""), collapse = "|"), Relationships$Results))
Since I'm concatenating I'm not really sure of how to use the \b to enforce a full match.
Any suggestions?
In addition to #Richard's solution, there are multiple ways to enforce a full match.
\b
"\b" is an anchor to identify word before/after pattern
> grepl("\\bWord1\\b",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
\< & \>
"\<" is an escape sequence for the beginning of a word, and ">" is used for end
> grepl("\\<Word1\\>",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
Use ^ to match the start of the string and $ to match the end of the string
Names <-c('^Word1$')
Or, to apply to the entire names vector
Names <-paste0('^',Names,'$')
I think this is just:
Relationships[Relationships$Results==Names,]
If you end up doing ^Word1$ you're just doing a straight subset.
If you have multiple names, then instead use:
Relationships[Relationships$Results %in% Names,]

Resources