r %in% operator | control case sensitivity [duplicate] - r

This question already has answers here:
Case-insensitive search of a list in R
(7 answers)
Closed 6 years ago.
Is there a way to control the case sensitivity of the %in% operator? In my case I want it to return true no matter the case of the input:
stringList <- c("hello", "world")
"Hello" %in% stringList
"helLo" %in% stringList
"hello" %in% stringList
Consider this code as a reproducible example, however in my real application I am also using a list of strings on the left and check for the presence of words from stringList.

Use grepl instead as it has an ignore.case parameter:
grepl("^HeLLo$",stringList,ignore.case=TRUE)
[1] TRUE FALSE
The first argument is a regular expression, so it gives you more flexibility, but you have to start with ^ and end with $ to avoid picking up sub-strings.

In addition to #James's answer, you can also use tolower if you want to avoid regexes:
tolower("HeLLo") %in% stringlist
If left side is also a character vector then we make tolower both sides, e.g.:
x <- c("Hello", "helLo", "hello", "below")
stringList <- c("heLlo", "world")
tolower(x) %in% tolower(stringList)
# [1] TRUE TRUE TRUE FALSE

Related

How to specify to grepl to take into account a vowel irrespective of the accent

I am working on data where the words are in French.
I want grepl to take into account the word, whether the vowel has an accent or not.
Here is a part of code:
Here I want grepl to spot all word that are radiothérapie or radiotherapie, that is to ignore the accent
ifelse(grepl("Radiothe(é)rapie",mydata$word),"yes","no")
ifelse(grepl("Radioth[eé]rapie", c("Radiotherapie", "Radiothérapie", "Radio")),"yes","no")
Well the brute force way of doing this would be to use a character class containing all variations of the letter, e.g.
ifelse(grepl("Radioth[eé]rapie", mydata$word), "yes", "no")
possible solution: convert to Latin-ASCII before using grepl.
x <- c("radiothérapie", "radiotherapie")
grepl("radiotherapie", stringi::stri_trans_general(x,"Latin-ASCII"))
[1] TRUE TRUE
this should work for most (all?) accents..
You can take a more universal approach
string = "ábçdêfgàõp"
iconv(string, to='ASCII//TRANSLIT')
# [1] "abcdefgaop"
For your scenario
x <- "Radiotherapie"
y <- c("Radiotherapie", "Radiothérapie", "Radio")
grepl(iconv(x, to='ASCII//TRANSLIT'), iconv(y, to='ASCII//TRANSLIT'))
# [1] TRUE TRUE FALSE
Here is a trick using grepl with regex:
Use caret inside the group which negates your selection:
x <- c("radiothérapie", "radiotherapie")
grepl('radioth[é^e]rapie', x)
[1] TRUE TRUE

Compare 2 strings in R

I have data as below:
vec <- c("ABC|ADC|1","ABC|ADG|2")
I need to check if below substring is present or not
"ADC|DFG", it should return false for this as I need to match exact pattern.
"ABC|ADC|1|5" should return True as this is a child element for the first element in vector.
I tried using grepl but it returns true if I just pass ADC as well, any help is appreciated.
grepl returns true because the pipe character | in regex is a special one. a|b means match a or b. all you need to do is escape it.
frtest<-c("ABC|ADC","ABC|ADC|1|2","ABC|ADG","ABC|ADG|2|5")
# making the last number and it's pipe optional
test <- gsub('(\\|\\d)$', '(\\1)?', frtest)
# escaping all pipes
test<-gsub('\\|' ,'\\\\\\\\|',test)
# testing if any of the strings is in vec
res <- sapply(test, function(x) any(grepl(x, vec)) )
# reassigning the names so they're readable
names(res) <-frtest
#> ABC|ADC ABC|ADC|1|2 ABC|ADG ABC|ADG|2|5
TRUE TRUE TRUE TRUE
For two vectors vec and test, this returns a vector which is TRUE if either the corresponding element of test is the start of one of the elements of vec, or one of the elements of vec is the start of the corresponding element of test.
vec <- c("ABC|ADC|1","ABC|ADG|2")
test <- c("ADC|DFG", "ABC|ADC|1|5", "ADC|1", "ABC|ADC")
colSums(sapply(test, startsWith, vec) | t(sapply(vec, startsWith, test))) > 0
# ADC|DFG ABC|ADC|1|5 ADC|1 ABC|ADC
# FALSE TRUE FALSE TRUE

Sapply grepl data frames exact/complete matches

I have the same problem as in :
How to apply grepl for data frame
But I'm getting undesired matches, as in :
Complete word matching using grepl in R
How do I apply the \< or \b solution in a sapply environment when grepl is looping through vectors?
You'd used an anonymous function to be applied to each element of the columns in the data frame.
vec1 <- c("I don't want to match this", "This is what I want to match")
vec2 <- c('Why would I match this?', "What is a good match for this?")
df <- data.frame(vec1,vec2)
sapply(df, function(x) grepl("\\<is\\>", x))
vec1 vec2
[1,] FALSE FALSE
[2,] TRUE TRUE
I found a solution myself.
It's sufficient to paste a blank space before and after each element in the vector to be matched with the sentences.
vector <- paste(" ", vector, " ")
matches <- sapply(vector, grepl, sentences, ignore.case=TRUE )

R's grepl() to find multiple strings exists [duplicate]

This question already has answers here:
R regex to find two words same string, order and distance may vary
(2 answers)
Closed 2 years ago.
grepl("instance|percentage", labelTest$Text)
will return true if any one of instance or percentage is present.
How will I get true only when both the terms are present?
Text <- c("instance", "percentage", "n",
"instance percentage", "percentage instance")
grepl("instance|percentage", Text)
# TRUE TRUE FALSE TRUE TRUE
grepl("instance.*percentage|percentage.*instance", Text)
# FALSE FALSE FALSE TRUE TRUE
The latter one works by looking for:
('instance')(any character sequence)('percentage')
OR
('percentage')(any character sequence)('instance')
Naturally if you need to find any combination of more than two words, this will get pretty complicated. Then the solution mentioned in the comments would be easier to implement and read.
Another alternative that might be relevant when matching many words is to use positive look-ahead (can be thought of as a 'non-consuming' match). For this you have to activate perl regex.
# create a vector of word combinations
set.seed(1)
words <- c("instance", "percentage", "element",
"character", "n", "o", "p")
Text2 <- replicate(10, paste(sample(words, 5), collapse=" "))
# grepl with multiple positive look-ahead
longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)",
Text2, perl=TRUE)
# this is equivalent to the solution proposed in the comments
longstrd <- grepl("instance", Text2) &
grepl("percentage", Text2) &
grepl("element", Text2) &
grepl("character", Text2)
# they produce identical results
identical(longperl, longstrd)
Furthermore, if you have the patterns stored in a vector you can condense the expressions significantly, giving you
pat <- c("instance", "percentage", "element", "character")
longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE)
longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L
As asked for in the comments, if you want to match on exact words, i.e. not match on substrings, we can specify word boundaries using \\b. E.g:
tx <- c("cent element", "percentage element", "element cent", "element centimetre")
grepl("(?=.*\\bcent\\b)(?=.*element)", tx, perl=TRUE)
# TRUE FALSE TRUE FALSE
grepl("element", tx) & grepl("\\bcent\\b", tx)
# TRUE FALSE TRUE FALSE
This is how you will get only "TRUE" if both terms do occur in an item of the vector "labelTest$Text".
I think this is the exact answer to the question and much shorter than the other solutions.
grepl("instance",labelTest$Text) & grepl("percentage",labelTest$Text)
Use intersect and feed it a grep for each word:
library(data.table) #used for subsetting text vector below
vector_of_text[
intersect(
grep(vector_of_text , pattern = "pattern1"),
grep(vector_of_text , pattern = "pattern2")
)
]

matching first word from a string

I have following R programs.
Test<-"CLC2" %in% "CLC2,CLC2,CLC2"
Test
Test1<-"CLC2" %in% "CLC2"
Test1
In first case, I want also get logical condition to be true as it matches to first word (required in my case).
You can find a word in a string and (if necessary) check if it is the first word of a string
gregexpr(pattern = "CLC2","CLC2,CLC2,CLC2")[[1]][1] == 1
Try
"CLC2" %in% c("CLC2", "CLC2", "CLC2")
# [1] TRUE
or
"CLC2" %in% strsplit("CLC2,CLC2,CLC2", ",")[[1]]
# [1] TRUE
The 2nd one splits your string at every , character.
Edit
It you just want to look at the first value, then it should be
"CLC2" %in% strsplit("CLC2,CLC2,CLC2", ",")[[1]][1]
"CLC2" %in% c("CLC2", "CLC2", "CLC2")[1]
as pointed out by #PierreLafortune. In that case, you don't need %in% but could also use == as you are just comparing one value to another value.
You can also try
grepl('\\<CLC2\\>', unlist(strsplit("CLC2,CLC2,CLC2", ","))[1])
#[1] TRUE

Resources