I'm working on a project where I define some nouns like Haus, Boot, Kampf, ... and what to detect every version (singular/plurar) and every combination of these words in sentences. For example, the algorithm should return true if a sentences does contain one of : Häuser, Hausboot, Häuserkampf, Kampfboot, Hausbau, Bootsanleger, ....
Are you familiar with an algorithm that can do such a thing (preferable in R)? Of course I could implement this manually, but I'm pretty sure that something should already exist.
Thanks!
you can use stringr library and the grepl function as it is done in this example:
> # Toy example text
> text1 <- c(" This is an example where Hausbau appears twice (Hausbau)")
> text2 <- c(" Here it does not appear the name")
> # Load library
> library(stringr)
> # Does it appear "Hausbau"?
> grepl("Hausbau", text1)
[1] TRUE
> grepl("Hausbau", text2)
[1] FALSE
> # Number of "Hausbau" in the text
> str_count(text1, "Hausbau")
[1] 2
check <- c("Der Häuser", "Das Hausboot ist", "Häuserkampf", "Kampfboot im Wasser", "NotMe", "Hausbau", "Bootsanleger", "Schauspiel")
base <- c("Haus", "Boot", "Kampf")
unlist(lapply(str_to_lower(stringi::stri_trans_general(check, "Latin-ASCII")), function(x) any(str_detect(x, str_to_lower(base)) == T)))
# [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
Breaking it down
Note the comment of Roland, you will match false TRUE values in words like "Schauspiel"
You need to get rid of the special characters, you can use stri_trans_general to translate them to Latin-ASCII
You need to convert your strings to lowercase (i.e. match Boot in Kampfboot)
Then apply over the strings to test and check if they are in the base list, if any of those values is true. You got a match.
Related
Lets say I have a string "Hello." I want to see if this string contains a period:
text <- "Hello."
results <- grepl(".", text)
This returns results as TRUE, but it would return that as well if text is "Hello" without the period.
I'm confused, I can't find anything about this in the documentation and it only does this for the period.
Any ideas?
See the differences with these examples
> grepl("\\.", "Hello.")
[1] TRUE
> grepl("\\.", "Hello")
[1] FALSE
the . means anything as pointed out by SimonO101, if you want to look for an explicit . then you have to skip it by using \\. which means look for a .
R documentation is extensive on regular expressions, you can also take a look at this link to understand the use of the dot.
I use Jilber's approach usually but here are two other ways:
> grepl("[.]", "Hello.")
[1] TRUE
> grepl("[.]", "Hello")
[1] FALSE
> grepl(".", "Hello.", fixed = TRUE)
[1] TRUE
> grepl(".", "Hello", fixed = TRUE)
[1] FALSE
I am working on data where the words are in French.
I want grepl to take into account the word, whether the vowel has an accent or not.
Here is a part of code:
Here I want grepl to spot all word that are radiothérapie or radiotherapie, that is to ignore the accent
ifelse(grepl("Radiothe(é)rapie",mydata$word),"yes","no")
ifelse(grepl("Radioth[eé]rapie", c("Radiotherapie", "Radiothérapie", "Radio")),"yes","no")
Well the brute force way of doing this would be to use a character class containing all variations of the letter, e.g.
ifelse(grepl("Radioth[eé]rapie", mydata$word), "yes", "no")
possible solution: convert to Latin-ASCII before using grepl.
x <- c("radiothérapie", "radiotherapie")
grepl("radiotherapie", stringi::stri_trans_general(x,"Latin-ASCII"))
[1] TRUE TRUE
this should work for most (all?) accents..
You can take a more universal approach
string = "ábçdêfgàõp"
iconv(string, to='ASCII//TRANSLIT')
# [1] "abcdefgaop"
For your scenario
x <- "Radiotherapie"
y <- c("Radiotherapie", "Radiothérapie", "Radio")
grepl(iconv(x, to='ASCII//TRANSLIT'), iconv(y, to='ASCII//TRANSLIT'))
# [1] TRUE TRUE FALSE
Here is a trick using grepl with regex:
Use caret inside the group which negates your selection:
x <- c("radiothérapie", "radiotherapie")
grepl('radioth[é^e]rapie', x)
[1] TRUE TRUE
I have a doubt very similar to this topic here: Find matches of a vector of strings in another vector of strings.
I have a vector of clients, and if the name indicates that is a commercial client, I need to change the type in my data.frame.
So, suppose that:
commercial_names <- c("BAKERY","MARKET", "SCHOOL", "CINEMA")
clients <- c("JOHN XX","REESE YY","BAKERY ZZ","SAMANTHA WW")
I tried the code in the topic cited before, but I had an error:
> grepl(paste(commercial_names, collape="|"),clients)
[1] TRUE TRUE TRUE TRUE
Warning message:
In grepl(paste(commercial_names, collape = "|"), clients) :
argument 'pattern' has length > 1 and only the first element will be used
What am I doing wrong? I would thank any help.
Your code is correct but for a typo:
grepl(paste0(commercial_names, collapse = "|"), clients) # typo: collape
[1] FALSE FALSE TRUE FALSE
Given the typo, the commercial_names are not collapsed.
Not sure how to do this with a one-liner but a loop seems to do the trick.
sapply(clients, function(client) {
any(str_detect(client, commercial_names))
})
> JOHN XX REESE YY BAKERY ZZ SAMANTHA WW
> FALSE FALSE TRUE FALSE
I found another way of to do this, with the command %like% of package data.table:
> clients %like% paste(commercial_names,collapse = "|")
[1] FALSE FALSE TRUE FALSE
You can do something like this too:
clients.first <- gsub(" ..", "", clients)
clients.first %in% commercial_names
This returns:
[1] FALSE FALSE TRUE FALSE
You might need to change the regular expression for gsub if your clients data is more heterogeneous though.
I have a text like this:
text = 'I love apple, pear, grape and peach'
If I want to know if the text contain either apple or pear. I can do the following and works fine:
str_detect(text,"apple|pear")
[1] TRUE
my question is what if I want to use boolean like this (apple OR pear) AND (grape).
Is there anyway that I can put it in str_detect(). Is that possible?
The following is NOT working:
str_detect(text,"(apple|pear) & (grape)" )
[1] FALSE
The reason I want to know this is I want to program to convert a 'boolean query' and feed into the grep or str_detect. something like:
str_detect(text, '(word1|word2) AND (word2|word3|word4) AND (word5|word6) AND .....')
The number of AND varies....
No solution with multiple str_detect please.
You can pass all the patterns to str_detect as a vector and check that they're all TRUE with all.
patterns <- c('apple|pear', 'grape')
all(str_detect(text, patterns))
Or with base R
all(sapply(patterns, grepl, x = text))
Or, you could put the patterns in a list and use map, which would give more detailed output for the ORs (or anything else you may want to put as a list element)
patterns <- list(c('apple', 'pear'), 'peach')
patterns %>%
map(str_detect, string = text)
# [[1]]
# [1] TRUE TRUE
#
# [[2]]
# [1] TRUE
It's also possible to write it as a single regular expression, but I see no reason to do this
patterns <- c('apple|pear', 'grape')
patt_combined <- paste(paste0('(?=.*', patterns, ')'), collapse = '')
str_detect(text, patt_combined)
patt_combined is
# [1] "(?=.*apple|pear)(?=.*grape)"
I am dealing with two strings like this below
x1 <- "Unknown, because not discussed"
x2 <- "Not at goal, no."
How do i use grepl function to distinguish between these two strings ?
When I use grepl("no", x1), it shows TRUE, which is not correct. This is picking up the no in not or Unknown. How do i use string parsing function to detect strings with the word no explicitly ? Any advise is much appreciated.
You can use word boundary \\b to distinguish them. \\bno\\b will match no only without preceding and following word characters:
grepl("\\bno\\b", x1)
# [1] FALSE
grepl("\\bno\\b", x2)
# [1] TRUE
I can think of a couple of options for matching "no" but not "not":
Using the \b "word boundary" pattern:
> x = c("Unknown, because not discussed", "Not at goal, no.")
> grepl("\\bno\\b", x)
[1] FALSE TRUE
Using [^t] to exclude "not":
> grepl("\\bno[^t]", x)
[1] FALSE TRUE
For matching the word "no" by itself the word boundary option "\\bno\\b" is probably best.