This question already has answers here:
R regex to find two words same string, order and distance may vary
(2 answers)
Closed 2 years ago.
grepl("instance|percentage", labelTest$Text)
will return true if any one of instance or percentage is present.
How will I get true only when both the terms are present?
Text <- c("instance", "percentage", "n",
"instance percentage", "percentage instance")
grepl("instance|percentage", Text)
# TRUE TRUE FALSE TRUE TRUE
grepl("instance.*percentage|percentage.*instance", Text)
# FALSE FALSE FALSE TRUE TRUE
The latter one works by looking for:
('instance')(any character sequence)('percentage')
OR
('percentage')(any character sequence)('instance')
Naturally if you need to find any combination of more than two words, this will get pretty complicated. Then the solution mentioned in the comments would be easier to implement and read.
Another alternative that might be relevant when matching many words is to use positive look-ahead (can be thought of as a 'non-consuming' match). For this you have to activate perl regex.
# create a vector of word combinations
set.seed(1)
words <- c("instance", "percentage", "element",
"character", "n", "o", "p")
Text2 <- replicate(10, paste(sample(words, 5), collapse=" "))
# grepl with multiple positive look-ahead
longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)",
Text2, perl=TRUE)
# this is equivalent to the solution proposed in the comments
longstrd <- grepl("instance", Text2) &
grepl("percentage", Text2) &
grepl("element", Text2) &
grepl("character", Text2)
# they produce identical results
identical(longperl, longstrd)
Furthermore, if you have the patterns stored in a vector you can condense the expressions significantly, giving you
pat <- c("instance", "percentage", "element", "character")
longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE)
longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L
As asked for in the comments, if you want to match on exact words, i.e. not match on substrings, we can specify word boundaries using \\b. E.g:
tx <- c("cent element", "percentage element", "element cent", "element centimetre")
grepl("(?=.*\\bcent\\b)(?=.*element)", tx, perl=TRUE)
# TRUE FALSE TRUE FALSE
grepl("element", tx) & grepl("\\bcent\\b", tx)
# TRUE FALSE TRUE FALSE
This is how you will get only "TRUE" if both terms do occur in an item of the vector "labelTest$Text".
I think this is the exact answer to the question and much shorter than the other solutions.
grepl("instance",labelTest$Text) & grepl("percentage",labelTest$Text)
Use intersect and feed it a grep for each word:
library(data.table) #used for subsetting text vector below
vector_of_text[
intersect(
grep(vector_of_text , pattern = "pattern1"),
grep(vector_of_text , pattern = "pattern2")
)
]
Related
I am new to programming, so I appreciate any help. I have several string documents, and I want to know if some keywords happen in these documents, but in combination with other words, and in some cases, ignoring some words.
For example, if I consider the following data, I would like to find all documents where “Rains & Germany” happen together.
list_documents <- c("it rains in Germany", "it rains a lot in the field" , "the sun is shining in Germany")
The output would be something like
[1] TRUE
[2] FALSE
[3] FALSE
Does anyone know which package should I use for that? I tried “str_extract”, but the logical operators do not work on text.
Thanks in advance!
Try grep function:
grepl('rains', list_documents, ignore.case = T) & grepl('germany', list_documents, ignore.case = T)
# [1] TRUE FALSE FALSE
OR
grepl('rains.*germany|germany.*rains', list_documents, ignore.case = T)
Here is a Reduce solution.
patterns <- c("Rains", "Germany")
Reduce('&', lapply(patterns, function(x) grepl(x, list_documents, ignore.case = TRUE)))
#[1] TRUE FALSE FALSE
It works by first creating a list of logical vectors, each of them with the result of greplapplied to each of the patterns. Then Reduce AND's the list's members in pairs.
Note that the above solution still works if there are more than 2 patterns to search for and AND. The one-liner can be made a function, this time separating the list creation and the Reduce.
and_grepl <- function(pattern, x, ...){
results <- lapply(pattern, function(p) grepl(p, x, ...))
Reduce('&', results)
}
and_grepl(patterns, list_documents, ignore.case = TRUE)
#[1] TRUE FALSE FALSE
pat <- c("A", "B", "C")
new_list <- c("ABCD", "ABCE", "ABDE", "DEFG")
and_grepl(pat, new_list)
#[1] TRUE TRUE FALSE FALSE
I am trying in R to find the spanish words in a number of words. I have all the spanish words from a excel that I don´t know how to attach in the post (it has more than 80000 words), and I am trying to check if some words are on it, or not.
For example:
words = c("Silla", "Sillas", "Perro", "asdfg")
I tried to use this solution:
grepl(paste(spanish_words, collapse = "|"), words)
But there is too much spanish words, and gives me this error:
Error
So... who can i do it? I also tried this:
toupper(words) %in% toupper(spanish_words)
Result
As you can see with this option only gives TRUE in exactly matches, and I need that "Sillas" also appear as TRUE (it is the plural word of silla). That was the reason that I tried first with grepl, for get plurals aswell.
Any idea?
As df:
df <- tibble(text = c("some words",
"more words",
"Perro",
"And asdfg",
"Comb perro and asdfg"))
Vector of words:
words <- c("Silla", "Sillas", "Perro", "asdfg")
words <- tolower(paste(words, collapse = "|"))
Then use mutate and str_detect:
df %>%
mutate(
text = tolower(text),
spanish_word = str_detect(text, words)
)
Returns:
text spanish_word
<chr> <lgl>
1 some words FALSE
2 more words FALSE
3 perro TRUE
4 and asdfg TRUE
5 comb perro and asdfg TRUE
i would like to get the count times that in a given string a word start with the letter given.
For example, in that phrase: "that pattern is great but pigs likes milk"
if i want to find the number of words starting with "g" there is only 1 "great", but right now i get 2 "great" and "pigs".
this is the code i use:
x <- "that pattern is great but pogintless"
sapply(regmatches(x, gregexpr("g", x)), length)
We need either a space or word boundary to avoid th letter from matching to characters other than the start of the word. In addition, it may be better to use ignore.case = TRUE as some words may begin with uppercase
lengths(regmatches(x, gregexpr("\\bg", x, ignore.case = TRUE)))
The above can be wrapped as a function
fLength <- function(str1, pat){
lengths(regmatches(str1, gregexpr(paste0("\\b", pat), str1, ignore.case = TRUE)))
}
fLength(x, "g")
#[1] 1
You can also do it with stringr library
library(stringr)
str_count(str_split(x," "),"\\bg")
I am dealing with two strings like this below
x1 <- "Unknown, because not discussed"
x2 <- "Not at goal, no."
How do i use grepl function to distinguish between these two strings ?
When I use grepl("no", x1), it shows TRUE, which is not correct. This is picking up the no in not or Unknown. How do i use string parsing function to detect strings with the word no explicitly ? Any advise is much appreciated.
You can use word boundary \\b to distinguish them. \\bno\\b will match no only without preceding and following word characters:
grepl("\\bno\\b", x1)
# [1] FALSE
grepl("\\bno\\b", x2)
# [1] TRUE
I can think of a couple of options for matching "no" but not "not":
Using the \b "word boundary" pattern:
> x = c("Unknown, because not discussed", "Not at goal, no.")
> grepl("\\bno\\b", x)
[1] FALSE TRUE
Using [^t] to exclude "not":
> grepl("\\bno[^t]", x)
[1] FALSE TRUE
For matching the word "no" by itself the word boundary option "\\bno\\b" is probably best.
This question already has answers here:
Case-insensitive search of a list in R
(7 answers)
Closed 6 years ago.
Is there a way to control the case sensitivity of the %in% operator? In my case I want it to return true no matter the case of the input:
stringList <- c("hello", "world")
"Hello" %in% stringList
"helLo" %in% stringList
"hello" %in% stringList
Consider this code as a reproducible example, however in my real application I am also using a list of strings on the left and check for the presence of words from stringList.
Use grepl instead as it has an ignore.case parameter:
grepl("^HeLLo$",stringList,ignore.case=TRUE)
[1] TRUE FALSE
The first argument is a regular expression, so it gives you more flexibility, but you have to start with ^ and end with $ to avoid picking up sub-strings.
In addition to #James's answer, you can also use tolower if you want to avoid regexes:
tolower("HeLLo") %in% stringlist
If left side is also a character vector then we make tolower both sides, e.g.:
x <- c("Hello", "helLo", "hello", "below")
stringList <- c("heLlo", "world")
tolower(x) %in% tolower(stringList)
# [1] TRUE TRUE TRUE FALSE