Returning Specific String found in text [duplicate] - r

This question already has an answer here:
How to find matching words in a DF from list of words and returning the matched words in new column [duplicate]
(1 answer)
Closed 3 years ago.
I have the following column in a df
c("I love bananas and apples.",
"I hate apples and pears.",
"I love to eat food.",
"I hate lettuce and bananas")
and I have a vector of fruit
fruit <- c("apples", "bananas", "pears")
I know using str_detect can return TRUE or FALSE per observation using
str_detect(df$text, paste(fruit, collapse='|'))
but what I would like is a column that has the variables that matched up, like the following
"I love bananas and apples." "bananas","apples"
"I hate apples and pears." "apples","pears"
"I love to eat food."
"I hate lettuce and bananas." "bananas"
is there a way to accomplish this? Is this outside the str_detect domain?

sapply(v, function(s){
toString(unlist(lapply(fruit, function(f){
if(grepl(f, s)) f
})))
},
USE.NAMES = FALSE)
#[1] "apples, bananas" "apples, pears" "" "bananas"

We can use str_extract_all to extract all the 'fruit' elements from the 'text' column in a list, loop through the list with map and paste (toString) them together to create the 'newtext' column
library(stringr)
library(dplyr)
library(purrr)
df %>%
mutate(newtext = map_chr(str_extract_all(text,
str_c(fruit, collapse='|')), ~toString(unique(.x)))

Related

Is there a way to show the matching element of a specific case using the grepl function in R?

I checked whether the brands of the data frame "df1"
brands
1 Nike
2 Adidas
3 D&G
are to be found in the elements of the following column of the data frame "df2"
statements
1 I love Nike
2 I don't like Adidas
3 I hate Puma
For this I use the code:
subset_df2 <- df2[grepl(paste(df1$brands, collapse="|"), ignore.case=TRUE, df2$statements), ]
The code works and I get a subset of df2 containing only the lines with the desired brands:
statements*
1 I love Nike
2 I don't like Adidas
Is there also a way to display which element of the cells from df2$statements exactly matches with df1$brands? For instance, a vector like [Nike, Adidas]. So, I only want to get the Nike and Adidas elements as my output and not the whole statement.
Many thanks in advance!
brands <- c("nike", "adidas", "d&g") # lower-case here
text <- c("I love Nike", "I love Adidas")
ptns <- paste(brands, collapse = "|")
ptns
# [1] "nike|adidas|d&g"
text2 <- text[NA]
text2[grepl(ptns, text, ignore.case=TRUE)] <- gsub(paste0(".*(", ptns, ").*"), "\\1", text, ignore.case = TRUE)
text2
# [1] "Nike" "Adidas"
The pre-assignment of text[NA] is because gsub will make no change if the pattern is not found. I'm using text[NA], but we could also use rep(NA_character_, length(text)), it's the same effect.
If you need multiple matches per text, then perhaps
brands <- c("Nike", "Adidas", "d&g")
text <- c("I love nike", "I love Adidas and Nike")
ptns <- paste(brands, collapse = "|")
gre <- gregexpr(ptns, text, ignore.case = TRUE)
sapply(regmatches(text, gre), paste, collapse = ";")
# [1] "nike" "Adidas;Nike"

Extracting words between word/space patterns

I have some data where I have names "sandwiched" between two spaces and the phrase "is a (number from 1-99) y.o". For example:
a <- "SomeOtherText John Smith is a 60 y.o. MoreText"
b <- "YetMoreText Will Smth Jr. is a 30 y.o. MoreTextToo"
c <- "JustJunkText Billy Smtih III is 5 y/o MoreTextThree"
I'd like to extract the names "John Smith", "Will Smth Jr." and "Billy Smtih III" (the misspellings are there on purpose). I tried using str_extract or gsub, based on answers to similar questions I found on SO, but with no luck.
You can chain multiple calls to stringr::str_remove.
First regex: remove pattern that start with (^) any letters ([:alpha:]) followed by one or more whitespaces (\\s+).
Seconde regex: remove pattern that ends with ($) a whitespace(\\s) followed by the sequence is, followed by any number of non-newline characters (.)
str_remove(a, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "John Smith"
str_remove(b, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Will Smth Jr."
str_remove(c, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Billy Smtih III"
You can also do it in a single call by using stringr::str_remove_all and joining the two patterns separated by an OR (|) symbol:
str_remove_all(a, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(b, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(c, '^[:alpha:]*\\s+|\\sis.*$')
You can use sub in base R as -
extract_name <- function(x) sub('.*\\s{2,}(.*)\\sis.*\\d+ y[./]o.*', '\\1', x)
extract_name(c(a, b, c))
#[1] "John Smith" "Will Smth Jr." "Billy Smtih III"
\\s{2,} is 2 or more whitespace
(.*) capture group to capture everything until
is followed by a number and y.o and y/o is encountered.

grepl for finding words

I am trying in R to find the spanish words in a number of words. I have all the spanish words from a excel that I donĀ“t know how to attach in the post (it has more than 80000 words), and I am trying to check if some words are on it, or not.
For example:
words = c("Silla", "Sillas", "Perro", "asdfg")
I tried to use this solution:
grepl(paste(spanish_words, collapse = "|"), words)
But there is too much spanish words, and gives me this error:
Error
So... who can i do it? I also tried this:
toupper(words) %in% toupper(spanish_words)
Result
As you can see with this option only gives TRUE in exactly matches, and I need that "Sillas" also appear as TRUE (it is the plural word of silla). That was the reason that I tried first with grepl, for get plurals aswell.
Any idea?
As df:
df <- tibble(text = c("some words",
"more words",
"Perro",
"And asdfg",
"Comb perro and asdfg"))
Vector of words:
words <- c("Silla", "Sillas", "Perro", "asdfg")
words <- tolower(paste(words, collapse = "|"))
Then use mutate and str_detect:
df %>%
mutate(
text = tolower(text),
spanish_word = str_detect(text, words)
)
Returns:
text spanish_word
<chr> <lgl>
1 some words FALSE
2 more words FALSE
3 perro TRUE
4 and asdfg TRUE
5 comb perro and asdfg TRUE

How to return the index of a vector that contains at least a string in another vector in R

I have a list containing verbs. I have another list containing sentences. How do I return the index of the sentence list that contains at least a verb in the verb list for that sentence?
verbList <- c("punching", "kicking", "jumping", "hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
I want it to return indexes 1, 3, and 4
Using no additional packages, we can sort of "or" different search terms together using | as follows:
Original question:
verbList <- list("punching, kicking, jumping, hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
v <- gsub(", ", "|", verbList)
grep(v, sentenceList)
New question:
verbList <- c("punching", "kicking", "jumping", "hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
v <- paste(verbList, collapse = "|")
grep(v, sentenceList)
A solution from stringr and rebus. We can first split the string, and then use str_which to check if the pattern is in the vector to return the index.
library(stringr)
library(rebus)
# Check the index
result <- str_which(sentenceList, or1(verbList))
result
# [1] 1 3 4

Gsub apostrophe in data frame R

I need to remove all apostrophes from my data frame but as soon as I use....
textDataL <- gsub("'","",textDataL)
The data frame gets ruined and the new data frame only contains values and NAs, when I am only looking to remove any apostrophes from any text that might be in there? Am I missing something obvious with apostrophes and data frames?
To keep the structure intact:
dat1 <- data.frame(Col1= c("a woman's hat", "the boss's wife", "Mrs. Chang's house", "Mr Cool"),
Col2= c("the class's hours", "Mr. Jones' golf clubs", "the canvas's size", "Texas' weather"),
stringsAsFactors=F)
I would use
dat1[] <- lapply(dat1, gsub, pattern="'", replacement="")
or
library(stringr)
dat1[] <- lapply(dat1, str_replace_all, "'","")
dat1
# Col1 Col2
# 1 a womans hat the classs hours
# 2 the bosss wife Mr. Jones golf clubs
# 3 Mrs. Changs house the canvass size
# 4 Mr Cool Texas weather
You don't want to apply gsub directly on a data frame, but column-wise instead, e.g.
apply(textDataL, 2, gsub, pattern = "'", replacement = "")

Resources