Extract words from text using dplyr and stringr - r

I'm trying to find an effective way to extract words from an text column in a dataset. The approach I'm using is
library(dplyr)
library(stringr)
Text = c("A little bird told me about the dog", "A pig in a poke", "As busy as a bee")
data = as.data.frame(Text)
keywords <- paste0(c("bird", "dog", "pig","wolf","cat", "bee", "turtle"), collapse = "|")
data %>% mutate(Word = str_extract(Text, keywords))
It's just an example but I have more than 2000 possible words to extract from each row. I don't know yet another approach to use, but the fact I will have a big regex will make things slow or doesn't matter the size of the regex? I think it will not appear more than one of these words in each row, but there is a way to make multiple columns automatically if more than one word appear in each row?

We can use str_extract_all to return a list, convert the list elements to a named list or tibble and use unnest_wider
library(purrr)
library(stringr)
library(tidyr)
library(dplyr)
data %>%
mutate(Words = str_extract_all(Text, keywords),
Words = map(Words, ~ as.list(unique(.x)) %>%
set_names(str_c('col', seq_along(.))))) %>%
unnest_wider(Words)
# A tibble: 3 x 3
# Text col1 col2
# <fct> <chr> <chr>
#1 A little bird told me about the dog bird dog
#2 A pig in a poke pig <NA>
#3 As busy as a bee bee <NA>

Try intersect with keywords as an array
data <- data.frame(Text = Text, Word = sapply(Text, function(v) intersect(unlist(strsplit(v,split = " ")),keywords),USE.NAMES = F))

Related

Regex for at least one instance of each of a list of letters?

I'm trying to sharpen my skills with regular expressions by coming up with some R code to solve the NY Time's Spelling Bee game.
I've done that, but now I'm going one step further and trying to identify specifically what the game calls "pangrams"--words that contain at least one instance of each of set of seven letters.
I was hoping to do this with str_detect() and a regex, but I'm not seeing a way to say "at least one of each of these letters."
Per the second example here, the function can be used over a list of letters, but I'm running into problems when the string I want to compare against is in a tibble with a list of words.
This does not work (to identify "pedagogy" as the pangram):
library(tidyverse)
required_letters <- c("o", "a", "d", "e", "g", "p", "y")
list_of_words <- tibble(word = c("pedagogy", "agog", "apogee", "dodge"))
pangrams <- list_of_words %>%
filter(all(str_detect(word, required_letters)))
But I was hoping it would work in the way that this does:
all(str_detect("pedagogy", required_letters))
In regex, you can create a pattern using look ahead for each letter:
pattern <- str_c("(?=.*", required_letters, ")", collapse = "")
list_of_words %>%
filter(str_detect(word, pattern))
# A tibble: 1 × 1
word
<chr>
1 pedagogy
You can do str_detect rowwise:
list_of_words %>%
rowwise() %>%
filter(all(str_detect(word, required_letters))) %>%
ungroup()
# # A tibble: 1 × 1
# word
# <chr>
# 1 pedagogy
or use map_lgl from purrr:
list_of_words %>%
filter(map_lgl(word, ~ all(str_detect(.x, required_letters))))
Here is a combination of rowSums and sapply using str_detect:
library(stringr)
list_of_words |>
filter(rowSums(sapply(required_letters, str_detect, string = word)) == length(required_letters))
word
<chr>
1 pedagogy

R - Splitting a dataframe by using strsplit, but keep delimiter [duplicate]

This question already has an answer here:
R split on delimiter (split) keep the delimiter (split)
(1 answer)
Closed 2 months ago.
I have a dataframe like the following:
ref = c("ab/1bc/1", "dd/1", "cc/1", "2323")
text = c("car", "train", "mouse", "house")
data = data.frame(ref, text)
Which produces this:
IF the cell within the ref column has /1 in it, I want to split it and duplicate the row.
I.e. the table above should look like this:
I have the following code, which splits the cell by the /1, but it also removes it. I thought about adding /1 back onto every ref, but not all refs have it.
data1 = data %>%
mutate(ref = strsplit(as.character(ref), "/1")) %>%
unnest(ref)
Some of the other answers use regex for when people split by things like &/,. etc, but not /1. Any ideas?
With separate_rows and look-behind:
library(tidyr)
library(dplyr)
data %>%
separate_rows(ref, sep = "(?<=/1)") %>%
filter(ref != "")
output
# A tibble: 5 × 2
ref text
<chr> <chr>
1 ab/1 car
2 bc/1 car
3 dd/1 train
4 cc/1 mouse
5 2323 house
Or with strsplit:
data %>%
mutate(ref = strsplit(ref, "(?<=/1)", perl = TRUE)) %>%
unnest(ref)

R / tidyverse - find intersect between multiple character columns

I have the following problem, I have a tibble with mutliple character columns.
I tried to provide an MRE below:
library(tidyverse)
df <- tibble(food = c("pizza, bread, apple","joghurt, cereal, banana"),
food2 = c("bread, sausage, strawberry", "joghurt, oat, bacon"),
food3 = c("ice cream, bread, milkshake", "melon, cake, joghurt")
)
df %>%
# rowwise() %>%
mutate(allcolumns = map2(
str_split(food, ", "),
str_split(food2, ", "),
# str_split(food3, ", "),
intersect
) %>% unlist()
) -> df_new
My goal would be to get the common words for all columns. Words are separated by , in the columns. In the MRE I am able to find the intersect between two columns, however I couldnt get a solution for this issue. I experimented with Reduce but was not able to get it.
As an EDIT: I would also like to append it as a new row to the existing tibble
We may use map to loop over the columns, do the str_split and then reduce to get the intersect for elementwise intersect
library(dplyr)
library(purrr)
library(stringr)
df %>%
purrr::map(str_split, ", ") %>%
transpose %>%
purrr::map_chr(reduce, intersect) %>%
mutate(df, Intersect = .)
-output
# A tibble: 2 x 4
food food2 food3 Intersect
<chr> <chr> <chr> <chr>
1 pizza, bread, apple bread, sausage, strawberry ice cream, bread, milkshake bread
2 joghurt, cereal, banana joghurt, oat, bacon melon, cake, joghurt joghurt
or may also use pmap
df %>%
mutate(Intersect = pmap(across(everything(), str_split, ", "),
~ list(...) %>%
reduce(intersect)))

The most frequent in column of dataframe

I have 3-column dataframe, where 3-rd (last) contains text body, something like one sentence.
Additionally I have one vector of words.
How to compute in elegant way a following thing:
find 15 the most frequent words (with number of occurences) in whole
3-rd column which occur in mentioned above vector ?
The sentence can look like:
I like dogs and my father like cats
vector=["dogs", "like"]
Here, the most frequent words are dogs and like.
You can try with this:
library(tidytext)
library(tidyverse)
df %>% # your data
unnest_tokens(word,text) %>% # clean a bit the data and split the phrases
group_by(word) %>% # grouping by words
summarise(Freq = n()) %>% # count them
arrange(-Freq) %>% # order decreasing
top_n(2) # here the top 2, you can use 15
Result:
# A tibble: 8 x 2
word Freq
<chr> <int>
1 dogs 3
2 i 2
If you already have the words splitted, you can skip the second line.
With data:
df <- data.frame(
id = c(1,2,3),
group = c(1,1,1),
text = c("I like dogs","I don't hate dogs", "dogs are the best"), stringsAsFactors = F)

replace symbols, in factors, in a data frame, with dplyr mutate

I have a data frame, and for various reasons I need to keep one of the elements as a factor and, maintaining the order of the levels, replace periods in the levels with spaces. Here's an example
library(tidyverse) library(stringr)
sandwich <- c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
data_frame(sandwich_str = sandwich) %>%
mutate(sandwich_factor = factor(sandwich)) %>%
mutate(sandwich2 = factor(sandwich_factor,
levels = str_replace_all(levels(sandwich_factor), "\\.", " "))) %>%
mutate(sandwich3 = str_replace_all(sandwich_str, "\\.", " "))
print(sandwich_df)
# A tibble: 5 x 4
sandwich_str, sandwich_factor, sandwich2, sandwich3
<chr> <fctr>, <fctr> <chr>,
1 bread bread bread bread
2 mustard.sauce mustard.sauce <NA> mustard sauce
3 tuna.fish tuna.fish <NA> tuna fish
4 lettuce lettuce lettuce lettuce
5 bread bread bread bread
So in this data frame:
sandwich_str is an element of characters
sandwich_factor is an element of factors
in sandwich2 I tried replacing all of the periods in the levels of sandwich_factor. For whatever reason, this returns NA whenever there are periods.
in sandwich3 I take the more simple approach of just replacing all of the periods in strings with spaces. This works substantially better.
So I'm wondering what isn't working in my attempt at sandwich2. I'd like it to look more like sandwich3. Any advice?
Does this suit?
library(tidyverse)
library(stringr)
# Data --------------------------------------------------------------------
sandwich <-
c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
df <-
data_frame(sandwich_str = sandwich)
# Convert periods to spaces -----------------------------------------------
df$sandwich_str <-
df$sandwich_str %>%
as.character() %>%
str_replace("\\."," ") %>%
as.factor()
# Print output ------------------------------------------------------------
df %>%
print()
Credit to #aosmith for posting this answer as a comment. I'll post it here as an answer so I can accept and close this.
The problem was that factor levels are defined with the flag labels rather than levels. So the correct way for me to have written this previously would be:
library(tidyverse) library(stringr)
sandwich <- c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
data_frame(sandwich_str = sandwich) %>%
mutate(sandwich_factor = factor(sandwich)) %>%
mutate(sandwich2 = factor(sandwich_factor,
labels = str_replace_all(levels(sandwich_factor), "\\.", " "))) %>%
mutate(sandwich3 = str_replace_all(sandwich_str, "\\.", " "))
print(sandwich_df)

Resources