regex: extract segments of a string containing a word, between symbols - r

Hello I have a data frame that looks something like this
dataframe <- data_frame(text = c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12',
'WUFF;other stuff to keep;WIFF2;yes yes IGWIFF'))
print(dataframe)
# A tibble: 2 × 1
text
<chr>
1 WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12
2 WUFF;other stuff to keep;WIFF2;yes yes IGWIFF
I want to extract the segment of the strings containing the word "keep". Note that these segments can be separated from other parts by different symbols for example , and ;.
the final dataset should look something like this.
final_dataframe <- data_frame(text = c('some words to keep',
'other stuff to keep'))
print(final_dataframe)
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Does anyone know how I could do this?

With stringr ...
library(stringr)
library(dplyr)
dataframe %>%
mutate(text = trimws(str_extract(text, "(?<=[,;]).*keep")))
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Created on 2022-02-01 by the reprex package (v2.0.1)

I've made great use of the positive lookbehind and positive lookahead group constructs -- check this out: https://regex101.com/r/Sc7h8O/1
If you want to assert that the text you're looking for comes after a character/group -- in your first case the apostrophe, use (?<=').
If you want to do the same but match something before ' then use (?=')
And you want to match between 0 and unlimited characters surrounding "keep" so use .* on either side, and you wind up with (?<=').*keep.*(?=')
I did find in my test that a string like text =' c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12', will also match the c(, which I didn't intend. But I assume your strings are all captured by pairs of apostrophes.

Related

Grepl for 2 words/phrases in proximity in R (dplyr)

I'm trying to create a filter for large dataframe. I'm trying to use grepl to search for a series of text within a specific column. I've done this for single words/combinations, but now I want to search for two words in close proximity (ie the word tumo(u)r within 3 words of the word colon).
I've checked my regular expression on https://www.regextester.com/109207 and my search works there, but it doesn't work within R.
The error I get is
Error: '\W' is an unrecognized escape in character string starting ""\btumor|tumour)\W"
Example below - trying to search for tumo(u)r within 3 words of cancer.
Can anyone help?
library(tibble)
example.df <- tibble(number = 1:4, AB = c('tumor of the colon is a very hard disease to cure', 'breast cancer is also known as a neoplasia of the breast', 'tumour of the colon is bad', 'colon cancer is also bad'))
filtered.df <- example.df %>%
filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB, ignore.case=T)
R uses backslashes as escapes and the regex engine does,too. Need to double your backslashes. This is explained in multiple prior questions on StackOverflow as well as in the help page brought up at ?regex. You should try to use the escaped operators in a more simple set of tests before attempting complex operations. And you should pay better attention to the proper placement of parentheses and quotes in the pattern argument.
filtered.df <- example.df %>%
#filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB,
# errors here ....^.^..............^..^...^..^.............^.^
filter(grepl( "(\\btumor|tumour)\\W|\\w+(\\w+\\W+){0,3}colon\\b", AB,
ignore.case=T) )
> filtered.df
# A tibble: 2 × 2
number AB
<int> <chr>
1 1 tumor of the colon is a very hard disease to cure
2 3 tumour of the colon is bad

splitting strings using regex in R

I have the following a really long list of strings that look like the following that I want to split it into several pieces.
strings<-c("https://www.website.com/stats/stat.227.y2020.eon.t879.html",
"https://www.website.com/stats/stat.229.y2019.eoff.t476.html")
and the desired output is as below:
links Year Seas Tour
https://www.website.com/stats/stat.227. y2020 eon t879
https://www.website.com/stats/stat.229. y2019 eoff t476
How can I achieve this using regex?
Using str_match :
stringr::str_match(strings, '.*\\.(y\\d+)\\.(\\w+)\\.(t\\d+)')
You can use the same regex in tidyr::extract if you put strings in a dataframe.
tidyr::extract(data.frame(strings), strings, c("Year","Seas", "Tour"),
'\\.(y\\d+)\\.(\\w+)\\.(t\\d+)', remove = FALSE)
# strings Year Seas Tour
#1 https://www.pgatour.com/stats/stat.227.y2020.eon.t879.html y2020 eon t879
#2 https://www.pgatour.com/stats/stat.229.y2019.eoff.t476.html y2019 eoff t476
Here, we capture data in 3 parts (capture groups)
1st part - 'y' followed by a number
2nd part - next word following part 1
3rd part 't' followed by a number.
You could use {unglue} :
library(unglue)
unglue::unglue_data(
strings, "{links}.{Year=[^.]+}.{Seas=[^.]+}.{Tour=[^.]+}.html")
#> links Year Seas Tour
#> 1 https://www.website.com/stats/stat.227 y2020 eon t879
#> 2 https://www.website.com/stats/stat.229 y2019 eoff t476
here "[^.]+" means "one or more non dot characters", which is what we want for Year, Seas, and Tour.

how to use boundary with str_detect (tidyr package)

Here is some data.
library(stringr)
library(dplyr)
df <- tibble(sentences)
I want to identify all sentences with the word "her." But this, of course, also returns sentences with words like "there" and "here."
df %>% filter(str_detect(sentences, "her"))
# A tibble: 43 x 1
sentences
<chr>
1 The boy was there when the sun rose.
2 Help the woman get back to her feet.
3 What joy there is in living.
4 There are more than two factors here.
5 Cats and dogs each hate the other.
6 The wharf could be seen at the farther shore.
7 The tiny girl took off her hat.
8 Write a fond note to the friend you cherish.
9 There was a sound of dry leaves outside.
10 Add the column and put the sum here.
The documentation for stringr::str_detect says, "Match character, word, line and sentence boundaries with boundary()." I can't figure out how to do this, nor can I find an example anywhere. All of the documentation examples involve the str_split or str_count functions.
My question is related to this question, but I would specifically like to understand how to use the stringr::boundary function.
We can specify the word boundary (\\b) at the start and end to avoid any partial matches
library(stringr)
library(dplyr)
df %>%
filter(str_detect(sentences, "\\bher\\b"))
# sentences
#1 Help the woman get back to her feet.
#2 The tiny girl took off her hat.
Or use boundary to wrap
df %>%
filter(str_detect(sentences, boundary("her")))

Separating txt (conversation) into columns with speaker names as variables

I'm new to text mining in R. I have multiple txt files of conversations between the same speakers organized as follows:
speaker one [speakers' names are on their own line]
what speaker one says [paragraph of each speaker's speech after
line break from name]
[empty line]
speaker two
what speaker two says
[empty line]
speaker one
what speaker one replies
[empty line]
speaker three
what speaker three says
...
I want to break up the texts into one row per text with columns as the names of speakers. I want to have everything that speaker one says in each text combined in one cell on each row and the same for other speakers. Something like this:
text "speaker one" "speaker two" ...
text1 everything speaker one said everything speaker two said
text2 everything speaker one said everything speaker two said
...
Any help on how to get started would be appreciated.
Using some tidyverse packages you can get there. First read the text with readr::read_file, next split on the empty line, use readr::read_delim to read this into data.frames. As the data is now in a list, using bind_rows will collapse all of this into one data.frame. bind_rows matches on the column names so that all the text of a speaker is in the correct column. Depending on which outcome you want either the first or the second solution.
I leave combining multiple text files up to you.
library(readr)
library(tidyr)
library(dplyr)
# read file into a character vector
text <- readr::read_file("conversation.txt")
# split the text on the empty line
split_text <- strsplit(text, split = "\r\n\r\n")
# read the data in again with read_delim. This will generate a list of data.frames
list_text <- lapply(unlist(split_text), function(x) readr::read_delim(x, col_names = TRUE, delim = "\t"))
# use bind_rows from dplyr to combine everything into 1 tibble. bind_rows matches on the column names.
list_text %>%
bind_rows
# A tibble: 5 x 3
`speaker one` `speaker two` `speaker three`
<chr> <chr> <chr>
1 what speaker one says is in this paragraph. NA NA
2 It might be in multiple lines, but not seperated by an empty line. NA NA
3 NA what speaker two says NA
4 what speaker one replies NA NA
5 NA
Collapsing all the text in one line:
This needs a bit more work with first gathering the data in a tidy long format, collapsing the text and then spreading it wide again. Run the statements in chunks if you want to see what is happening in each step.
list_text %>%
bind_rows %>%
pivot_longer(everything(),
names_to = "speakers",
values_to = "text",
values_drop_na = TRUE) %>%
group_by(speakers) %>%
summarise(text = paste0(text, collapse = " ")) %>%
pivot_wider(names_from = speakers, values_from = text)
# A tibble: 1 x 3
`speaker one` `speaker three` `speaker two`
<chr> <chr> <chr>
1 what speaker one says is in this paragraph. It might be in multiple lines, but not seperated b~ what speaker three s~ what speaker two ~
text used in text file conversation.txt
speaker one
what speaker one says is in this paragraph.
It might be in multiple lines, but not seperated by an empty line.
speaker two
what speaker two says
speaker one
what speaker one replies
speaker three
what speaker three says.

Find the most frequently occuring words in a text in R

Can someone help me with how to find the most frequently used two and three words in a text using R?
My text is...
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
The tidytext package makes this sort of thing pretty simple:
library(tidytext)
library(dplyr)
data_frame(text = text) %>%
unnest_tokens(word, text) %>% # split words
anti_join(stop_words) %>% # take out "a", "an", "the", etc.
count(word, sort = TRUE) # count occurrences
# Source: local data frame [73 x 2]
#
# word n
# (chr) (int)
# 1 phrase 8
# 2 sentence 6
# 3 words 4
# 4 called 3
# 5 common 3
# 6 grammatical 3
# 7 meaning 3
# 8 alex 2
# 9 bird 2
# 10 complete 2
# .. ... ...
If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:
library(tokenizers)
tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>% # tokenize bigrams and trigrams
as_data_frame() %>% # structure
count(value, sort = TRUE) # count
# Source: local data frame [531 x 2]
#
# value n
# (fctr) (int)
# 1 of the 5
# 2 a phrase 4
# 3 the sentence 4
# 4 as a 3
# 5 in the 3
# 6 may be 3
# 7 a complete 2
# 8 a phrase is 2
# 9 a sentence 2
# 10 a white 2
# .. ... ...
Your text is:
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
In Natural Language Processing, 2-word phrases are referred to as "bi-gram", and 3-word phrases are referred to as "tri-gram", and so forth. Generally, a given combination of n-words is called an "n-gram".
First, we install the ngram package (available on CRAN)
# Install package "ngram"
install.packages("ngram")
Then, we will find the most frequent two-word and three-word phrases
library(ngram)
# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)
# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)
Finally, we will print the objects (ngrams) using various methods as below:
print(ng, output="truncated")
print(ngram(x), output="full")
get.phrasetable(ng)
ngram::ngram_asweka(text, min=2, max=3)
We can also use Markov Chains to babble new sequences:
# if we are using ng2 (bi-gram)
lnth = 2
babble(ng = ng2, genlen = lnth)
# if we are using ng3 (tri-gram)
lnth = 3
babble(ng = ng3, genlen = lnth)
We can split the words and use table to summarize the frequency:
words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)
Simplest?
require(quanteda)
# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
## of_the a_phrase the_sentence may_be as_a in_the in_common phrase_is
## 5 4 4 3 3 3 2 2
## is_usually group_of
## 2 2
# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
## a_phrase_is group_of_words of_a_sentence of_the_sentence for_example_in example_in_the
## 2 2 2 2 2 2
## in_the_sentence an_orange_bird orange_bird_with bird_with_a
# 2 2 2 2
Here's a simple base R approach for the 5 most frequent words:
head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)
# a the of in phrase
# 21 18 12 10 8
What it returns is an integer vector with the frequency count and the names of the vector correspond to the words that were counted.
gsub("[[:punct:]]", "", text) to remove punctuation since you don't want to count that, I guess
strsplit(gsub("[[:punct:]]", "", text), " ") to split the string on spaces
table() to count unique elements' frequency
sort(..., decreasing = TRUE) to sort them in decreasing order
head(..., 5) to select only the top 5 most frequent words

Resources