Extract specified number of words after a string in R - r

I am trying to extract the 4 words after the string "source:" in this example below.
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
x$source = str_extract(x$end, '[^source: ](.)*')
when I try the code above, I can extract all the text after "source:" into a new column. I was wondering if there is a way to extract only the first 4 words following "source", either using stringr or any other package.

You can use :
trimws(stringr::str_extract(x$end, '(?<=source:\\s)(\\w+,?\\s){4}'))
#[1] "from animal origin as" "Eggs, liver, certain fish"
# "Leafy green vegetables such"
?<= is positive lookbehind to search for 'source:' followed by whitespace.
We capture 4 "words" after it including an optional comma and whitespace.

Related

Text column in R - Trying to count the keywords sequentially

I am working on a dataset that has a text column. The text has many sentences separated by a semi-colon ';'. I am trying to get a word count in a new column in dataframe for words that match my keyword. However, in one sentence, if there are repeated keywords, they should be considered only once.
For instance -
The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods
Solar panels, Tawian tariffs, trade
Trade issues impacting the solar industry
are the text in one column of the dataframe.
My keywords include solar, solar panels, section 201
I want to count the words in each sentence that match my keywords but if both or all words are there in the sentence, then it is counted only once. Word counts should only consider keywords in different sentences. If one sentence doesn't have a specific keyword, then move towards finding the second keyword.
My output should be -
word_count
2 (as section 201 is mentioned in both sentences, we do not search for solar because the first word in the keyword list matched)
1 (as only solar word is there)
1 (as only solar word is there)
Please suggest a way to resolve this issue. It is a crucial part of my research work. Thanks.
Kind Regards,
Preety
I think for each number in your example, you want to consider that a separate list item. You want to consider each list item as separate sentences wherever there is semi-colon. Then you want to look for the first occurrence of a keyword in that each list item. That becomes the target keyword for that list item. You then want to count the first occurrence only of that target keyword in each sentence within each list item:
library(dplyr)
library(stringr)
# I modified your example sentences to include "section 201" twice in sentence 2 in list item #1 to show it will only count once in that sentence
# I modified the order of your keywords, otherwise solar will be detected first in all sentences
sentences <- list("The section 201 solar trade case on cells and modules; Issues relating to section 201 section 201 tariffs on imported goods", "Solar panels, Tawian tariffs, trade", "Trade issues impacting the solar industry")
keywords <- c("section 201", "solar", "solar panels")
# Go through by each list item
lapply(sentences, function(sentence){
# For each list item, sentences, split into separate sentences by ;
# Also I changed each sentence to lowercase, otherwise solar != Solar
str_split(tolower(sentence), ";") -> split_string
# Apply each keyword and detect if the keyword is in the list_item using str_detect from stringr
sapply(keywords, function(keyword) str_detect(split_string, keyword)) -> output
# Output is the keywords that are in each list item
# Choose the first occurrence of a keyword: keywords[output == T][1]
# Detect in each sentence if the keyword is included, then sum the number of occurrences for each list item. Name as the keyword that was detected
setNames(str_detect(split_string[[1]], keywords[output == T][1]) %>% sum(), keywords[output == T][1])
})
Gives a list, with the first occurring keyword identified and the number of first occurrences of that keyword in each sentence within list item:
[[1]]
section 201
2
[[2]]
solar
1
[[3]]
solar
1
I would first split your text column into multiple columns using tidyr::separate(); then search for your key phrases within each of these, using stringr::str_detect() or base regex functions; then sum across columns using rowSums(dplyr::across()). Like so:
library(tidyverse)
keywords <- paste("solar", "section 201", sep = "|")
df <- tibble(
text = c(
"The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods",
"Solar panels; Tawian tariffs; trade",
"Trade issues impacting the solar industry"
)
)
df <- df %>%
separate(text, into = c("text1", "text2", "text3"), sep = ";", fill = "right") %>%
mutate(
count = rowSums(
across(text1:text3, ~ str_detect(str_to_lower(.x), keywords)),
na.rm = TRUE
)
)
The count column then contains the results you predicted:
df %>%
select(count)
# A tibble: 3 x 1
# count
# <dbl>
# 1 2
# 2 1
# 3 1
If the list of keywords is not too long (say 10 or 20 words), then you can look at the count of all the keywords for each text string. I am adding ; at the end of each text string so that a sentence always ends with a ;. The pattern paste0("[^;]*", key, "[^;]*;") identifies any sentence containing the word (stored in) key.
txt <- c("The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods",
"Solar panels, Tawian tariffs, trade",
"Trade issues impacting the solar industry")
keys <- c("section 201", "solar panels", "solar")
counts <- sapply(keys, function(key) stringr::str_count(paste0(txt, ";"), regex(paste0("[^;]*", key, "[^;]*;"), ignore_case = T)))
Next you can go over each row of counts and look at the first non-zero element which should be the value you are looking for.
sapply(1:nrow(counts), function(i) {
a <- counts[i, ]
a[a != 0][1]
})

regex: extract segments of a string containing a word, between symbols

Hello I have a data frame that looks something like this
dataframe <- data_frame(text = c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12',
'WUFF;other stuff to keep;WIFF2;yes yes IGWIFF'))
print(dataframe)
# A tibble: 2 × 1
text
<chr>
1 WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12
2 WUFF;other stuff to keep;WIFF2;yes yes IGWIFF
I want to extract the segment of the strings containing the word "keep". Note that these segments can be separated from other parts by different symbols for example , and ;.
the final dataset should look something like this.
final_dataframe <- data_frame(text = c('some words to keep',
'other stuff to keep'))
print(final_dataframe)
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Does anyone know how I could do this?
With stringr ...
library(stringr)
library(dplyr)
dataframe %>%
mutate(text = trimws(str_extract(text, "(?<=[,;]).*keep")))
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Created on 2022-02-01 by the reprex package (v2.0.1)
I've made great use of the positive lookbehind and positive lookahead group constructs -- check this out: https://regex101.com/r/Sc7h8O/1
If you want to assert that the text you're looking for comes after a character/group -- in your first case the apostrophe, use (?<=').
If you want to do the same but match something before ' then use (?=')
And you want to match between 0 and unlimited characters surrounding "keep" so use .* on either side, and you wind up with (?<=').*keep.*(?=')
I did find in my test that a string like text =' c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12', will also match the c(, which I didn't intend. But I assume your strings are all captured by pairs of apostrophes.

R: extract substring with capital letters from string

I have a dataframe with strings in a column. How could I extract only the substrings that are in capital letters and add them to another column?
This is an example:
fecha incident
1 2020-12-01 Check GENERATOR
2 2020-12-01 Check BLADE
3 2020-12-02 Problem in GENERATOR
4 2020-12-01 Check YAW
5 2020-12-02 Alarm in SAFETY SYSTEM
And I would like to create another column as follows:
fecha incident system
1 2020-12-01 Check GENERATOR GENERATOR
2 2020-12-01 Check BLADE BLADE
3 2020-12-02 Problem in GENERATOR GENERATOR
4 2020-12-01 Check YAW YAW
5 2020-12-02 Alarm in SAFETY SYSTEM SAFETY SYSTEM
I have tried with str_sub or str_extract_all using a regex but I believe I'm doing thigs wrong.
You can use str_extract if you want to work in a dataframe and tie it into a tidyverse workflow.
The regex asks either for capital letters or space and there need to be two or more consecutive ones (so it does not find capitalized words). str_trim removes the white-space that can get picked up if the capitalized word is not at the end of the string.
Note that this code snipped will only extract the first capitalized words connected via a space. If there are capitalized words in different parts of the string, only the first one will be returned.
library(tidyverse)
x <- c("CAPITAL and not Capital", "one more CAP word", "MULTIPLE CAPITAL words", "CAP words NOT connected")
cap <- str_trim(str_extract(x, "([:upper:]|[:space:]){2,}"))
cap
#> [1] "CAPITAL" "CAP" "MULTIPLE CAPITAL" "CAP"
Created on 2021-01-08 by the reprex package (v0.3.0)
library(tidyverse)
string <- data.frame(test="does this WORK")
string$new <-str_extract_all(string$test, "[A-Z]+")
string
test new
1 does this WORK WORK
If there are cases when the upper-case letters are not next to each other you can use str_extract_all to extract all the capital letters in a sentence and then paste them together.
sapply(stringr::str_extract_all(df$incident, '[A-Z]{2,}'),paste0, collapse = ' ')
#[1] "GENERATOR" "BLADE" "GENERATOR" "YAW" "SAFETY SYSTEM"

Text-mining including patterns and numbers

Dataset contains a free text field with information on building plans. I need to split the content of the field in 2 parts, the first part contains only the number of planned buildings, the other only the type of building. I have a reference lexicon list with the types of buildings.
Example
Plans<- c("build 10 houses ","5 luxury apartments with sea view",
"renovate 20 cottages"," transform 2 bungalows and a school", "1 hotel")
Reference list
Types <-c("houses", "cottages", "bungalows", "luxury apartments")
Desired Output 2 colums, Number and Type, with this content:
Number Type
10 houses
5 apartments
20 cottages
2 bungalows
Tried
matches <- unique (grep(paste(Types,collapse="|"), Plans, value=TRUE))
I can match the plans and types, but I can’t extract the numbers and types into two columns.
I tried str_split_fixed and grepl using :digit: and :alpha: but it isn’t working.
Assuming there is only going to be one numeric part in the string, we can extract the numeric part by replacing all the characters to empty strings. We create the Type column by extracting any of the string present in the Plans.
library(stringr)
data.frame(Number = as.numeric(gsub("[[:alpha:]]", "", Plans)),
Type = str_extract(Plans, paste(Types,collapse="|")))
# Number Type
#1 10 houses
#2 5 luxury apartments
#3 20 cottages
#4 2 bungalows
#5 1 <NA>
For the 5th row, "hotel" is not present in Types so it gives output as NA, if you need to ignore such cases you can do it with is.na. Extracting number from the string part is taken from here.
You can also use strcapture from base R:
strcapture(pattern = paste0("(\\d+)\\s(",paste(Types,collapse="|"),")"),x = Plans,
proto = data.frame(Number=numeric(),Type=character()))
Number Type
1 10 houses
2 5 luxury apartments
3 20 cottages
4 2 bungalows
5 NA <NA>

Find the most frequently occuring words in a text in R

Can someone help me with how to find the most frequently used two and three words in a text using R?
My text is...
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
The tidytext package makes this sort of thing pretty simple:
library(tidytext)
library(dplyr)
data_frame(text = text) %>%
unnest_tokens(word, text) %>% # split words
anti_join(stop_words) %>% # take out "a", "an", "the", etc.
count(word, sort = TRUE) # count occurrences
# Source: local data frame [73 x 2]
#
# word n
# (chr) (int)
# 1 phrase 8
# 2 sentence 6
# 3 words 4
# 4 called 3
# 5 common 3
# 6 grammatical 3
# 7 meaning 3
# 8 alex 2
# 9 bird 2
# 10 complete 2
# .. ... ...
If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:
library(tokenizers)
tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>% # tokenize bigrams and trigrams
as_data_frame() %>% # structure
count(value, sort = TRUE) # count
# Source: local data frame [531 x 2]
#
# value n
# (fctr) (int)
# 1 of the 5
# 2 a phrase 4
# 3 the sentence 4
# 4 as a 3
# 5 in the 3
# 6 may be 3
# 7 a complete 2
# 8 a phrase is 2
# 9 a sentence 2
# 10 a white 2
# .. ... ...
Your text is:
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
In Natural Language Processing, 2-word phrases are referred to as "bi-gram", and 3-word phrases are referred to as "tri-gram", and so forth. Generally, a given combination of n-words is called an "n-gram".
First, we install the ngram package (available on CRAN)
# Install package "ngram"
install.packages("ngram")
Then, we will find the most frequent two-word and three-word phrases
library(ngram)
# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)
# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)
Finally, we will print the objects (ngrams) using various methods as below:
print(ng, output="truncated")
print(ngram(x), output="full")
get.phrasetable(ng)
ngram::ngram_asweka(text, min=2, max=3)
We can also use Markov Chains to babble new sequences:
# if we are using ng2 (bi-gram)
lnth = 2
babble(ng = ng2, genlen = lnth)
# if we are using ng3 (tri-gram)
lnth = 3
babble(ng = ng3, genlen = lnth)
We can split the words and use table to summarize the frequency:
words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)
Simplest?
require(quanteda)
# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
## of_the a_phrase the_sentence may_be as_a in_the in_common phrase_is
## 5 4 4 3 3 3 2 2
## is_usually group_of
## 2 2
# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
## a_phrase_is group_of_words of_a_sentence of_the_sentence for_example_in example_in_the
## 2 2 2 2 2 2
## in_the_sentence an_orange_bird orange_bird_with bird_with_a
# 2 2 2 2
Here's a simple base R approach for the 5 most frequent words:
head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)
# a the of in phrase
# 21 18 12 10 8
What it returns is an integer vector with the frequency count and the names of the vector correspond to the words that were counted.
gsub("[[:punct:]]", "", text) to remove punctuation since you don't want to count that, I guess
strsplit(gsub("[[:punct:]]", "", text), " ") to split the string on spaces
table() to count unique elements' frequency
sort(..., decreasing = TRUE) to sort them in decreasing order
head(..., 5) to select only the top 5 most frequent words

Resources