Text column in R - Trying to count the keywords sequentially - r

I am working on a dataset that has a text column. The text has many sentences separated by a semi-colon ';'. I am trying to get a word count in a new column in dataframe for words that match my keyword. However, in one sentence, if there are repeated keywords, they should be considered only once.
For instance -
The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods
Solar panels, Tawian tariffs, trade
Trade issues impacting the solar industry
are the text in one column of the dataframe.
My keywords include solar, solar panels, section 201
I want to count the words in each sentence that match my keywords but if both or all words are there in the sentence, then it is counted only once. Word counts should only consider keywords in different sentences. If one sentence doesn't have a specific keyword, then move towards finding the second keyword.
My output should be -
word_count
2 (as section 201 is mentioned in both sentences, we do not search for solar because the first word in the keyword list matched)
1 (as only solar word is there)
1 (as only solar word is there)
Please suggest a way to resolve this issue. It is a crucial part of my research work. Thanks.
Kind Regards,
Preety

I think for each number in your example, you want to consider that a separate list item. You want to consider each list item as separate sentences wherever there is semi-colon. Then you want to look for the first occurrence of a keyword in that each list item. That becomes the target keyword for that list item. You then want to count the first occurrence only of that target keyword in each sentence within each list item:
library(dplyr)
library(stringr)
# I modified your example sentences to include "section 201" twice in sentence 2 in list item #1 to show it will only count once in that sentence
# I modified the order of your keywords, otherwise solar will be detected first in all sentences
sentences <- list("The section 201 solar trade case on cells and modules; Issues relating to section 201 section 201 tariffs on imported goods", "Solar panels, Tawian tariffs, trade", "Trade issues impacting the solar industry")
keywords <- c("section 201", "solar", "solar panels")
# Go through by each list item
lapply(sentences, function(sentence){
# For each list item, sentences, split into separate sentences by ;
# Also I changed each sentence to lowercase, otherwise solar != Solar
str_split(tolower(sentence), ";") -> split_string
# Apply each keyword and detect if the keyword is in the list_item using str_detect from stringr
sapply(keywords, function(keyword) str_detect(split_string, keyword)) -> output
# Output is the keywords that are in each list item
# Choose the first occurrence of a keyword: keywords[output == T][1]
# Detect in each sentence if the keyword is included, then sum the number of occurrences for each list item. Name as the keyword that was detected
setNames(str_detect(split_string[[1]], keywords[output == T][1]) %>% sum(), keywords[output == T][1])
})
Gives a list, with the first occurring keyword identified and the number of first occurrences of that keyword in each sentence within list item:
[[1]]
section 201
2
[[2]]
solar
1
[[3]]
solar
1

I would first split your text column into multiple columns using tidyr::separate(); then search for your key phrases within each of these, using stringr::str_detect() or base regex functions; then sum across columns using rowSums(dplyr::across()). Like so:
library(tidyverse)
keywords <- paste("solar", "section 201", sep = "|")
df <- tibble(
text = c(
"The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods",
"Solar panels; Tawian tariffs; trade",
"Trade issues impacting the solar industry"
)
)
df <- df %>%
separate(text, into = c("text1", "text2", "text3"), sep = ";", fill = "right") %>%
mutate(
count = rowSums(
across(text1:text3, ~ str_detect(str_to_lower(.x), keywords)),
na.rm = TRUE
)
)
The count column then contains the results you predicted:
df %>%
select(count)
# A tibble: 3 x 1
# count
# <dbl>
# 1 2
# 2 1
# 3 1

If the list of keywords is not too long (say 10 or 20 words), then you can look at the count of all the keywords for each text string. I am adding ; at the end of each text string so that a sentence always ends with a ;. The pattern paste0("[^;]*", key, "[^;]*;") identifies any sentence containing the word (stored in) key.
txt <- c("The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods",
"Solar panels, Tawian tariffs, trade",
"Trade issues impacting the solar industry")
keys <- c("section 201", "solar panels", "solar")
counts <- sapply(keys, function(key) stringr::str_count(paste0(txt, ";"), regex(paste0("[^;]*", key, "[^;]*;"), ignore_case = T)))
Next you can go over each row of counts and look at the first non-zero element which should be the value you are looking for.
sapply(1:nrow(counts), function(i) {
a <- counts[i, ]
a[a != 0][1]
})

Related

How to remove the first words of specific rows that appear in another column?

Is there a way to remove the first n words of the column "content" when there are words present in the "keyword" column?
I am working with a data frame similar to this:
keyword <- c("Mr. Jones", "My uncle Sam", "Tom", "", "The librarian")
content <- c("Mr. Jones is drinking coffee", "My uncle Sam is sitting in the kitchen with my uncle Richard", "Tom is playing with Tom's family's dog", "Cassandra is jogging for her first time", "The librarian is jogging with her")
data <- data.frame(keyword, content)
data
In some cases, the first few words of the "keyboard" sting are contained in the "content" string.
In others, the "keyword" string remains empty and only "content" is filled.
What I want to achieve here is to remove the first appearance of the word combination in "keyword" that appears in the same row in "content".
Unfortunately, I am only able to create code that deletes all the matching words. But as you can see, some words (like "uncle" or "Tom") appear more than one time in a cell.
I'd like to only delete the first appearance and keep all that come after in the same cell.
My next-best solution was to use the following code:
data$content <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",data$keyword),data$content)
This code was designed to remove all of the words from "content" that are present in "keyword" of the same row. (It was initially posted here).
Another option that I tried was to design a function for this:
I first created a new variable which counted the number of words that are included in the "keyword" string of the corresponding line:
numw <- lengths(gregexpr("\\S+", data$keyword))
data <- cbind(data, numw)
Second, I tried to formulate a function to remove the first n words of content[i], with n = numw[i]
shorten <- function(v, z){
v <- gsub(".*^\\w+", z, v)
}
shorten(data$content, data$numw)
Unfortunately, I am not able to make the function work and the following error message will be generated:
Error in gsub(".*^\w+", z, v) : invalid 'replacement' argument
So, I'd be incredibly greatful if one could help me to formulate a function that could actually deal with the issue more appropriately.
Here is a solution which is based on str_remove. As str_remove gives warnings, if the pattern is '' the first row exchanges it with NA. If then keyword is NA the keyword is stripped off, if not content is taken as is.
library(tidyverse )
data |>
mutate(keyword = na_if(keyword, '')) |>
mutate(content = case_when(
!is.na(keyword) ~ str_remove(content, keyword),
is.na(keyword) ~content))
#> keyword content
#> 1 Mr. Jones is drinking coffee
#> 2 My uncle Sam is sitting in the kitchen with my uncle Richard
#> 3 Tom is playing with Tom's family's dog
#> 4 <NA> Cassandra is jogging for her first time
#> 5 The librarian is jogging with her

Assign an ID based on keywords present in Tweets

I have extracted Tweets by feeding in 44 different keywords, and the output is in a file which consists of 400k tweets in total. The output file has tweets that contain the relevant keywords. How could I create a separate ID column which contains the keyword present in that tweet?
Eg: The tweet is:
Andhra Pradesh is the highest state with crimes against women
the keyword here is "crimes against women"
I would like to create a column that assigns the keyword "crimes against women" to the tweet, a sort of ID column to be precise.
#input column 1
Tweet<-("Andhra Pradesh is the highest state with crimes against women")
#expected output column 2 beside the Tweet column
Keyword<-("crimes against women")
Edit: I do not want to extract any part of the tweet, I just want to be able to assign to the tweet, in a new column, the keyword it contains so it will help me segregate the tweets based on this keyword.
You can perform this analysis with the stringr package, however, I don't think you need to use sapply.
Consider the following keyword list and table with tweets:
keyword_list <- c("crimes against women", "downloading tweets", "r analysis")
tweets <- data.frame(
tweet = c("Andhra Pradesh is the highest state with crimes against women",
"I am downloading tweets",
"I love r analysis",
"downloading tweets helps with my r analysis")
)
First, you want to combine your keywords into one regular expression that searches for any of the strings.
keyword_pattern <- paste0(
"(",
paste0(keyword_list, collapse = "|"),
")"
)
keyword_pattern
#> [1] "(crimes against women|downloading tweets|r analysis)"
Finally, we can add a column to the data frame that extracts the keyword from the tweet.
tweets$keyword <- str_extract(tweets$tweet, keyword_pattern)
> tweets
#> tweet keyword
#> 1 Andhra Pradesh is the highest state with crimes against women crimes against women
#> 2 I am downloading tweets downloading tweets
#> 3 I love r analysis r analysis
#> 4 downloading tweets helps with my r analysis downloading tweets
As the final example illustrates, you need to think about what you want to do when a tweet contains multiple keywords. In this case, the keyword returned is simply the first one found in the tweet. However, you can also use str_extract_all to return ALL keywords found in the tweet.
We can use stringr which is very handy for string operations and simply use str_extract, i.e.
str_extract(Tweet, Keyword)
#[1] "crimes against women"
For multiple keywords and multiple strings you need to apply, i.e.
Keyword <- c("crimes against women", "something")
Tweet <- c("Andhra Pradesh is the highest state with crimes against women",
"another string with something else")
sapply(Tweet, function(i)str_extract(i, paste(Keyword, collapse = '|')))
# Andhra Pradesh is the highest state with crimes against women another string with something else
# "crimes against women" "something"

Matching strings with str_detect and regex

I have a data frame containing unstructured text. In this reproducible example, I'm downloading a 10K company filing directly from the SEC website and loading it with read.table.
dir = getwd(); setwd(dir)
download.file("https://www.sec.gov/Archives/edgar/data/2648/0000002648-96-000013.txt", file.path(dir,"filing.txt"))
filing <- read.table(file=file.path(dir, "filing.txt"), sep="\t", quote="", comment.char="")
droplevels.data.frame(filing)
I want to remove the SEC header in order to focus on the main body of the document (starting in row 216) and divide my text into sections/items.
> filing$V1[216:218]
[1] PART I
[2] Item 1. Business.
[3] A. Organization of Business
Therefore, I'm trying to match strings starting with the word Item (or ITEM) followed by one or more spaces, one or two digits, a dot, one or more spaces and one or more words. For example:
Item 1. Business.
ITEM 1. BUSINESS
Item 1. Business
Item 10. Directors and Executive Officers of
ITEM 10. DIRECTORS AND EXECUTIVE OFFICERS OF THE REGISTRANT
My attempt involves str_detect and regex in order to create a variable count that jumps each time there is a string match.
library(dplyr)
library(stringr)
tidy_filing <- filing %>% mutate(count = cumsum(str_detect(V1, regex("^Item [\\d]{1,2}\\.",ignore_case = TRUE)))) %>% ungroup()
However, I'm missing the first 9 Items and my count starts only with Item 10.
tidy_filing[c(217, 218,251:254),]
V1 count
217 Item 1. Business. 0
218 A. Organization of Business 3 0
251 PART III 0
252 Item 10. Directors etc. 38 1
253 Item 11. Executive Compens. 38 2
254 Item 12. Security Ownership. 38 3
Any help would be highly appreciated.
The problem is that the single digit items have double spaces in order to align with the two digit ones. You can get round this by changing your regex string to
"^Item\\s+\\d{1,2}\\."

Find the most frequently occuring words in a text in R

Can someone help me with how to find the most frequently used two and three words in a text using R?
My text is...
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
The tidytext package makes this sort of thing pretty simple:
library(tidytext)
library(dplyr)
data_frame(text = text) %>%
unnest_tokens(word, text) %>% # split words
anti_join(stop_words) %>% # take out "a", "an", "the", etc.
count(word, sort = TRUE) # count occurrences
# Source: local data frame [73 x 2]
#
# word n
# (chr) (int)
# 1 phrase 8
# 2 sentence 6
# 3 words 4
# 4 called 3
# 5 common 3
# 6 grammatical 3
# 7 meaning 3
# 8 alex 2
# 9 bird 2
# 10 complete 2
# .. ... ...
If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:
library(tokenizers)
tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>% # tokenize bigrams and trigrams
as_data_frame() %>% # structure
count(value, sort = TRUE) # count
# Source: local data frame [531 x 2]
#
# value n
# (fctr) (int)
# 1 of the 5
# 2 a phrase 4
# 3 the sentence 4
# 4 as a 3
# 5 in the 3
# 6 may be 3
# 7 a complete 2
# 8 a phrase is 2
# 9 a sentence 2
# 10 a white 2
# .. ... ...
Your text is:
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
In Natural Language Processing, 2-word phrases are referred to as "bi-gram", and 3-word phrases are referred to as "tri-gram", and so forth. Generally, a given combination of n-words is called an "n-gram".
First, we install the ngram package (available on CRAN)
# Install package "ngram"
install.packages("ngram")
Then, we will find the most frequent two-word and three-word phrases
library(ngram)
# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)
# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)
Finally, we will print the objects (ngrams) using various methods as below:
print(ng, output="truncated")
print(ngram(x), output="full")
get.phrasetable(ng)
ngram::ngram_asweka(text, min=2, max=3)
We can also use Markov Chains to babble new sequences:
# if we are using ng2 (bi-gram)
lnth = 2
babble(ng = ng2, genlen = lnth)
# if we are using ng3 (tri-gram)
lnth = 3
babble(ng = ng3, genlen = lnth)
We can split the words and use table to summarize the frequency:
words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)
Simplest?
require(quanteda)
# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
## of_the a_phrase the_sentence may_be as_a in_the in_common phrase_is
## 5 4 4 3 3 3 2 2
## is_usually group_of
## 2 2
# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
## a_phrase_is group_of_words of_a_sentence of_the_sentence for_example_in example_in_the
## 2 2 2 2 2 2
## in_the_sentence an_orange_bird orange_bird_with bird_with_a
# 2 2 2 2
Here's a simple base R approach for the 5 most frequent words:
head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)
# a the of in phrase
# 21 18 12 10 8
What it returns is an integer vector with the frequency count and the names of the vector correspond to the words that were counted.
gsub("[[:punct:]]", "", text) to remove punctuation since you don't want to count that, I guess
strsplit(gsub("[[:punct:]]", "", text), " ") to split the string on spaces
table() to count unique elements' frequency
sort(..., decreasing = TRUE) to sort them in decreasing order
head(..., 5) to select only the top 5 most frequent words

Count how many times specific words are used

I want to perform textmining on several bank account descriptions. My first step would be get a ranking of the words that are used the most in the description.
So lets say i have a dataframe that looks like this:
a b
1 1 House expenses
2 2 Office furniture bought
3 3 Office supplies ordered
Then I want to create a ranking of the use of the words. Like this:
Name Times
1. Office 2
2. Furniture 1
Etc...
Any thoughts on how I can quickly get an overview of the words that are used most in the description?
Another way around this is using the tm package.
You can create a corpus:
require(tm)
corpus <- Corpus(DataframeSource(data))
dtm<-DocumentTermMatrix(corpus)
dtmDataFrame <- as.data.frame(inspect(dtm))
by default it makes term frequencies tf using "weightTf". I converted the Document Term Matrix into a Dataframe.
Now what you have is a row per document, a column for each term and the value is the term frequency for every term, you can just create the rankings in a straightforward way, adding all values for each column.
colSums(dtmDataFrame)
You can sort it too after, whatever. The good point of using tm is that you can filter easily words out, process them with bunch of things like stop words, remove punctuations, stemming, remove sparse words in case you need it.
d<-data.frame(a=c(1,2,3), b=c("1 House expenses", "2 Office furniture bought", "3 Office supplies ordered"), stringsAsFactors =FALSE)
e <- unlist(strsplit(d$b, " "))
f <- e[! e %in% c("")]
g <- sapply(f, function(x) { sum(f %in% c(x))})
h = data.frame(Name=names(g), Times=g)
h[!duplicated(h),]

Resources