Matching strings with str_detect and regex - r

I have a data frame containing unstructured text. In this reproducible example, I'm downloading a 10K company filing directly from the SEC website and loading it with read.table.
dir = getwd(); setwd(dir)
download.file("https://www.sec.gov/Archives/edgar/data/2648/0000002648-96-000013.txt", file.path(dir,"filing.txt"))
filing <- read.table(file=file.path(dir, "filing.txt"), sep="\t", quote="", comment.char="")
droplevels.data.frame(filing)
I want to remove the SEC header in order to focus on the main body of the document (starting in row 216) and divide my text into sections/items.
> filing$V1[216:218]
[1] PART I
[2] Item 1. Business.
[3] A. Organization of Business
Therefore, I'm trying to match strings starting with the word Item (or ITEM) followed by one or more spaces, one or two digits, a dot, one or more spaces and one or more words. For example:
Item 1. Business.
ITEM 1. BUSINESS
Item 1. Business
Item 10. Directors and Executive Officers of
ITEM 10. DIRECTORS AND EXECUTIVE OFFICERS OF THE REGISTRANT
My attempt involves str_detect and regex in order to create a variable count that jumps each time there is a string match.
library(dplyr)
library(stringr)
tidy_filing <- filing %>% mutate(count = cumsum(str_detect(V1, regex("^Item [\\d]{1,2}\\.",ignore_case = TRUE)))) %>% ungroup()
However, I'm missing the first 9 Items and my count starts only with Item 10.
tidy_filing[c(217, 218,251:254),]
V1 count
217 Item 1. Business. 0
218 A. Organization of Business 3 0
251 PART III 0
252 Item 10. Directors etc. 38 1
253 Item 11. Executive Compens. 38 2
254 Item 12. Security Ownership. 38 3
Any help would be highly appreciated.

The problem is that the single digit items have double spaces in order to align with the two digit ones. You can get round this by changing your regex string to
"^Item\\s+\\d{1,2}\\."

Related

Text column in R - Trying to count the keywords sequentially

I am working on a dataset that has a text column. The text has many sentences separated by a semi-colon ';'. I am trying to get a word count in a new column in dataframe for words that match my keyword. However, in one sentence, if there are repeated keywords, they should be considered only once.
For instance -
The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods
Solar panels, Tawian tariffs, trade
Trade issues impacting the solar industry
are the text in one column of the dataframe.
My keywords include solar, solar panels, section 201
I want to count the words in each sentence that match my keywords but if both or all words are there in the sentence, then it is counted only once. Word counts should only consider keywords in different sentences. If one sentence doesn't have a specific keyword, then move towards finding the second keyword.
My output should be -
word_count
2 (as section 201 is mentioned in both sentences, we do not search for solar because the first word in the keyword list matched)
1 (as only solar word is there)
1 (as only solar word is there)
Please suggest a way to resolve this issue. It is a crucial part of my research work. Thanks.
Kind Regards,
Preety
I think for each number in your example, you want to consider that a separate list item. You want to consider each list item as separate sentences wherever there is semi-colon. Then you want to look for the first occurrence of a keyword in that each list item. That becomes the target keyword for that list item. You then want to count the first occurrence only of that target keyword in each sentence within each list item:
library(dplyr)
library(stringr)
# I modified your example sentences to include "section 201" twice in sentence 2 in list item #1 to show it will only count once in that sentence
# I modified the order of your keywords, otherwise solar will be detected first in all sentences
sentences <- list("The section 201 solar trade case on cells and modules; Issues relating to section 201 section 201 tariffs on imported goods", "Solar panels, Tawian tariffs, trade", "Trade issues impacting the solar industry")
keywords <- c("section 201", "solar", "solar panels")
# Go through by each list item
lapply(sentences, function(sentence){
# For each list item, sentences, split into separate sentences by ;
# Also I changed each sentence to lowercase, otherwise solar != Solar
str_split(tolower(sentence), ";") -> split_string
# Apply each keyword and detect if the keyword is in the list_item using str_detect from stringr
sapply(keywords, function(keyword) str_detect(split_string, keyword)) -> output
# Output is the keywords that are in each list item
# Choose the first occurrence of a keyword: keywords[output == T][1]
# Detect in each sentence if the keyword is included, then sum the number of occurrences for each list item. Name as the keyword that was detected
setNames(str_detect(split_string[[1]], keywords[output == T][1]) %>% sum(), keywords[output == T][1])
})
Gives a list, with the first occurring keyword identified and the number of first occurrences of that keyword in each sentence within list item:
[[1]]
section 201
2
[[2]]
solar
1
[[3]]
solar
1
I would first split your text column into multiple columns using tidyr::separate(); then search for your key phrases within each of these, using stringr::str_detect() or base regex functions; then sum across columns using rowSums(dplyr::across()). Like so:
library(tidyverse)
keywords <- paste("solar", "section 201", sep = "|")
df <- tibble(
text = c(
"The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods",
"Solar panels; Tawian tariffs; trade",
"Trade issues impacting the solar industry"
)
)
df <- df %>%
separate(text, into = c("text1", "text2", "text3"), sep = ";", fill = "right") %>%
mutate(
count = rowSums(
across(text1:text3, ~ str_detect(str_to_lower(.x), keywords)),
na.rm = TRUE
)
)
The count column then contains the results you predicted:
df %>%
select(count)
# A tibble: 3 x 1
# count
# <dbl>
# 1 2
# 2 1
# 3 1
If the list of keywords is not too long (say 10 or 20 words), then you can look at the count of all the keywords for each text string. I am adding ; at the end of each text string so that a sentence always ends with a ;. The pattern paste0("[^;]*", key, "[^;]*;") identifies any sentence containing the word (stored in) key.
txt <- c("The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods",
"Solar panels, Tawian tariffs, trade",
"Trade issues impacting the solar industry")
keys <- c("section 201", "solar panels", "solar")
counts <- sapply(keys, function(key) stringr::str_count(paste0(txt, ";"), regex(paste0("[^;]*", key, "[^;]*;"), ignore_case = T)))
Next you can go over each row of counts and look at the first non-zero element which should be the value you are looking for.
sapply(1:nrow(counts), function(i) {
a <- counts[i, ]
a[a != 0][1]
})

Extracting N number of matches from a text string in R?

I am using stringr in R, and I have a string of text that lists titles of news articles. I want to extract these titles, but only the first N-number of titles that appear. In my example string of text, I have three article titles, but I only want to extract the first two.
How can I tell str_extract to only collect the first 2 titles? Thank you.
Here is my current code with the example texts.
library(stringr)
Here is the example text.
texting <- ("Time: Friday, September 14, 2018 4:34:00 PM EDT\r\nJob Number: 73591483\r\nDocuments (100)\r\n 1. U.S. Stocks Rebound Slightly After Tech-Driven Slump\r\n Client/Matter: -None-\r\n Search Terms: trade war or US-China trade or china tariff and not dealbook\r\n Search Type: Terms and Connectors\r\n Narrowed by:\r\n Content Type Narrowed by\r\n News Sources: The New York Times; Content Type: News;\r\n Timeline: Jan 01, 2018 to Dec 31, 2018\r\n 2. Shifting Strategy on Tariffs\r\n Client/Matter: -None-\r\n Search Terms: trade war or US-China trade or china tariff and not dealbook\r\n 100. Example")
titles.1 <- str_extract_all(texting, "\\d+\\.\\s.+")
titles.1
The current code brings back all three matches in the string:
[[1]]
[1] "1. U.S. Stocks Rebound Slightly After Tech-Driven Slump"
[2] "2. Shifting Strategy on Tariffs"
[3] "100. Example"
I only want it to collect the first two matches.
You can use the option simplify = TRUE to get a vector as result, rather than a list. Then, just pick the first N elements from the vector
titles.1 <- str_extract_all(texting, "\\d+\\.\\s.+", simplify = TRUE)[1:2]

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

keeping the row number in a data frame column

I have a bunch of .txt files (articles) in a folder, I use a for cycle in order to get text from all of them on R
input_loc <- "C:/Users/User/Desktop/Folder"
files <- dir(input_loc, full.names = TRUE)
text <- c()
for (f in files) {
text <- c(text, paste(readLines(f), collapse = "\n"))
}
from here, I tokenize per paragraphs and I get each paragraph in each article:
paragraphs <- tokenize_paragraphs(text)
sapply(paragraphs, length)
paragraphs
then I unlist and transform into a dataframe
par_unlisted<-unlist(paragraphs)
par_unlisted
par_unlisted_df<-as.data.frame(par_unlisted)
BUT doing that I no longer have an inter-article separation of paragraph numbers (e.g. first article has 6 paragraphs, before unlisting the first paragraph of the second article would still have a [1] in front, while after unlisting it will have a [7]).
What I would like to do is, once I have the dataframe, having a column with the number of the paragraph, then create another column named "article" with the number of the article.
Thank You in advance
EDIT
this is roughly what I get once I get to paragraphs:
> paragraphs
[[1]]
[1] "The Miami Dolphins have decided to use their non-exclusive franchise
tag on wide receiver Jarvis Landry."
[2] "The Dolphins tweeted the announcement Tuesday, the first day teams
could use their franchise or transition tags. The salary for wide receivers
getting the franchise tag this offseason is expected to be around $16.2
million, which will be quite the raise for Landry, who made $894,000 last
season."
[[2]]
[1] "Despite months of little-to-no movement on contract negotiations,
Jarvis Landry has often stated his desire to stay in Miami."
[2] "The Dolphins used their lone tool to wipe away negotation-driven stress
-- at least in the immediate future -- and ensure Landry won't be lured away
from Miami, placing the franchise tag on the receiver on Tuesday, the team
announced."
I would want to keep the paragraph number ([n]) as a column in the dataframe, because when I unlist them they no longer stay separated per article and then per paragraph, but I get them in sequence, let's say (basically in the example I've just posted I no longer have
[[1]]
[1] ...
[2] ...
[[2]]
[1] ...
[2] ...
but I get
[1] ...
[2] ...
[3] ...
[4] ...
Consider iterating through the paragraphs list and build a list of dataframes with needed article and paragraph numbers with a final row bind through all dataframe elements.
Input Data
paragraphs <- list(
c("The Miami Dolphins have decided to use their non-exclusive franchise tag on wide receiver Jarvis Landry.",
"The Dolphins tweeted the announcement Tuesday, the first day teams could use their franchise or transition tags. The salary for wide receivers
getting the franchise tag this offseason is expected to be around $16.2 million, which will be quite the raise for Landry, who made $894,000 last
season."),
c("Despite months of little-to-no movement on contract negotiations, Jarvis Landry has often stated his desire to stay in Miami.",
"The Dolphins used their lone tool to wipe away negotation-driven stress -- at least in the immediate future -- and ensure Landry won't be lured away
from Miami, placing the franchise tag on the receiver on Tuesday, the team announced."))
Dataframe Build
df_list <- lapply(seq_along(paragraphs), function(i)
setNames(data.frame(i, 1:length(paragraphs[[i]]), paragraphs[[i]]),
c("article_num", "paragraph_num", "paragraph"))
)
final_df <- do.call(rbind, df_list)
Output Result
final_df
# article_num paragraph_num paragraph
# 1 1 1 The Miami Dolphins have decided to use their non-e...
# 2 1 2 The Dolphins tweeted the announcement Tuesday, the...
# 3 2 1 Despite months of little-to-no movement on contrac...
# 4 2 2 The Dolphins used their lone tool to wipe away neg...

row wise count the number of the words in a review text in an R dataframe

I want to count the number of words in each row:
Review_ID Review_Date Review_Content Listing_Title Star Hotel_Name
1 1/25/2016 I booked both the Crosby and Four Seasons but decided to cancel the Four Seasons closer to the arrival date based on reviews. Glad I did. The Crosby is an outstanding hotel. The rooms are immaculate and luxurious, with real attention to detail and none of the bland furnishings you find in even the top chain hotels. Staff on the whole were extremely attentive and seemed to enjoy being there. Breakfast was superb and facilities at ground level gave an intimate and exclusive feel to the hotel. It's a fairly expensive place to stay but is one of those hotels where you feel you're getting what you pay for, helped by an excellent location. Hope to be back! Outstanding 5 Crosby Street Hotel
2 1/18/2016 We've stayed many times at the Crosby Street Hotel and always have an incredible, flawless experience! The staff couldn't be more accommodating, the housekeeping is immaculate, the location's awesome and the rooms are the coolest combination of luxury and chic. During our most recent trip over The New Years holiday, we stayed in the stunning Crosby Suite which has the most extraordinary, gorgeous decor. The Crosby remains our absolute favorite in NYC. Can't wait to return! Always perfect! 5 Crosby Street Hotel
I was thinking something like:
WordFreqRowWise %>%
rowwise() %>%
summarise(n = n())
To get the results something like..
Review_ID Review_Content total_Words Min_occrd_word Max Average
1 .... 230 great: 1 the: 25 total_unique/total_words in the row
But do not have idea, how can I do it....
Here is a method in base R using strsplit and sapply. Let's say the data is stored in a data.frame df and the reviews are stored in the variable Review_Content
# break up the strings in each row by " "
temp <- strsplit(df$Review_Content, split=" ")
# count the number of words as the length of the vectors
df$wordCount <- sapply(temp, length)
In this instance, sapply will return a vector of the counts for each row.
Since the word count is now an object, you can perform analysis you want on it. Here are some examples:
summarize the distribution of word counts: summary(df$wordCount)
maximum word count: max(df$wordCount)
mean word count: mean(df$wordCount)
range of word counts: range(df$wordCount)
interquartile range of word counts: IQR(df$wordCount)
Adding to #lmo's answer above..
Below code will generate a dataframe that consists of all the words, row-wise, and their frequencies:
temp2 <- data.frame()
for (i in 1:length(temp)){
temp1 <- as.data.frame(table(temp[[i]]))
temp1$ID <- paste0("Row_", i)
temp2 <- rbind(temp2, temp1)
temp1 <- NULL
}

Resources