Extracting Elements from text files in R - r

I am trying to get into text analysis in R. I have a text file with the following structure.
HD A YEAR Oxxxx
WC 244 words
PD 28 February 2018
SN XYZ
SC hydt
LA English
CY Copyright 2018
LP Rio de Janeiro, Feb 28
TD
With recreational cannabis only months away from legalization in Canada, companies are racing to
prepare for the new market. For many, this means partnerships, supply agreements,
I want to extract the following elements (PD and TD) in R, and saved into a table.
I have tried this but I am unable to get it correct.
Extract PD
library(stringr)
library(tidyverse)
pd <- unlist(str_extract_all(txt, "\\bPD\\b\t[0-9]+?\\s[A-Za-z]+?\\s[0-9]+\\s"))
pd <- str_replace_all(pd, "\\bPD\\b\t", "")
if (length(pd) == 0) {
pd <- as.character(NA)
}
pd <- str_trim(pd)
pd <- as.Date(strptime(pd, format = "%d %B %Y"))
Extract TD
td <- unlist(str_extract_all(txt, "\\bTD\\b[\\t\\s]*?.+?\\bCO\\b"))
td <- str_replace_all(td, "\\bTD\\b[\\t\\s]+?", "")
td <- str_replace_all(td, "\\bCO\\b", "")
td <- str_replace_all(td, "\\s+", " ")
if (length(td) == 0) {
td <- as.character(NA)
I want table as follows please:
PD TD
28 February 2018 With recreational cannabis only months away from
legalization in Canada, companies are racing to
prepare for the new market. For many, this means
partnerships, supply agreements, Production hit a
record 366.5Mt
Any help would be appreciated. Thank you

[I had to add a few characters to the end of your data set which I inferred from your regexes:
txt <- "HD A YEAR Oxxxx
WC 244 words
PD 28 February 2018
SN XYZ
SC hydt
LA English
CY Copyright 2018
LP Rio de Janeiro, Feb 28
TD
With recreational cannabis only months away from legalization in Canada, companies are racing to
prepare for the new market. For many, this means partnerships, supply agreements,
CO ...further stuff"
Dirty
The dirty solution to your problems is probably:
For the date field, fix either the regex that it expects not a tab but an arbitrary space after the PD text. E.g. \\bPD\\b [0-9]+?\\s[A-Za-z]+?\\s[0-9]+\\s" works for me.
For the TD field, make your regex multi-line by using the dotall= option: (See ?stringr::regex)
td <- unlist(str_extract_all(txt, regex("\\bTD\\b[\\t\\s]*?.+?\\bCO\\b", dotall=TRUE)))
Maybe shorter regexes are better?
However, I would recommend you capture the characteristics of your input format only as fine-grained as needed. For example, I would not check the date format via a regex. Just search for "^ PD.*" and let R try to parse the result. It will complain anyway if it does not match.
To filter for a text block which starts with multiple spaces like after the TD marker, you can use the multiline= option to use ^ to match every (not only the first) line beginning. E.g.
str_extract_all(txt, regex("^TD\\s+(^\\s{3}.*\\n)+", multiline = TRUE))
(note that the regex class \s comprises \n so I do not need to specify that explicitly after matching the TD line)
Careful if fields are missing
Finally, your current approach might assign the wrong dates to the text if one of the TD or PD fields are ever missing in the input! A for loop in combination with readLines instead of regex matching might help for this:

Related

Extract text from CSV in R

I have an Excel .CSV file in which one column has the transcription of a conversation. Whenever the speaker uses Spanish, the Spanish is written within brackets.
One example sentence:
so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day
Ideally, I'd like to extract the English and Spanish separately, so one file would contain all the Spanish words, and another would contain all the English words.
Any ideas on how to do this? Or which function/package to use?
Edited to add: there's about 100 cells that contain text in this Excel sheet. I guess where I'm confused is how do I treat this entire CSV as a "string"?
I don't want to copy and paste every cell as a "strng" -- I was hoping I could someone just upload the entire CSV
To load the CSV into R, you could use readr::read_CSV(YOUR_FILE.CSV). There are more options, some of which are available to you if you use the "File -- Import Dataset -- From Text (readr)" menu option in RStudio.
Supposing you have the data loaded, you will likely need to rely on some form of "regex" to parse the text into sections based on the brackets. There are some base R functions for this, but I find the functions in stringr (part of the tidyverse meta-package) to be useful for this. And tidyr::separate_rows is a nice way to split the text into more lines.
In the regex below, there are a few ingredients:
(?=...) means to split before the [ but to keep it.
\\[ is how we refer to [ because brackets have special meaning in regex so we need to "escape" them to treat them as a literal character.
(?<=...) means to split after the ] but keep it.
| in the last row means "or"
(Granted, I'm still a regex beginner, so I expect there are more concise ways to do this.)
So we could do something like:
df1 <- data.frame(text = "so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day")
library(tidyverse)
df1 %>%
mutate(orig_row = row_number()) %>%
separate_rows(text, sep = "(?=\\[)") %>%
separate_rows(text, sep = "(?<=\\] )") %>%
mutate(language = if_else(str_detect(text, "\\[|\\]"), "Espanol", "English"),
text = str_remove_all(text, "\\[|\\]"))
Result
# A tibble: 5 × 3
text orig_row language
<chr> <int> <chr>
1 "so " 1 English
2 "usualmente " 1 Espanol
3 "maybe " 1 English
4 "me levanto como a las nueve y media " 1 Espanol
5 "like I exercise and the I like either go to class online or in person like it depends on the day" 1 English

Trying to scrape a playoff bracket from wikipedia using beautiful soup. How do I identify the correct columns?

I'm trying to scrape the nhl playoff bracket from wikipedia, for the years 1988 on, using beautiful soup 4 in python. Inconsistent formatting (sometimes the there is more than one team on a row see: (https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs) makes this hard. I would like to identify the Team, Round, and Number of Games Won for every series in that year.
Initially, I converted the table to text and used regular expressions to identify the teams and the information, but the ordering shifts depending on whether the brackets allow more than one team per row or not.
Now I'm trying to work my way down the rows and count things like the number of cells/columns spans, but the results are inconsistent. I'm missing how the 4th round teams are identified.
What I have so far is an attempt to count the number of cells before a cell with a team is reached...
from bs4 import BeautifulSoup as soup
hockeyteams = ['Anaheim','Arizona','Atlanta','Boston','Buffalo','Calgary','Carolina','Chicago','Colorado','Columbus','Dallas','Detroit',
'Edmonton','Florida','Hartford','Los Angeles','Minnesota','Montreal','Nashville','New Jersey',
'Ottawa','Philadelphia','Pittsburgh','Quebec','San Jose','St. Louis','Tampa Bay','Toronto','Vancouver','Vegas','Washington',
'Winnipeg','NY Rangers','NY Islanders']
#fetch the content from the url from the library
page_response = requests.get(full_link, timeout=5)
#use the html parser to parse the url
page_content = soup(page_response.content, "html.parser")
tables = page_content.find_all('table')
cnt = 0
#identify the appropriate table
for table in tables:
if ('Semi' in table.text) & ('Stanley Cup Finals' in table.text):
bracket = table
break
row_num = 0
for row in bracket.find_all('tr'):
row_num += 1
print(row_num,'#')
colcnt = 0
for col in row.find_all('td'):
if "colspan" in col.attrs:
colcnt += int(col.attrs['colspan'])
else:
colcnt += 1
if (col.text.strip(' \n') in str(hockeyteams)):
print(colcnt,col.text)
print('col width:',colcnt)
Ultimately I'd like something like a dataframe that has:
Round Team A Team A Wins, Team B Team B Wins
1, Tampa Bay, 4, NY Islanders, 1
2, Tampa Bay, 4, Montreal, 0
etc
That table can be scraped with pandas:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs#Playoff_bracket')
bracket = tables[2].dropna(axis=1, how='all').dropna(axis=0, how='all')
print(bracket)
The output is full of NaNs, but it has what I think you're looking for and you can modify it using standard pandas methods.

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

Text summarization in R language

I have long text file using help of R language I want to summarize text in at least 10 to 20 line or in small sentences.
How to summarize text in at least 10 line with R language ?
You may try this (from the LSAfun package):
genericSummary(D,k=1)
whereby 'D' specifies your text document and 'k' the number of sentences to be used in the summary. (Further modifications are shown in the package documentation).
For more information:
http://search.r-project.org/library/LSAfun/html/genericSummary.html
There's a package called lexRankr that summarizes text in the same way that Reddit's /u/autotldr bot summarizes articles. This article has a full walkthrough on how to use it but just as a quick example so you can test it yourself in R:
#load needed packages
library(xml2)
library(rvest)
library(lexRankr)
#url to scrape
monsanto_url = "https://www.theguardian.com/environment/2017/sep/28/monsanto-banned-from-european-parliament"
#read page html
page = xml2::read_html(monsanto_url)
#extract text from page html using selector
page_text = rvest::html_text(rvest::html_nodes(page, ".js-article__body p"))
#perform lexrank for top 3 sentences
top_3 = lexRankr::lexRank(page_text,
#only 1 article; repeat same docid for all of input vector
docId = rep(1, length(page_text)),
#return 3 sentences to mimick /u/autotldr's output
n = 3,
continuous = TRUE)
#reorder the top 3 sentences to be in order of appearance in article
order_of_appearance = order(as.integer(gsub("_","",top_3$sentenceId)))
#extract sentences in order of appearance
ordered_top_3 = top_3[order_of_appearance, "sentence"]
> ordered_top_3
[1] "Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a parliamentary hearing into allegations of regulatory interference."
[2] "Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg."
[3] "A Monsanto letter to MEPs seen by the Guardian said that the European parliament was not “an appropriate forum” for discussion on the issues involved."

Text Mining with R

I need help in text mining using R
Title Date Content
Boy May 13 2015 "She is pretty", Tom said. Tom is handsome.
Animal June 14 2015 The penguin is cute, lion added.
Human March 09 2015 Mr Koh predicted that every human is smart...
Monster Jan 22 2015 Ms May, a student, said that John has $10.80. May loves you.
I would just want to get the opinions from what the people had said.
And also, I would like to seek help in getting the percentage (Eg. 9.8%), because when i split the sentences based on fullstop ("."), i would get "His result improved by 0." instead of "His result improved by 0.8%".
Below is the output that I would like to obtain:
Title Date Content
Boy May 13 2015 she is pretty
Animal June 14 2015 the penguin is cute
Human March 09 2015 every human is smart
Monster Jan 22 2015 john has $10.80
Below is the code that I tried, but didn't get desired output:
list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")
dataframe <- stack(setNames(lapply(strsplit(dataframe, '(?<=[.])', perl=TRUE), grep, pattern = pattern, value = TRUE), dataframe$Title))[2:1]
You're close, but your regular expression for splitting is wrong. This gave the correct arrangement for the data, modulo your request to extract opinions more exactly:
txt <- '
Title Date Content
Boy May 13 2015 "She is pretty", Tom said. Tom is handsome.
Animal June 14 2015 The penguin is cute, lion added.
Human March 09 2015 Mr Koh predicted that every human is smart...
Monster Jan 22 2015 Ms May, a student, said that John has $10.80. May loves you.
'
txt <- gsub(" {2,}(?=\\S)", "|", txt, perl = TRUE)
dataframe <- read.table(sep = "|", text = txt, header = TRUE)
list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")
content <- strsplit(dataframe$Content, '\\.(?= )', perl=TRUE)
opinions <- lapply(content, grep, pattern = pattern, value = TRUE)
names(opinions) <- dataframe$Title
result <- stack(opinions)
In your sample data, all full stops followed by spaces are sentence-ending, so that's what the regular expression \.(?= ) matches. However that will break up sentences like "I was born in the U.S.A. but I live in Canada", so you might have to do additional pre-processing and checking.
Then, assuming the Titles are unique identifiers, you can just merge to add the dates back in:
result <- merge(dataframe[c("Title", "Date")], result, by = "Title")
As mentioned in the comments, the NLP task itself has more to do with text parsing than R programming. You can probably get some mileage out of searching for a pattern like
<optional adjectives> <noun> <verb> <optional adverbs> <adjective> <optional and/or> <optional adjective> ...
which would match your sample data, but I'm far from an expert here. You'd also need a dictionary with lexical categories. A Google search for "extract opinion text" yielded a lot of helpful results on the first page, including this site run by Bing Liu. From what I can tell, Professor Liu literally wrote the book on sentiment analysis.

Resources