How to select part of a text in R - r

I have an HTML file which consists of 5 different articles and I would like to extract each one of these articles separately in R and run some analysis per article. Each article starts with < doc> and ends with < /doc> and also has a document number.Example:
<doc>
<docno> NA123455-0001 </docno>
<docid> 1 </docid>
<p>
NASA one-year astronaut Scott Kelly speaks after coming home to Houston on
March 3, 2016. Behind Kelly,
from left to right: U.S. Second Lady Jill Biden; Kelly's identical in
brother, Mark;
John Holdren, Assistant to the President for Science and ...
</p>
</doc>
<doc>
<docno> KA25637-1215 </docno>
<docid> 65 </docid>
<date>
<p>
February 1, 2014, Sunday
</p>
</date>
<section>
<p>
WASHINGTON -- Former Republican presidential nominee Mitt Romney
is charging into the increasingly divisive 2016 GOP
White House sweepstakes Thursday with a harsh takedown of front-runner
Donald Trump, calling him a "phony" and exhorting fellow
</p>
</type>
</doc>
<doc>
<docno> JN1234567-1225 </docno>
<docid> 67 </docid>
<date>
<p>
March 5, 2003
</p>
</date>
<section>
<p>
SEOUL—New U.S.-led efforts to cut funding for North Korea's nuclearweapons
program through targeted
sanctions risk faltering because of Pyongyang's willingness to divert all
available resources to its
military, even at the risk of economic collapse ...
</p>
</doc>
I have uploaded the url by using readLines() function and combined all lines together by using
articles<- paste(articles, collapse=" ")
I would like to select first article which is between < doc>..< /doc> and assign it to article1, and second one to article2 and so on.
Could you please advise how to construct the function in order to select each one of these articles separately?

You could use strsplit, which splits strings on whatever text or regex you give it. It will give you a list with one item for each part of the string between the splitting string, which you can then subset into different variables, if you like. (You could use other regex functions, as well, if you prefer.)
splitArticles <- strsplit(articles, '<doc>')
You'll still need to chop out the </doc> tags (plus a lot of other cruft, if you just want the text), but it's a start.
A more typical way to do the same thing would be to use a package for html scraping/parsing. Using the rvest package, you'd need something like
library(rvest)
read_html(articles) %>% html_nodes('doc') %>% html_text()
which will give you a character vector of the contents of <doc> tags. It may take more cleaning, especially if there are whitespace characters that you need to clean. Picking your selector carefully for html_nodes may help you avoid some of this; it looks like if you used p instead of doc, you're more likely to just get the text.

Simplest solution is use strsplit:
art_list <- strsplit(s, "<doc>")
art_list <- art_list[art_list != ""]
ids <- gsub(".*<docid>|</docid>.*", "", art_list[[i]] )
ids <- ids[ids != ""]
for (i in 1: length(unlist(art_list)) ){
assign( paste("article", ids[i], sep = "_") , gsub(".*<doc>|</doc>.*", "", unlist(art_list) )[i] )}

Related

Remove row if string contains more than one "#" using regular expression

I have a data frame with two columns. cnn_handle contains Twitter handles and tweet contains tweets where the Twitter handle in the corresponding row is mentioned. However, most tweets mention at least one other user/handle indicated by #. I want to remove all rows where a tweet contains more than one #.
df
cnn_handle tweet
1 #DanaBashCNN #JohnKingCNN #DanaBashCNN #kaitlancollins #eliehonig #thelauracoates #KristenhCNN CNN you are still FAKE NEWS !!!
2 #DanaBashCNN #DanaBashCNN He could have made the same calls here, from SC.
3 #DanaBashCNN #DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point. Also please refrain from showing a pic of him till you have one in his casket. thank you
4 #brianstelter #eliehonig #brianstelter My apologies to you sir. Just seems like that story disappeared. Imo the nursing home scandal is just as bad.
5 #brianstelter #DrAndrewBaer1 #JGreenblattADL #brianstelter #CNN #TuckerCarlson #FoxNews Anti-Semite are you, Herr Doktor? How very Mengele of you.
6 #brianstelter #ma_makosh #Shortguy1 #brianstelter #ChrisCuomo Liberals, their feelings before facts and their crucifixion of people before due process. Never a presumption of innocence when it concerns the rival party. So un-American.
7 #andersoncooper #BrendonLeslie And Biden was a staunch opponent of “forced busingâ€. He also said that integrating schools will cause a “racial jungleâ€. But u won’t hear this on #ChrisCuomo #jaketapper #Acosta #andersoncooper bc they continue to cover up the truth about Biden & his family.
8 #andersoncooper Anderson Cooper revealed that he "wanted a change" when reflecting on his break from news as #TheMole arrives on Netflix.
9 #andersoncooper #johnnydollar01 #newsbusters #drsanjaygupta #andersoncooper He was terrible as a host
I suspect some type of regular expression is needed. However, I am not sure how to combine it with a greater-than sign.
The desired result i.e. tweets only mentioning the corresponding cnn_handle
cnn_handle tweet
2 #DanaBashCNN #DanaBashCNN He could have made the same calls here, from SC.
3 #DanaBashCNN #DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point. Also please refrain from showing a pic of him till you have one in his casket. thank you
8 #andersoncooper Anderson Cooper revealed that he "wanted a change" when reflecting on his break from news as #TheMole arrives on Netflix.
A straighforward solution using str_count from stringrwhich presupposes that # occur only in Twitter handles:
base R:
library(stringr)
df[str_count(df$tweet, "#") > 1,]
dplyr:
library(dplyr)
library(stringr)
df %>%
filter(!str_count(tweet, "#") > 1)
Assuming your dataframe is called tweets, just check to see if there is more than one match for # followed by text:
pattern <- "#[a-zA-Z.+]"
multiple_ats <- unlist(lapply(tweets$tweet, function(x) length(gregexpr(pattern, x)[[1]])>1))
tweets[!multiple_ats,]
Output:
# A tibble: 3 x 2
cnn_handle tweet
<chr> <chr>
1 #DanaBashCNN "#DanaBashCNN He could have made the same calls here, from SC."
2 #DanaBashCNN "#DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point.,Also please refrain from showing a pic of him till you have one in his casket.,thank you"
3 #andersoncooper "Anderson Cooper revealed that he \"wanted a change\" when reflecting on his break from news as #TheMole arrives on Netflix."
Edit: You will have to change the pattern if Twitter user names are allowed to start with numbers or special characters. I don't know what the rules are.

Using wildcards to filter out words between semantic tags with R

I have a dataset that has a feature body which are all text from an html file and include semantic tags like so,
</strong>earned six Tony nominations this week, including one for Nyong'o (Best Actress in a Leading Role). <em>Eclipsed</em> is also significant for being the first Broadway play to feature a cast and creative team that is entirely black, female, and of African descent. (The play was written by Danai Gurira, who also plays Michonne on <em>The Walking Dead</em>.)</p> \n<p><!-- ######## BEGIN SNIPPET ######## -->
I would like to remove all text between semantic tags using wildcards. Is there a way to do so?
<!-- .--> My logic here is to remove the comment tag with everything inside of it.
Supposing your data frame looks like this
df <- data.frame(text = '</strong>earned six Tony nominations this week, including one for Nyong\'o (Best Actress in a Leading Role). <em>Eclipsed</em> is also significant for being the first Broadway play to feature a cast and creative team that is entirely black, female, and of African descent. (The play was written by Danai Gurira, who also plays Michonne on <em>The Walking Dead</em>.)</p> \n<p><!-- ######## BEGIN SNIPPET ######## -->')
Then you could use
df$new_text <- gsub("<!--.*-->", "", df$text)
to get your desired output in a new column new_text.

How to stop readtext function adding extra quotes in R?

I want to read some Word documents into R and extract the sentences that are contained within quotation marks. When I used the readtext function from that package it adds extra quotes around the whole string of each article. Is there a way to change this?
path <- "folder"
mydata <-
readtext(paste0(path, "\\*.docx"))
mydata$text
quotes <- vector()
for (i in c(1:2)){
quotes[i] <- sub('.*?"([^"]+)"', "\\1", mydata$text[i])
}
Here's the content of both Word documents:
A Bengal tiger named India that went missing in the US state of Texas, has been found unharmed and now transferred to one of the animal shelter in Houston.
"We got him and he is healthy," said Houston Police Department (HPD) Major Offenders Commander Ron Borza.
A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK.
and this is what my current output looks like
[1] "We got him and he is healthy, said Houston Police Department (HPD) Major Offenders Commander Ron Borza."
[2] "A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK."
It looks like the issue is related to the different types of quotations marks i.e. curly Vs straight. I can remove them as follows:
#' remove curly punctuation
mydata$text <- gsub("/’", "/'", mydata$text, ignore.case = TRUE)
mydata$text <- gsub("[“”]", "\"", gsub("[‘’]", "'", mydata$text))

How do I parse a movie script for lines of dialogue that have consistent spacing with R?

'''
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
RIDER
Hey -- sorry.
''''
I'm scraping some scripts that I want to do some text analysis with. I want to pull only dialogue from the scripts and it looks like it has a certain amount of spacing.
So for example, I want that line "Hey -- sorry.". I know that the spacing is 20 and that is consistent throughout the script. So I how can I only read in that line and the rest that have equal spacing?
I want to say that I am going to use read.fwf, reading a fixed width.
What do you guys think?
I'm scraping from urls like this:
https://imsdb.com/scripts/10-Things-I-Hate-About-You.html
library(tidytext)
library(tidyverse)
text <- c("PADUA HIGH SCHOOL - DAY
Welcome to Padua High School,, your typical urban-suburban
high school in Portland, Oregon. Smarties, Skids, Preppies,
Granolas. Loners, Lovers, the In and the Out Crowd rub sleep
out of their eyes and head for the main building.
PADUA HIGH PARKING LOT - DAY
KAT STRATFORD, eighteen, pretty -- but trying hard not to be
-- in a baggy granny dress and glasses, balances a cup of
coffee and a backpack as she climbs out of her battered,
baby blue '75 Dodge Dart.
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
She grabs his skateboard and uses it to SHOVE him against a
car, skateboard tip to his throat. He whimpers pitifully
and she lets him go. A path clears for her as she marches
through a pack of fearful students and SLAMS open the door,
entering school.
INT. GIRLS' ROOM - DAY
BIANCA STRATFORD, a beautiful sophomore, stands facing the
mirror, applying lipstick. Her less extraordinary, but
still cute friend, CHASTITY stands next to her.
BIANCA
Did you change your hair?
CHASTITY
No.
BIANCA
You might wanna think about it
Leave the girls' room and enter the hallway.
HALLWAY - DAY- CONTINUOUS
Bianca is immediately greeted by an admiring crowd, both
boys
and girls alike.
BOY
(adoring)
Hey, Bianca.
GIRL
Awesome shoes.
The greetings continue as Chastity remains wordless and
unaddressed by her side. Bianca smiles proudly,
acknowledging her fans.
GUIDANCE COUNSELOR'S OFFICE - DAY
CAMERON JAMES, a clean-cut, easy-going senior with an open,
farm-boy face, sits facing Miss Perky, an impossibly cheery
guidance counselor.")
names_stopwords <- c("^(rider|kat|chastity|bianca|boy|girl)")
text %>%
as_tibble() %>%
unnest_tokens(text, value, token = "lines") %>%
filter(str_detect(text, "\\s{15,}")) %>%
mutate(text = str_trim(text)) %>%
filter(!str_detect(text, names_stopwords))
Output:
# A tibble: 9 x 1
text
<chr>
1 hey -- sorry.
2 leave it
3 i said, leave it!
4 did you change your hair?
5 no.
6 you might wanna think about it
7 (adoring)
8 hey, bianca.
9 awesome shoes.
You can include further character names in the names_stopwords vector.
You can try the following :
url <- 'https://imsdb.com/scripts/10-Things-I-Hate-About-You.html'
url %>%
#Read webpage line by line
readLines() %>%
#Remove '<b>' and '</b>' from string
gsub('<b>|</b>', '', .) %>%
#select only the text which begins with 20 whitespace characters
grep('^\\s{20,}', ., value = TRUE) %>%
#Remove whitespace
trimws() %>%
#Remove all caps string
grep('^([A-Z]+\\s?)+$', ., value = TRUE, invert = TRUE)
#[1] "Hey -- sorry." "Leave it" "KAT (continuing)"
#[4] "I said, leave it!" "Did you change your hair?" "No."
#...
#...
I have tried cleaning this as much as possible but might require some more cleaning based on what you actually want to extract.

Split character vector containing annual report into sentences

I have read Microsoft's filings for 2016 into R. Now I want to clean the file, and split it into sentences. I have used the following code:
MSFT <- paste(readLines("https://www.sec.gov/Archives/edgar/data/789019/000156459017014900/0001564590-17-014900.txt"), collapse = " ")
Can someone help me?
This is one way that you can try:
MSFT <- paste(readLines("https://www.sec.gov/Archives/edgar/data/789019/000156459017014900/0001564590-17-014900.txt"), collapse = " ")
Remove everything from the text that is not within a body HTML tag (assupmtion: everything else is unwanted)
#Remove everything but body(s)
MSFT_body <- substr(MSFT, gregexpr("<body", MSFT)[[1]], gregexpr("</body", MSFT)[[1]])
Within the body, remove everything that is within < and > to get rid of HTML, CSS, aso...
#Remove all html tags and characters
MSFT_body_html_removed <- gsub("<.*?>|&[A-Za-z]+;|&#[0-9]+;", "", MSFT_body)
Remove all whitespace (i.e. spaces, line breaks, tabs,...) with 1 space
#Remove all whitespace and replace with space
MSFT_body_html_removed <- gsub("\\s+", " ", MSFT_body_html_removed)
You can use the openNLP sentence tokeniser (pretrained) to find sentences:
#Define function to tokenise text to sentences
sentence_tokeniser <- openNLP::Maxent_Sent_Token_Annotator(language = "en")
#convert to String class
text <- NLP::as.String(MSFT_body_html_removed)
Use annotate to apply the tokeniser to the text
#Annotate text
annotated_sentences <- NLP::annotate(text, sentence_tokeniser)
Extract sentences
#extract sentences
sentences <- text[annotated_sentences]
Print first 5 sentences:
# print first 5 sentences
for (i in 1:5) {
print(paste("Sentence", i))
cat(paste(sentences[i], "\n"))
}
This will give you:
[1] "Sentence 1"
UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Fiscal Year Ended June 30, 2017 OR TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to Commission File Number 001-37845 MICROSOFT CORPORATION WASHINGTON 91-1144442 (STATE OF INCORPORATION) (I.R.S. ID) ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399 (425) 882-8080 www.microsoft.com/investor
[1] "Sentence 2"
Securities registered pursuant to Section12(b) of the Act: COMMON STOCK, $0.00000625 par value per share NASDAQ Securities registered pursuant to Section12(g) of the Act: NONE Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.YesNo Indicate by check mark if the registrant is not required to file reports pursuant to Section13 or Section15(d) of the Exchange Act.YesNo Indicate by check mark whether the registrant (1)has filed all reports required to be filed by Section13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2)has been subject to such filing requirements for the past 90 days.YesNo Indicate by check mark whether the registrant has submitted electronically and posted on its corporate website, if any, every Interactive Data File required to be submitted and posted pursuant to Rule 405 of Regulat... <truncated>
[1] "Sentence 3"
Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company.
[1] "Sentence 4"
See the definitions of large accelerated filer, accelerated filer, smaller reporting company, and emerging growth company in Rule12b-2 of the Exchange Act.
[1] "Sentence 5"
Large accelerated filer Acceleratedfiler Non-acceleratedfiler (Donotcheckifasmallerreportingcompany) Smallerreportingcompany Emerging growth company If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act.

Resources