Split character vector containing annual report into sentences

Split character vector containing annual report into sentences - r

I have read Microsoft's filings for 2016 into R. Now I want to clean the file, and split it into sentences. I have used the following code:
MSFT <- paste(readLines("https://www.sec.gov/Archives/edgar/data/789019/000156459017014900/0001564590-17-014900.txt"), collapse = " ")
Can someone help me?

This is one way that you can try:
MSFT <- paste(readLines("https://www.sec.gov/Archives/edgar/data/789019/000156459017014900/0001564590-17-014900.txt"), collapse = " ")
Remove everything from the text that is not within a body HTML tag (assupmtion: everything else is unwanted)
#Remove everything but body(s)
MSFT_body <- substr(MSFT, gregexpr("<body", MSFT)[[1]], gregexpr("</body", MSFT)[[1]])
Within the body, remove everything that is within < and > to get rid of HTML, CSS, aso...
#Remove all html tags and characters
MSFT_body_html_removed <- gsub("<.*?>|&[A-Za-z]+;|&#[0-9]+;", "", MSFT_body)
Remove all whitespace (i.e. spaces, line breaks, tabs,...) with 1 space
#Remove all whitespace and replace with space
MSFT_body_html_removed <- gsub("\\s+", " ", MSFT_body_html_removed)
You can use the openNLP sentence tokeniser (pretrained) to find sentences:
#Define function to tokenise text to sentences
sentence_tokeniser <- openNLP::Maxent_Sent_Token_Annotator(language = "en")
#convert to String class
text <- NLP::as.String(MSFT_body_html_removed)
Use annotate to apply the tokeniser to the text
#Annotate text
annotated_sentences <- NLP::annotate(text, sentence_tokeniser)
Extract sentences
#extract sentences
sentences <- text[annotated_sentences]
Print first 5 sentences:
# print first 5 sentences
for (i in 1:5) {
print(paste("Sentence", i))
cat(paste(sentences[i], "\n"))
}
This will give you:
[1] "Sentence 1"
UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Fiscal Year Ended June 30, 2017 OR TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to Commission File Number 001-37845 MICROSOFT CORPORATION WASHINGTON 91-1144442 (STATE OF INCORPORATION) (I.R.S. ID) ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399 (425) 882-8080 www.microsoft.com/investor
[1] "Sentence 2"
Securities registered pursuant to Section12(b) of the Act: COMMON STOCK, $0.00000625 par value per share NASDAQ Securities registered pursuant to Section12(g) of the Act: NONE Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.YesNo Indicate by check mark if the registrant is not required to file reports pursuant to Section13 or Section15(d) of the Exchange Act.YesNo Indicate by check mark whether the registrant (1)has filed all reports required to be filed by Section13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2)has been subject to such filing requirements for the past 90 days.YesNo Indicate by check mark whether the registrant has submitted electronically and posted on its corporate website, if any, every Interactive Data File required to be submitted and posted pursuant to Rule 405 of Regulat... <truncated>
[1] "Sentence 3"
Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company.
[1] "Sentence 4"
See the definitions of large accelerated filer, accelerated filer, smaller reporting company, and emerging growth company in Rule12b-2 of the Exchange Act.
[1] "Sentence 5"
Large accelerated filer Acceleratedfiler Non-acceleratedfiler (Donotcheckifasmallerreportingcompany) Smallerreportingcompany Emerging growth company If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act.

Related

Remove row if string contains more than one "#" using regular expression

I have a data frame with two columns. cnn_handle contains Twitter handles and tweet contains tweets where the Twitter handle in the corresponding row is mentioned. However, most tweets mention at least one other user/handle indicated by #. I want to remove all rows where a tweet contains more than one #.
df
cnn_handle tweet
1 #DanaBashCNN #JohnKingCNN #DanaBashCNN #kaitlancollins #eliehonig #thelauracoates #KristenhCNN CNN you are still FAKE NEWS !!!
2 #DanaBashCNN #DanaBashCNN He could have made the same calls here, from SC.
3 #DanaBashCNN #DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point. Also please refrain from showing a pic of him till you have one in his casket. thank you
4 #brianstelter #eliehonig #brianstelter My apologies to you sir. Just seems like that story disappeared. Imo the nursing home scandal is just as bad.
5 #brianstelter #DrAndrewBaer1 #JGreenblattADL #brianstelter #CNN #TuckerCarlson #FoxNews Anti-Semite are you, Herr Doktor? How very Mengele of you.
6 #brianstelter #ma_makosh #Shortguy1 #brianstelter #ChrisCuomo Liberals, their feelings before facts and their crucifixion of people before due process. Never a presumption of innocence when it concerns the rival party. So un-American.
7 #andersoncooper #BrendonLeslie And Biden was a staunch opponent of â€œforced busingâ€. He also said that integrating schools will cause a â€œracial jungleâ€. But u wonâ€™t hear this on #ChrisCuomo #jaketapper #Acosta #andersoncooper bc they continue to cover up the truth about Biden & his family.
8 #andersoncooper Anderson Cooper revealed that he "wanted a change" when reflecting on his break from news as #TheMole arrives on Netflix.
9 #andersoncooper #johnnydollar01 #newsbusters #drsanjaygupta #andersoncooper He was terrible as a host
I suspect some type of regular expression is needed. However, I am not sure how to combine it with a greater-than sign.
The desired result i.e. tweets only mentioning the corresponding cnn_handle
cnn_handle tweet
2 #DanaBashCNN #DanaBashCNN He could have made the same calls here, from SC.
3 #DanaBashCNN #DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point. Also please refrain from showing a pic of him till you have one in his casket. thank you
8 #andersoncooper Anderson Cooper revealed that he "wanted a change" when reflecting on his break from news as #TheMole arrives on Netflix.

A straighforward solution using str_count from stringrwhich presupposes that # occur only in Twitter handles:
base R:
library(stringr)
df[str_count(df$tweet, "#") > 1,]
dplyr:
library(dplyr)
library(stringr)
df %>%
filter(!str_count(tweet, "#") > 1)

Assuming your dataframe is called tweets, just check to see if there is more than one match for # followed by text:
pattern <- "#[a-zA-Z.+]"
multiple_ats <- unlist(lapply(tweets$tweet, function(x) length(gregexpr(pattern, x)[[1]])>1))
tweets[!multiple_ats,]
Output:
# A tibble: 3 x 2
cnn_handle tweet
<chr> <chr>
1 #DanaBashCNN "#DanaBashCNN He could have made the same calls here, from SC."
2 #DanaBashCNN "#DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point.,Also please refrain from showing a pic of him till you have one in his casket.,thank you"
3 #andersoncooper "Anderson Cooper revealed that he \"wanted a change\" when reflecting on his break from news as #TheMole arrives on Netflix."
Edit: You will have to change the pattern if Twitter user names are allowed to start with numbers or special characters. I don't know what the rules are.

How to stop readtext function adding extra quotes in R?

I want to read some Word documents into R and extract the sentences that are contained within quotation marks. When I used the readtext function from that package it adds extra quotes around the whole string of each article. Is there a way to change this?
path <- "folder"
mydata <-
readtext(paste0(path, "\\*.docx"))
mydata$text
quotes <- vector()
for (i in c(1:2)){
quotes[i] <- sub('.*?"([^"]+)"', "\\1", mydata$text[i])
}
Here's the content of both Word documents:
A Bengal tiger named India that went missing in the US state of Texas, has been found unharmed and now transferred to one of the animal shelter in Houston.
"We got him and he is healthy," said Houston Police Department (HPD) Major Offenders Commander Ron Borza.
A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK.
and this is what my current output looks like
[1] "We got him and he is healthy, said Houston Police Department (HPD) Major Offenders Commander Ron Borza."
[2] "A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK."

It looks like the issue is related to the different types of quotations marks i.e. curly Vs straight. I can remove them as follows:
#' remove curly punctuation
mydata$text <- gsub("/’", "/'", mydata$text, ignore.case = TRUE)
mydata$text <- gsub("[“”]", "\"", gsub("[‘’]", "'", mydata$text))

Is there a way to use regex in an R script to pull data from in between two characters with accounting for inconsistent spacing?

I am currently trying to pull data from a pdf using the str_match function which is working well. This is an example:
values[[18]] <- str_match(Sprout_textNoLines, "Business Description: (.*?) Renter or Owned:")[,2]
Sprout_textNoLines is just a paragraph of all the characters in the pdf, not separated by lines. The particular case that I'm parsing here is
Business Description: Federal and State Construction Renter or Owned:
The str_match call that I showed earlier returns "Federal and State Construction" which is exactly what I need. However, I am finding cases where some of the pdfs are different and the inputs on the lines won't be separated by a space for example:
Business Description:Federal and State Construction Renter or Owned:
There is no space between Description: and Federal here so the earlier function call will just pull back NA here because Business Description: (.*?) Renter or Owned:. I need to automate this process so is there a regex that could accomplish something similar to
values[[18]] <- str_match(Sprout_textNoLines, "Business Description: (.*?) Renter or Owned:")[,2]
but with adding regex to the (.*?) to account for variability in the amount of spaces between the string that I want to pull and the strings that precede and follow it?

You may use
str_match(Sprout_textNoLines, "Business Description:\\s*(.*?)\\s*Renter or Owned:")[,2]
See the regex demo
The part that is changed is \s*(.*?)\s* that matches 0 or more whitespaces (\s*), then captures any 0 or more chars other than line break chars as few as possible, and then again 0 or more whitespaces are matched.

Deleting whole lines of a text file when starting with certain words in R

I have a .txt file that contains multiple newspaper articles. Each article has a headline, the author name etc. I want to read the whole .txt file in R and remove every line + the next 5 lines that starts with certain words. I think gsub + reg expression might be the solution, but I do not know how to define it like the way so that not only the line containing these words is deleted, but also the next 5 lines.
Edit:
The txt. file consists of 200 Washington Post articles. Each article ends with:
lydia.depillis#washpost.com
LOAD-DATE: July 14, 2013
LANGUAGE: ENGLISH
PUBLICATION-TYPE: Web Publication
Copyright 2013 Washingtonpost.Newsweek Interactive Company, LLC d/b/a Washington
Post Digital
All Rights Reserved
4 of 200 DOCUMENTS
Washington Post Blogs
In the Loop
June 28, 2013 Friday 3:08 PM EST
Whenever an e-mail address appears, I want to delete everything until the line where a date appears so that we have a smooth transition to the next article. I want to use a sentiment analysis and thus don't need these lines.

How to select part of a text in R

I have an HTML file which consists of 5 different articles and I would like to extract each one of these articles separately in R and run some analysis per article. Each article starts with < doc> and ends with < /doc> and also has a document number.Example:
<doc>
<docno> NA123455-0001 </docno>
<docid> 1 </docid>
<p>
NASA one-year astronaut Scott Kelly speaks after coming home to Houston on
March 3, 2016. Behind Kelly,
from left to right: U.S. Second Lady Jill Biden; Kelly's identical in
brother, Mark;
John Holdren, Assistant to the President for Science and ...
</p>
</doc>
<doc>
<docno> KA25637-1215 </docno>
<docid> 65 </docid>
<date>
<p>
February 1, 2014, Sunday
</p>
</date>
<section>
<p>
WASHINGTON -- Former Republican presidential nominee Mitt Romney
is charging into the increasingly divisive 2016 GOP
White House sweepstakes Thursday with a harsh takedown of front-runner
Donald Trump, calling him a "phony" and exhorting fellow
</p>
</type>
</doc>
<doc>
<docno> JN1234567-1225 </docno>
<docid> 67 </docid>
<date>
<p>
March 5, 2003
</p>
</date>
<section>
<p>
SEOUL—New U.S.-led efforts to cut funding for North Korea's nuclearweapons
program through targeted
sanctions risk faltering because of Pyongyang's willingness to divert all
available resources to its
military, even at the risk of economic collapse ...
</p>
</doc>
I have uploaded the url by using readLines() function and combined all lines together by using
articles<- paste(articles, collapse=" ")
I would like to select first article which is between < doc>..< /doc> and assign it to article1, and second one to article2 and so on.
Could you please advise how to construct the function in order to select each one of these articles separately?

You could use strsplit, which splits strings on whatever text or regex you give it. It will give you a list with one item for each part of the string between the splitting string, which you can then subset into different variables, if you like. (You could use other regex functions, as well, if you prefer.)
splitArticles <- strsplit(articles, '<doc>')
You'll still need to chop out the </doc> tags (plus a lot of other cruft, if you just want the text), but it's a start.
A more typical way to do the same thing would be to use a package for html scraping/parsing. Using the rvest package, you'd need something like
library(rvest)
read_html(articles) %>% html_nodes('doc') %>% html_text()
which will give you a character vector of the contents of <doc> tags. It may take more cleaning, especially if there are whitespace characters that you need to clean. Picking your selector carefully for html_nodes may help you avoid some of this; it looks like if you used p instead of doc, you're more likely to just get the text.

Simplest solution is use strsplit:
art_list <- strsplit(s, "<doc>")
art_list <- art_list[art_list != ""]
ids <- gsub(".*<docid>|</docid>.*", "", art_list[[i]] )
ids <- ids[ids != ""]
for (i in 1: length(unlist(art_list)) ){
assign( paste("article", ids[i], sep = "_") , gsub(".*<doc>|</doc>.*", "", unlist(art_list) )[i] )}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex