I want to read some Word documents into R and extract the sentences that are contained within quotation marks. When I used the readtext function from that package it adds extra quotes around the whole string of each article. Is there a way to change this?
path <- "folder"
mydata <-
readtext(paste0(path, "\\*.docx"))
mydata$text
quotes <- vector()
for (i in c(1:2)){
quotes[i] <- sub('.*?"([^"]+)"', "\\1", mydata$text[i])
}
Here's the content of both Word documents:
A Bengal tiger named India that went missing in the US state of Texas, has been found unharmed and now transferred to one of the animal shelter in Houston.
"We got him and he is healthy," said Houston Police Department (HPD) Major Offenders Commander Ron Borza.
A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK.
and this is what my current output looks like
[1] "We got him and he is healthy, said Houston Police Department (HPD) Major Offenders Commander Ron Borza."
[2] "A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK."
It looks like the issue is related to the different types of quotations marks i.e. curly Vs straight. I can remove them as follows:
#' remove curly punctuation
mydata$text <- gsub("/’", "/'", mydata$text, ignore.case = TRUE)
mydata$text <- gsub("[“”]", "\"", gsub("[‘’]", "'", mydata$text))
Related
'''
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
RIDER
Hey -- sorry.
''''
I'm scraping some scripts that I want to do some text analysis with. I want to pull only dialogue from the scripts and it looks like it has a certain amount of spacing.
So for example, I want that line "Hey -- sorry.". I know that the spacing is 20 and that is consistent throughout the script. So I how can I only read in that line and the rest that have equal spacing?
I want to say that I am going to use read.fwf, reading a fixed width.
What do you guys think?
I'm scraping from urls like this:
https://imsdb.com/scripts/10-Things-I-Hate-About-You.html
library(tidytext)
library(tidyverse)
text <- c("PADUA HIGH SCHOOL - DAY
Welcome to Padua High School,, your typical urban-suburban
high school in Portland, Oregon. Smarties, Skids, Preppies,
Granolas. Loners, Lovers, the In and the Out Crowd rub sleep
out of their eyes and head for the main building.
PADUA HIGH PARKING LOT - DAY
KAT STRATFORD, eighteen, pretty -- but trying hard not to be
-- in a baggy granny dress and glasses, balances a cup of
coffee and a backpack as she climbs out of her battered,
baby blue '75 Dodge Dart.
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
She grabs his skateboard and uses it to SHOVE him against a
car, skateboard tip to his throat. He whimpers pitifully
and she lets him go. A path clears for her as she marches
through a pack of fearful students and SLAMS open the door,
entering school.
INT. GIRLS' ROOM - DAY
BIANCA STRATFORD, a beautiful sophomore, stands facing the
mirror, applying lipstick. Her less extraordinary, but
still cute friend, CHASTITY stands next to her.
BIANCA
Did you change your hair?
CHASTITY
No.
BIANCA
You might wanna think about it
Leave the girls' room and enter the hallway.
HALLWAY - DAY- CONTINUOUS
Bianca is immediately greeted by an admiring crowd, both
boys
and girls alike.
BOY
(adoring)
Hey, Bianca.
GIRL
Awesome shoes.
The greetings continue as Chastity remains wordless and
unaddressed by her side. Bianca smiles proudly,
acknowledging her fans.
GUIDANCE COUNSELOR'S OFFICE - DAY
CAMERON JAMES, a clean-cut, easy-going senior with an open,
farm-boy face, sits facing Miss Perky, an impossibly cheery
guidance counselor.")
names_stopwords <- c("^(rider|kat|chastity|bianca|boy|girl)")
text %>%
as_tibble() %>%
unnest_tokens(text, value, token = "lines") %>%
filter(str_detect(text, "\\s{15,}")) %>%
mutate(text = str_trim(text)) %>%
filter(!str_detect(text, names_stopwords))
Output:
# A tibble: 9 x 1
text
<chr>
1 hey -- sorry.
2 leave it
3 i said, leave it!
4 did you change your hair?
5 no.
6 you might wanna think about it
7 (adoring)
8 hey, bianca.
9 awesome shoes.
You can include further character names in the names_stopwords vector.
You can try the following :
url <- 'https://imsdb.com/scripts/10-Things-I-Hate-About-You.html'
url %>%
#Read webpage line by line
readLines() %>%
#Remove '<b>' and '</b>' from string
gsub('<b>|</b>', '', .) %>%
#select only the text which begins with 20 whitespace characters
grep('^\\s{20,}', ., value = TRUE) %>%
#Remove whitespace
trimws() %>%
#Remove all caps string
grep('^([A-Z]+\\s?)+$', ., value = TRUE, invert = TRUE)
#[1] "Hey -- sorry." "Leave it" "KAT (continuing)"
#[4] "I said, leave it!" "Did you change your hair?" "No."
#...
#...
I have tried cleaning this as much as possible but might require some more cleaning based on what you actually want to extract.
I have read Microsoft's filings for 2016 into R. Now I want to clean the file, and split it into sentences. I have used the following code:
MSFT <- paste(readLines("https://www.sec.gov/Archives/edgar/data/789019/000156459017014900/0001564590-17-014900.txt"), collapse = " ")
Can someone help me?
This is one way that you can try:
MSFT <- paste(readLines("https://www.sec.gov/Archives/edgar/data/789019/000156459017014900/0001564590-17-014900.txt"), collapse = " ")
Remove everything from the text that is not within a body HTML tag (assupmtion: everything else is unwanted)
#Remove everything but body(s)
MSFT_body <- substr(MSFT, gregexpr("<body", MSFT)[[1]], gregexpr("</body", MSFT)[[1]])
Within the body, remove everything that is within < and > to get rid of HTML, CSS, aso...
#Remove all html tags and characters
MSFT_body_html_removed <- gsub("<.*?>|&[A-Za-z]+;|&#[0-9]+;", "", MSFT_body)
Remove all whitespace (i.e. spaces, line breaks, tabs,...) with 1 space
#Remove all whitespace and replace with space
MSFT_body_html_removed <- gsub("\\s+", " ", MSFT_body_html_removed)
You can use the openNLP sentence tokeniser (pretrained) to find sentences:
#Define function to tokenise text to sentences
sentence_tokeniser <- openNLP::Maxent_Sent_Token_Annotator(language = "en")
#convert to String class
text <- NLP::as.String(MSFT_body_html_removed)
Use annotate to apply the tokeniser to the text
#Annotate text
annotated_sentences <- NLP::annotate(text, sentence_tokeniser)
Extract sentences
#extract sentences
sentences <- text[annotated_sentences]
Print first 5 sentences:
# print first 5 sentences
for (i in 1:5) {
print(paste("Sentence", i))
cat(paste(sentences[i], "\n"))
}
This will give you:
[1] "Sentence 1"
UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Fiscal Year Ended June 30, 2017 OR TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to Commission File Number 001-37845 MICROSOFT CORPORATION WASHINGTON 91-1144442 (STATE OF INCORPORATION) (I.R.S. ID) ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399 (425) 882-8080 www.microsoft.com/investor
[1] "Sentence 2"
Securities registered pursuant to Section12(b) of the Act: COMMON STOCK, $0.00000625 par value per share NASDAQ Securities registered pursuant to Section12(g) of the Act: NONE Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.YesNo Indicate by check mark if the registrant is not required to file reports pursuant to Section13 or Section15(d) of the Exchange Act.YesNo Indicate by check mark whether the registrant (1)has filed all reports required to be filed by Section13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2)has been subject to such filing requirements for the past 90 days.YesNo Indicate by check mark whether the registrant has submitted electronically and posted on its corporate website, if any, every Interactive Data File required to be submitted and posted pursuant to Rule 405 of Regulat... <truncated>
[1] "Sentence 3"
Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company.
[1] "Sentence 4"
See the definitions of large accelerated filer, accelerated filer, smaller reporting company, and emerging growth company in Rule12b-2 of the Exchange Act.
[1] "Sentence 5"
Large accelerated filer Acceleratedfiler Non-acceleratedfiler (Donotcheckifasmallerreportingcompany) Smallerreportingcompany Emerging growth company If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act.
I have a question on regex. Suppose i have this string
"She gained about 55 pounds in...9 months. She was like an eating machine. ”Trump, a man who wants to be president: "
I want to remove every blank space after period and before character ” and delete character ”
For example this part of sentence
She was like an eating machine. ”Trump, a man who wants to be president:
should become
She was like an eating machine.Trump, a man who wants to be president: "
Thanks guys, regex is not easy to learn. Appreciate any help! bye
p.s i'm using software R but i think it's irrelevant since regex works in every programming language
UPDATE
I solved my problem and i'd like to share it, maybe could help someone else. I have this dataset downloaded from kaggle about trump and hillary tweet.
I have to do some cleaning before importing data on Knime(project at university).
I have solved all encoding issues through gsub except this. i finally manage to solve it writing a csv file in R with Encoding UTF-8. Clearly i read that file in Knime with the same encoding
If you need to match any number of whitespaces (1 or more) between a dot and the curly double quote, you may use
x <- "She gained about 55 pounds in...9 months. She was like an eating machine. ”Trump, a man who wants to be president: "
gsub("\\.\\s+”", ".", x)
## => [1] "She gained about 55 pounds in...9 months. She was like an eating machine.Trump, a man who wants to be president: "
The \\. matches a dot, \\s+ matches 1 or more whitespace symbols and ” matches a ”.
See the regex demo and an R demo.
If there is only 1 regular space between the dot and the quote, you may use a fixed string replacement:
gsub(". ”", ".", x, fixed=TRUE)
See this R demo.
May be this could help:
var str = 'She was like an eating machine. "Trump, a man who wants to be president. "New value';
str.replace(/\.\s"/g,".");
http://regexr.com/ is a great tool for learning and testing regular expressions.
The only thing I'd add to Wiktor's answer is that it won't match "machine.”Trump". To match any number of spaces after a dot and before a quote, use the * quantifier:
x <- "She gained about 55 pounds in...9 months. She was like an eating machine. ”Trump, a man who wants to be president: "
gsub("\\.\\s*”", ".", x)
I have an HTML file which consists of 5 different articles and I would like to extract each one of these articles separately in R and run some analysis per article. Each article starts with < doc> and ends with < /doc> and also has a document number.Example:
<doc>
<docno> NA123455-0001 </docno>
<docid> 1 </docid>
<p>
NASA one-year astronaut Scott Kelly speaks after coming home to Houston on
March 3, 2016. Behind Kelly,
from left to right: U.S. Second Lady Jill Biden; Kelly's identical in
brother, Mark;
John Holdren, Assistant to the President for Science and ...
</p>
</doc>
<doc>
<docno> KA25637-1215 </docno>
<docid> 65 </docid>
<date>
<p>
February 1, 2014, Sunday
</p>
</date>
<section>
<p>
WASHINGTON -- Former Republican presidential nominee Mitt Romney
is charging into the increasingly divisive 2016 GOP
White House sweepstakes Thursday with a harsh takedown of front-runner
Donald Trump, calling him a "phony" and exhorting fellow
</p>
</type>
</doc>
<doc>
<docno> JN1234567-1225 </docno>
<docid> 67 </docid>
<date>
<p>
March 5, 2003
</p>
</date>
<section>
<p>
SEOUL—New U.S.-led efforts to cut funding for North Korea's nuclearweapons
program through targeted
sanctions risk faltering because of Pyongyang's willingness to divert all
available resources to its
military, even at the risk of economic collapse ...
</p>
</doc>
I have uploaded the url by using readLines() function and combined all lines together by using
articles<- paste(articles, collapse=" ")
I would like to select first article which is between < doc>..< /doc> and assign it to article1, and second one to article2 and so on.
Could you please advise how to construct the function in order to select each one of these articles separately?
You could use strsplit, which splits strings on whatever text or regex you give it. It will give you a list with one item for each part of the string between the splitting string, which you can then subset into different variables, if you like. (You could use other regex functions, as well, if you prefer.)
splitArticles <- strsplit(articles, '<doc>')
You'll still need to chop out the </doc> tags (plus a lot of other cruft, if you just want the text), but it's a start.
A more typical way to do the same thing would be to use a package for html scraping/parsing. Using the rvest package, you'd need something like
library(rvest)
read_html(articles) %>% html_nodes('doc') %>% html_text()
which will give you a character vector of the contents of <doc> tags. It may take more cleaning, especially if there are whitespace characters that you need to clean. Picking your selector carefully for html_nodes may help you avoid some of this; it looks like if you used p instead of doc, you're more likely to just get the text.
Simplest solution is use strsplit:
art_list <- strsplit(s, "<doc>")
art_list <- art_list[art_list != ""]
ids <- gsub(".*<docid>|</docid>.*", "", art_list[[i]] )
ids <- ids[ids != ""]
for (i in 1: length(unlist(art_list)) ){
assign( paste("article", ids[i], sep = "_") , gsub(".*<doc>|</doc>.*", "", unlist(art_list) )[i] )}
I have some strings like:
Sample Input:
Also known as temple of the city,
xxx as Pune Banglore as kolkata Delhi India,
as Mumbai India or as Bombay India,
Calcutta,India is now know as Kolkata,India,
From the above I want to convert as xxx xxxx xx, to as xxx_xxxx_xx, and it should be effective after the last as.
Sample output for above:
Also known as temple_of_the_city,
xxx as Pune Banglore as kolkata_Delhi_India,
as Mumbai India or as Bombay_India,
Calcutta,India is now know as Kolkata,India,
There should be no space separated string after the last as in a line.
Please let me know if it is not clear.
Thanks
Paul is right that it's not really a simple task. This is a sed solution that I put together:
sed 's/\(.*as \)/\1\n/;h;y/ /_/;G;s/.*\n\(.*\)\n\(.*\)\n.*/\2\1/' file.txt
Demonstration on your data:
$ echo 'Also known as temple of the city,
> xxx as Pune Banglore as kolkata Delhi India,
> as Mumbai India or as Bombay India,
> Calcutta,India is now know as Kolkata,India,' | \
> sed 's/\(.*as \)/\1\n/;h;y/ /_/;G;s/.*\n\(.*\)\n\(.*\)\n.*/\2\1/'
Also known as temple_of_the_city,
xxx as Pune Banglore as kolkata_Delhi_India,
as Mumbai India or as Bombay_India,
Calcutta,India is now know as Kolkata,India,
I'd be inclined to use Perl, the swiss army chainsaw, but sed is also an option. In either case you're looking at a substantial learning curve.
The replacement you've described is probably complex enough that you'd be better off writing a script rather than trying to do it as a one liner.
If you're going to write a script and don't already know Perl there's no reason why you shouldn't pick your scripting language of choice (python, ruby, etc) as long as it has some sort of text pattern matching syntax.
I don't know of a simple, shallow learning curve method of doing a complex pattern match and replacement of this sort. Is this a one time thing where you need to do this replacement only? Or are you going to be doing similar sorts of complicated pattern replacements in the future. If you're going to be doing this frequently you really should invest the time in learning some scripting language but I won't impose my Perl bias on you. Just pick any language that seems accessible.