E-Mail Text Parsing and Extracting via Regular Expression in R - r

After over a year struggling to no avail, I'm turning the SO community for help. I've used various RegEx creator sites, standalone RegEx creator software as well as manual editing all in a futile attempt to create a pattern to parse and extract dynamic data from the below e-mail samples (sanitized to protect the innocent):
Action to Take: Buy shares of Facebook (Nasdaq: FB) at market. Use a 20% trailing stop to protect yourself. ...
Action to Take: Buy Google (Nasdaq: GOOG) at $42.34 or lower. If the stock is above $42.34, don't chase it. Wait for it to come down. Place a stop at $35.75. ...
***Action to Take***
 
Buy International Business Machines (NYSE: IBM) at market. And use a protective stop at $51. ...
What needs to be parsed is both forms of "Action to Take" sections and the resulting extracted data must include the direction (i.e. buy or sell, but just concerned about buys here), the ticker, the limit price (if applicable) and the stop value as either a percentage or number (if applicable). Sometimes there's also multiple "Action to Take"'s in a single e-mail as well.
Here's examples of what the pattern should not match (or ideally be flexible enough to deal with):
Action to Take: Sell half of your Apple (NYSE: AAPL) April $46 calls for $15.25 or higher. If the spread between the bid and the ask is $0.20 or more, place your order between the bid and the ask - even if the bid is higher than $15.25.
Action to Take: Raise your stop on Apple (NYSE: AAPL) to $75.15.
Action to Take: Sell one-quarter of your Facebook (Nasdaq: FB) position at market. ...
Here's my R code with the latest Perl pattern (to be able to use lookaround in R) that I came up with that sort of works, but not consistently or over multiple saved e-mails:
library(httr)
library("stringr")
filenames <- list.files("R:/TBIRD", pattern="*.eml", full.names=TRUE)
parse <- function(input)
{
text <- readLines(input, warn = FALSE)
text <- paste(text, collapse = "")
trim <- regmatches(text, regexpr("Content-Type: text/plain.*Content-Type: text/html", text, perl=TRUE))
pattern <- "(?is-)(?<=Action to Take).*(?i-s)(Buy|Sell).*(?:\\((?:NYSE|Nasdaq)\\:\\s(\\w+)\\)).*(?:for|at)\\s(\\$\\d*\\.\\d* or|market)\\s"
df <- str_match(text,pattern)
return(df)
}
list <- lapply(filenames, function(x){ parse(x) })
table <- do.call(rbind,list)
table <- data.frame(table)
table <- table[rowSums(is.na(table)) < 1, ]
table <- subset(table, select=c("X2","X3","X4"))
The parsing has to operate on the text copy because the HTML appears way too complicated to do so due to lack of standardization from e-mail to e-mail. Unfortunately, the text copy also commonly tends to have wrong line endings than regexp expects which greatly aggravates things.

Related

How can I use the `#` symbol in a twitter bot without mentioning a user?

Using twitter in the normal way, you can post an # symbol without tagging someone by using ##. E.g. #_wurli would show as a mention, but ##_wurli, while giving the same text, would not be mention.
I want to know how I can achieve this using a bot. I am using the {rtweet} package in the R language.
I'm not sure whether or not there's an intended approach to for dealing with these. However, a good workaround I've found is to simply insert a zero-width space between the # and the rest of the text. This also works for hashtags:
tweets <- c("#not_a_user", "#notatag")
zero_width_space <- "\U200B"
tweets <- tweets |>
gsub(x = _, "#", paste0("#", zero_width_space)) |>
gsub(x = _, "#", paste0("#", zero_width_space))
The only potential issue is that this adds to the number of characters in the tweet. However if you're only transforming a few characters per tweet this is most likely not going to be an issue.

How to search tweets containing either one hashtag or other ones

I need to get tweets that contain at least of the following hashtags: #EUwahl #Euwahlen #Europawahl #Europawahlen. This means, I am looking for tweets containing at least one of those hashtags but they can also contain more of them. Furthermore, in each of these tweets one out of seven specific user (eg #AfD) must be mentioned as well in the tweet.
So far I only know how to search Twitter for one hashtag only or several ones. Meaning, I am familiar with the operator and using a + between the hashtags but not with the operator for or.
This is an example of a code I have used so far to do any searches in Twitter:
euelection <- searchTwitter("#EUwahl", n=1000, since = "2019-05-01",until = "2019-05-26")
I can install twitteR but it requires some authentication key which is not very easy for me to get.
The principle is to search using OR with space in between. I provide you an example with rtweet
library(rtweet)
# your tags
TAGS = c("#EUwahl","#Europawahl")
# make the search term
SEARCH = paste(TAGS,collapse=" OR ")
# do the search
# you can also use twitteR
test <- search_tweets(SEARCH, n=100)
# your found tweet text
head(test$text)
## check which tweet contains which tag
tab = sapply(TAGS,function(i)as.numeric(grepl(i,test$text,ignore.case=T)))
# all of them contain either #EUwahl or #Europawahl

How to read a txt file in R which indicates page number and paragraphs in each page

I have read a .txt file using readLines() in R. I have not given line numbers-(ie.Display line numbers) in the txt file.
The txt file is in this form.
page1:
paragraph1:Banks were early adopters, but now the range of applications
and organizations using predictive analytics successfully have multiplied. Direct marketing and sales.
Leads coming in from a company’s website can be scored to determine the probability of a
sale and to set the proper follow-up priority.
paragraph2: Campaigns can be targeted to the candidates most
likely to respond. Customer relationships.Customer characteristics and behavior are strongly
predictive of attrition (e.g., mobile phone contracts and credit cards). Attrition or “churn”
models help companies set strategies to reduce churn rates via communications and special offers.
Pricing optimization. With sufficient data, the relationship between demand and price can be modeled for
any product and then used to determine the best pricing strategy.
Similarly page2 in the .txt file have paragraphs.
But i couldn't differentiate between pages and paragraph, since .txt file doesn't differentiate pages. Is there any way or suggestion to indicate pages and paragraph in R.
The answer given by Edward Carney is just right for this. But if I'm not using "paragraph(No.)" how to identify the paragraph using tab/space?
This method uses the stripWhitespace function from the tm library, but, other than that, it's basic R.
First, read in the text and locate the page#: lines using grep.
x <- readLines('text2.txt')
page_locs <- grep('page\\d:', x)
# add an element with the last line of the text plus 1
page_locs[length(page_locs)+1] <- length(x) + 1
# strip out the whitespace
x <- stripWhitespace(x)
# break the text into a list of pages, eliminating the `page#:` lines.
pages <- list()
# grab each page's lines into successive list elements
for (i in 1:(length(page_locs)-1)) {
pages[[i]] <- x[(page_locs[i]+1):(page_locs[i+1]-1)]
}
Then, process each page into a list of paragraphs for each page.
for (i in 1:length(pages)) {
# get the locations for the paragraphs
para_locs <- grep('paragraph\\d:', pages[[i]])
# add an end element
para_locs[length(para_locs) + 1] <- length(pages[[i]]) + 1
# delete the paragraph marker
curr_page <- gsub('paragraph\\d:','',pages[[i]])
curr_paras <- list()
# step through the paragraphs in each page
for (j in 1:(length(para_locs)-1)) {
# collapse the vectors for each paragraph
curr_paras[[j]] <- paste(curr_page[para_locs[j]:(para_locs[j+1]-1)], collapse='')
# delete leading spaces for each paragraph if desired
curr_paras[[j]] <- gsub('^ ','',curr_paras[[j]])
}
# store the list of paragraphs back into the pages list
pages[[i]] <- curr_paras
}
You might need some additional clean up depending on your text.

Retrieve whole lyrics from URL

I am trying to retrieve the whole lyrics of a band from the web.
I have noticed that they build URLs using ".../firstletter/bandname/songname.html"
Here is an example.
http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html
I was thinkining about creating a function that would read.csv the URLs.
That part was kind of easy because I can get the titles by a simple copy paste and save as .csv. Then, use that vector to pass the function for each value in order to construct the URL name.
But I tried to read the first one just to see what it looks like and I found that there will be too much "cleaning the data" if my goal is to build a csv file with each lyric.
x <-read.csv(url("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html"))
I think my approach is not the best (or maybe I need a better data cleaning strategy)
The HTML page has a tell on where the lyrics begin:
Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that.
Taking advantage of that, you can detect this string, and then read everything up to the end of the div:
m <- readLines("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html")
giveaway <- "Sorry about that."
#You can add the full line in case you think one of the lyrics might have this sentence in it.
start <- grep(giveaway, m) + 1 # Where the lyric starts
end <- grep("</div>", m[start:length(m)])[1] + start
# Take the first </div> after the start of the lyric, and then fix the position by adding the start
lyrics <- paste(gsub("<br>|</div>", "", m[start:end]), collapse = "\n")
#This is just an example of how to clear the remaining tags and join the text.
And then:
> cat(lyrics) #using cat() prints the line breaks
Ridin' down the highway
Goin' to a show
Stop in all the byways
Playin' rock 'n' roll
.
.
.
Well it's a long way
It's a long way, you should've told me
It's a long way, such a long way
Assuming that "cleaning the data" means you would be parsing through html tags. I recommend using DOM scraping library that would extract only the text lyrics from the page and save those lyrics to CSV, database or wherever. That way you wouldn't have to do any data cleaning. I don't know what programming language your using, but a simple google search will show you a lot of DOM querying and parsing libraries for any language.
Here is an example with PHP
http://simplehtmldom.sourceforge.net/manual.htm
$html = file_get_html('http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html');
// Find all images
$lyrics = $html->find('div.ringtone',1)->next_sibling();
print($lyrics.innertext);
now you have lyrics. Save Them.(code not tested);
If your using the R-Language. Use this library here. You will be able to query the DOM and extract the lyrics easily.
https://github.com/hadley/rvest

xpath node determination

I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)

Resources