I am looking at twitter data which I am then feeding into an html document. Often the text contains special characters like emojis that aren't properly encoded for html. For example the tweet:
If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥
would become:
If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥
when fed into an html document.
Working manually I could use a tool like https://www.textfixer.com/html/html-character-encoding.php to encode the tweet to look like:
If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be "�";"�"; "�";"�"; "�";"�";
which I could then feed to an html document and have the emojis show up. Is there a package or function in R that could take text and html encode it similarly to the web tool above?
Here's a function which will encode non-ascii characters as HTML entities.
entity_encode <- function(x) {
cp <- utf8ToInt(x)
rr <- vector("character", length(cp))
ucp <- cp>128
rr[ucp] <- paste0("&#", as.character(cp[ucp]), ";")
rr[!ucp] <- sapply(cp[!ucp], function(z) rawToChar(as.raw(z)))
paste0(rr, collapse="")
}
This returns
[1] "If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥"
for your input but those seem to be equivalent encodings.
Related
I have a dataset that has a feature body which are all text from an html file and include semantic tags like so,
</strong>earned six Tony nominations this week, including one for Nyong'o (Best Actress in a Leading Role). <em>Eclipsed</em> is also significant for being the first Broadway play to feature a cast and creative team that is entirely black, female, and of African descent. (The play was written by Danai Gurira, who also plays Michonne on <em>The Walking Dead</em>.)</p> \n<p><!-- ######## BEGIN SNIPPET ######## -->
I would like to remove all text between semantic tags using wildcards. Is there a way to do so?
<!-- .--> My logic here is to remove the comment tag with everything inside of it.
Supposing your data frame looks like this
df <- data.frame(text = '</strong>earned six Tony nominations this week, including one for Nyong\'o (Best Actress in a Leading Role). <em>Eclipsed</em> is also significant for being the first Broadway play to feature a cast and creative team that is entirely black, female, and of African descent. (The play was written by Danai Gurira, who also plays Michonne on <em>The Walking Dead</em>.)</p> \n<p><!-- ######## BEGIN SNIPPET ######## -->')
Then you could use
df$new_text <- gsub("<!--.*-->", "", df$text)
to get your desired output in a new column new_text.
I extracted tweets from Twitter related to #TrumpCaved!!
In my tweets I wanted to remove the emoticons, url's and all other speacial characters from all the tweets. One of the tweets is as follows:
#mitchellvii #AnnCoulter Hey all you #MAGA people, how did you like
watching #realDonaldTrump cave today?
… HTTP content[If I
use http link I could not able to post it]
I tried using the following code but it doesn't work for me.
In my scenario, I tried to remove URL's successfully and after I use the next code to remove the emoticons it gets removed but now the URL's gets added. Can anyone help me in removing all the unwanted characters from the text especially the URL's and emoticons?
First I tried to remove the http using gsub function
Corpus = gsub("https.*","", tweets_text$Tweets)
O/p : #mitchellvii #AnnCoulter Hey all you #MAGA people, how did you like watching #realDonaldTrump cave today? <U+0001F602><U+0001F923><U+0001F602><U+0001F923>…
Next I tried to remove the emoticons using gsub function
Corpus = gsub("[^[:alnum:]///' ]","", tweets_text$Tweets)
O/P : mitchellvii AnnCoulter Hey all you MAGA people how did you like watching realDonaldTrump cave today https//tco/vmUCJvTnEO
I've been grappling with regex in following string:
"Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... https://t dot co/hUradDaNVX"
I am unable to remove the entire \x...\x pattern from the above string.
I'm unable to remove https URL from above string.
My regex expression are:
gsub('http.* *', '', twts_array)
gsub("\\x.*\\x..","",twts_array)
My output is:
"Just beautiful let’s see how the next few days go \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... httpstcohUradDaNVX"
My expected output is:
Just beautiful, let’s see how the next few days go. Long term buying opportunities could be around the corner
P.S: As you can see neither of problems got solved. I also added dot for . in https://t dot co/hUradDaNVX as StackOverflow does not allow me to post shortened urls. Can some one help me in tackling this problem.
On Linux you can do the following:
twts_array <- "Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... https://t dot co/hUradDaNVX"
twts_array_str <- enc2utf8(twts_array)
twts_array_str <- gsub('<..>', '', twts_array_str)
twts_array_str <- gsub('http.*', '', twts_array_str)
twts_array_str
# "Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner ... "
enc2utf8 will convert any unknown Unicode sequences to <..> format. Then it will be replaced by gsub with URL as well.
After over a year struggling to no avail, I'm turning the SO community for help. I've used various RegEx creator sites, standalone RegEx creator software as well as manual editing all in a futile attempt to create a pattern to parse and extract dynamic data from the below e-mail samples (sanitized to protect the innocent):
Action to Take: Buy shares of Facebook (Nasdaq: FB) at market. Use a 20% trailing stop to protect yourself. ...
Action to Take: Buy Google (Nasdaq: GOOG) at $42.34 or lower. If the stock is above $42.34, don't chase it. Wait for it to come down. Place a stop at $35.75. ...
***Action to Take***
Buy International Business Machines (NYSE: IBM) at market. And use a protective stop at $51. ...
What needs to be parsed is both forms of "Action to Take" sections and the resulting extracted data must include the direction (i.e. buy or sell, but just concerned about buys here), the ticker, the limit price (if applicable) and the stop value as either a percentage or number (if applicable). Sometimes there's also multiple "Action to Take"'s in a single e-mail as well.
Here's examples of what the pattern should not match (or ideally be flexible enough to deal with):
Action to Take: Sell half of your Apple (NYSE: AAPL) April $46 calls for $15.25 or higher. If the spread between the bid and the ask is $0.20 or more, place your order between the bid and the ask - even if the bid is higher than $15.25.
Action to Take: Raise your stop on Apple (NYSE: AAPL) to $75.15.
Action to Take: Sell one-quarter of your Facebook (Nasdaq: FB) position at market. ...
Here's my R code with the latest Perl pattern (to be able to use lookaround in R) that I came up with that sort of works, but not consistently or over multiple saved e-mails:
library(httr)
library("stringr")
filenames <- list.files("R:/TBIRD", pattern="*.eml", full.names=TRUE)
parse <- function(input)
{
text <- readLines(input, warn = FALSE)
text <- paste(text, collapse = "")
trim <- regmatches(text, regexpr("Content-Type: text/plain.*Content-Type: text/html", text, perl=TRUE))
pattern <- "(?is-)(?<=Action to Take).*(?i-s)(Buy|Sell).*(?:\\((?:NYSE|Nasdaq)\\:\\s(\\w+)\\)).*(?:for|at)\\s(\\$\\d*\\.\\d* or|market)\\s"
df <- str_match(text,pattern)
return(df)
}
list <- lapply(filenames, function(x){ parse(x) })
table <- do.call(rbind,list)
table <- data.frame(table)
table <- table[rowSums(is.na(table)) < 1, ]
table <- subset(table, select=c("X2","X3","X4"))
The parsing has to operate on the text copy because the HTML appears way too complicated to do so due to lack of standardization from e-mail to e-mail. Unfortunately, the text copy also commonly tends to have wrong line endings than regexp expects which greatly aggravates things.
I read a text into R using the readChar() function. I aim at testing the hypothesis that the sentences of the text have as many occurrences of letter "a" as occurrences of letter "b". I recently discovered the {stringr} package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to know the number of sentences in the whole text. Does R have any function, which can help me do that? Thank you very much!
Thank you #gui11aume for your answer. A very good package I just found that can help do the work is {openNLP}. This is the code to do that:
install.packages("openNLP") ## Installs the required natural language processing (NLP) package
install.packages("openNLPmodels.en") ## Installs the model files for the English language
library(openNLP) ## Loads the package for use in the task
library(openNLPmodels.en) ## Loads the model files for the English language
text = "Dr. Brown and Mrs. Theresa will be away from a very long time!!! I can't wait to see them again." ## This sentence has unusual punctuation as suggested by #gui11aume
x = sentDetect(text, language = "en") ## sentDetect() is the function to use. It detects and seperates sentences in a text. The first argument is the string vector (or text) and the second argument is the language.
x ## Displays the different sentences in the string vector (or text).
[1] "Dr. Brown and Mrs. Theresa will be away from a very long time!!! "
[2] "I can't wait to see them again."
length(x) ## Displays the number of sentences in the string vector (or text).
[1] 2
The {openNLP} package is really great for natural language processing in R and you can find a good and short intro to it here or you can check out the package's documentation here.
Three more languages are supported in the package. You just need to install and load the corresponding model files.
{openNLPmodels.es} for Spanish
{openNLPmodels.ge} for German
{openNLPmodels.th} for Thai
What you are looking for is sentence tokenization, and it is not as straightforward as it seems, even in English (sentences like "I met Dr. Bennett, the ex husband of Mrs. Johson." can contain full stops).
R is definitely not the best choice for natural language processing. If you are Python proficient, I suggest you have a look at the nltk module, which covers this and many other topics. You can also copy the code from this blog post, which does sentence tokenization and word tokenization.
If you want to stick to R, I would suggest you count the end-of-sentence characters (., ?, !), since you are able to count characters. A way of doing it with a regular expression is like so:
text <- 'Hello world!! Here are two sentences for you...'
length(gregexpr('[[:alnum:] ][.!?]', text)[[1]])