How to extract longest string between patterns in HTML [in R] - r

I am extracting text from HTML from a series of articles. However, I am still to get the articles to a format I happy with. More specifically I hope to find the longest string between the occurence of a pattern ("/n").
The code I use now is the following:
library(newsanchor)
library(htm2txt)
library(RCurl)
library(XML)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
test$txt <- NA
for(i in 1:22){
tryCatch({
html <- getURL(test$url[i], followlocation = TRUE)
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
test$txt[i] <- c(paste(plain.text, collapse = "\n"))
}, error=function(e){})
print(i)
}
The result looks something like this
[1] "EDITION\nUS President Donald Trump has made his first meaningful remarks on the Huawei firestorm since his administration blacklisted the Chinese tech giant last week.\nThe president was speaking at a news conference announcing a $US16 billion aid package for farmers caught up in the China trade war when he addressed Huawei, which has been placed on a list that means US firms need permission to do business with the Chinese company.\nTrump started out by saying that Huawei poses a huge security threat to the US. US officials have long floated suspicions that Huawei acts as a conduit for Chinese surveillance.\n“Huawei is something that’s very dangerous. You look at what they have done from a security standpoint, from a military standpoint, it’s very dangerous,” the president told reporters.\n Read more: Here are all the companies that have cut ties with Huawei, dealing the Chinese tech giant a crushing blow\nHe then immediately switched gears to suggest that Huawei could form part of a trade deal with America and China. “So it’s possible that Huawei even would be included in some kind of a trade deal. If we made a deal, I could imagine Huawei being possibly included in some form,” he said.\n\"Huawei is very dangerous,\" Trump says, adding that an exception for the company could be made in a trade deal with China pic.twitter.com/TFlClewBNt\n— TicToc by Bloomberg (#tictoc) May 23, 2019\n\nTrump: “Huawei is something that’s very dangerous. You look at what they have done from a security standpoint, from a military standpoint, it’s very dangerous. So, it’s possible that Huawei even would be included in some kind of a trade deal. If we made a deal, I could imagine Huawei being possibly included in some form of, or some part of a trade deal.”\nJournalist: “How would that look?”\nTrump: “It would look very good for us.”\nJournalist:
I hope to get the most part of the most important part - the actual article. I am not sure how to best do this, but I think it might be to find the longest string between two incidences of ("/n"). Can anyone help doing that, or perhaps even suggest a better method?

Edit: #user101 explained that nchar is vectorized. Here is a more optimal solution:
splitarticle <- unlist(strsplit(i, "\n"))
splitarticle[which.max(nchar(splitarticle))]
Something like this could work, unless I misunderstood what you're trying to do.
splitarticle <- unlist(strsplit(i, "\n"))
lengths <- unlist(lapply(splitarticle, nchar))
splitarticle[match(max(lengths), lengths)]

Related

R: Calculating the Cosine Similarity Between Restaurant Reviews

I am working with the R programming language.
Suppose I have the following data frame that contains data on 8 different restaurant reviews (taken from here: https://www.consumeraffairs.com/food/mcd.html?page=2#scroll_to_reviews=true) :
text = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.",
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.",
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.",
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.",
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!",
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))
I would like to find out which reviews are similar to each other (e.g. perhaps 1,2 and 3 are similar to each other, 5,7,1 are similar to each other, 7,2 are similar to each other, etc.). I tried to research to see if there is some method that can be used to accomplish this task - in particular, I found out about something called the "Cosine Distance" which is apparently used often for similar tasks in NLP and Text Mining.
I tried to follow the instructions here to accomplish this task: Cosine Similarity Matrix in R
library(tm)
library(proxy)
text = text[,2]
corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus,
control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm,
MARGIN = 1,
FUN = function(x) sum(x > 0) / ncol(tdm))
tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])
dist(tdm_mat, method = "cosine", upper = TRUE)
My Question: The above code seems to run without errors, but I am not sure if this code is able to indicate which restaurant reviews are similar to one another.
Can someone please show me how to do this?
Thanks!

How does "sentimentr" package split a paragraph or sentences into more than 1 sentences?

I am trying to run sentiment analysis in r using "sentimentr" package. I fed in a list of comments and in the output got element_id, sentence_id, word_count, sentiment. Comments with long phrases are getting converted into single sentences. I want to know the logic based on which package does that ?
I have 4 main categories for my comments- Food, Atmosphere, Price and service. and I have also set bigrams for those themes, i am trying to split sentences based on themes
install.packages("sentimentr")
library(sentimentr)
data <- read.csv("Comments.csv")
data_new <- as.matrix(data)
scores <- sentiment(data_new)
#scores
write.csv(scores,"results.csv")
For e.g - " We had a large party of about 25, so some issues were understandable. But the servers seemed totally overwhelmed. There are so many issues I cannot even begin to explain. Simply stated food took over an hour to be served, it was overcooked when it arrived, my son had a steak that was charred, manager came to table said they were now out of steak, I could go on and on. We were very disappointed" got split up into 5 sentences
1) We had a large party of about 25, so some issues were understandable
2) But the servers seemed totally overwhelmed.
3) There are so many issues I cannot even begin to explain.
4) Simply stated food took over an hour to be served, it was overcooked when it arrived, my son had a steak that was charred, manager came to table said they were now out of steak, I could go on and on.
5) We were very disappointed
I want to know if there is any semantic logic behind the splitting or it's just based on full stops?
It uses textshape::split_sentence(), see https://github.com/trinker/sentimentr/blob/e70f218602b7ba0a3f9226fb0781e9dae28ae3bf/R/get_sentences.R#L32
A bit of searching found the logic is here:
https://github.com/trinker/textshape/blob/13308ed9eb1c31709294e0c2cbdb22cc2cac93ac/R/split_sentence.R#L148
I.e. yes it is splitting on ?.!, but then it is using a bunch of regexes to look for exceptions, such as "No.7" and "Philip K. Dick".

Text mining on sentences with the tm.package in R

I'm working with the tm package in R.
I have several txt.files in a folder and a list of 30 sentences.
Now I have to check if my files contains these sentences.
How can I create now a programming which considers sentences and not single words?
Below is a potential approach. Also you may want to look into the readtext package for quickly reading in an entire directory of files as text in one function call.
library(tidytext)
library(stringr)
sample_text <- "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth."
# this must be lower-case because tidytext will tokenize to lower-case by default
sentence_to_match <- "we are met on a great battle-field of that war."
sentences_df <- tibble(text = sample_text) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_match = str_detect(sentence, sentence_to_match))

how to read text document in R?

I want to read text document in R based on following condition -
based on certain keywords it will read the sentences and whenever it will find the keywords and sentence ended with full stop (.), just stores only those statement in a list.
output- list contain only those statement which have particular keyword.
I tried with scan function like this-
b<-scan("cbt14-Short Stories For Children.txt",what = "char",sep = '.', nlines = 50)
as scan function have so many parameter, which I, am unable to understand right now.
can we achieve above output using scan function???
keyword = "ship"
input--
this article u can read from "www.google.com/ship".
Illustrated by Subir Roy and Geeta Verma Man Overboard
I stood on the deck of S.S. Rajula. As she slowly moved out of Madras harbour, I waved to my grandparents till I could see them no more. I was thrilled to be on board a ship. It was a new experience for me.
"Are you travelling alone?" asked the person standing next to me.
"Yes, Uncle, I'm going back to my parents in Singapore," I replied.
"What's your name?" he asked. "Vasantha," I replied. I spent the day exploring the ship. It looked just like a big house. There were furnished rooms, a swimming pool, a room for indoor games, and a library. Yet, there was plenty of room to 11111 around. The next morning the passengers were seated in the dining hall, having breakfast. The loudspeaker spluttered noisily and then the captain's voice came loud and clear. "Friends we have just received a message that a storm is brewing in the Indian Ocean. I request all of you to keep calm. Do not panic. Those who are inclined to sea-
3
output list--
[1]this article u can read from "www.google.com/ship".
[2]I was thrilled to be on board a ship.
[3] I spent the day exploring the ship.
The difficult part of this problem is properly separating the sentences. In this case I am using the period followed by a space ". " to define a sentence. In this sample it does produce a sentence with a single word - "Rajula" but this may be acceptable depending on your final application.
#split the text into sentences using a ". "
sentences<-strsplit(b, "\\. ")
#find the sentences with the word ship in the answer
finallist<-sentences[[1]][grepl("ship", sentences[[1]] )]
The above code uses base R. Looking into the stringi or stringr library, there maybe a function to better handle the string splitting on a defined sentence.

Preparing the corpus for n-gram model word prediction in R

I have the following function to predict the next word using trigrams. The libraries that I am using are: ngrams, RWeka and tm.
f <- function(queryHistoryTab, query, n = 2) {
require(tau)
trigrams <- sort(textcnt(rep(tolower(names(queryHistoryTab)), queryHistoryTab), method = "string", n = length(scan(text = query, what = "character", quiet = TRUE)) + 1))
query <- tolower(query)
idx <- which(substr(names(trigrams), 0, nchar(query)) == query)
res <- head(names(sort(trigrams[idx], decreasing = TRUE)), n)
res <- substr(res, nchar(query) + 2, nchar(res))
return(res)
}
In order to feed the function I have to set a corpus. For this purpose I am using a data sets that consists in textual data extracted from US blogs:
text1 <- readLines("en_US.news.txt", encoding = "UTF-8")
corpus <- Corpus(VectorSource(text1))
The class of the corpus is
>class(corpus)
[1] "VCorpus" "Corpus"
Nevertheless, when I am trying guess the two most common of words of a sentence I get the following error:
f(corpus, "I will like a")
Error in textcnt(rep(tolower(names(queryHistoryTab)), queryHistoryTab), :
(list) object cannot be coerced to type 'integer'
Here are the first lines of the en_US.news.txt in case you want to test it yourselves:
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let's hear it! :)

Resources