I am working with the R programming language.
Suppose I have the following data frame that contains data on 8 different restaurant reviews (taken from here: https://www.consumeraffairs.com/food/mcd.html?page=2#scroll_to_reviews=true) :
text = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.",
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.",
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.",
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.",
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!",
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))
I would like to find out which reviews are similar to each other (e.g. perhaps 1,2 and 3 are similar to each other, 5,7,1 are similar to each other, 7,2 are similar to each other, etc.). I tried to research to see if there is some method that can be used to accomplish this task - in particular, I found out about something called the "Cosine Distance" which is apparently used often for similar tasks in NLP and Text Mining.
I tried to follow the instructions here to accomplish this task: Cosine Similarity Matrix in R
library(tm)
library(proxy)
text = text[,2]
corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus,
control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm,
MARGIN = 1,
FUN = function(x) sum(x > 0) / ncol(tdm))
tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])
dist(tdm_mat, method = "cosine", upper = TRUE)
My Question: The above code seems to run without errors, but I am not sure if this code is able to indicate which restaurant reviews are similar to one another.
Can someone please show me how to do this?
Thanks!
Related
I'm working with the tm package in R.
I have several txt.files in a folder and a list of 30 sentences.
Now I have to check if my files contains these sentences.
How can I create now a programming which considers sentences and not single words?
Below is a potential approach. Also you may want to look into the readtext package for quickly reading in an entire directory of files as text in one function call.
library(tidytext)
library(stringr)
sample_text <- "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth."
# this must be lower-case because tidytext will tokenize to lower-case by default
sentence_to_match <- "we are met on a great battle-field of that war."
sentences_df <- tibble(text = sample_text) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_match = str_detect(sentence, sentence_to_match))
I want to read text document in R based on following condition -
based on certain keywords it will read the sentences and whenever it will find the keywords and sentence ended with full stop (.), just stores only those statement in a list.
output- list contain only those statement which have particular keyword.
I tried with scan function like this-
b<-scan("cbt14-Short Stories For Children.txt",what = "char",sep = '.', nlines = 50)
as scan function have so many parameter, which I, am unable to understand right now.
can we achieve above output using scan function???
keyword = "ship"
input--
this article u can read from "www.google.com/ship".
Illustrated by Subir Roy and Geeta Verma Man Overboard
I stood on the deck of S.S. Rajula. As she slowly moved out of Madras harbour, I waved to my grandparents till I could see them no more. I was thrilled to be on board a ship. It was a new experience for me.
"Are you travelling alone?" asked the person standing next to me.
"Yes, Uncle, I'm going back to my parents in Singapore," I replied.
"What's your name?" he asked. "Vasantha," I replied. I spent the day exploring the ship. It looked just like a big house. There were furnished rooms, a swimming pool, a room for indoor games, and a library. Yet, there was plenty of room to 11111 around. The next morning the passengers were seated in the dining hall, having breakfast. The loudspeaker spluttered noisily and then the captain's voice came loud and clear. "Friends we have just received a message that a storm is brewing in the Indian Ocean. I request all of you to keep calm. Do not panic. Those who are inclined to sea-
3
output list--
[1]this article u can read from "www.google.com/ship".
[2]I was thrilled to be on board a ship.
[3] I spent the day exploring the ship.
The difficult part of this problem is properly separating the sentences. In this case I am using the period followed by a space ". " to define a sentence. In this sample it does produce a sentence with a single word - "Rajula" but this may be acceptable depending on your final application.
#split the text into sentences using a ". "
sentences<-strsplit(b, "\\. ")
#find the sentences with the word ship in the answer
finallist<-sentences[[1]][grepl("ship", sentences[[1]] )]
The above code uses base R. Looking into the stringi or stringr library, there maybe a function to better handle the string splitting on a defined sentence.
I am scraping Amazon customer reviews using R and have come across a bug that I was hoping someone might have some insight into.
I have noticed that R fails to scrape the specified node (found by using SelectorGadget) from all reviews. Each time I run the script I retrieve a different amount, but never the entirety. This is very frustrating since the goal is to scrape the reviews and compile them into csv files that can later be manipulated using R. Essentially, if a product has 200 reviews, when I run the script, sometimes I will get 150 reviews, sometimes 75 reviews, etc- but not the entire 200. This issue seems to happen after I have done repeated scraping.
I have also gotten a few timeout errors, specifically "Error in open.connection(x, "rb") : Timeout was reached".
How do I get around this to continue scraping? I am a beginner but any help or insight is greatly appreciated!!
url <- "https://www.amazon.com/Match-Mens-Wild-Cargo-Pants/product-reviews/B009HLOZ9U/ref=cm_cr_arp_d_show_all?ie=UTF8&reviewerType=all_reviews&pageNumber="
N_pages <- 204
A <- NULL
for (j in 1: N_pages){
pant <- read_html(paste0(url, j))
B <- cbind(pant %>% html_nodes(".review-text") %>% html_text() )
A <- rbind(A,B)
}
tail(A)
print(j)
Is this not working for you?
Setting the URL as "https://www.amazon.com/Match-Mens-Wild-Cargo-Pants/product-reviews/B009HLOZ9U/ref=cm_cr_arp_d_paging_btm_2?ie=UTF8&reviewerType=avp_only_reviews&sortBy=recent&pageNumber="
N_pages <- 204
A <- NULL
for (j in 1: N_pages){
pant <- read_html(paste0(url, j))
B <- cbind(pant %>% html_nodes(".review-text") %>% html_text() )
A <- rbind(A,B)
}
tail(A)
[,1]
[1938,] "This is really a good item to get. Trendy, probably you can choose a different color, it fits good but I wouldn't say perfect."
[1939,] "I don't write reviews for most products, but I felt the need to do so for these pants for a couple reasons. First, they are great pants! Solid material, well-made, and they fit great. Second, I want to echo those who say you need to go up in size when you order. I wear anywhere from 32-34, depending on the brand. I ordered these in a 36 and they fit like a 33 or 34. I really love the look and feel of these, and will be ordering more!"
[1940,] "I bought the green one before, it is good quality and looks nice, than I purchased the similar one, but the khaki color, but received absolutely different product, different material. really disappointed."
[1941,] "These pants are great! I have been looking to update my wardrobe with a more edgy style; these cargo pants deliver on that. Paired with some casual sneakers or a decent nubuck leather boot completes the look from the waist down. The lazy-casual look is great when traveling, as are the many pockets. I wore these pants on a recent day trip to NYC and traveled comfortably with essential items contained in the 8 pockets. I placed a second order shortly after my first pair arrived because I like them so much. Shipping and delivery is also fairly fast, considering these pants ship from China!"
[1942,] "Pants are awesome, just like the picture. The size runs small, so if you order them I would order them bigger than normal. I usually wear a 34inch waist because i dont like my pants snug, these pants fit more like a 32 inch waist.Other than that i love them!"
[1943,] "the good:Pants are made from the durable cotton that has a nice feel; have a lot of useful features and roomy well placed pockets; durable stitching.the bad:Pants will shrink and drier/hot water is not recommended. Would have been better if the cotton was pretreated to prevent shrinking. I would gladly gave up the belt if I wouldn't have to wary about how to wash these pants.the ugly:faux pocket with a zipper. useless feature. on my pair came with a bright gold zipper, unlike a silver in a picture."
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I am trying to automatically make a big corpus into a numeric list. One number per line. For example I have the following data:
Df.txt =
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let's hear it! :)
First I read the text using the command readLines:
text <- readLines("Df.txt", encoding = "UTF-8")
Secondly I get all the text into lower letters and I remove unnecessary spacing:
## Lower cases input:
lower_text <- tolower(text)
## removing leading and trailing spaces:
Spaces_remove <- str_trim(lower_text)
From here on, I will like to assign each line a number e.g.:
"In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." = 1
"We love you Mr. Brown." = 2
...
"If you have an alternative argument, let's hear it! :)" = 6
Any ideas?
You already do kinda have numeric line # associations with the vector (it's indexed numerically), but…
text_input <- 'In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one\'s life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE\'s new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let\'s hear it! :)'
library(dplyr)
library(purrr)
library(stringi)
textConnection(text_input) %>%
readLines(encoding="UTF-8") %>%
stri_trans_tolower() %>%
stri_trim() -> corpus
# data frame with explicit line # column
df <- data_frame(line_number=1:length(corpus), text=corpus)
# list with an explicit line number field
lst <- map(1:length(corpus), ~list(line_number=., text=corpus[.]))
# implicit list numeric ids
as.list(corpus)
# explicit list numeric id's (but they're really string keys)
setNames(as.list(corpus), 1:length(corpus))
# named vector
set_names(corpus, 1:length(corpus))
There are a plethora of R packages that significantly ease the burden of text processing/NLP ops. Doing this work outside of them is likely to be reinventing the wheel. The CRAN NLP Task View lists many of them.
I have the following function to predict the next word using trigrams. The libraries that I am using are: ngrams, RWeka and tm.
f <- function(queryHistoryTab, query, n = 2) {
require(tau)
trigrams <- sort(textcnt(rep(tolower(names(queryHistoryTab)), queryHistoryTab), method = "string", n = length(scan(text = query, what = "character", quiet = TRUE)) + 1))
query <- tolower(query)
idx <- which(substr(names(trigrams), 0, nchar(query)) == query)
res <- head(names(sort(trigrams[idx], decreasing = TRUE)), n)
res <- substr(res, nchar(query) + 2, nchar(res))
return(res)
}
In order to feed the function I have to set a corpus. For this purpose I am using a data sets that consists in textual data extracted from US blogs:
text1 <- readLines("en_US.news.txt", encoding = "UTF-8")
corpus <- Corpus(VectorSource(text1))
The class of the corpus is
>class(corpus)
[1] "VCorpus" "Corpus"
Nevertheless, when I am trying guess the two most common of words of a sentence I get the following error:
f(corpus, "I will like a")
Error in textcnt(rep(tolower(names(queryHistoryTab)), queryHistoryTab), :
(list) object cannot be coerced to type 'integer'
Here are the first lines of the en_US.news.txt in case you want to test it yourselves:
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let's hear it! :)