Scraping Amazon Customer Reviews

Scraping Amazon Customer Reviews - r

I am scraping Amazon customer reviews using R and have come across a bug that I was hoping someone might have some insight into.
I have noticed that R fails to scrape the specified node (found by using SelectorGadget) from all reviews. Each time I run the script I retrieve a different amount, but never the entirety. This is very frustrating since the goal is to scrape the reviews and compile them into csv files that can later be manipulated using R. Essentially, if a product has 200 reviews, when I run the script, sometimes I will get 150 reviews, sometimes 75 reviews, etc- but not the entire 200. This issue seems to happen after I have done repeated scraping.
I have also gotten a few timeout errors, specifically "Error in open.connection(x, "rb") : Timeout was reached".
How do I get around this to continue scraping? I am a beginner but any help or insight is greatly appreciated!!
url <- "https://www.amazon.com/Match-Mens-Wild-Cargo-Pants/product-reviews/B009HLOZ9U/ref=cm_cr_arp_d_show_all?ie=UTF8&reviewerType=all_reviews&pageNumber="
N_pages <- 204
A <- NULL
for (j in 1: N_pages){
pant <- read_html(paste0(url, j))
B <- cbind(pant %>% html_nodes(".review-text") %>% html_text() )
A <- rbind(A,B)
}
tail(A)
print(j)

Is this not working for you?
Setting the URL as "https://www.amazon.com/Match-Mens-Wild-Cargo-Pants/product-reviews/B009HLOZ9U/ref=cm_cr_arp_d_paging_btm_2?ie=UTF8&reviewerType=avp_only_reviews&sortBy=recent&pageNumber="
N_pages <- 204
A <- NULL
for (j in 1: N_pages){
pant <- read_html(paste0(url, j))
B <- cbind(pant %>% html_nodes(".review-text") %>% html_text() )
A <- rbind(A,B)
}
tail(A)
[,1]
[1938,] "This is really a good item to get. Trendy, probably you can choose a different color, it fits good but I wouldn't say perfect."
[1939,] "I don't write reviews for most products, but I felt the need to do so for these pants for a couple reasons. First, they are great pants! Solid material, well-made, and they fit great. Second, I want to echo those who say you need to go up in size when you order. I wear anywhere from 32-34, depending on the brand. I ordered these in a 36 and they fit like a 33 or 34. I really love the look and feel of these, and will be ordering more!"
[1940,] "I bought the green one before, it is good quality and looks nice, than I purchased the similar one, but the khaki color, but received absolutely different product, different material. really disappointed."
[1941,] "These pants are great! I have been looking to update my wardrobe with a more edgy style; these cargo pants deliver on that. Paired with some casual sneakers or a decent nubuck leather boot completes the look from the waist down. The lazy-casual look is great when traveling, as are the many pockets. I wore these pants on a recent day trip to NYC and traveled comfortably with essential items contained in the 8 pockets. I placed a second order shortly after my first pair arrived because I like them so much. Shipping and delivery is also fairly fast, considering these pants ship from China!"
[1942,] "Pants are awesome, just like the picture. The size runs small, so if you order them I would order them bigger than normal. I usually wear a 34inch waist because i dont like my pants snug, these pants fit more like a 32 inch waist.Other than that i love them!"
[1943,] "the good:Pants are made from the durable cotton that has a nice feel; have a lot of useful features and roomy well placed pockets; durable stitching.the bad:Pants will shrink and drier/hot water is not recommended. Would have been better if the cotton was pretreated to prevent shrinking. I would gladly gave up the belt if I wouldn't have to wary about how to wash these pants.the ugly:faux pocket with a zipper. useless feature. on my pair came with a bright gold zipper, unlike a silver in a picture."

Related

R: Calculating the Cosine Similarity Between Restaurant Reviews

I am working with the R programming language.
Suppose I have the following data frame that contains data on 8 different restaurant reviews (taken from here: https://www.consumeraffairs.com/food/mcd.html?page=2#scroll_to_reviews=true) :
text = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.",
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.",
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.",
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.",
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!",
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))
I would like to find out which reviews are similar to each other (e.g. perhaps 1,2 and 3 are similar to each other, 5,7,1 are similar to each other, 7,2 are similar to each other, etc.). I tried to research to see if there is some method that can be used to accomplish this task - in particular, I found out about something called the "Cosine Distance" which is apparently used often for similar tasks in NLP and Text Mining.
I tried to follow the instructions here to accomplish this task: Cosine Similarity Matrix in R
library(tm)
library(proxy)
text = text[,2]
corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus,
control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm,
MARGIN = 1,
FUN = function(x) sum(x > 0) / ncol(tdm))
tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])
dist(tdm_mat, method = "cosine", upper = TRUE)
My Question: The above code seems to run without errors, but I am not sure if this code is able to indicate which restaurant reviews are similar to one another.
Can someone please show me how to do this?
Thanks!

How does "sentimentr" package split a paragraph or sentences into more than 1 sentences?

I am trying to run sentiment analysis in r using "sentimentr" package. I fed in a list of comments and in the output got element_id, sentence_id, word_count, sentiment. Comments with long phrases are getting converted into single sentences. I want to know the logic based on which package does that ?
I have 4 main categories for my comments- Food, Atmosphere, Price and service. and I have also set bigrams for those themes, i am trying to split sentences based on themes
install.packages("sentimentr")
library(sentimentr)
data <- read.csv("Comments.csv")
data_new <- as.matrix(data)
scores <- sentiment(data_new)
#scores
write.csv(scores,"results.csv")
For e.g - " We had a large party of about 25, so some issues were understandable. But the servers seemed totally overwhelmed. There are so many issues I cannot even begin to explain. Simply stated food took over an hour to be served, it was overcooked when it arrived, my son had a steak that was charred, manager came to table said they were now out of steak, I could go on and on. We were very disappointed" got split up into 5 sentences
1) We had a large party of about 25, so some issues were understandable
2) But the servers seemed totally overwhelmed.
3) There are so many issues I cannot even begin to explain.
4) Simply stated food took over an hour to be served, it was overcooked when it arrived, my son had a steak that was charred, manager came to table said they were now out of steak, I could go on and on.
5) We were very disappointed
I want to know if there is any semantic logic behind the splitting or it's just based on full stops?

It uses textshape::split_sentence(), see https://github.com/trinker/sentimentr/blob/e70f218602b7ba0a3f9226fb0781e9dae28ae3bf/R/get_sentences.R#L32
A bit of searching found the logic is here:
https://github.com/trinker/textshape/blob/13308ed9eb1c31709294e0c2cbdb22cc2cac93ac/R/split_sentence.R#L148
I.e. yes it is splitting on ?.!, but then it is using a bunch of regexes to look for exceptions, such as "No.7" and "Philip K. Dick".

making a text document a numeric list [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I am trying to automatically make a big corpus into a numeric list. One number per line. For example I have the following data:
Df.txt =
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let's hear it! :)
First I read the text using the command readLines:
text <- readLines("Df.txt", encoding = "UTF-8")
Secondly I get all the text into lower letters and I remove unnecessary spacing:
## Lower cases input:
lower_text <- tolower(text)
## removing leading and trailing spaces:
Spaces_remove <- str_trim(lower_text)
From here on, I will like to assign each line a number e.g.:
"In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." = 1
"We love you Mr. Brown." = 2
...
"If you have an alternative argument, let's hear it! :)" = 6
Any ideas?

You already do kinda have numeric line # associations with the vector (it's indexed numerically), but…
text_input <- 'In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one\'s life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE\'s new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let\'s hear it! :)'
library(dplyr)
library(purrr)
library(stringi)
textConnection(text_input) %>%
readLines(encoding="UTF-8") %>%
stri_trans_tolower() %>%
stri_trim() -> corpus
# data frame with explicit line # column
df <- data_frame(line_number=1:length(corpus), text=corpus)
# list with an explicit line number field
lst <- map(1:length(corpus), ~list(line_number=., text=corpus[.]))
# implicit list numeric ids
as.list(corpus)
# explicit list numeric id's (but they're really string keys)
setNames(as.list(corpus), 1:length(corpus))
# named vector
set_names(corpus, 1:length(corpus))
There are a plethora of R packages that significantly ease the burden of text processing/NLP ops. Doing this work outside of them is likely to be reinventing the wheel. The CRAN NLP Task View lists many of them.

Preparing the corpus for n-gram model word prediction in R

I have the following function to predict the next word using trigrams. The libraries that I am using are: ngrams, RWeka and tm.
f <- function(queryHistoryTab, query, n = 2) {
require(tau)
trigrams <- sort(textcnt(rep(tolower(names(queryHistoryTab)), queryHistoryTab), method = "string", n = length(scan(text = query, what = "character", quiet = TRUE)) + 1))
query <- tolower(query)
idx <- which(substr(names(trigrams), 0, nchar(query)) == query)
res <- head(names(sort(trigrams[idx], decreasing = TRUE)), n)
res <- substr(res, nchar(query) + 2, nchar(res))
return(res)
}
In order to feed the function I have to set a corpus. For this purpose I am using a data sets that consists in textual data extracted from US blogs:
text1 <- readLines("en_US.news.txt", encoding = "UTF-8")
corpus <- Corpus(VectorSource(text1))
The class of the corpus is
>class(corpus)
[1] "VCorpus" "Corpus"
Nevertheless, when I am trying guess the two most common of words of a sentence I get the following error:
f(corpus, "I will like a")
Error in textcnt(rep(tolower(names(queryHistoryTab)), queryHistoryTab), :
(list) object cannot be coerced to type 'integer'
Here are the first lines of the en_US.news.txt in case you want to test it yourselves:
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let's hear it! :)

A weird word appears with topic analysis in r

I have a paragraph:
disgusting do at was horrific we have stayed please to at traveler photos ironic i did post those witnessed each every thing in pictures gave us fist free then moved us to rooms were any better we slept with clothes on entire there never once took off shoes to walk on carpet shower etc holes in wall stains on bedding curtains couch chair no working electric in lamps cords nothing could be plugged in when we called down to fix it so we no lighting except bathroom light tv toilets constantly plugged up shower drain.
That appears to be a little grammatically weird since I cleaned the paragraph. And I use the following code to extract work frequencies.
# create corpus
docs<-Corpus(VectorSource(example))
# stem document
docs<-tm_map(docs,stemDocument)
# create document-term matrix
dtm<-DocumentTermMatrix(docs)
# convert row names
rownames(dtm)<-"example"
# collapse matrix by summing over columns
freq<-colSums(as.matrix(dtm))
# length should be total number of terms
length(freq)
# create sort order (descending)
ord<-order(freq,decreasing=TRUE)
# list all terms in decreasing order of freq and write to disk
freq[ord]
Then the freq[ord] is:
I am wondering why there is a word ani here, apparently, ani does not appear in my paragraph. Thanks.
Just figured the problem, the following code transfers any to ani, does anyone know how to avoid that?
docs<-tm_map(docs,stemDocument)

It's the word "any" after having being stemmed. The (in this case faulty) logic of the underlying function, wordStem, which uses Dr. Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball, changed the y to an i.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex