scraping all the reviews of a IMDB movie in R - r

I wrote the code to scrape the review and the detailed review for a movie.
But it scrape information that has been already loaded to the page. (Ex: If there are 1000 reviews, the web page only shows the 10 reviews first. The other reviews will display after clicking "Load more")
require(rvest)
require(dplyr)
MOVIE_URL <- read_html("https://www.imdb.com/title/tt0167260/reviews?ref_=tt_urv")
ex_review <- MOVIE_URL %>% html_nodes(".lister-item a") %>%
html_text()
detialed <- MOVIE_URL %>% html_nodes(".content")%>%
html_text()
Is there a way to scrape the information of every review?

This is similar to a previous question (How to scrape all the movie reviews from IMDB using rvest), though the answer no longer works.
Now when you are looking at a single page of reviews, say (https://www.imdb.com/title/tt0167260/reviews), you can load the next page of reviews reviews via the url:
movieurl = "https://www.imdb.com/title/tt0167260/reviews/_ajax?&paginationKey="+pagination_key
where pagination_key is the data-key hidden in the html under:
<div class="load-more-data" data-key="g4xolermtiqhejcxxxgs753i36t52q343andv6xeade6qp6qwx57ziim2edmxvqz2tftug54" data-ajaxurl="/title/tt0167260/reviews/_ajax">.
So if you retrieve the html from movie_url = "https://www.imdb.com/title/tt0167260/reviews/_ajax?&paginationKey=g4xolermtiqhejcxxxgs753i36t52q343andv6xeade6qp6qwx57ziim2edmxvqz2tftug54" you will get the second page of reviews.
To then access the third page you need to repeat the process i.e. look for the pagination key from this second page and repeat.

Related

Web Scraping within iframe using R

I am very new to rvest library and this is the first time that I am trying to scrape something. I am trying to scrape the very first table on this web page https://mup.gov.hr/promet-na-granicnim-prijelazima-282198/282198, that is titled PUTNICI (translated to PASSANGERS) within iframe, but I am struggling to do that.
In the top left corner, there is also a date option, that one can choose to select very specific day, month and year that one wants to see.
Is there any chance that I can scrape that very first table for specific time period, lets say whole January 2022, or if not, at least to scrape the very first table?
This is my code at the moment:
"https://mup.gov.hr/promet-na-granicnim-prijelazima-282198/282198" %>%
read_html() %>%
html_nodes("iframe") %>%
extract(1) %>%
html_attr("src") %>%
read_html() %>%
html_node("#prometGranicniPrijelaz") %>%
html_text()
I would be really thankful if someone helped me on this subject!
If you open your browser's Developer Tools - Network tab - fetch/Xhr and then change the date on the website you will see the request the happens in the backend that loads the data you are looking for. You can make your queries to that backend url and loop through the dates:
https://granica.mup.hr/default.inc.aspx?ajaxq=PrometPoDatumu&odDat=06.02.2022
https://granica.mup.hr/default.inc.aspx?ajaxq=PrometPoDatumu&odDat=07.02.2022
etc
I don't believe it's in an iframe but rather an HTML table with a class = "desktop promet" and you can parse the data you are looking for from there.

What to do if rvest isn't recognizing a node in R?

I am trying to scrape referee game data using rvest. See the code below:
page_ref<-read_html("https://www.pro-football-reference.com/officials/HittMa0r.htm")
ref_tab <- page_ref %>%
html_node("#games") %>%
html_text()
#html_table()
But rvest does not recognize any of the nodes for the "Games" table in the link. It can pull the data from the first table "Season Totals" just fine. So, am I missing something? In general, what does it mean if rvest doesn't recognize a node identified with SelectorGadget and is clearly identified in the developer tools?
It is because the first table is in the html you get from the server and the other tables are filled by JavaScript. Rvest can only get you the things that are there in the html response from server. If you want to get the data filled by JS, you need to use some other tool like Selenium or Puppeteer, for example.
Selenium
Puppeteer / gepetto

Scrape in Amazon search to get prices and products with R (rvest)

I'm trying to scrape in any Amazon search to get products and their prices so I'm working with rvest library in R to do that.
For example, for this search:
Amazon Search
I want to extract all product names and their prices. I tried the follow:
library(rvest)
link='https://www.amazon.com.mx/s?k=gtx+1650+super&__mk_es_MX=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss_2'
simple=read_html(link)
simple %>% html_nodes("[class='a-size-base-plus a-color-base a-text-normal']") %>% html_text()
Using Chrome, class 'a-size-base-plus a-color-base a-text-normal' is where
product name it's stored.
That code works fine and I get all the products names. So, I was trying to get theirs prices with this:
simple %>% html_nodes("[class='a-offscreen']") %>% html_text()
Using Chrome, class 'a-offscreen' is where price it's stored.
That code returns me every price in the search but if you have seen the search, not all products have price. So, that code returns me products with price and I can't match products with their prices.
Is there a way to make it possible? maybe it can be possible filter only those products that have class 'a-offset' and get their prices?
Thanks.
You need to scrape the nodes of items first and then with each node, you scrape the product name and the price. Similar to this question: RVEST package seems to collect data in random order

How to scrape Google with specific search criteria using R?

I'm trying to scrape Google with a specific search term, from a specific site, in a specific date range using R.
Example
Search term: "Miroslava Breach Velducea"
Site: www.jornada.com.mx
Dates: 1/1/2011 - 1/1/2012
The link for that specific search is: https://www.google.com/search?q=Miroslava+Breach+Velducea+site:www.jornada.com.mx&tbas=0&tbs=cdr:1,cd_min:1/1/2011,cd_max:1/1/2012&ei=UqCzW6LZC8OK5wKg97vYDA&start=10&sa=N&biw=1137&bih=474
When I code that in R, I can scrape Google for that search term and in that site, but not for those dates.
web_address ='https://www.google.com/search?q=miroslava+breach+velducea+site%3Awww.jornada.com.mx&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2010%2Ccd_max%3A12%2F31%2F2011'
webpage_code = read_html(web_address)
Título = html_text(html_nodes(webpage_code,'.r a'))
Título
Does anyone know how to scrape Google general search for specific dates?

R: Submit Form then Scrape Results

I'm trying to write code in R that will allow me to submit a query on http://nbawowy.com/ and scrape the resulting data. I'd like to be able to input values for at least the "Select Team", "Select Players On", and "Select Players Off" fields and then submit the form. As an example, if I select the 76ers as my team, and Ben Simmons as the "Player On", the resulting query is found here: http://nbawowy.com/#/z31mjvm5ss. I've tried using the following code, but it provides me with an unknown field names error:
library(rvest)
url <- "http://nbawowy.com/#/l0krk654imh"
session <- html_session(url)
form <- html_form(read_html(url))[[1]]
filled_form <- set_values(form,
"s2id_autogen1_search" = "76ers",
"s2id_autogen2" = "Ben Simmons")
session1<-submit_form(session, filled_form, submit='submit')
Since I can't seem to get passed this initial part, I'm looking to the community for some help. I'd ultimately like to navigate the session to the resulting url and scrape the data.
This is not a problem with your code. If you check the form returned by the website, you will see that not only there are no list elements named "s2id_autogen1_search" or "s2id_autogen2", in fact, the whole form is unnamed. Furthermore, what seems like one form in the browser, are, in fact, multiple forms (so the players' names cannot be even entered into the html_form(read_html(url))[[1]] element). Get to know the object to which you are trying to set the values with str(form). Alternatively, try to scrape using RSelenium.

Resources