Web Scraping within iframe using R - r

I am very new to rvest library and this is the first time that I am trying to scrape something. I am trying to scrape the very first table on this web page https://mup.gov.hr/promet-na-granicnim-prijelazima-282198/282198, that is titled PUTNICI (translated to PASSANGERS) within iframe, but I am struggling to do that.
In the top left corner, there is also a date option, that one can choose to select very specific day, month and year that one wants to see.
Is there any chance that I can scrape that very first table for specific time period, lets say whole January 2022, or if not, at least to scrape the very first table?
This is my code at the moment:
"https://mup.gov.hr/promet-na-granicnim-prijelazima-282198/282198" %>%
read_html() %>%
html_nodes("iframe") %>%
extract(1) %>%
html_attr("src") %>%
read_html() %>%
html_node("#prometGranicniPrijelaz") %>%
html_text()
I would be really thankful if someone helped me on this subject!

If you open your browser's Developer Tools - Network tab - fetch/Xhr and then change the date on the website you will see the request the happens in the backend that loads the data you are looking for. You can make your queries to that backend url and loop through the dates:
https://granica.mup.hr/default.inc.aspx?ajaxq=PrometPoDatumu&odDat=06.02.2022
https://granica.mup.hr/default.inc.aspx?ajaxq=PrometPoDatumu&odDat=07.02.2022
etc
I don't believe it's in an iframe but rather an HTML table with a class = "desktop promet" and you can parse the data you are looking for from there.

Related

scraping all the reviews of a IMDB movie in R

I wrote the code to scrape the review and the detailed review for a movie.
But it scrape information that has been already loaded to the page. (Ex: If there are 1000 reviews, the web page only shows the 10 reviews first. The other reviews will display after clicking "Load more")
require(rvest)
require(dplyr)
MOVIE_URL <- read_html("https://www.imdb.com/title/tt0167260/reviews?ref_=tt_urv")
ex_review <- MOVIE_URL %>% html_nodes(".lister-item a") %>%
html_text()
detialed <- MOVIE_URL %>% html_nodes(".content")%>%
html_text()
Is there a way to scrape the information of every review?
This is similar to a previous question (How to scrape all the movie reviews from IMDB using rvest), though the answer no longer works.
Now when you are looking at a single page of reviews, say (https://www.imdb.com/title/tt0167260/reviews), you can load the next page of reviews reviews via the url:
movieurl = "https://www.imdb.com/title/tt0167260/reviews/_ajax?&paginationKey="+pagination_key
where pagination_key is the data-key hidden in the html under:
<div class="load-more-data" data-key="g4xolermtiqhejcxxxgs753i36t52q343andv6xeade6qp6qwx57ziim2edmxvqz2tftug54" data-ajaxurl="/title/tt0167260/reviews/_ajax">.
So if you retrieve the html from movie_url = "https://www.imdb.com/title/tt0167260/reviews/_ajax?&paginationKey=g4xolermtiqhejcxxxgs753i36t52q343andv6xeade6qp6qwx57ziim2edmxvqz2tftug54" you will get the second page of reviews.
To then access the third page you need to repeat the process i.e. look for the pagination key from this second page and repeat.

What to do if rvest isn't recognizing a node in R?

I am trying to scrape referee game data using rvest. See the code below:
page_ref<-read_html("https://www.pro-football-reference.com/officials/HittMa0r.htm")
ref_tab <- page_ref %>%
html_node("#games") %>%
html_text()
#html_table()
But rvest does not recognize any of the nodes for the "Games" table in the link. It can pull the data from the first table "Season Totals" just fine. So, am I missing something? In general, what does it mean if rvest doesn't recognize a node identified with SelectorGadget and is clearly identified in the developer tools?
It is because the first table is in the html you get from the server and the other tables are filled by JavaScript. Rvest can only get you the things that are there in the html response from server. If you want to get the data filled by JS, you need to use some other tool like Selenium or Puppeteer, for example.
Selenium
Puppeteer / gepetto

Scrape in Amazon search to get prices and products with R (rvest)

I'm trying to scrape in any Amazon search to get products and their prices so I'm working with rvest library in R to do that.
For example, for this search:
Amazon Search
I want to extract all product names and their prices. I tried the follow:
library(rvest)
link='https://www.amazon.com.mx/s?k=gtx+1650+super&__mk_es_MX=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss_2'
simple=read_html(link)
simple %>% html_nodes("[class='a-size-base-plus a-color-base a-text-normal']") %>% html_text()
Using Chrome, class 'a-size-base-plus a-color-base a-text-normal' is where
product name it's stored.
That code works fine and I get all the products names. So, I was trying to get theirs prices with this:
simple %>% html_nodes("[class='a-offscreen']") %>% html_text()
Using Chrome, class 'a-offscreen' is where price it's stored.
That code returns me every price in the search but if you have seen the search, not all products have price. So, that code returns me products with price and I can't match products with their prices.
Is there a way to make it possible? maybe it can be possible filter only those products that have class 'a-offset' and get their prices?
Thanks.
You need to scrape the nodes of items first and then with each node, you scrape the product name and the price. Similar to this question: RVEST package seems to collect data in random order

Data missing while scraping website

I am trying to scrap a website (Please refer to urls in the code).
From the website ,i am trying to scrap all the information and transfer the data to json file.
scrapy shell http://www.narakkalkuries.com/intimation.html
To extract the information from website
response.xpath('//table[#class="MsoTableGrid"]/tr/td[1]//text()').re(r'[0-9,-/]+|[0-9]+')
I am able to retrieve most of the information from the website.
Concern:
Able to scrap data under "Intimation",expect'Intimation For September 2017' not able to scrap information under this tab.
Finding:
For 'Intimation For September 2017', the value is stored in the span tag
/html/body/div[4]/div[2]/div/table/tbody/tr[32]/td[1]/table/tbody/tr[1]/td[1]/p/b/span
For the remaining month the values are stored in the font tag
/html/body/div[4]/div[2]/div/table/tbody/tr[35]/td[1]/table/tbody/tr[2]/td[1]/p/b/span/font
How to extract information for "Intimation For September 2017" ?
You tables use different #class (MsoTableGrid and MsoNormalTable) so you need some way to process all of them:
for table in response.xpath('//table[#width="519"]'):
for row in table.xpath('./tr[position() > 1]'):
for cell in row.xpath('./td'):
#you can stringify value
cell_value = cell.xpath('string(.)').extract_first()

Rvest - Mismatch between retrieved html and actual site html

I have a script which regularly checks (once a week) some website to perform scraping. The website is updated once a week (or once every two weeks). Up till now, I've had no issue retrieving the correct information.
Yet today, for some unknown reason, I am getting the wrong information from read_html
library(rvest)
urlATP <- "http://www.tennisleader.fr/classement/1"
checkDate <- as.Date(gsub("Au ", "",
html_text(html_nodes(read_html(urlATP), ".date-atp"))),
format = "%d/%m/%Y")
print(checkDate)
Which returns this
[1] "2018-07-02"
But when I get to the website however, the date is different.
<p class="date-atp">Au 16/07/2018</p>
What could explain this mismatch and more importantly how do I get rid of it ?
Additional information :
The date retrieved was the one present on the website last week
I have tried to clear memory / refresh R session but with no success

Resources