R selenium webdriver not loading element even after wait and scroll down - r

I'm trying to design a scraper for a page in R using the selenium webdriver package and the part of the page I want to scrape is not loading, no matter how long I wait for it to. It may be to do with javascript which I admittedly know nothing about.
I've tried forcing it to scroll down to load the element (in this case a table) but to no avail.
It loads fine in normal browsers.
It's like the severalth site for which this has happened so I thought I'd pop my stackoverflow cherry and ask the experts.
Sorry I have no reprex as I just don't know where the issue is coming from!
The link to the page is
https://jdih.kemenkeu.go.id/#/home
an image showing what selenium says it sees - yellow highlighted area is where the table should load.
how it is supposed to display shown in firefox
Thanks for reading!

(18 months later and I can answer my own question!)
The issue was that the page is loading content dynamically using an API request.
When scraping using direct GET requests of a URL to extract the page contents, this initial request alone may not load the desired content.
In this case, I found the exact issue by reloading the page with the developer interface open (F12) with the 'Network' (or similar) tab open.
This then shows all the requests made when the browser loads the page.
One of these will be a request for the desired data - in this case by filtering on XHR requests only, I was able to identify one which loaded content through an internal API.
Right-click the request, open in new tab, and voilà, you have a URL which you can use in the same way you would normally with this scraping method that will provide the page content required.
Sometimes the URL alone may not be enough. You will need the correct request headers sent with the request. These can be seen in the request data as mentioned above in the Developer interface in one's browser. Right-click and select 'Copy headers' or similar to get them.
In this case, i.e. when using R, the httr package can be used to send get requests with specific headers thus:
headers = c(
"Host" = "website.com"
[other headers here]
)
page <- httr::GET(url = "www.website.com/data",
httr::add_headers(.headers = headers)) %>%
httr::content()
When you have the page content, it is possible to parse the HTML or whatever else is required as usual.

Related

Web scraping using rvest works partially ok

I'm new into web scraping using rvest in R and I'm trying to access to the left column match names of this betting house using xpath. So i know the names are under the tag. But i cant access to them using the next code:
html="https://www.supermatch.com.uy/live#/5009370062"
a=read_html(html)
a %>% html_nodes(xpath="//span") %>% html_text()
But i only access to some of the text. I was reading that this may be because the website dynamically pull data from databases using JavaScript and jQuery. Do you know how i can access to these match names? Already thank you guys.
Some generic notes about basic scraping strategies
Following refers to Google Chrome and Chrome DevTools, but those same concepts apply to other browsers and built-in developer tools too. One thing to remember about rvest is that it can only handle response delivered for that specific request, i.e. content that is not fetched / transformed / generated by JavasScript running on the client side.
Loading the page and inspecting elements to extract xpath or css selector for rvest seems to be most most common approach. Though the static content behind that URL versus the rendered page in browser and elemts in inspector can be quite different. To take some guesswork out of the process, it's better to start by checking what is the actual content that rvest might receive - open the page source and skim through it or just search for a term you are interested in. At the time of writing Viettel is playing, but they are not listed anywhere in the source:
Meaning there's no reason to expect that rvest would be able to extract that data.
You could also disable JavaScript for that particular site in your browser and check if that particular piece of information is still there. If not, it's not there for rvest either.
If you want to step further and/or suspect that rvest receives something different compared to your browser session (target site is checking request headers and delivers some anti-scraping notice when it doesn't like the user-agent, for example), you can always check the actual content rvest was able to retrieve, for example read_html(some_url) %>% as.character() to dump the whole response, read_html(some_url) %>% xml2::html_structure() to get formatted stucture of the page or read_html(some_url) %>% xml2::write_html("temp.html") to save the page content and inspect it in editor or browser.
Coming back to Supermatch & DevTools. That data on a left pane must be coming from somewhere. What usually works is a search on the network pane - open network, clear current content, refresh the page and make sure page is fully loaded; run a search (for "Viettel" for example):
And you'll have the URL from there. There are some IDs in that request (https://www.supermatch.com.uy/live_recargar_menu/32512079?_=1656070333214) and it's wise to assume those values might be related to current session or are just shortlived. So sometimes it's worth trying what would happen if we just clean it up a bit, i.e. remove 32512079?_=1656070333214. In this case it happens to work.
While here it's just a fragment of html and it makes sense to parse it with rvest , in most cases you'll end up landing on JSON and the process transforms into working with APIs. When it happens it's time to switch from rvest to something more apropriate for JSON, jsonlite + httr for example.
Sometimes plane rvest is not enough and you either want or need to work with the page as it would have been rendered in your JavaScript-enabled browser. For this there's RSelenium

Getting the download url of a video from a site

I'm trying to build a web scraper which downloads videos from "fmovies.se".
I was not able to fully extract the video url given the webpage.
The webpage I'm considering is "https://fmovies.se/film/la-cage-doree.5283j".
Two queries are required to retrieve the video url.
The initial one is 'https://fmovies.se/ajax/episode/info?ts=1483027200&=2399&id=9076jn&update=0'.
The query is composed of "ts", "_", "id" and "update" elements. Everything except "_" part was mentioned in html code of the webpage.
I couldn't get from where "_2399" part was coming.
Can anyone help me with this ?
Even if you figure out how those parameters are computed, they can change their algorithm at any moment, which this site specifically has done in the past, see this thread.
You need a long lasting solution — a headless browser.
You can use a headless browser to simulate user interactions programmatically and intercept the XHR request that you are looking for (e.g. https://fmovies.se/ajax/episode/info?ts=1483027200&=2399&id=9076jn&update=0).
One of the best headless browsers out there is Puppeteer and there's a lot of information on how to use it.

How to scrape ajax calls in PHP

Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.

How to make sure that content is only rendered within an iframe

I am wondering how to make sure that I only ever show/render the content (send the code to the client) if the content is loaded in an iframe in a real browser, similar to the way Facebook checks when to display their like buttons and other social utilities.
There, when trying to simply load the content using curl, even when sending cookies, session details and user agent details, it still returns nothing. When trying to load the content outside an iframe, one receives nothing. How can that be achieved? I guess it is all but a simple process that involves multiple steps. I am especially interested in the first one, namely how to detect that it is really sent from a browser and not simply curled.
Thanks.
There is no way for your server to detect if it was sent using browser or curl, as the headers are easily forged.

Screen Scraping - how to get AJAX based filtered data

I am working on screen scraping, its easy when filteration in query string, but the problem in AJAX based filteration,
e.g. here is an sample URL
When you open this page, enter hotel name and click Go, Ajax filter work and show the result accordingly or you click on Next Page, it will shown next record using AJAX based.
please suggest me, how to handle these kind of issues when working in Screen Scraping?
Thank alot
You may want to try 2 Firefox add-ons. They are "Firebug" and "Tamper Data".
The "Console" window of Firebug shows the AJAX request and response.
You can then write scripts using the PHP/cURL library to mimic the request.
Do a http request as you normally do for any link or form sumbit but use the url used with ajax. Sometimes you may need to read the javascript source to determine how the url is built.

Resources