I'm trying to get the data behind a flash website. To do this, I used firebug and found a POST method under the Net tab that contains the data I want. I can see this data just by clicking on the POST method in Firebug, but when I do so the url displayed is not different than the whole webpage. Is there a persistent url for this data? If so, what is it? If not, how can I get to this data without using firebug? I'm going to want to scrape it using Ruby.
POST send data to server and server send back page which url you have.
You can see what data was send from browser to server but you can get it back from server.
This data are derived from data in form on (previous) page or they are created by javascript on page. First type you put in form on page on your own, second type requires analizing of javascript code.
BTW: in Firebug you have two types of data - data send from browser to server (so called POST data, send as Request) and data send from server to browser (send as Response) which you can see as page in browser.
Related
I'm trying to design a scraper for a page in R using the selenium webdriver package and the part of the page I want to scrape is not loading, no matter how long I wait for it to. It may be to do with javascript which I admittedly know nothing about.
I've tried forcing it to scroll down to load the element (in this case a table) but to no avail.
It loads fine in normal browsers.
It's like the severalth site for which this has happened so I thought I'd pop my stackoverflow cherry and ask the experts.
Sorry I have no reprex as I just don't know where the issue is coming from!
The link to the page is
https://jdih.kemenkeu.go.id/#/home
an image showing what selenium says it sees - yellow highlighted area is where the table should load.
how it is supposed to display shown in firefox
Thanks for reading!
(18 months later and I can answer my own question!)
The issue was that the page is loading content dynamically using an API request.
When scraping using direct GET requests of a URL to extract the page contents, this initial request alone may not load the desired content.
In this case, I found the exact issue by reloading the page with the developer interface open (F12) with the 'Network' (or similar) tab open.
This then shows all the requests made when the browser loads the page.
One of these will be a request for the desired data - in this case by filtering on XHR requests only, I was able to identify one which loaded content through an internal API.
Right-click the request, open in new tab, and voilĂ , you have a URL which you can use in the same way you would normally with this scraping method that will provide the page content required.
Sometimes the URL alone may not be enough. You will need the correct request headers sent with the request. These can be seen in the request data as mentioned above in the Developer interface in one's browser. Right-click and select 'Copy headers' or similar to get them.
In this case, i.e. when using R, the httr package can be used to send get requests with specific headers thus:
headers = c(
"Host" = "website.com"
[other headers here]
)
page <- httr::GET(url = "www.website.com/data",
httr::add_headers(.headers = headers)) %>%
httr::content()
When you have the page content, it is possible to parse the HTML or whatever else is required as usual.
Is it possible for me to scrape the data from the pop up appears after clicking the link.the website is https://ngodarpan.gov.in/index.php/home/statewise_ngo/61/35
Of course it's possible, it's just a table with pagination.
But you'd better check the legal part before scraping a website, moreover on a governmental one.
Yes, you have to follow exactly what browser does. See network behaviour from your browser.
First, you have to send request to https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf in order to get token like this :{"csrf_token":"0d1c59184c7df788dc4b8759f6da40c6"}
After, send another POST request to https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info. As parameters you have to mention csrf_test_name which which is equals to csrf_token and id which is found from onclick attribute of each link.
You will get JSON as response and just to parse it as you need.
Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.
The overall goal is to perform a search on the following webpage http://www.cma-cgm.com/eBusiness/Tracking/Default.aspx with a container value of CMAU1173561. I have tried two approaches, the php extension cURL and python's mechanized. The php approached involves a performing a POST submit using the input fields found on the page (NOTE: These are really ugly on the asp.net page). The returned page does not contain any of the search results. The second approaches involves using python's mechanize module. In this approach I load the page, select the form, then change the text field ctl00$ContentPlaceBody$TextSearch to the container value. When I load the response again no search results.
I am at a really dead end. Any help would be appreciate because as it stands my next step is to become a asp.net expertm which i perfer not to.
The source of that page is pretty scary (giant viewstate, tables all over the place, inline CSS, styles that look like they were copied from Word).
Regardless...an ASP.Net form still passes the same raw data to the server as any other form (though it is abstracted to the developer).
It's very possible that you are missing the cookies which go along with the request. If the search page (or any piece of the site) uses session state, the ASP.Net session cookie must be included in the request. You will be able to tell it from its name (contains "asp.net" and "session").
I assume that you have used a tool like Firebug or Chrome to view the complete outgoing request when the page is submitted. From my quick test, it looks like the request may be performed with a GET, not a POST. I submitted a form, looked at the request, and pasted the URL into a new browser window.
Example: http://www.cma-cgm.com/eBusiness/Tracking/Default.aspx?ContNum=CMAU1173561&T=57201202648
This may be all you need to do.
I am working on screen scraping, its easy when filteration in query string, but the problem in AJAX based filteration,
e.g. here is an sample URL
When you open this page, enter hotel name and click Go, Ajax filter work and show the result accordingly or you click on Next Page, it will shown next record using AJAX based.
please suggest me, how to handle these kind of issues when working in Screen Scraping?
Thank alot
You may want to try 2 Firefox add-ons. They are "Firebug" and "Tamper Data".
The "Console" window of Firebug shows the AJAX request and response.
You can then write scripts using the PHP/cURL library to mimic the request.
Do a http request as you normally do for any link or form sumbit but use the url used with ajax. Sometimes you may need to read the javascript source to determine how the url is built.