How to scrape ajax calls in PHP - web-scraping

Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.

Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.

Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.

Related

R selenium webdriver not loading element even after wait and scroll down

I'm trying to design a scraper for a page in R using the selenium webdriver package and the part of the page I want to scrape is not loading, no matter how long I wait for it to. It may be to do with javascript which I admittedly know nothing about.
I've tried forcing it to scroll down to load the element (in this case a table) but to no avail.
It loads fine in normal browsers.
It's like the severalth site for which this has happened so I thought I'd pop my stackoverflow cherry and ask the experts.
Sorry I have no reprex as I just don't know where the issue is coming from!
The link to the page is
https://jdih.kemenkeu.go.id/#/home
an image showing what selenium says it sees - yellow highlighted area is where the table should load.
how it is supposed to display shown in firefox
Thanks for reading!
(18 months later and I can answer my own question!)
The issue was that the page is loading content dynamically using an API request.
When scraping using direct GET requests of a URL to extract the page contents, this initial request alone may not load the desired content.
In this case, I found the exact issue by reloading the page with the developer interface open (F12) with the 'Network' (or similar) tab open.
This then shows all the requests made when the browser loads the page.
One of these will be a request for the desired data - in this case by filtering on XHR requests only, I was able to identify one which loaded content through an internal API.
Right-click the request, open in new tab, and voilĂ , you have a URL which you can use in the same way you would normally with this scraping method that will provide the page content required.
Sometimes the URL alone may not be enough. You will need the correct request headers sent with the request. These can be seen in the request data as mentioned above in the Developer interface in one's browser. Right-click and select 'Copy headers' or similar to get them.
In this case, i.e. when using R, the httr package can be used to send get requests with specific headers thus:
headers = c(
"Host" = "website.com"
[other headers here]
)
page <- httr::GET(url = "www.website.com/data",
httr::add_headers(.headers = headers)) %>%
httr::content()
When you have the page content, it is possible to parse the HTML or whatever else is required as usual.

Webscraping a tricky asp.net page

The overall goal is to perform a search on the following webpage http://www.cma-cgm.com/eBusiness/Tracking/Default.aspx with a container value of CMAU1173561. I have tried two approaches, the php extension cURL and python's mechanized. The php approached involves a performing a POST submit using the input fields found on the page (NOTE: These are really ugly on the asp.net page). The returned page does not contain any of the search results. The second approaches involves using python's mechanize module. In this approach I load the page, select the form, then change the text field ctl00$ContentPlaceBody$TextSearch to the container value. When I load the response again no search results.
I am at a really dead end. Any help would be appreciate because as it stands my next step is to become a asp.net expertm which i perfer not to.
The source of that page is pretty scary (giant viewstate, tables all over the place, inline CSS, styles that look like they were copied from Word).
Regardless...an ASP.Net form still passes the same raw data to the server as any other form (though it is abstracted to the developer).
It's very possible that you are missing the cookies which go along with the request. If the search page (or any piece of the site) uses session state, the ASP.Net session cookie must be included in the request. You will be able to tell it from its name (contains "asp.net" and "session").
I assume that you have used a tool like Firebug or Chrome to view the complete outgoing request when the page is submitted. From my quick test, it looks like the request may be performed with a GET, not a POST. I submitted a form, looked at the request, and pasted the URL into a new browser window.
Example: http://www.cma-cgm.com/eBusiness/Tracking/Default.aspx?ContNum=CMAU1173561&T=57201202648
This may be all you need to do.

Correct way to link to new page in canvas page app php-sdk

I'm running in to a couple of issues and wondered if anyone had any insight. I'm using the latest php-sdk I'm developing a canvas app that has a number of different steps. These steps are spread across multiple pages. Now when I first enter the app everything seems to work fine. The access token is there and I can call the api functions. On the second page (which is linked to in the same iframe) I get OAuth errors. Now if I use this on the 2nd page:
$me = $facebook->getUser();
var_dump($me);
it returns the correct user id, but I still get errors when trying to use an api query (specifically a FQL one in this instance)
Now, bear in mind these links are within the iframe so I was assuming the signed_request is getting lost somewhere, I know facebook normally issues this via a POST. If I set all my links to target="_parent" with a url such as http://apps.facebook.com/myapp/page2.php then everything works fine. Facebook clearly posts the correct info this time. Subsequently, then when I use links that only redirect the iframe it seems to work fine again (implying a cookie is being set somewhere).
Now I've seen other apps that don't have a target="_parent" that seem to work correctly, only ever loading the iframe on subsequent clicks and not the full facebook site. So I can only assume they are storing this info somewhere. I've tried to inspect these apps using httpfox but I can't see anything obvious. Does anyone have any links for best practice with multiple page apps? I know I can get around this using full urls and target="_blank" but I would like to know what's going on here. I've looked through the developer docs and the canvas page examples, but there's nothing obvious to me.
Any help or info would be appreciated
Many Thanks
There is some ways to achieve this
using Facebook JavaScript SDK (which will set cookie for you, so PHP-SDK can rely on it)
issuing POST request to your pages including signed_request from initial page loaded in canvas

How do I submit a form with a large file upload and send ajax requests?

This works in other browsers but not in chrome. I am trying to allow users to upload large files and have an ajax call to update them on the progress of the file upload.
So a unique ID is generated on the client side and added to the action of the form before sending. Then the form is submitted (form only contains a file upload input) and an ajax call is made to get the progress of the upload.
The ajax call goes to another page and uses the ID to lookup the upload.
I am using JQuery 1.5.1. Debugging this and putting something on the error function give me nothing other than "error". Not very helpful. I used Chrome's debugger and it just says failed to load resource xxxx.aspx. xxx.aspx is the URL i needed. Turns out that there seems to be some sort of conflict between the form and the ajax call.
Is there some way to get around this?
you should really look at SWFupload, a great flash based uploader, with concurrent upload and progressbar support. Also it makes it really easy to use server-side, you dont need to implement upload percentage view as it client-side based.
not exactly an answer to your question, but a link to a tool that can really help you drill down and find good error messages, step through javascript code and such would be firebug for Chrome, I got the IE and Chrome versions working and use it very regularly, it has been a life saver and greatly has decreased debugging time:
http://getfirebug.com/releases/lite/chrome/
I would suggest making firebug a common tool in your debugging arsenal.
Use SlickUpload
It is a server control and module that does exactly what you are looking for and takes less than 10 minutes to setup.
Documentation: http://krystalware.com/Products/SlickUpload/Documentation/overview/

Screen Scraping - how to get AJAX based filtered data

I am working on screen scraping, its easy when filteration in query string, but the problem in AJAX based filteration,
e.g. here is an sample URL
When you open this page, enter hotel name and click Go, Ajax filter work and show the result accordingly or you click on Next Page, it will shown next record using AJAX based.
please suggest me, how to handle these kind of issues when working in Screen Scraping?
Thank alot
You may want to try 2 Firefox add-ons. They are "Firebug" and "Tamper Data".
The "Console" window of Firebug shows the AJAX request and response.
You can then write scripts using the PHP/cURL library to mimic the request.
Do a http request as you normally do for any link or form sumbit but use the url used with ajax. Sometimes you may need to read the javascript source to determine how the url is built.

Resources