Waiting for the loading page with scrapy - web-scraping

I'm trying to take the content of a webpage using FormRequest to bypass a form. But the problem is that after this form, there is a page with a loading bar and only after this bar is full the site show me the content that I want. The scrapy script is giving the loading page in the Response object, not the final webpage with the results that I want. What I can do to solve this? I believe that maybe I need to set a timer to make the crawler wait the loading page finish his work.

There's not concept of waiting when doing basic HTML scraping. Scrapy makes a request to a webserver and receives a response - that response is all you get.
In all likelihood, the loading bar on the page is using Javascript to render the results of the page. An ordinary browser will appear to wait on the page - under the hood, it's running Javascript and likely making more requests to a web-server before it has enough information to render the page.
In order to replicate the result programmatically, you will have to somehow render that Javascript. Unfortunately, Scrapy does not have that capability built in.
Some options you have include:
http://www.seleniumhq.org/
https://github.com/scrapinghub/splash

Related

R selenium webdriver not loading element even after wait and scroll down

I'm trying to design a scraper for a page in R using the selenium webdriver package and the part of the page I want to scrape is not loading, no matter how long I wait for it to. It may be to do with javascript which I admittedly know nothing about.
I've tried forcing it to scroll down to load the element (in this case a table) but to no avail.
It loads fine in normal browsers.
It's like the severalth site for which this has happened so I thought I'd pop my stackoverflow cherry and ask the experts.
Sorry I have no reprex as I just don't know where the issue is coming from!
The link to the page is
https://jdih.kemenkeu.go.id/#/home
an image showing what selenium says it sees - yellow highlighted area is where the table should load.
how it is supposed to display shown in firefox
Thanks for reading!
(18 months later and I can answer my own question!)
The issue was that the page is loading content dynamically using an API request.
When scraping using direct GET requests of a URL to extract the page contents, this initial request alone may not load the desired content.
In this case, I found the exact issue by reloading the page with the developer interface open (F12) with the 'Network' (or similar) tab open.
This then shows all the requests made when the browser loads the page.
One of these will be a request for the desired data - in this case by filtering on XHR requests only, I was able to identify one which loaded content through an internal API.
Right-click the request, open in new tab, and voilĂ , you have a URL which you can use in the same way you would normally with this scraping method that will provide the page content required.
Sometimes the URL alone may not be enough. You will need the correct request headers sent with the request. These can be seen in the request data as mentioned above in the Developer interface in one's browser. Right-click and select 'Copy headers' or similar to get them.
In this case, i.e. when using R, the httr package can be used to send get requests with specific headers thus:
headers = c(
"Host" = "website.com"
[other headers here]
)
page <- httr::GET(url = "www.website.com/data",
httr::add_headers(.headers = headers)) %>%
httr::content()
When you have the page content, it is possible to parse the HTML or whatever else is required as usual.

How to scrape ajax calls in PHP

Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.

Seamlessly load images to ASP.NET page

I've been trying to load an image onto an aspx page without revealing its request. I mean - when I monitor the page with Telerik's Fiddler I can see that the image is in the requests list. If I refresh the page, the request is not being shown anymore (apparently the image got cached the first time).
Question: Is it possible to load/cache the image silently, without the user even knowing it had been requested?
After a bunch of hours messing with various techniques involving requests, server side operations and what not, it turned out to be as simple as it can get..
In order to hide the image loading request I modified the previous page.
I have, for example, pages A and B. I need to cloak an image request for page B.
Page A loads a bunch of images already. Lets squeeze in one more? Image gets loaded and cached. When the user opens page B, image is still in cache and is used instead of requesting a new one. All we had to do was to put this
<img src="sampleImage.png" style="display:none" />
code in page A somewhere, so it got requested, loaded and cached, but not shown.Too simple solution to be more embarrased than proud to solve, but this is how I learn, I guess :D

Facebook Page Load Concept

I know the page reload concept via ajax without page refresh.
But facebook pages reload through normal page load. but sidebar's not loading just reload content area.
How is it possible?
Advance Thanks friends
Facebook uses bigpipe
The general idea is to decompose web pages into small chunks called pagelets, and pipeline them through several execution stages inside web servers and browsers. It is implemented entirely in PHP and JavaScript.
Clicking and taking some action on a webpage initialize/executes a pagelet, response generates from iframe or from ajax as well. Read the response and show it to a small chunk, this will not refresh the page.
I believe they are using the new history.pushState functionality in HTML5.

Openfire fastpath chat causing site to load slowly

We are adding openfire fastpath chat to our site. It will determine and indicate when live chat is available or not and display an appropriate image to indicate the current status and links for each state.
The javascript call hit's a function that is on another box and this function uses document.write to output the html to the page. I know there is a delay because it is making the request to another server and waiting for a result to be returned. The pause here is about a half second, but causes the rest of the page load to be held up.
Has anyone experience a similar issue or offer any tips for getting this to load synchronously somehow. I tried putting this into an aspx ajax panel, but that seemed to cause other issues.
I used an iframe that is the only thing I could find that seemed to work.

Resources