I want to crawl web page like quora, pinterest, etc. I found all examples using selenium to simulate scrolling action, yet open a new window for scrolling and crawling is slow and inefficient. Is Any method for crawling an infinite scrolling page more efficient?
If the website uses Ajax frame to transfer data, my answer could help you.
https://stackoverflow.com/a/34802775/5246180
Related
I'm trying to take the content of a webpage using FormRequest to bypass a form. But the problem is that after this form, there is a page with a loading bar and only after this bar is full the site show me the content that I want. The scrapy script is giving the loading page in the Response object, not the final webpage with the results that I want. What I can do to solve this? I believe that maybe I need to set a timer to make the crawler wait the loading page finish his work.
There's not concept of waiting when doing basic HTML scraping. Scrapy makes a request to a webserver and receives a response - that response is all you get.
In all likelihood, the loading bar on the page is using Javascript to render the results of the page. An ordinary browser will appear to wait on the page - under the hood, it's running Javascript and likely making more requests to a web-server before it has enough information to render the page.
In order to replicate the result programmatically, you will have to somehow render that Javascript. Unfortunately, Scrapy does not have that capability built in.
Some options you have include:
http://www.seleniumhq.org/
https://github.com/scrapinghub/splash
I'm looking to get structured article data from webpage urls. So far I've found these two services http://www.diffbot.com/ and http://embed.ly/extract/demos/nlp. Are there better alternatives or is it worthwhile to write the code to do this myself?
If you'd like to skip the code, and are looking for a simple software for web scraping / ETL applications, I'd suggest Foxtrot. It's easy enough to use and doesn't require coding. I use it to scrape data from certain gov't websites and dump it into an Excel spreadsheet for reporting purposes.
I have done web scraping / content extract for quite some time now.
For me the best approach is to write a Chrome content extension and automate the browser with their API. This requires that you know Javascript and HTML. In one of my recent projects I use a background page with a couple of editable divs to configure the scraping session. I have some buttons on the background page to start the process. The background page loads a JS script which listens to click events of the buttons.
When one of the buttons is clicked I add a new tab for the scraping session with chrome.tab.create. The background js also defines some chrome.tabs.onUpdated.addListener to inject content scripts when the tab url contains a specific page/domain name.
The content script then does the scraping job for example selecting some elements with jquery, regular expressions etc and finally send a message with an object back to background JS using chrome.runtime.sendmessage. The background JS script listens to messages with chrome.runtime.onMessage.addListener and acts based on the content being extracted.
The extension also automates web databases by clicking for example the next page links.
I have added a timing setting to control the amount of links being clicked / tabs being opened per minute so that the access is slowed down on purpose and too much crawling is avoided.
Finally the results are being uploaded to a database with an AJAX call and inserted with a PHP page into MySQL.
When the extension runs the next time it compares the keys/links which already exist in the database with another AJAX call and ensures that only new information is being extracted.
I have also built extension like the above with Firefox but the best and easiest solution for me is a Chrome/Chromium content extension.
My Asp.net web site home page take 15-17 second to when typing the address,pressing 'enter' and when there is something (anything) visible on the page. Page has dynamically loading data.
why this is happening and is there any solution to prevent this?
Boy is this open ended!
Have you investigated;
Volume of data being returned
Size of page being returned
Speed of getting the dynamically generated data
Speed of your internet connection
Are you returning any images and if so are they large files
Without knowing more about what your site is doing in the backend, there is little for us to suggest in any real sense.
Please provide more information.
Maybe if you post the URL?
I know the page reload concept via ajax without page refresh.
But facebook pages reload through normal page load. but sidebar's not loading just reload content area.
How is it possible?
Advance Thanks friends
Facebook uses bigpipe
The general idea is to decompose web pages into small chunks called pagelets, and pipeline them through several execution stages inside web servers and browsers. It is implemented entirely in PHP and JavaScript.
Clicking and taking some action on a webpage initialize/executes a pagelet, response generates from iframe or from ajax as well. Read the response and show it to a small chunk, this will not refresh the page.
I believe they are using the new history.pushState functionality in HTML5.
I'm about to build a web application(not web presentation) which will load its content through AJAX (jQuery) into a specific div. There will be a menu above the div and when a user clicks on an item from the menu, the appropriate page will be loaded into the main div.
I'd like to know if there are any cons and pros of choosing this pattern for a web application.
So far I'm avare that the browser back button and history/url will be gone.
Two possible downsides are that it could make it difficult for users to bookmark content on your site and difficult for search engines to differentiate pages on your site.
You should probably provide more information on your reasons for taking this approach. You might have good reasons or it might be a case of using a technology (AJAX) because it is cool to use.
If you want to give the users the impression of fast responsiveness, then yes AJAX load your pages, but still have a different url for each page. This will take more code but it will solve both issues that I mentioned.
http://yourdomain.com/home.aspx //loads its own content via AJAX
http://yourdomain.com/contact.aspx //loads its own content via AJAX
etc
This is really only appropriate if you have a lot of content, or where the content involves time-consuming calculations, such as on a financial site. In most cases, it would be less trouble to just load your pages normally or break you content into paged chunks.
The main con of this approach this will make your site very difficult for search engines to crawl. They don't read Javascript, so your content won't get seen or indexed by them. Try to do progressive enhancement so that they (and any users who don't use Javascript, e.g. screen-readers) don't get left behind.
On the other hand, you can keep browser history functionality. This can be done using the URL hash, e.g. http://www.example.com/#home vs http://www.example.com/#about-us. The nicest way to do this is to get Ben Alman's hashchange plugin and then use the hashchange event:
$(window).hashchange(function(){
var location = window.location.hash;
//do your processing here based on the contents of location
});
This will allow your users to use the history function and the bookmarking function of their browsers. See the documentation on his site for more information.