How to scrape data from a website that uses AJAX and JavaScript? - web-scraping

If a website uses AJAX and JavaScript to load content, it may be difficult to scrape data from the site. The data may be dynamically generated and stored in a JavaScript variable, making it inaccessible to a web scraper.

This question is very broad and could have many different answers. Some tools that could be used to scrape data from a website that uses AJAX and JavaScript are Selenium, BeautifulSoup, and Scrapy.
This guide here can help you understand how to use Scrapy to scrape AJAX and JavaScript.

Related

Manage ads inside a Single Page Application

I m developing a Single Page Application (SPA). So, I use to refresh the page's HTML's content dynamically using Ajax requests.
I'd like to register to the DoubleClick for Publishers program, but I m wondering if my SPA is able to integrate advertising due to its dynamic content loaded without refreshing the page.
I saw this link: https://support.google.com/dfp_sb/answer/3058726
So I assume it's ok. But I'd like to be certain before starting using DFP. Could someone confirm please?
Then, sometimes I m using external html pages that I still load using Ajax. Should I consider writing the advertising banners JavaScript inside these external views, or directly inside the master page of my app?
Last question: How can I manage users having an adblocker software installed? Am I allowed to detect the presence of an adblocker software using JavaScript and then execute some specific code for this kind of users?
I'm working in a SPA and working with DFP successfully. Here is my feedback to your questions:
So I assume it's ok. But I'd like to be certain before starting using
DFP. Could someone confirm please?
Yes, you can refresh the banners using the method you are refering in the link you shared
Then, sometimes I m using external html pages that I still load using
Ajax. Should I consider writing the advertising banners JavaScript
inside these external views, or directly inside the master page of my
app?
To load them externally will bring you to lower performance results. You can control everything from the main page and you will have better results.
Last question: How can I manage users having an adblocker software
installed? Am I allowed to detect the presence of an adblocker
software using JavaScript and then execute some specific code for this
kind of users?
This is something I have not started to work on it but you can detect (like forbes.com is doing on it website) and there are also projects on dealing with this.

Web Scraping in Asp.net? Any library?

I have used HTML Agility pack but it does not allow me to crawl pages and also i found watin but its website not working yet. Can any body suggest me with list of libraries?
I have to fill some information than click button and then extract some information from responded pages.
You can try this open source web crawler; http://code.google.com/p/abot/

How to index a web site

I'm asking on behalf of somebody, so I don't have too many details.
What options are available for indexing site content in an ASP.NET web site? I suspect SQL Server's Full Text index may be used if the page content is stored in the database. How would I index dynamic and static content if that content isn't stored in the DB, but in html and aspx pages themselves?
We purchased Karamasoft Ultimate Search several years ago. It is a search engine add-on for your web site. I like it because it is a simple tool that taught us searching on our site. It is pretty inexpensive and we knew we could buy later if we needed more or different features. We needed something that would give us searching without having to do a lot of programming.
Specifically, this tool is a web crawler. It will run on your web server and it will act like an end-user and navigate through your site keeping a record of your web pages, so when a real users searches, they are told the pages that have the content they want.
Keep that in mind it is acting like an end-user, so your dynamic data is indexed right along with the static stuff because it indexes the final web page. We needed this feature and it is what appealed to us the most.
You can use a web crawler to crawl that site and add the content to a database which then is full text indexed. There are a number of web crawlers out there.
Lucene is a well known open source tool that would help you here. The main branch is Java based but there is a .Net port too.
Main site: http://lucene.apache.org/
.Net port: http://incubator.apache.org/lucene.net/
Having used several alternatives I would be loath to do anything other than Google Site Search.
The only reason I use SQL Full Text Search is to search through multiple columns. It's really hard to implement it in any effective manner.

Dynamically loading content (ajax) other than using page methods

I'm working on a site at the moment that loads all of its browser popups by using page methods. This approach works but it's starting to get messy. I also view page methods as ways to perform small tasks, username availability comes to mind.
What other options are there besides page methods and the update panel?
You should look at the JQuery ajax functionality. http://api.jquery.com/category/ajax/
You can point the url of the AJAX request to an aspx page or an html page or pretty much any web resource that you like, as long as the request is handled on the server by some kind of HttpHandler. And as long as your callback handler is able to handle and display the returned resource
PageMethods sounding fine to me (which are essentially Webservices).
You could pull more data per request and use cache more. You could build a better JavaScript wrapper which satisfies the need for more tidiness.
You could choose another library: How to call a web service from jQuery

Rss and external feed

I want to build a similar app like this:http://community.livejournal.com/ohnotheydidnt/32551171.html
using a livejournal rss feed. Any way of retrieving an external feed ( meaning getting a feed from a different domain that the one your web application-Same origin policy)? I've built a parser, but I would like to use dashcode for simple html building.
Across domains, if the data is only available via RSS and you don't have control of the other domain, then your best option is a server-side proxy.
If you have control over the other domain, you can create a page containing a javascript function which uses XmlHttpRequest to pull the RSS and returns the RSS. Then you can use a cross-domain messaging library like EasyXDM to call that script.
You also might want to check if the RSS feed's website supports JSONP as an alternate format, which would allow you to get the RSS data via javascript. Make sure you trust the site if you do this, though, since the site can execute javascript inside your page!

Resources