Best scraper/crawler needed for multipage forms (Nokogiri, Scrapy, other?) - web-scraping

I've read that Nokogiri/Mechanize (Ruby) for example are not good at traversing multiple pages, but may be better with sites that use Ajax.
The sites I want to scrape are multi-page forms, with some ajax overlays. Speed is important. These sites all display prices, so I am making a price aggregator.

I use Capybara with Webkit to a headless browser.
You'll need install capybara gem, and webkit gem as well.
https://github.com/thoughtbot/capybara-webkit
The syntax is very simple.
agent.visit 'some url'
agent.execute_script 'javascript here'
The gem also have page management, or you may simply go back to previous page by execute a javascript going back.
ag.execute_script("window.history.go(-1)")

Related

Is it easier to scrape the AMP versions of webpages?

I'm working on a web-scraper that aggregates newspaper articles. I know AMP protocol mandates a stripped-down version of Javascript, and I also know that Javascript (in part) enables website administrators to detect/prevent scraping. So logically, I figured it would be easier to scrape AMP websites. However, one the other hand, if this is true, I presume StackOverflow would be on top of it, but I haven't found a single thread reaffirming my inference. Am I correct or am I overlooking something?
I would say that AMP pages are definitely easier to scrape due to the fact that there is virtually no custom JS code. Many sites insert content with JS or AJAX. AMP limits the amount of libraries you can use and thus has less amount of them compared to a regular site.
Furthermore, if you want to scrape content written in JavaScript, you should can Selenium. If not, PHP is the way to go (IMHO) or BeautifulSoup in Python.
Happy scraping!

.NET library to interact with a webpage programmatically

I need to parse some information from a website.
I know there are some testing tools that do that, but I need it to be done in real time, from ASP.NET webpage code. So I guess, I need a .NET library.
Previously I was just able to generate a specific URL and parse the HTML but now I have to get information from a page that is generated by AJAX calls after form clicks. So I have to generate those clicks, right?
BTW I only have (limited) experience with server side ASP.NET. Maybe the optimal solution is quite different to what I suspect?
You may take a look at WatiN.
From the homepage:
Automate all major HTML elements with ease
Find elements by multiple attributes
Native support for Page and Control model.
Supports AJAX website testing
Supports creating screenshots of webpages
Supports frames (cross domain) and iframes
Handles popup dialogs like alert, confirm, login etc..
Supports HTML dialogs (modal and modeless)
Easy to integrate with your favorit (unit) test tool
Works with Internet Explorer 6, 7, 8, 9 and FireFox 2 and 3
Can be used with any .Net Language
Licensed under the Apache License 2.0
Downloaded more than 120.000 times.
And since its open source you can add and contribute new features yourself!
Take a look at Coypu. It wraps WatiN and Selenium giving you helper methods that make this kind of work a lot easier.

Does Aptana support custom page templates?

I'm fairly new to using Aptana, but I'm wondering if it supports a feature that I used to use a lot in Dreamweaver, where you could create a page template, i.e. the header and footer of a page that would stay the same for each page on a website, leaving only the content to be coded.
I found this feature really useful as you only needed to change code once for it to propagate to all pages.
I've searched for this feature in Aptana, but I'm not sure on the exact terminology.
Not currently. There have been requests for it, but it has not yet been implemented:
https://aptana.lighthouseapp.com/projects/35272/tickets/2140-allow-for-dreamweaver-like-templates

Facebook iFrame application using Facebook style sheet

I am working on a Facebook iFrame application, and have a question about styling.
I want the application content to look like the rest of facebook. So the most obvious approach I could think of was to use a stylesheet provided by Facebook for application development that includes such styles. However I cannot seem to find anything about this on developers.facebook.com or any other site for that matter.
I have created some FBML application earlier, and these was able to use Facebook styles directly since the application content was rendrered within the facebook pages. But iframes does not inherit the stylesheet from the parent content (nor should they), so I was wondering how (or possibly if) this can be done.
I have found some posts/blogs that simply tells you to create an application stylesheet that mimics the Facebook look. But I don't think this is a very good idea, as this CSS must be updated every time anything changes on Facebook. It also seems that all facebook wiki pages regarding CSS (which I have used before) has been removed.
The reason I do not want to use FBML Canvas is that Facebook is in the process of deprecating this approach. They recommend new applications to be created using iframes.
http://developers.facebook.com/docs/reference/fbml/
I really hope anyone has any good ideas on this.
There is no official way. For some reason, FB shards their styles to a ridiculous degree. They also change the filename rather than appending a version parameter every time they make a change to prevent downstream caching. Here's an example of todays stylesheets:
http://static.ak.fbcdn.net/rsrc.php/y-/r/40PDtAkbl8D.css
http://b.static.ak.fbcdn.net/rsrc.php/yE/r/u7RMVVYiOcY.css
http://static.ak.fbcdn.net/rsrc.php/yT/r/P-HsvhlyVjJ.css
http://static.ak.fbcdn.net/rsrc.php/yT/r/CFyyRO05F0N.css
http://static.ak.fbcdn.net/rsrc.php/y0/r/k00rCIzSCMA.css
http://b.static.ak.fbcdn.net/rsrc.php/yv/r/BJI6bizfXHL.css
http://b.static.ak.fbcdn.net/rsrc.php/yD/r/rmbhh_xQwEk.css
http://b.static.ak.fbcdn.net/rsrc.php/yn/r/xlsrXFt9-vD.css
http://b.static.ak.fbcdn.net/rsrc.php/yN/r/Uuokrl6Xv3c.css
http://b.static.ak.fbcdn.net/rsrc.php/y0/r/klTGALEjWM8.css
http://b.static.ak.fbcdn.net/rsrc.php/yN/r/mlYhlJwnCdr.css
http://b.static.ak.fbcdn.net/rsrc.php/yT/r/uFI2FW2LitH.css
http://b.static.ak.fbcdn.net/rsrc.php/yh/r/5Bzj1255G1S.css
http://b.static.ak.fbcdn.net/rsrc.php/yp/r/5UteuBI1b8_.css
You can automate this process fairly easily using either PHP or .NET using existing solutions Minify and Combiner respectively.
A simpler method would be to use the Web Developer toolbar for Firefox, go to Facebook and choose the Web Developer toolbar option to "view CSS" which will bunch all the CSS up for you. Copy and paste it into your own local stylesheet and you only have to update when Facebook makes a major change.
So while there is no simple way (that I am aware of), there are methods for you take care of it in a fairly speedy manner.

Which Javascript history back implementation is the best?

There are implementations for history.back in Micrososft AJAX and jQuery (http://www.asual.com/jquery/address/).
I already have jQuery and asp.net ajax included in my project but I am not sure which implementation of history.back is better.
Better for me is:
Already used by some large projects
Wide browser support
Easy to implement
Little footprint
Does anybody know which one is better?
EDIT:
Another jquery plugin is http://plugins.jquery.com/project/history It is recommmended in the book JQuery Cookbook. This one worked well so far.
One alternative to jQuery Address is the nice jQuery history plugin. There are also URL Utils.
Reference: AJAX History and Bookmarks.
If you are building an ASP.NET application then using ASP.NET Ajax Framework gives you many advantages and a nice-simple API to use server-side.
Below you can find an example that uses Browser History with ASP.NET Ajax
Create a Facebook-like AJAX image gallery
Both have a wide range support in browsers.
For me is easier to integrate Microsoft AJAX Framework in an ASP.NET page so again if you have an .aspx page it might be easier to work with ASP.NET AJAX
If you don't need exactly AJAX, i.e. updating only parts of the site on request is sufficient for you, then you can use invisible iframe as target for loading generated HTML file containing only JS script that updates/resets "updateable" parts of the site. This is cross-browser solution and doesn't need address polling.
Example, but not in ASP: kociszkowo.pl (Polish site)
When you click there in the section icon and your browser supports javascript, link is modified before being fetched - target is changed to iframe and href is suffixed with .dhtml to inform server, that we're interested in a special version of the page. If you press Back in your js-equipped browser, then previously fetched iframe page is loaded from the cache. Simple, but requires some decisions at architectural level.
This link modification is irrelevant here, it's just the result of combining JS/non-JS world.
In my experience, your best bet is using the same one that you have doing most (if not all) of your ajax calls. For instance, if you're using asp:UpdatePanel's, use the MS one - if you're using jQuery.ajax, use the jQuery history plugin. If you're doing a mix (which I've tried to avoid in my projects), I'd personally test with both and see which behaves better - if they both test fine, then it's a bit of preference. Some may argue the Microsoft one would have better support, but jQuery's history plugin may get more use and more mature.
http://msdn.microsoft.com/en-us/library/system.web.ui.updatepanel.aspx
http://docs.jquery.com/Ajax/jQuery.ajax#options

Resources