How to scrape websites such as Hype Machine? - web-scraping

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine.
I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications.
Any help or directions greatly appreciated.

The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.
You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.
Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.

You may want to check the following books:
"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL"
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204
"HTTP Programming Recipes for C# Bots"
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677
"HTTP Programming Recipes for Java Bots"
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669

I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org
On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.
Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure.
hope it helps!.
sebastian.

Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.

Related

SCORM, building without an authoring tool?

I've a couple training courses that I sometimes provide to clients. They are both a few text pages + 20-25 videos + a link out to take an exam in my LMS.
My preference has always been to provide embed links to the videos, as it allows us to easily push out updates. Then the client embeds that in their own LMS / training package (however they want). But two clients are requesting the courses are delivered in a SCORM package to be delivered on their LMS.
I'm familiar with the authoring tools like Captivate and Storyline Articulate. I'm not a huge fan as they feel like canned powerpoints. I'm also not sure that's what the client wants.
Two questions:
(1) My understanding is I can package a SCORM file manually. How does that content present itself when put into an LMS? Would it present slide-by-slide in a single panel (similar to how I see storyline work) or is it distributed based on how the LMS is set-up?
(2) Would doing it manually be advantageous in any way?
Couple things to watch out for -
Simply zipping your content folder with a imsmanifest.xml maybe not put the files in the root of the zip like the LMS may possibly expect. Instead you end up with folder/imsmanifest.xml vs /imsmanifest.xml. Something to watch out for, but some LMS's do not care.
You can manually edit the imsmanifest.xml, but I'd work to validate it. Performing this task in editors that do not provide feedback can lead to validation issues.
Content Packaging is part of it, but the LMS may have its own subset of additional options like how to launch the content (popup, new window/tab, or in a IFRAME). They may have additional scoring/attempts and other settings outside of what the SCORM CAM provides.
Aside from familiarizing yourself with the IMS Manifest format and the above cautionary items you can do do this without to many hiccups.
There is a packager on https://cybercussion.com if you'd like to demo that after you have prepared your Shareable Content Object.
GL

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

Is there a scraper application like KimonoLabs?

I have used scrapy and beautiful soup many times, however find kimonolabs solution much easier and faster. The only problem is that sometimes jobs do need a bit of tweaking, which is not possible (e.g., crawling using a unique pattern).
Is there any other solution which combines the ease with optional complexity? Mainly I want to define a page scraping template using a WYSIWYG interface, and then programatically write the crawler.
Use an Import.io extractor.
Download the Import.io browser
Create an extractor (what you call a "scraping template")
From your code use the extractor's REST API
Full disclosure: I'm one of the founders of ParseHub.
ParseHub tries to solve exactly this problem. It gives you a gui and powerful tools for defining templates visually, and falls back to a subset of javascript if you need more fine-grained control. All of the programming primitives that you're familiar with (if, for, break, recursion, etc.) are available.
You can find it at www.parsehub.com
Try Agenty
Agenty has exact same feature to scrape websites, and the Chrome extension to setup the scraping agents. You can just install the extension and create agents to scrape any site.
FYI : We also have plan to launch hosted solution and REST API by April, 2016 (Update - API is available now)
You may see more details on website (www.datascraping.co) now Agenty.com
Disclosure : I'm one of the founding member

Categorized Document Management System

At the company I work for, we have an intranet that provides employees with access to a wide variety of documents. These documents fall into several categories and subcategories, and each of these categories have their own web page. Below is one such page (each of the links shown will link to a similar view for that category):
http://img16.imageshack.us/img16/9800/dmss.jpg
We currently store each document as a file on the web server and hand-code links to these documents whenever we need to add a new document. This is tedious and error-prone, and it also means we lack any sort of security for accessing these documents. I began looking into document management systems (like KnowledgeTree and OpenKM), however, none of these systems seem to provide a categorized view like in the preview above.
My question is ... does anyone know of any Document Management System that allow for the type of flexibility we currently have with hand-coding links to our documents into various webpages (major and minor , while also providing security, ease of use, and (less important) version control? Or do you think I'd be better off developing such a system from scratch?
If you are trying to categorize the files or folders in the document management system, That's not a difficult task. You only need to access to admin panel to maintain the folders or categorize the folders
In Laserfiche, You can easily categorize your folders regarding the departments and can also be subcategorized them
You should look into Alfresco. It's extremely extensible and provides a lot of ways of accessing the repository.
Note: click the "Developers" tab for the community edition.
My question is ... does anyone know of
any Document Management System that
allow for the type of flexibility we
currently have with hand-coding links
to our documents into various webpages
(major and minor , while also
providing security, ease of use, and
(less important) version control?
Or do you think I'd be better off developing such a system from scratch?
Well there are companies that make a living selling doc management software. Anything you can get off the shelf is going to be a huge time saver, and its going to be better than anything you could reasonably develop by hand.
I've used a few systems:
Sharepoint: although I hear some people don't like it, I didn't either ;)
HyperOffice worked really well for my company of around 150 employees and has all the features you describe.
Current company uses Confluence, I like it :) But its probably one of those tools whose pricetag isn't worth it, especially if you're only using a subset of its features like doc management.
I haven't used it, but one guy I know raves about Alfresco, a free and open source doc management system. I looked at its website, seems simple enough to use.
We also faced a similar problem. However version control was more on our priority and we did look into many solutions in and around. We found Globodox extremely easy to install and use and more important the support team was absolutely fantastic
Try Mayan EDMS, it's Django based, and open source, used it as a base and build the custom features you wish on top of it.
Code location: https://gitlab.com/mayan-edms/mayan-edms
Homepage at: http://www.mayan-edms.com
The project is also available via PyPI at: https://pypi.python.org/pypi/mayan-edms/

Is Wiki Content Portable?

I'm thinking of starting a wiki, probably on a low cost LAMP hosting account. I'd like the option of exporting my content later in case I want to run it on IIS/ASP.NET down the line. I know in the weblog world, there's an open standard called BlogML which will let you export your blog content to an XML based format on one site and import it into another. Is there something similar with wikis?
The correct answer is ... "it depends".
It depends on which wiki you're using or planning to use. I've used various over the years MoinMoin was ok, used files rather than database, Ubuntu seem to like it. MediaWiki, everyone knows about and JAMWiki is a java clone(ish) of MediaWiki with the aim to be markup compatible with MediaWiki, both use databases and you can generally connect whichever database you want, JAMWiki is pre-configured to use an internal HSQLDB instance.
I recently converted about 80 pages from a MoinMoin wiki into JAMWiki pages and this was probably 90% handled by a tiny perl script I found somewhere (I'll provide a link if I can find it again). The other 10% was unfortunately a by-hand experience (they were of the utmost importance with them being recipies for the missus) ;-)
I also recently setup a Mediawiki instance for work and that took all of about 8 minutes to do. So that'd be my choice.
To answer your question I don't believe that there's such a standard as WikiML as Till called it.
As strange as it sounds, I've investigated screen scraping a wiki for a co-worker to help him port it to another wiki engine. It turned out that screen scraping would have been easier, quicker and more efficient to write to move this particular file based wiki to another one or a CMS.
Given the context that you wrote the question in I would bite the bullet now and pay the little extra for a windows hosted account and put Screwturn wiki on it. You're got the option of using file based or SQL Server based back end for it but because one of your requirements is low cost I'm guessing that you would use file based now for a cheaper hosted account and then you can always upscale the back end to SQL Server.
I haven't heard of WikiML.
I think your biggest obstacle is gonna be converting one wiki markup to another. For example, some wikis use markdown (which is what Stack Overflow uses), others use another markup syntax (e.g. BBCode, ...), etc.. The bottom line is - assuming the contents are databased it's not impossible to export and parse it to make it "fit" in another system. It might just be a pain in the ass.
And if the contents are not databased, it's gonna be a royal pain in the ass. :D
Another solution would be to stay with the same system. I am not sure what the reason is for changing the technology later on. It's not like a growing project requires IIS/ASP.NET all of the sudden. (It might just be the other way around.) But for example, if you could stick with PHP for a while, you could also run that on IIS.

Resources