Nutch possibilities - web-scraping

i am new to nutch and am using nutch 1.9. right now am doing some POC on a sample site(shaadi.com). I have few questions, can somebody help me out on this?
i cant access the urls that requires login authentication(form based), though i setup the configuration in httpclient-auth.xml, nutch-site.xml and all.
i know nutch fetches us only the whole content of the website. but is it possible to get only a piece of information like first name, address etc.. from the website page using nutch? (i think its more like scraping.. this is what pythons scrapy does)
Thanks in advance.

You will need to use plugin to extract specific data & add that data to nutch document while indexing.
This plugin can be used to extract data
www.atlantbh.com/precise-data-extraction-with-apache-nutch/

Related

MLS / IDX Data in the form of XML

I have an issue and I have been struggling with for weeks, almost a month, I am working on this website for a real estate agent in Toronto, Ontario and the last thing I have to do is get the listing on her website. We are using Wordpress wp-residence theme, this theme is compatible with iHomefinder, however the data I get from iHomefinder does not work with the themes features (maps, searches and even styles) I read somewhere that the only way to use these theme features is to add the listing manually. I found this plugin that will import all the listings
https://en-ca.wordpress.org/plugins/wp-residence-add-on-for-wp-all-import/
But now I need to get my listings in the form of XML, CSV or XLS. I have all my login info and url to get my listing, however in the instructions it says I need to connect via RETS Client, which I do not have...is there anyone out there who can point me in the right direction?
You'll need to use an IDX plugin.
I think you should also look over https://mlsimport.com .This plugin should help you import data from MLS. But is a paid version.

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

Search multiple plone site indexes

I need to implement a central search for multiple plone sites on different servers/machines.If there is a way to select which sites to search would be a plus but not the primary concern.Few ways I came upon to go about this:
-Export the ZCatalog indexes to an XML file and use a crawler periodically to get all the XML files so a search can be done on them,but this way does not allow for live searching.
-There is a way to use a common catalog but its not optimal and cannot be implemented on the sites i am working on because of some requirements.
-I read somewhere that they used solr but i need help on how to use it.
But I need a way to use the existing ZCatalog and index and not create another index as i think is the case with using solr due to the extra overheads and the extra index required to be maintained.But will use it if no other solution possible.I am a beginner at searching so please give details as much as possible.
You should really look into collective.solr:
https://pypi.python.org/pypi/collective.solr/4.1.0
Searching multiple sites is a complex use case and you most likely need a solution that scales. In the end it will require far less effort to go with Solr instead of coming up with your own solution. Solr is build for these kind of requirements.
As an alternative, you can also use collective.elasticindex, an extension to index Plone content into ElasticSearch, for this.
According to its documentation:
This doesn’t replace the Plone catalog with ElasticSearch, nor
interact with the Plone catalog at all, it merely index content inside
ElasticSearch when it is modified or published.
In addition to this, it provides a simple search page called
search.html that queries ElasticSearch using Javascript (so Plone is
not involved in searching) and propose the same features than the
default Plone search page. A search portlet let you redirect people to
this new search page as well.
That can be and advantage over collective.solr.

Create own customed URL shortener

I am currently trying to code out a simple asp.net URL shortener which allows me to customise the shortened url. I am also not allowed to use open source, which means I cannot use any of the url shortening services. I am required to develop on on my own.
But this is the first time I am doing this so i have no idea on how to start(excluding the UI).
I understand that there are already such questions being asked. But I've read through the posts and I couldn't understand what is it about. I've also tried to google for the solution but it doesn't seem to be working.
I would really appreciate any help given to me.
P.S I am fairly new in programming and not strong in any of the programming languages.
You would need:
A system to store pairs of shortened URLs and their full version.
A page which takes the shortened URL parameter (eg. short.aspx?q=SHORTENED), looks it up in your data store, and redirects to the full URL.
Some interface to edit your data store, add new URLs, etcetera.
That should be it really. If this is too difficult, it might be smarter to start on a basic programming course first.

Crawling wikipedia

I'm going through crawling wikipedia using website downloader for windows, i was looking through the whole options in this tool to find an option to download wikipedia pages for specific period, for example from 2005 untill now.
Does anyone get any idea about crawling the website in specific period of time ?
Why not download the SQL database containing all of Wikipedia?
You can then query it using SQL.
Give a try to the Wikipedia API and your programming skills.
There should be no need to do web scraping; use the MediaWiki API to directly request the information you want. I'm not sure what you mean by "wikipedia pages for a specific period" - do you mean last edited at a certain time? If so, while skimming, I noticed an API call that lets you get a look at the last n revisions; just ask for the last revision and see what its date is.
It depends if the website in question offers the archive and mostly don't so its not possible in a straightforward way to crawl a sample started from specific date. But you can implement some intelligence in your crawler to read the page created date or something like that.
But you can also look at Wikipedia API at http://en.wikipedia.org/w/api.php

Resources