I am currently planning to build a search engine website. I want my engine to search in certain other sites (let's say 10 sites) and return results from them. One way to achieve that is do it from scratch: build spiders that would scan sites, build a data base of products and index them and then a simple search mechanism to return results. Is there an easier way to do that (maybe using plugins in CMSs like Wordpress?), free or payed? Which way to build it would be more efficient and fast?
Related
I am currently working on a job portal kind of project in which we generate links(official government website links) related to jobs through customized search engine. Is there any way to extract the data from these links generated?.
I have tried web scraping but the structure of all the websites are different. So I needed a generic method to extract the data from these websites...?
I've started using Bing Custom Search API and many of the top search results I get are... surprisingly old and irrelevant.
The Custom Search interface allows you to rank slices of websites higher than others and to boost some results, but it remains URL-based and doesn't go into weighting of actual page contents or metadata such as date, keywords, author and so on.
Will "classic" SEO tips such as using one h1, optimizing page title/description/keywords, etc. help improve result relevance?
I guess it boils down to asking "does Bing Custom Search API use the regular Bing search engine behind the scenes?", but if it is more complex than that, any answer to my main problem will do.
Bing Custom Search is basically indexing and ranking mechanism similar to the Bing search engine. Only difference in Bing Custom Search is restricting results to certain sites and/or controlling ranking of results. So, anything that helps in improving page quality (and hence ranking in Web Search) will also help in improving Bing Custom Search results.
This actually becomes more important as the candidate pool to select from is very small in Custom Search (or any such API) compared to the full-fledged Web Search API, which has billions of pages to select from.
Only caveat is it takes time to improve page quality and hence ranking, so until then you may have to pin/block/boost results.
I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.
I need to implement a central search for multiple plone sites on different servers/machines.If there is a way to select which sites to search would be a plus but not the primary concern.Few ways I came upon to go about this:
-Export the ZCatalog indexes to an XML file and use a crawler periodically to get all the XML files so a search can be done on them,but this way does not allow for live searching.
-There is a way to use a common catalog but its not optimal and cannot be implemented on the sites i am working on because of some requirements.
-I read somewhere that they used solr but i need help on how to use it.
But I need a way to use the existing ZCatalog and index and not create another index as i think is the case with using solr due to the extra overheads and the extra index required to be maintained.But will use it if no other solution possible.I am a beginner at searching so please give details as much as possible.
You should really look into collective.solr:
https://pypi.python.org/pypi/collective.solr/4.1.0
Searching multiple sites is a complex use case and you most likely need a solution that scales. In the end it will require far less effort to go with Solr instead of coming up with your own solution. Solr is build for these kind of requirements.
As an alternative, you can also use collective.elasticindex, an extension to index Plone content into ElasticSearch, for this.
According to its documentation:
This doesn’t replace the Plone catalog with ElasticSearch, nor
interact with the Plone catalog at all, it merely index content inside
ElasticSearch when it is modified or published.
In addition to this, it provides a simple search page called
search.html that queries ElasticSearch using Javascript (so Plone is
not involved in searching) and propose the same features than the
default Plone search page. A search portlet let you redirect people to
this new search page as well.
That can be and advantage over collective.solr.
I want to create a shopping search engine that shows products from many websites and I wonder how can I retrieve information about products from those sites.
I am not interested in search engine part but in extracting product information from web pages in an automated manner using auto-generated templates. Does anybody knows some good algorithms for this / papers to read..
Dapper looks pretty close to what you are looking for. http://open.dapper.net/