I am tinkering with Freebase, and trying things with the query editor and everything looks great. I'm still reading the fine manual, but can't seem to get the point if this may be used as a web search replacement for showing refined data to the user. The main question is:
If {q1, q2, q3 ....} be the query the user submits - how do I programatically map each query term to the freebase query key:value pair?
I am not sure if it can entirely replace the current search engines as of now.I have just written a blog post for handling a basic query(more will follow) in C# for freebase.
http://2guysfrommumbai.wordpress.com/
If you like the java stuff you can go to,
https://github.com/narphorium/freebase-java-api
this is a more complete api,and i have used it with good success.
You can append multiple queries using q1,q2,q3 as parameters,more details are available on the freebase developer site.
Hope this helps.
Related
I am setting up a database of certain events that have occurred in the past, and need to search the internet for a number of terms to retrieve as many pages as possible that contain terms related to the happenings i want to document.
First I looked into achieving this using Googles "Custom Search API", after reading this question:
Need to access Google Custom search api through R
I did manage to get a JSON of search results through the browser, but not through R, so I moved on.
When I saw that the Custom Search API was using OpenSearch, and found the rOpenSearch package for R, I wanted to try going down this path:
http://terradue.github.io/rOpenSearch/
After reading through the documentation, there was only provided examples of searching sites that provide opensearch descriptions. As I need to search as many websites as possible, it seems like I would need an opensearch description for a search engine like Google. But I can't seem to find that anywhere.
Is there any way to search the internet via. R using OpenSearch, and collecting the results in a data table?
If you know of a better solution to my problem, I'd appreciate if you could point me in another direction.
If I read well, you are looking for something called Web Scraping via R.
<See me!>
I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.
I have a task to optimize search engine in asp.net ecommerce store based on nopcommerce tempate.
I would like to hear on what should I pay most attention to improve the search engine and to deliver faster results, since current search engine is taking forever to display results.
Full Text Search is one of the options to be implemented too.
Thanks in advance, Laziale
Make sure that all search queries are thru database
Make sure that all the search fields have the proper indecies
Return as minimum info as needed (probably create stored procedures)
Look at your search queries, perhaps they can be rewritten an optimized
Profile your .net code and find the place where it slow and optimize it
Cache your results or even sql queries
For FULL TEXT SEARCH look at Lucene.NET
Skip the EF and have your own data layer, at least for the purpose of Search Optimization.
I think the best way will be reading this document provided by Google which tells you what are the most important tweaks you should pay attention to, i used it myself and was very rewarded indeed:
http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en//webmasters/docs/search-engine-optimization-starter-guide.pdf
I'm going through crawling wikipedia using website downloader for windows, i was looking through the whole options in this tool to find an option to download wikipedia pages for specific period, for example from 2005 untill now.
Does anyone get any idea about crawling the website in specific period of time ?
Why not download the SQL database containing all of Wikipedia?
You can then query it using SQL.
Give a try to the Wikipedia API and your programming skills.
There should be no need to do web scraping; use the MediaWiki API to directly request the information you want. I'm not sure what you mean by "wikipedia pages for a specific period" - do you mean last edited at a certain time? If so, while skimming, I noticed an API call that lets you get a look at the last n revisions; just ask for the last revision and see what its date is.
It depends if the website in question offers the archive and mostly don't so its not possible in a straightforward way to crawl a sample started from specific date. But you can implement some intelligence in your crawler to read the page created date or something like that.
But you can also look at Wikipedia API at http://en.wikipedia.org/w/api.php
I have been looking for an autosuggest search script and I have finally found one that I like, the only problem is that I cannot find a way to get it to run off our database results.
Is there any way to customize this script so that it runs from our own database, and not off the freebase pre-defined data types?
http://www.freebase.com/docs/suggest
Have you tried overriding service_url and service_path ? There are also the corresponding params for the flyout service. It's documented in the docs that you pointed to.
As masouras says, you can override service_url and service_path, but that's not particularly helpful unless you have another service which provides the same APIs as Freebase.
Dae Park recently posted a recipe to the Freebase mailing list which might help - however, I'm not aware of anyone who's actually managed to get Suggest working with anything other than Freebase.