web information extraction - information-retrieval

web information extraction - information-retrieval

I want to create a shopping search engine that shows products from many websites and I wonder how can I retrieve information about products from those sites.
I am not interested in search engine part but in extracting product information from web pages in an automated manner using auto-generated templates. Does anybody knows some good algorithms for this / papers to read..

Dapper looks pretty close to what you are looking for. http://open.dapper.net/

Related

Firebase - Machine Learning and interest tracking to create an algorithm for sorting posts

One of my applications includes user-generated posts and functions in a similar way to Instagram. When a user opens the app they see a feed of posts sorted by date. This works when there just one small demographic using the app, but as the user base becomes more diverse, not everyone is interested in the same posts. This is why apps like TikTok and Instagram have algorithms to decide which posts to show to a user. Where do I even start with this? I understand that there need to be tags on each post for what they are about (this is where I think I can use machine learning) and then each users information needs to include their interests (I’m not sure what can be used to change this as they like or dislike posts). Is there a simple pre-built way of doing this or any examples? It seems fo be a pretty big secret that mostly big tech companies understand and use.

You could use Google's "cloud vision api(For Images): https://cloud.google.com/vision" and "Video Intelligent Api(For videos): https://cloud.google.com/video-intelligence/docs".
Video Intelligence Api could handle images too from byte stream.
Build a firebase function that analyse posted media with these api.
Build the rest of the logic from here. Find a way to detect their interest from post, save their interests.

How to search the internet for pages containing specified terms and storing the results in a data table, from within R, using OpenSearch

I am setting up a database of certain events that have occurred in the past, and need to search the internet for a number of terms to retrieve as many pages as possible that contain terms related to the happenings i want to document.
First I looked into achieving this using Googles "Custom Search API", after reading this question:
Need to access Google Custom search api through R
I did manage to get a JSON of search results through the browser, but not through R, so I moved on.
When I saw that the Custom Search API was using OpenSearch, and found the rOpenSearch package for R, I wanted to try going down this path:
http://terradue.github.io/rOpenSearch/
After reading through the documentation, there was only provided examples of searching sites that provide opensearch descriptions. As I need to search as many websites as possible, it seems like I would need an opensearch description for a search engine like Google. But I can't seem to find that anywhere.
Is there any way to search the internet via. R using OpenSearch, and collecting the results in a data table?
If you know of a better solution to my problem, I'd appreciate if you could point me in another direction.

If I read well, you are looking for something called Web Scraping via R.
<See me!>

Can regular SEO help improve Bing Custom Search API results?

I've started using Bing Custom Search API and many of the top search results I get are... surprisingly old and irrelevant.
The Custom Search interface allows you to rank slices of websites higher than others and to boost some results, but it remains URL-based and doesn't go into weighting of actual page contents or metadata such as date, keywords, author and so on.
Will "classic" SEO tips such as using one h1, optimizing page title/description/keywords, etc. help improve result relevance?
I guess it boils down to asking "does Bing Custom Search API use the regular Bing search engine behind the scenes?", but if it is more complex than that, any answer to my main problem will do.

Bing Custom Search is basically indexing and ranking mechanism similar to the Bing search engine. Only difference in Bing Custom Search is restricting results to certain sites and/or controlling ranking of results. So, anything that helps in improving page quality (and hence ranking in Web Search) will also help in improving Bing Custom Search results.
This actually becomes more important as the candidate pool to select from is very small in Custom Search (or any such API) compared to the full-fledged Web Search API, which has billions of pages to select from.
Only caveat is it takes time to improve page quality and hence ranking, so until then you may have to pin/block/boost results.

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?

Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.

Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.

I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.

Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

Best way to create book publication back end

i want to create automated web based book publication system.
My company is publishing reports which is written in words. & there is lots of hassel on formatting word document.
So we want to develop system where different users can loged into system then create different chapters & different parts within that chapters.
Can anyone suggest any opensource project or any guidelines to achieve the things mentioned above.

Some sort of wiki seems like the way to go in my opinion, you could set it up so users could only edit their own stuff pretty easily. There are some example ASP.NET wikis and other examples that might interest you on this page.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex