Finding similar items using Microsoft Cognitive Services - microsoft-cognitive

Which Microsoft Cognitive Services (or Azure Machine Learning services?) is best and least work to use to solve the problem of finding similar articles given an article. An article is a string of text. And assuming I do not have user interaction data about the articles.
Are there anything in Microsoft Cognitive Services that can solve this problem out-of-the-box? It seems I cannot use the Recommendations API since I don't have interaction/user data.
Anthony

I am not sure that Text Analytics API may be a good use for this scenario, at least not yet.
There are really two types of similarities:
1. Surface similarity (lexical) – Similarity by presence of words/alphabets
If we are looking for surface similarity, try fuzzy matching/lookup (SQL Server Integration Services – provides a component for this.), or approximate similarity functions (Jaro-Winkler distance, Levenshtein distance) etc. This would be easier as it would not require you to create a custom machine learning model.
2. Semantic similarity – Similarity by meaning of words
If we are looking for Semantic similarity, then you need to go for semantic clustering, word embedding, DSSM (Deep semantic similarity model) etc.
this is harder to do, as it would require you to train your own machine learning model based on an annotated corpus.
Luis Cabrera | Text Analytics Program Manager | Cloud AI Platform, Microsoft

Yes, you can use Text Analytics API.
examples are available here. https://www.microsoft.com/cognitive-services/en-us/text-analytics-api

I would suggest you use the Text Analytics API [1] as #Narasimha suggested. You would put your strings through the Topic detection API, and then come up with a metric (say, Similarity = count(Matching topics) - count(Non Matching topics)) that could order each string against the others for similarity. This would just require one API call and a little JSON parsing.
[1] https://www.microsoft.com/cognitive-services/en-us/text-analytics-api

Sentence similarity or semantic textual similarity is a measure of how similar two pieces of text are, or to what degree they express the same meaning.
This Microsoft's GitHub repo for NLP provide some sample wich could be used from Azure VM and Azure ML : https://github.com/microsoft/nlp/tree/master/examples/sentence_similarity
This folder contains examples and best practices, written in Jupyter notebooks, for building sentence similarity models. The gensen and pretrained embeddings utility scripts are used to speed up the model building process in the notebooks.
The sentence similarity scores can be used in a wide variety of applications, such as search/retrieval, nearest-neighbor or kernel-based classification methods, recommendations, and ranking tasks.

Related

Cross data matching algorithm (seperate datasets) in R or any machine learning platform

I have two datasets. One with details of contracts and other with details of organizations. For eg: One dataset has details- Company name, description, company type. Other datasets has details- Contract name, Contract description, CPV code.
I want an algorithm that can 1) given a company can we find the top 10 contracts that are most closely related or potentially interesting to this company.
2. Or given a contract can we find the companies most likely to bid or win the contract.
This might be a one off, real time algorithm to match one row of the first dataset to a best match cluster in the second dataset.
Is it possible to do this type of row by row cross matching in two different datasets? Is it possible to use text descriptions for this kind of matching?
It would be of great help if someone has code examples. Thank you.
I am also attaching example datasets here.
Company data
Contract data
Your question is effectively "Will someone do ~10K worth of data science for me for free?" What you are looking for is a recommender system and what seems more specifically to be a content based filtering system. In order for these to work, you are going to have to look at your two datasets and develop features that can be used to quantitatively describe the contracts and the clients. If you have information about previous contracts the organizations were interested in you can use a hybrid algorithm that incorporates aspects of collaborative filtering.
R has a package recommenderlab that can help you to work on these types of problems. I haven't used it, but skimming over it, it seems to be solid. If you are wanting something a little more plug and play though with fewer options, I would recommend checking out AzureML. It uses GUI interfaces to help guide users through the data science process including a recommender tutorial. You may also be able to use some of their text classifier tutorial to help engineer features from your fields containing free form text.
Best of luck.

Graph database vertex/edge inference from a text (i.e. an informal Graph 'schema'), using Natural Language Processing (NLP) - does this exist?

Caveat Emptor - I'm neither a linguist nor a Graph theorist, however, I am a [Java] developer wishing to use a Graph database for persistence and the following topic is of interest to me, and I hope to others.
OK, the idea is to have some application or code to:
recognise the embedded relationship structures between named entities within a given piece of text
apply or expose these discovered relationships to usage within a Graph database structure.
In such a system, the text might essentially form a basic, layman-written graph schema of sorts. To better visualise this, here is some [very], basic text:
Andrew is married to Jane
Using the online CLAWS parts-of-speech tagger (POS), I'm given the following:
Andrew_NP0 is_VBZ married_AJ0 to_SENT Jane_NP0
According to 'The BNC Basic (C5) Tagset' # Oxford University, NP0='Proper noun', which is a name (as you know) but these NP0-tagged entries would lend themselves to becoming graph vertice instances/nodes (the end user could be further prompted to give these entries an encompassing 'type/description'). The verb(s), 'VBZ' and adjective(s), AJ0, might highlight graph relationships.
Once the end user has confirmed their graph representation, they might export it to GraphML, for re-import into a graph database such as Titan or Neo4j.
So, the overall idea is to have a tool that allows a layman end user the ability to create Graph-theory-based database structures, using everyday language.
Does such a tool exist already?
Some of my observations above were influenced, in some way, by the following tools (amongst others):
http://www.plantuml.com <- UML diagrams defined using a simple and intuitive language
http://www.planttext.com <- See plantuml
http://www.acqualia.com/soulver <- An NLP-based calculator and currency exchange tool, using natural sentence phrases
http://nlp.stanford.edu/software/tagger.shtml <- Stanford Log-linear Part-Of-Speech Tagger
Yes, this exists in many different places. Examples include OpenCalais (which was created by Reuters) and the AlchemyAPI. There are a bunch of other toolkits and APIs like NLTK and IBM's UIMA that don't present you with a finished solution, but a bunch of tools necessary to build a bespoke solution.
This is a very deep area, subject to ongoing research. I can't cover all of it here, but one thing to keep in mind is that solutions in this space are often highly specific to a certain "corpus" of documents. Software which does any arbitrary English text well doesn't really exist. Instead what you see is solutions that do it really well for business press releases. Or intelligence reports. Or newspaper articles. Or medical alerts. But not any, arbitrary text.
The area is also rife with a lot of problems; one of the big ones is known as "Named Entity Recognition"
Andrew is married to Jane. Andrew bought eggs yesterday.
How many people are being discussed here? Is the second Andrew the same as the first? That's a very complicated and contextual question. But you better get it right, otherwise you might have more or fewer "person" nodes in your resulting graph than you expect.

How to perform Semantic Similarity in document

I am doing project In which I need to ranked text document according to search query like search engine but I need to rank documents having semantic similarity of the word or sentence,I am unable to start regarding how to find semantic similarity using java. Is there any link or any paper through which I can start finding semantic similarity of words in documents or any idea.
The standard way to represent documents in term-space is to treat the terms as mutually orthogonal or independent of each other, e.g. the terms "atomic" and "nuclear" although being synonymous and hence interchangeable are treated as distinct, whereas the semantic similarity between this pair of words should be fairly high.
Thus, for implementing a semantic similarity based score, you need to know the relation between a pair of words, for which you can use either of the following.
An external resource such as a Wordnet or a semantic similarity library such as DISCO.
A corpus analysis methodology such as Latent Semantic Analysis (LSA) which reduces the dimensionality of the term space by combining semantically similar terms such as "atomic" and "nuclear".
Have a look at this Demo for semantic similarity
It shows the demo for different algorithms. you can see which one works for you and try to go with it. Also the this "semilar" module can be used with the help of Java I think. You can try using it, I didnt tried it yet but the demo is for the same on that page. Thanks :)

How to get a handle on all this middleware?

My organization has recently been wrestling the question of whether we should be incorporating different middleware products / concepts into our applications. Products we are looking at are things like Pegasystems, Oracle BPM / BPEL, BizTalk, Fair Isaac Blaze, etc., etc., etc.
But I'm having a hard time getting a handle on all this. Before I go forward with evaluating the usefulness (positive or negative) of these different products I'm trying to get an understanding of all the different concepts in this space. I'm overwhelmed with an alphabet soup of BPM, ESB, SOA, CEP, WF, BRE, ERP, etc. Some products seem to cover one or more of those aspects, others focus on doing one. The terms all seem very ambiguous and conflated with each other.
Is there a good resource out there to get a handle on all these different middleware concepts / patterns? A book? A website? An article that sums it up well? Bonus points if there is a resource that maps the various popular products into which pattern(s) they address.
Thanks,
~ Justin
I've spent the last 3-4 years blogging on the topics you mentioned (http://www.UdiDahan.com) as well as writing my own lightweight ESB (http://www.NServiceBus.com) and many more years working and consulting in this space. The main conclusion that I've come to is that strong business analysis and technologically-agnostic architecture is needed - no tool or technology can prevent a mess by itself.
There is the Enterprise Integration Patterns book which provides a good catalog of the technical patterns involved but doesn't touch on the necessary business analysis. I've found that Value Networks (http://en.wikipedia.org/wiki/Value_network_analysis) can be used as a good start for identifying business boundaries to which IT boundaries can be then aligned, resulting in the benefits of SOA, and the use of an ESB across those boundaries is justified.
CEP, WF, and BRE should be used within a boundary and not across them.
ERP packages tend to cross boundaries and, as such, should be integrated piecemeal into the boundaries mentioned - DDD anti-corruption layers can be used to insulate custom logic from those apps.
Hope that helps.
IBM and Oracle have SOA certifications. Since they're the leaders in the marketplace (Gartner Magic Quadrant), I would read about how they define SOA and ESBs (along with methodology and the components needed to support SOA like Governance, Registry, etc etc). It'll give you the high level overview that you're looking for and the use cases "all this middleware" is trying to solve.

How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.
Update: Just to answer one of the comment. I am more interested in Text Information Extraction.
Just to answer one of the comment. I am more interested in Text Information Extraction.
Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from the textual information, and apply training, scoring, or classification.
Good introductory books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).
Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from the plain text). You can use Wikipedia as a training corpus since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.
The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and a very good motivator -maybe you could reimplement some of their results as a learning exercise.
As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task is usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high-quality info is currently in obscure white papers (Google Scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.
I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).
I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.
You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.
A standard process model you can follow to do your extraction is to adapt a data/text mining approach:
pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data
segmentation/classification/clustering/association - your black box where most of your extraction work will be done
post-processing - cleansing your data back to where you want to store it or represent it as information
Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.
standard steps for: input->process->output
If you are using Java/C++ there are loads of frameworks and libraries available you can work with.
Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.
You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.
Good sources to read are:
Handbook of Computational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal
Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.
Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.
Standard measures include precision, recall, f1 measure amongst others.
I disagree with the people who recommend reading Programming Collective Intelligence. If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence. For example, when it talks of SVM, it just says that libSVM is a good way of implementing them.
Now, libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend).
IMHO, there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory. Bayes, I should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what I mean, right ?). Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.
Check these two for that :
http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html
http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
Okay, now that's three of them :) / Cool
The Wikipedia Information Extraction article is a quick introduction.
At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.
Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.
This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

Resources