Grakn: how can I construct a knowledge graph from a collection of texts? - vaticle-typedb

I have several documents (pdf and txt) in my notebook and I want to construct a knowledge graph using Grakn.
Through Google I found the blog but there is no documentation or readme teaching how to do that.
Also is written in the blog "The script to mine text can be found on our GitHub repo here" but I am failing in understanding what I have to do.
Can someone here advise me how to construct a knowledge graph from text using Grakn?

Grakn is a knowledge engine/network, which understands knowledge by well defined entities and relations (ontologies), so you need to use NLP (Natural Language processing) to make human language accessible to a graph network. also you need OCR (Optical Character Recognition) to convert some image texts to text. also you should teach the network basic ontologies to understand the texts. you are actually heading through Singularity era.

To give an example of how to go from a collection of text to a knowledge graph, let us assume that all of your text is concerned with a certain domain of knowledge - in the example of the blog post you mention, we are dealing with biomedical research publications.
A first step could be to find entities, or defined "things", in the text. To stick with the biomedical example, we could look for drugs and genes mentioned in the publications. This is called named-entity-recognition (NER), a technique applied in text-mining.
If a certain drug is often mentioned in the same publication as a particular gene, they "co-occur" and are likely related in some way. This would be an example of a relationship. The automated extraction of exactly how they are related is a difficult problem and is called relationship-extraction (RE).
Solutions for both NER and RE are usually domain-specific (ranging from simple matching of dictionary terms to AI models).
If you are interested in text-mining, a good place to start in python is NLTK.
The idea of a knowledge graph is to put defined things, called entities, in defined relationships to one another to create context. After you have a list of entities that you have found in all your documents, as well as their relationships (as in the example above, co-occurrance in a document or even a single sentence), you can define a schema and upload the entities and relationships into grakn and use all of its functionality to analyze your data.
For a tutorial on how to use grakn with already extracted data, see here

Related

Documentation tool for Ada software

Brief: I'm looking for some kind of tool to produce a software description from the comments in existing software source code.
In more detail: I've got existing source code written in Ada. Changes need to be made to this source code and I also need to generate a document containing a description of the software as a whole and all of its packages, routines etc. (if possible as PDF). For the existing routines these source code comments already exist and contain sufficient detail for my needs.
The description shall include at least
overall software design
textual description of packages, routines, variables, constants etc.
call and caller graphs
For projects based on C I'd do this using Doxygen. Doxygen itself, however, does not cope with sotware written in Ada. My thought was to (automatically) convert existing comments in the source code so that Doxygen can read these. The conversion itself was no problem (using Doxygen's filter mechanism), but as keywords and syntax between C and Ada differ a lot, this did not produce any useable output.
I then had a look at Understand from SciTools. While this analyses the software to a good detail and generates nice metrices, I was not able to get anything out of it, that resembles a document with what I need.
I want to avoid (manually) writing a separate document, but instead would like to generate this from the code base. I will have to put all the necessary information (perhaps with the the exception of a general overview) there anyhow, so why not use it for documentation purpose as well.
Is there any tool that is able to do what I need?
There's a tool called "AdaDoc", which seems to do a part of what you're asking for. You can of course use "a2ps" for the textual part of your needs (I like that better than what AdaDoc generates).
There are several UML tools ("Umbrello" is one name I remember), which offer to create graphs of inter-package relations, but for a seriously sized project, the best option is to use the original design documents, and simply verify that the source text actually matches that design.
For languages not supported by Doxygen, I've written my own "general purpose" filter.
It's very basic, but useful for me.
https://github.com/malkev/doxphp

Graph database vertex/edge inference from a text (i.e. an informal Graph 'schema'), using Natural Language Processing (NLP) - does this exist?

Caveat Emptor - I'm neither a linguist nor a Graph theorist, however, I am a [Java] developer wishing to use a Graph database for persistence and the following topic is of interest to me, and I hope to others.
OK, the idea is to have some application or code to:
recognise the embedded relationship structures between named entities within a given piece of text
apply or expose these discovered relationships to usage within a Graph database structure.
In such a system, the text might essentially form a basic, layman-written graph schema of sorts. To better visualise this, here is some [very], basic text:
Andrew is married to Jane
Using the online CLAWS parts-of-speech tagger (POS), I'm given the following:
Andrew_NP0 is_VBZ married_AJ0 to_SENT Jane_NP0
According to 'The BNC Basic (C5) Tagset' # Oxford University, NP0='Proper noun', which is a name (as you know) but these NP0-tagged entries would lend themselves to becoming graph vertice instances/nodes (the end user could be further prompted to give these entries an encompassing 'type/description'). The verb(s), 'VBZ' and adjective(s), AJ0, might highlight graph relationships.
Once the end user has confirmed their graph representation, they might export it to GraphML, for re-import into a graph database such as Titan or Neo4j.
So, the overall idea is to have a tool that allows a layman end user the ability to create Graph-theory-based database structures, using everyday language.
Does such a tool exist already?
Some of my observations above were influenced, in some way, by the following tools (amongst others):
http://www.plantuml.com <- UML diagrams defined using a simple and intuitive language
http://www.planttext.com <- See plantuml
http://www.acqualia.com/soulver <- An NLP-based calculator and currency exchange tool, using natural sentence phrases
http://nlp.stanford.edu/software/tagger.shtml <- Stanford Log-linear Part-Of-Speech Tagger
Yes, this exists in many different places. Examples include OpenCalais (which was created by Reuters) and the AlchemyAPI. There are a bunch of other toolkits and APIs like NLTK and IBM's UIMA that don't present you with a finished solution, but a bunch of tools necessary to build a bespoke solution.
This is a very deep area, subject to ongoing research. I can't cover all of it here, but one thing to keep in mind is that solutions in this space are often highly specific to a certain "corpus" of documents. Software which does any arbitrary English text well doesn't really exist. Instead what you see is solutions that do it really well for business press releases. Or intelligence reports. Or newspaper articles. Or medical alerts. But not any, arbitrary text.
The area is also rife with a lot of problems; one of the big ones is known as "Named Entity Recognition"
Andrew is married to Jane. Andrew bought eggs yesterday.
How many people are being discussed here? Is the second Andrew the same as the first? That's a very complicated and contextual question. But you better get it right, otherwise you might have more or fewer "person" nodes in your resulting graph than you expect.

Graph Traversing algorithms in Semantic web

I am asking about Algorithms that would be useful in Querying the Semantic web DB to get all the related RDFs to an original Object.
i.e If the original Object is the movie "inception", I want an algorithm to build queries to get the RDFs of the cast of the movie, the studio, the country ....etc so that I can build a relationship graph.
The most close example is the answer to this question , Especially this class , I wan similar algorithms or maybe titles to search in order to produce such an algorithm, I am thinking maybe some modifications on graph traversing algorithms can work, but I'm not sure.
NOTE: My project is in ASP.NET. So, it would help to use Exisiting .NET libraries.
You should be able to do a simple breadth-first-search to get all the objects that are a certain distance away from a given node.
You'll need to know something about the schema because some neighboring nodes are more meaningful than others. For example, in Freebase, we have intermediate nodes that link a film to an actor and a role. You need to know to go 2-ply deep to get at the actor and the role because just saying that the film is related to the intermediate nodes is not very interesting.
Did you take a look at "property paths"?
Property Paths give a more succinct way to write parts of basic graph
patterns and also extend matching of triple pattern to arbitrary
length paths. Property paths do not invalidate or change any existing
SPARQL query.
Triple stores and SPARQL engines such as OWLIM and AllegroGraph support them.

Data Visualisation

UPDATE: I had posted this on UI.stackexchange also for views on different kinds od visualisation. I am posting this here for finding out the programming techniques and tools required to do so.
Let us have the following three sets of information
Now I want to combine all of this data and show it all together. Telling it like a story. Giving inter-relations. Showing similarities in terms, concepts etc. to get the following (Note that in the diagram below, the colored relations may not be exact, they are merely indicative of a node of information)
Situation: I need to tell somebody the relation between two or more important things through the commonness of concepts, keywords, behaviours in those things.
One way that I figured out would be to use circles for concepts.
So that all concepts connected to thing A would be connected to it and all concept related to B would be connected to it. And the common concepts would be connected to both. That way 2 things can be easily compared.
Problem: To build such a graph/visualisation manually would be cumbersome. Especially to add, arrange, update and manipulate.
Question: Is there a good way to do it. Also, Is there a tool available for doing this?
I hope this make the question much more clear. :)
Where does this data (the concepts, keywords, and relations between them etc...) come from? If it's in a database somewhere you could write soem code to generate a graphiz file then open it in a graphiz visualizer. There might be some tools out there that allow interactive editing of a graphiz graph, it looks like WebDot may and there are probably others.
How to display the hierarchical data on User Interface
You're talking about Venn diagrams. I think there should be plenty of online and offline tools that can help making these.
graphiz has been mentioned already, although that would be used more to show a flow of a system, or a treeview.
When you're talking about software development and want to display a design through diagrams, a complete diagram solution already exist as UML. And there are plenty of UMT tools that can help here. A commercial version is Altova UModel, which has some very nice features. You could probably use Use Cases as the most logical diagram type.
Also see Wikipedia for more info about use case diagrams. Reconsidering the image you've added, I do tend to consider it to be a usecase. Since UML is based on XML, it should be possible to transform your data through a stylesheet to UML, then use a random UML tool to display the diagrams.To convert your data to XML, well... If it's in Excel then exporting it to XML should not be too difficult.
Why is your sample image an Use Case? Well, you have actors (Pinguin, Koala, Tulips) and you have actions. (well, kind of actions: Cause for concern, some kind of animal, linked to movie, bites your nose off...) And finally, there are associations between the actors and the actions connecting them all in some way. Thus Data--(export)->XML--(Styleheet)->UML--(UML tool)->Diagram.
D3: Data-Driven Documents JS library

How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.
Update: Just to answer one of the comment. I am more interested in Text Information Extraction.
Just to answer one of the comment. I am more interested in Text Information Extraction.
Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from the textual information, and apply training, scoring, or classification.
Good introductory books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).
Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from the plain text). You can use Wikipedia as a training corpus since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.
The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and a very good motivator -maybe you could reimplement some of their results as a learning exercise.
As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task is usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high-quality info is currently in obscure white papers (Google Scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.
I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).
I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.
You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.
A standard process model you can follow to do your extraction is to adapt a data/text mining approach:
pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data
segmentation/classification/clustering/association - your black box where most of your extraction work will be done
post-processing - cleansing your data back to where you want to store it or represent it as information
Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.
standard steps for: input->process->output
If you are using Java/C++ there are loads of frameworks and libraries available you can work with.
Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.
You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.
Good sources to read are:
Handbook of Computational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal
Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.
Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.
Standard measures include precision, recall, f1 measure amongst others.
I disagree with the people who recommend reading Programming Collective Intelligence. If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence. For example, when it talks of SVM, it just says that libSVM is a good way of implementing them.
Now, libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend).
IMHO, there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory. Bayes, I should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what I mean, right ?). Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.
Check these two for that :
http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html
http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
Okay, now that's three of them :) / Cool
The Wikipedia Information Extraction article is a quick introduction.
At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.
Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.
This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

Resources