Weaviate Search Graph Vs. GA of IBM Graph - graph

IBM Graph service is only compared to how it can add and store properties in the form of key/value pairs associated with the data, for both vertices and nodes connected by edges, rather than the more traditional form of storing the data in tabular form using rows and columns. However, how is the GA of IBM Graph compared to Knowledge Graph with Weaviate Search Graph (GraphQL - RESTful - Semantic Search - Semantic Classification - Knowledge Representation)?

The answer is based on the assumption that you mean JanusGraph because of this article.
JanusGraph is a Graph DBMS optimized for distributed clusters.
Weaviate is a GraphQL based semantic search engine.
The main difference is that JanusGraph focusses on large graphs whereas Weaviate focusses on search where results are represented in graph format.
You might pick JanusGraph if you want to store and analyze a large graph.
You might pick Weaviate if you build a search engine and/or store data in graph format.
Something Weaviate can do what other search engines (regardless if they are graph-based or not) is index your data semantically, e.g., you can find a data object about the publication Vogue by searching for fashion.
Weaviate query example:
Disclaimer:
I'm part of the team working on Weaviate.

Related

Proper way to store graphs with Neo4J

I'm building a system which allows the user to call N number of different graphs through an API. Currently I have a working prototype which pulls graphs from CouchDB. However, for obvious reasons, I would like to move to a graph DB. My understanding is that Neo4J can only handle one graph at a time or requires so sort of tagging system to not mix graphs. Neither of those approaches seem optimal. What's the best practice approach for this?
A few more things to note: I will be calling these graphs and manipulating them with something like networkx, and I've considered storing the graphs in a "regular" DB then moving them to Neo4J as the requests come in, which seems pretty intense.
Neo4j does not have a concept of multiple databaseslike most relational databases do using CREATE DATABASE. In Neo4j there is one graph space which you can use.
So you have 2 options:
use seperate Neo4j instances (single or clustered) for each graph, maybe using Neo4j in embedded mode is helpful here
use one Neo4j instance (single or clustered) and store your data in distinct subgraphs. If the subgraphs need some interconnections you can use labels to identify to which subgraph a certain node belongs.

Titan geo data on Cassandra

I'm looking at using Titan to create a scalable geospatial data store (I'm thinking R trees). In the documentation, there is a GeoShape query, and the docs say that titan can do geo data with Lucene or ElasticSearch. However, it seems like this would be very slow because traversing nodes in cassandra is essentially doing join queries in cassandra which is a really bad idea. I think I might be misunderstanding the data representation.
I read the Titan Data Model doc, and I still don't quite get it. If all the edges are stored in a Cassandra row, then Titan would still have to "join" on a vertex table. One way to solve this would be to make the column value equal to the edge property data, and then you could neatly package the vertex data and the edge data into the row. However, this breaks down when you want to do queries deeper than 1 node, and we're back to the joining problem again.
So. Is titan emulating join queries in Cassandra? - and - How performant is it at geo lookups under these conditions?
I think the question conflates edge traversal with geospatial index lookups. These are separate at both the API and implementation levels. The index is not illustrated in the data model pictures.
Let's make this a little bit more specific. Say I run Titan with ES and Cassandra using Murmur3Partitioner or RandomPartitioner. I declare an ES geospatial index over edges called "place", as documented in the Getting Started page. Looking up edges by geospatial queries, such as this "WITHIN" in the Getting Started docs, first hits ES. ES returns IDs Titan can use to lookup the associated vertex/edge data in Cassandra quickly, without doing an analog to relational joins.
The cost of these edge lookups by geospatial data should be roughly equivalent to the cost of ES's WITHIN implementation (which I think is delegated to Spatial4j), plus the lookups Titan makes on Cassandra after getting IDs, which should be roughly linear in the number of edges found by ES. This is just back-of-the-envelope estimation, so please take it with a big grain of salt.
After I get place edges by geo matching, if I then want to run arbitrary traversals in the neighborhood of each edge in the set, then I would have a look at rooting a MultiQuery on the head/tail vertices and enabling database-level caching. If the query misses cache or cache is cold/disabled, then Titan will still attempt to retrieve all edges the traversal cares about in a single Cassandra slice per vertex, when possible. If you're concerned about Titan's edge traversal efficiency, then you might find Boutique Graph Data with Titan interesting.
HTH

Can Riak do facet queries?

Riak has both secondary indexes and (solr-ish) search.
But can it do faceted searches like Solr does? That is:
fetch facets that are applicable to the results returned
drill down into facets by contraining facet values
bucket the ranges (eg: cities that start with a C)
The Riak 2.0 release coming later this year includes integrated Solr support. I.e. it ships with Solr 4.x included. The project is called "Yokozuna" and has been under development for the last year. If enabled it allows you to create indexes, associate a Riak bucket with an index, and all objects stored under that bucket will be converted to Solr documents and then shipped to Solr for indexing. You can then query via a pass-through HTTP interface (which allows you to use standard Solr clients) or via Riak's protobuff search interface. Basically, it combines the distributed and highly-available aspects of Riak with the robust search capabilities of Solr. Here are various links to learn more.
Code: https://github.com/basho/yokozuna
Slides Berlin Buzzwords June 2013: https://speakerdeck.com/rzezeski/yokozuna-scaling-solr-with-riak
Solr-compatible interface of Riak is more like a marketing feature, than actually usable in real applications. Secondary indices are simple exact match and value range queries. So out-of-the-box Riak can not do it, some time ago it was clearly stated in official wiki, but that sentance is gone, only some traces left: http://news.ycombinator.com/item?id=2377680.
But this functionality can be quite easily implemented using MapReduce with search results as input or simply on client by running through search results and generating data structure with possible filters and counts of available items matching that criteria.

Storing multiple graphs in Neo4J

I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.
Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.
Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.
This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).

Which technology is best suited to store and query a huge readonly graph?

I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).
It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.
One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.
Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?
Which technology would you use to store and query such a huge readonly graph?
LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.
There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.
The SQL has extensions for trees and graphs. Example:
select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')
To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.
I have a directed graph for which I (mis)used Lucene.
Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.
It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.
However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.
Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage.
It also sound like Google App Engine would be right up your alley:
It's optimized for reading - and there's memcached if you want it even faster
it's distributed - so the size doesn't affect efficiency
Of course if you somehow rely on Relational DB to find the path, it won't work for you...
And I just noticed that the q is 4 months old
So you have a graph as your data and want to perform a classic graph operation. I can't see what other technology could fit better than a graph database.

Resources