Method to export Incidence Matrix from Grakn? - vaticle-typedb

We often use GraphBLAS for graph processing so we need to use the incidence matrix. I haven't been able to find a way to export this from Grakn to a csv or any file. Is this possible?

There isn't a built-in way to dump data to CSV in Grakn right now. However, we do highly encourage our community to contribute open source tooling for these kinds of tasks! Feel free to chat to use about it on our discord.
As to how it can be done, conceptually it's pretty easy:
Query to get stream all hyper-relations out:
match $r isa relation;
and then for each relation, we can pipeline another query (possibly in new transaction if you wish to keep memory usage lower):
match $r iid <iid of $r from previous query>; $r ($x); get $x;
which will get you everything in this particular hyper relation $r playing a role.
If you also wish to extract attributes that are attached to the hyper relation, you can use the following
match $r iid <iid of $r from first query>; $r has $a; get $a;
In effect we can use these steps to build up each column in the A incidence matrix.
There are a couple if important caveats I should bring up:
What you'll end up with, will exclude all type information about the hyper relations, the role players in the relations, and the actual role that is being played by the role player, and attribute types owned.
==> It would be interesting to hear/discuss how one could encode types information for use in GraphBLAS
In Graql, it's entirely possible to have relations participating in relations. in the worst case, this means all hyper-edges E will also be present in the set V. In practice only a few relations will play a role in other relations, so only a subset of E may be in V.

So the incidence matrix is equivalent to the nodes/edges array used in force graph visualisation. In this case it is pretty straight forward.
My approach would be slightly different than the above as all i need to do is pull all of the things in the db (entities, relations, attributes), with
match $ting isa thing;
Now when i get my transaction back, for each $ting I want to pull all of the available properties using both local and remote methods if I am building a force graph viz, but for your incidence matrix, I really only care about pulling 3 bits of data:
The iid of the thing
The attributes the thing may own.
The roles the thing owns if it is a relation
Essentially one tests each returned object to find out the type (e.g. entity, attribute, relation), and then uses some of the local and remote methods to get the data one wants. In Python, the code for pulling the data for relations looks like
# pull relation data
elif thing.is_relation():
rel = {}
rel['type'] = 'relation'
rel['symbol'] = key
rel['G_id'] = thing.get_iid()
rel['G_name'] = thing.get_type().get_label().name()
att_obj = thing.as_remote(r_tx).get_has()
att = []
for a in att_obj:
att.append(a.get_iid())
rel['has'] = att
links = thing.as_remote(r_tx).get_players_by_role_type()
logger.debug(f' links are -> {links}')
edges = {}
for edge_key, edge_thing in links.items():
logger.debug(f' edge key is -> {edge_key}')
logger.debug(f' edge_thing is -> {list(edge_thing)}')
edges[edge_key.get_label().name()] = [e.get_iid() for e in list(edge_thing)]
rel['edges'] = edges
res.append(rel)
layer.append(rel)
logger.debug(f'rel -> {rel}')
This then gives us a node array, which we can easily process to build an edges array (i.e. the links joining an object and the attributes it owns, or the links joining a relation to its role players). Thus, exporting your incidence matrix is pretty straightforward

Related

How to get a path from one node to another including all other nodes and relationships involved in between

I have designed a model in Neo4j in order to get paths from one station to another including platforms/legs involved. The model is depicted down here. Basically, I need a query to take me from NBW to RD. also shows the platforms and legs involved. I am struggling with the query. I get no result. Appreciate if someone helps.
Here is my cypher statement:
MATCH p = (a:Station)-[r:Goto|can_board|can_alight|has_platfrom*0..]->(c:Station)
WHERE (a.name='NBW')
AND c.name='RD'
RETURN p
Model:
As mentioned in the comments, in Cypher you can't use a directed variable-length relationship that uses differing directions for some of the relationships.
However, APOC Procedures just added the ability to expand based on sequences of relationships. You can give this a try:
MATCH (start:station), (end:station)
WHERE start.name='NBW' AND end.name='THT'
CALL apoc.path.expandConfig(start, {terminatorNodes:[end], limit:1,
relationshipFilter:'has_platform>, can_board>, goto>, can_alight>, <has_platform'}) YIELD path
RETURN path
I added a limit so that only the first (and shortest) path to your end station will be returned. Removing the limit isn't advisable, since this will continue to repeat the relationships in the expansion, going from station to station, until it finds all possible ways to get to your end station, which could hang your query.
EDIT
Regarding the new model changes, the reason the above will not work is because relationship sequences can't contain a variable-length sequence within them. You have 2 goto> relationships to traverse, but only one is specified in the sequence.
Here's an alternative that doesn't use sequences, just a whitelisting of allowed relationships. The spanningTree() procedure uses NODE_GLOBAL uniqueness so there will only be a single unique path to each node found (paths will not backtrack or revisit previously-visited nodes).
MATCH (start:station), (end:station)
WHERE start.name='NBW' AND end.name='RD'
CALL apoc.path.spanningTree(start, {terminatorNodes:[end], limit:1,
relationshipFilter:'has_platform>|can_board>|goto>|can_alight>|<has_platform'}) YIELD path
RETURN path
Your query is directed --> and not all of the relationships between your two stations run in the same direction. If you remove the relationship direction you will get a result.
Then once you have a result I think something like this could get you pointed in the right direction on extracting the particular details from the resulting path once you get that working.
Essentially I am assuming that everything you are interested in is in your path that is returned you just need to filter out the different pieces that are returned.
As #InverseFalcon points out this query should be limited in a larger graph or it could easily run away.
MATCH p = (a:Station)-[r:Goto|can_board|can_alight|has_platfrom*0..]-(c:Station)
WHERE (a.name='NBW')
AND c.name='THT'
RETURN filter( n in nodes(p) WHERE 'Platform' in labels(n)) AS Platforms

Arango DB performace: edge vs. DOCUMENT()

I'm new to arangoDB with graphs. I simply want to know if it is faster to build edges or use 'DOCUMENT()' for very simple 1:1 connections where a querying the graph is not needed?
LET a = DOCUMENT(#from)
FOR v IN OUTBOUND a
CollectionAHasCollectionB
RETURN MERGE(a,{b:v})
vs
LET a = DOCUMENT(#from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}
A simple benchmark you can try:
Create the collections products, categories and an edge collection has_category. Then generate some sample data:
FOR i IN 1..10000
INSERT {_key: TO_STRING(i), name: CONCAT("Product ", i)} INTO products
FOR i IN 1..10000
INSERT {_key: TO_STRING(i), name: CONCAT("Category ", i)} INTO categories
FOR p IN products
LET random_categories = (
FOR c IN categories
SORT RAND()
LIMIT 5
RETURN c._id
)
LET category_subset = SLICE(random_categories, 0, RAND()*5+1)
UPDATE p WITH {
categories: category_subset,
categoriesEmbedded: DOCUMENT(category_subset)[*].name
} INTO products
FOR cat IN category_subset
INSERT {_from: p._id, _to: cat} INTO has_category
Then compare the query times for the different approaches.
Graph traversal (depth 1..1):
FOR p IN products
RETURN {
product: p.name,
categories: (FOR v IN OUTBOUND p has_category RETURN v.name)
}
Look-up in categories collection using DOCUMENT():
FOR p IN products
RETURN {
product: p.name,
categories: DOCUMENT(p.categories)[*].name
}
Using the directly embedded category names:
FOR p IN products
RETURN {
product: p.name,
categories: p.categoriesEmbedded
}
Graph traversal is the slowest of all 3, the lookup in another collection is faster than the traversal, but the by far fastest query is the one with embedded category names.
If you query the categories for just one or a few products however, the response times should be in the sub-millisecond area regardless of the data model and query approach and therefore not pose a performance problem.
The graph approach should be chosen if you need to query for paths with variable depth, long paths, shortest path etc. For your use case, it is not necessary. Whether the embedded approach is suitable or not is something you need to decide:
Is it acceptable to duplicate information, and potentially have inconsistencies in the data? (If you want to change the category name, you need to change it in all product records instead of just one category document, that products can refer to via the immutable ID)
Is there a lot of additional information per category? If so, all that data needs to be embedded into every product document that has that category - basically trading memory / storage space for performance
Do you need to retrieve a list of all (distinct) categories often? You can do this type of query really cheap with the separate categories collection. With the embedded approach, it will be much less efficient, because you need to go over all products and collect the category info.
Bottom line: you should choose the data model and approach that fits your use case best. Thanks to ArangoDB's multi-model nature you can easily try another approach if your use case changes or you run into performance issues.
Generally spoken, the latter variant
LET a = DOCUMENT(#from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}
should have lower overhead than the full-featured traversal variant. This is because the DOCUMENT variant will do a point lookup of a document whereas the traversal variant is very general purpose: it can return zero to many results from a variable number of collections, needs to keep track of the path seen etc.
When I tried both variants in a local test case, the non-traversal variant was also a lot faster, supporting this claim.
However, the traversal-based variant is more flexible: it can also be used should there be multiple edges (no 1:1 mapping) and for longer paths.

How do i get all nodes in the graph on a certain relation ship type

I have build a small graph where all the screens are connected and the flow of the screen varies based on the system/user. So the system/user is the relationship type.
I am looking to fetch all nodes that are linked with a certain relation ship from a starting screen. I don't care about the depth since i don't know the depth of the graph.
Something like this, but the below query takes ever to get the result and its returning incorrect connections not matching the attribute {path:'CC'}
match (n:screen {isStart:true})-[r:NEXT*0..{path:'CC'}]-()
return r,n
A few suggestions:
Make sure you have created an index for :screen(isStart):
CREATE INDEX ON :screen(isStart);
Are you sure you want to include 0-length paths? If not, take out 0.. from your query.
You did not specify the directionality of the :NEXT relationships, so the DB has to look at both incoming and outgoing :NEXT relationships. If appropriate, specify the directionality.
To minimize the number of result rows, add a WHERE clause that ensures that the current path cannot be extended further.
Here is a proposed query that combines the last 3 suggestions (fix it up to suit your needs):
MATCH (n:screen {isStart:true})-[r:NEXT* {path:'CC'}]->(x)
WHERE NOT (x)-[:NEXT {path:'CC'}]->()
return r,n;

How to iterate maps in insertion order?

I have a navbar as a map:
var navbar = map[string]navbarTab{
}
Where navbarTab has various properties, child items and so on. When I try to render the navbar (with for tabKey := range navbar) it shows up in a random order. I'm aware range randomly sorts when it runs but there appears to be no way to get an ordered list of keys or iterate in the insertion order.
The playground link is here: http://play.golang.org/p/nSL1zhadg5 although it seems to not exhibit the same behavior.
How can I iterate over this map without breaking the insertion order?
The general concept of the map data structure is that it is a collection of key-value pairs. "Ordered" or "sorted" is nowhere mentioned.
Wikipedia definition:
In computer science, an associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of (key, value) pairs, such that each possible key appears just once in the collection.
The map is one of the most useful data structures in computer science, so Go provides it as a built-in type. However, the language specification only specifies a general map (Map types):
A map is an unordered group of elements of one type, called the element type, indexed by a set of unique keys of another type, called the key type. The value of an uninitialized map is nil.
Note that the language specification not only leaves out the "ordered" or "sorted" words, it explicitly states the opposite: "unordered". But why? Because this gives greater freedom to the runtime to implement the map type. The language specification allows to use any map implementation like hash map, tree map etc. Note that the current (and previous) versions of Go use a hash map implementation, but you don't need to know that to use it.
The blog post Go maps in action is a must read regarding to this question.
Before Go 1, when a map was not changed, the runtime returned the keys in the same order when you iterated over its keys/entries multiple times. Note that this order could have changed if the map was modified as the implementation might needed to do a rehash to accommodate more entries. People started to rely on the same iteration order (when map was not changed), so starting with Go 1 the runtime randomizies map iteration order on purpose to get the attention of the developers that the order is not defined and can't be relied on.
What to do then?
If you need a sorted dataset (be it a collection of key-value pairs or anything else) either by insertion order or natural order defined by the key type or an arbitrary order, map is not the right choice. If you need a predefined order, slices (and arrays) are your friend. And if you need to be able to look up the elements by a predefined key, you may additionally build a map from the slice to allow fast look up of the elements by a key.
Either you build the map first and then a slice in proper order, or the slice first and then build a map from it is entirely up to you.
The aforementioned Go maps in action blog post has a section dedicated to Iteration order:
When iterating over a map with a range loop, the iteration order is not specified and is not guaranteed to be the same from one iteration to the next. Since Go 1 the runtime randomizes map iteration order, as programmers relied on the stable iteration order of the previous implementation. If you require a stable iteration order you must maintain a separate data structure that specifies that order. This example uses a separate sorted slice of keys to print a map[int]string in key order:
import "sort"
var m map[int]string
var keys []int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
fmt.Println("Key:", k, "Value:", m[k])
}
P.S.:
...although it seems to not exhibit the same behavior.
Seemingly you see the "same iteration order" on the Go Playground because the outputs of the applications/codes on the Go Playground are cached. Once a new, yet-unique code is executed, its output is saved as new. Once the same code is executed, the saved output is presented without running the code again. So basically it's not the same iteration order what you see, it's the exactly same output without executing any of the code again.
P.S. #2
Although using for range the iteration order is "random", there are notable exceptions in the standard lib that do process maps in sorted order, namely the encoding/json, text/template, html/template and fmt packages. For more details, see In Golang, why are iterations over maps random?
Go maps do not maintain the insertion order; you will have to implement this behavior yourself.
Example:
type NavigationMap struct {
m map[string]navbarTab
keys []string
}
func NewNavigationMap() *NavigationMap { ... }
func (n *NavigationMap) Set(k string, v navbarTab) {
n.m[k] = v
n.keys = append(n.keys, k)
}
This example is not complete and does not cover all use-cases (eg. updating insertion order on duplicate keys).
If your use-case includes re-inserting the same key multiple times (this will not update insertion order for key k if it was already in the map):
func (n *NavigationMap) Set(k string, v navbarTab) {
_, present := n.m[k]
n.m[k] = v
if !present {
n.keys = append(n.keys, k)
}
}
Choose the simplest thing that satisfies your requirements.

How to store and retrieve different types of Vertices with the Tinkerpop/Blueprints graph API?

When looking at the Tinkerpop-Blueprints API it is quite straight forward to use one type of vertices but how can I store two? E.g. Users and their interests?
And how can I get a Vertex by id? I mean, there could be a user named 'timetabling' as well as the interests 'timetabling' - how to handle that id conflict?
-
I know that the first problem could be solved via introducing an index for a type-property and for the second problem I could auto generate the id and create another index for the name-property. BUT why would I then need the vertex id at all? E.g. for the in-memory there is a HashMap for all vertices which would be of no use and wasting memory! (I could solve the problem differently via combining type and name as the id but then it would inefficient if I e.g. list all users.)
Hmmh, ok. I'm just using the vertices for the combined id (name+type) and a separate index for type. Better solutions?
In general it is best to rely on the automatic ID system of the underlying graph database (e.g. Neo4j, InfiniteGraph, OrientDB, etc.). The way in which you would add the information you want is as follows:
Vertex v = graph.addVertex(null)
v.setProperty("name","timetabling")
Vertex marko = graph.addVertex(null)
graph.addEdge(null, marko, v, "hasInterest")
Verte aType = graph.addVertex(null)
graph.addEdge(null, aType, v, "hasType")
In short, the ID of a vertex/edge is a non-domain-specific way of retrieving vertices/edges. Generally, it is best to use properties in your domain model for indexing.
Hope that speaks to your question,
Marko.
http://markorodriguez.com

Resources