Arango DB performace: edge vs. DOCUMENT() - graph

I'm new to arangoDB with graphs. I simply want to know if it is faster to build edges or use 'DOCUMENT()' for very simple 1:1 connections where a querying the graph is not needed?
LET a = DOCUMENT(#from)
FOR v IN OUTBOUND a
CollectionAHasCollectionB
RETURN MERGE(a,{b:v})
vs
LET a = DOCUMENT(#from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}

A simple benchmark you can try:
Create the collections products, categories and an edge collection has_category. Then generate some sample data:
FOR i IN 1..10000
INSERT {_key: TO_STRING(i), name: CONCAT("Product ", i)} INTO products
FOR i IN 1..10000
INSERT {_key: TO_STRING(i), name: CONCAT("Category ", i)} INTO categories
FOR p IN products
LET random_categories = (
FOR c IN categories
SORT RAND()
LIMIT 5
RETURN c._id
)
LET category_subset = SLICE(random_categories, 0, RAND()*5+1)
UPDATE p WITH {
categories: category_subset,
categoriesEmbedded: DOCUMENT(category_subset)[*].name
} INTO products
FOR cat IN category_subset
INSERT {_from: p._id, _to: cat} INTO has_category
Then compare the query times for the different approaches.
Graph traversal (depth 1..1):
FOR p IN products
RETURN {
product: p.name,
categories: (FOR v IN OUTBOUND p has_category RETURN v.name)
}
Look-up in categories collection using DOCUMENT():
FOR p IN products
RETURN {
product: p.name,
categories: DOCUMENT(p.categories)[*].name
}
Using the directly embedded category names:
FOR p IN products
RETURN {
product: p.name,
categories: p.categoriesEmbedded
}
Graph traversal is the slowest of all 3, the lookup in another collection is faster than the traversal, but the by far fastest query is the one with embedded category names.
If you query the categories for just one or a few products however, the response times should be in the sub-millisecond area regardless of the data model and query approach and therefore not pose a performance problem.
The graph approach should be chosen if you need to query for paths with variable depth, long paths, shortest path etc. For your use case, it is not necessary. Whether the embedded approach is suitable or not is something you need to decide:
Is it acceptable to duplicate information, and potentially have inconsistencies in the data? (If you want to change the category name, you need to change it in all product records instead of just one category document, that products can refer to via the immutable ID)
Is there a lot of additional information per category? If so, all that data needs to be embedded into every product document that has that category - basically trading memory / storage space for performance
Do you need to retrieve a list of all (distinct) categories often? You can do this type of query really cheap with the separate categories collection. With the embedded approach, it will be much less efficient, because you need to go over all products and collect the category info.
Bottom line: you should choose the data model and approach that fits your use case best. Thanks to ArangoDB's multi-model nature you can easily try another approach if your use case changes or you run into performance issues.

Generally spoken, the latter variant
LET a = DOCUMENT(#from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}
should have lower overhead than the full-featured traversal variant. This is because the DOCUMENT variant will do a point lookup of a document whereas the traversal variant is very general purpose: it can return zero to many results from a variable number of collections, needs to keep track of the path seen etc.
When I tried both variants in a local test case, the non-traversal variant was also a lot faster, supporting this claim.
However, the traversal-based variant is more flexible: it can also be used should there be multiple edges (no 1:1 mapping) and for longer paths.

Related

Method to export Incidence Matrix from Grakn?

We often use GraphBLAS for graph processing so we need to use the incidence matrix. I haven't been able to find a way to export this from Grakn to a csv or any file. Is this possible?
There isn't a built-in way to dump data to CSV in Grakn right now. However, we do highly encourage our community to contribute open source tooling for these kinds of tasks! Feel free to chat to use about it on our discord.
As to how it can be done, conceptually it's pretty easy:
Query to get stream all hyper-relations out:
match $r isa relation;
and then for each relation, we can pipeline another query (possibly in new transaction if you wish to keep memory usage lower):
match $r iid <iid of $r from previous query>; $r ($x); get $x;
which will get you everything in this particular hyper relation $r playing a role.
If you also wish to extract attributes that are attached to the hyper relation, you can use the following
match $r iid <iid of $r from first query>; $r has $a; get $a;
In effect we can use these steps to build up each column in the A incidence matrix.
There are a couple if important caveats I should bring up:
What you'll end up with, will exclude all type information about the hyper relations, the role players in the relations, and the actual role that is being played by the role player, and attribute types owned.
==> It would be interesting to hear/discuss how one could encode types information for use in GraphBLAS
In Graql, it's entirely possible to have relations participating in relations. in the worst case, this means all hyper-edges E will also be present in the set V. In practice only a few relations will play a role in other relations, so only a subset of E may be in V.
So the incidence matrix is equivalent to the nodes/edges array used in force graph visualisation. In this case it is pretty straight forward.
My approach would be slightly different than the above as all i need to do is pull all of the things in the db (entities, relations, attributes), with
match $ting isa thing;
Now when i get my transaction back, for each $ting I want to pull all of the available properties using both local and remote methods if I am building a force graph viz, but for your incidence matrix, I really only care about pulling 3 bits of data:
The iid of the thing
The attributes the thing may own.
The roles the thing owns if it is a relation
Essentially one tests each returned object to find out the type (e.g. entity, attribute, relation), and then uses some of the local and remote methods to get the data one wants. In Python, the code for pulling the data for relations looks like
# pull relation data
elif thing.is_relation():
rel = {}
rel['type'] = 'relation'
rel['symbol'] = key
rel['G_id'] = thing.get_iid()
rel['G_name'] = thing.get_type().get_label().name()
att_obj = thing.as_remote(r_tx).get_has()
att = []
for a in att_obj:
att.append(a.get_iid())
rel['has'] = att
links = thing.as_remote(r_tx).get_players_by_role_type()
logger.debug(f' links are -> {links}')
edges = {}
for edge_key, edge_thing in links.items():
logger.debug(f' edge key is -> {edge_key}')
logger.debug(f' edge_thing is -> {list(edge_thing)}')
edges[edge_key.get_label().name()] = [e.get_iid() for e in list(edge_thing)]
rel['edges'] = edges
res.append(rel)
layer.append(rel)
logger.debug(f'rel -> {rel}')
This then gives us a node array, which we can easily process to build an edges array (i.e. the links joining an object and the attributes it owns, or the links joining a relation to its role players). Thus, exporting your incidence matrix is pretty straightforward

Avoiding salesforce governing limits on soql queries getting group members for each group?

I am working in apex on salesforce platform. I have this loop to grab all group names, Ids, and their respective group members, place them in an object to collect all this info, then put that in a list to have a list of all groups and all information I need:
List<groupInfo> memberList = new List<groupInfo>();
for(Id key : groupMap.keySet()){
groupInfo newGroup = new groupInfo();
Group g = groupMap.get(key);
if(g.Name != null){
set<Id> memberSet = getGroupEventRelations(new set<Id>{g.Id});
if(memberSet.size() != 0){
newGroup.groupId = g.Id;
newGroup.groupName = g.Name;
newGroup.groupMemberIds = memberSet;
memberList.add(newGroup);
}
}
}
My getGroupEventRelations method is as such:
global static set<Id> getGroupEventRelations(set<Id> groupIds){
set<Id> nestedIds = new set<Id>();
set<Id> returnIds = new set<Id>();
List<GroupMember> members = [SELECT Id, GroupId, UserOrGroupId FROM GroupMember WHERE GroupId IN :groupIds];
for(GroupMember member : members){
if(Schema.Group.SObjectType == member.UserOrGroupId.getSObjectType()){
nestedIds.add(member.UserOrGroupId);
} else{
returnIds.add(member.UserOrGroupId);
}
}
if(nestedIds.size() > 0){
returnIds.addAll(getGroupEventRelations(nestedIds));
}
return returnIds;
}
getGroupEventRelations contains a soql query, and considering this is called inside a loop of groups... if someone has over 100 groups with group members or possibly a series of 100 nested groups inside groups... then this is going to hit the governing limits of salesforce soql queries pretty quickly...
I am wondering if anyone knows of a way to possibly get rid of the soql query inside getGroupEventRelations to get rid of the query in the loop. When I want group members for a specific group, I am not really seeing a way to get by this without more loops inside loops where I could risk running into CPU timeout salesforce governing limit :(
Thank you in advance for any help!
At large enough numbers there's no solution, you'll run into SOME governor limit. But you can certainly make your code work with bigger numbers than it does now. Here's a quick little cheat you could do to cut nesting 5-fold. Instead of just looking at the immediate parent (single level of children) look for parent, grandparent, great grandparent, etc, all in one query.
[SELECT Id, GroupId, UserOrGroupId FROM GroupMember WHERE (GroupId IN :groupIds OR Group.GroupId IN :groupIds OR Group.Group.GroupId IN :groupIds OR Group.Group.Group.GroupId IN :groupIds OR Group.Group.Group.Group.GroupId IN :groupIds OR Group.Group.Group.Group.Group.GroupId IN :groupIds) AND Id NOT IN :returnIds];
You just got 5 (or is it 6?) levels of children in one SOQL call, so you can support that many times more nest levels now. Note that I added a 'NOT IN' clause to make sure you don't repeat children that you already have, since you won't know which Ids came from the bottom level.
You can also make your very first call for all groups instead of each group at a time. So if someone has 100 groups you'll make just one call instead of 100.
List<Group> groups = groupMap.values();
List<GroupMember> allMembers = [SELECT Id, GroupId, UserOrGroupId FROM GroupMember WHERE GroupId IN :groups];
Lastly, you could query all GroupMembers in a single SOQL call and then iterate yourself. Like you said, you risk running into the 10 second limit here, but if the number of groups isn't in the millions you'll likely be just fine, especially if you do some O(n) analysis and choose good data structures and algorithms. On the plus side, you won't have to worry about SOQL limits regardless of the nesting and the tree complexity. This answer should be very helpful, they are doing almost exactly what you'd have to do if you pulled all members in one call.
How to efficiently build a tree from a flat structure?

Graph DB get the next best recommended node in Neo4j cypher

I have a graph using NEO4j and currently trying to build a simple recommendation system that is better than text based search.
Nodes are created such as: Album, People, Type, Chart
Relationship are created such as:
People - [:role] -> Album
where roles are: Artist, Producer, Songwriter
Album-[:is_a_type_of]->Type (type is basically Pop, Rock, Disco...)
People -[:POPULAR_ON]->Chart (Chart is which Billboard they might have been)
People -[:SIMILAR_TO]->People (Predetermined similarity connection)
I have written the following cypher:
MATCH (a:Album { id: { id } })-[:is_a_type_of]->(t)<-[:is_a_type_of]-(recommend)
WITH recommend, t, a
MATCH (recommend)<-[:ARTIST_OF]-(p)
OPTIONAL MATCH (p)-[:POPULAR_ON]->()
RETURN recommend, count(DISTINCT t) AS type
ORDER BY type DESC
LIMIT 25;
It works however, it easily repeats itself if it has only one type of music connected to it, therefore has the same neighbors.
Is there a suggested way to say:
Find me the next best album that has the most similar connected relationships to the starting Album from.
Any Recommendation for a tie breaker scenario? Right now it is order by type (so if an album has more than one type of music it is valued more but if everyone has the same number, there is no more
significant)
-I made the [:SIMILAR_TO] link to enforce a priority to consider that relationship as important, but I haven't had a working cypher with it
-Same goes for [:Popular_On] (Maybe Drop this relationship?)
You can use 4 configurations and order albums according to higher value in this order. Keep configuration between 0 to 1 (ex. 0.6)
a. People Popular on Chart and People are similar
b. People Popular on Chart and People are Not similar
c. People Not Popular on Chart and People are similar
d. People Not Popular on Chart and People are Not similar
Calculate and sum these 4 values with each album. Higher the value, higher recommended Album.
I have temporarily made config as a = 1, b =0.8, c=0.6, d = 0.4. And assumed some relationship present which suggests some People Likes Album. If you are making logic based on Chart only then use a & b only.
MATCH (me:People)
where id(me) = 123
MATCH (a:Album { id: 456 })-[:is_a_type_of]->(t:Type)<-[:is_a_type_of]-(recommend)
OPTIONAL MATCH (recommend)<-[:ARTIST_OF]-(a:People)-[:POPULAR_ON]->(:Chart)
WHERE exists((me)-[:SIMILAR_TO]->(a))
OPTIONAL MATCH (recommend)<-[:ARTIST_OF]-(b:People)-[:POPULAR_ON]->(:Chart)
WHERE NOT exists((me)-[:SIMILAR_TO]->(b))
OPTIONAL MATCH (recommend)<-[:LIKES]-(c:People)
WHERE exists((me)-[:SIMILAR_TO]->(a))
OPTIONAL MATCH (recommend)<-[:LIKES]-(d:People)
WHERE NOT exists((me)-[:SIMILAR_TO]->(a))
RETURN recommend, (count(a)*1 + count(b)*0.8 + count(c)* 0.6+count(d)*0.4) as rec_order
ORDER BY rec_order DESC
LIMIT 10;

CouchDB: Merging Objects in Reduce Function

I'm new to CouchDB, so bear with me. I searched SO for an answer, but couldn't narrow it down to this specifically.
I have a mapper function which creates values for a user. The users have seen different product pages, and we want to tally the type and products they've seen.
var emit_values = {};
emit_values.name = doc.name;
...
emit_values.productsViewed = {};
emit_values.productsViewed[doc.product] = 1
emit([doc.id, doc.customer], emit_values);
In the reduce function, I want to gather different values into that productsViewed object for that given user. So after the reduce, I have this:
productsViewed: {
book1: 1,
book3: 2,
book8: 1
}
Unfortunately, doing this creates a reduce overflow error. According to the other posts, this is because the productsViewed object is growing in size in the reduce function, and Couch doesn't like that. Specifically:
A common mistake new CouchDB users make is attempting to construct complex aggregate values with a reduce function. Full reductions should result in a scalar value, like 5, and not, for instance, a JSON hash with a set of unique keys and the count of each.
So, I understand this is not the right way to do this in Couch. Does anyone have any insight into how to properly gather values into a document after reduce?
You simple build a view with the customer as key
emit(doc.customer, doc.product);
Then you can call
/:db/_design/:name/_view/:name?key=":customer"
to get all products an user has viewed.
If a customer can have viewed a product several times you can build a multi-key view
emit([doc.customer, doc.product], null);
and reduce it with the built-in function _count
/:db/_design/:name/_view/:name?startkey=[":customer","\u0000"]&endkey=[":customer","\u9999"]&reduce=true&group_level=2
You have to accept that you cannot
construct complex aggregate values
with CouchDB by requesting the view. If you want to have a data structure like your wished payload
productsViewed: {
book1: 1,
book3: 2,
book8: 1
}
i recommend to use an _update handler on the customer doc. Every request that logs a product visit adds a value to the customer.productsViewed property instead of creating a new doc.

How to iterate maps in insertion order?

I have a navbar as a map:
var navbar = map[string]navbarTab{
}
Where navbarTab has various properties, child items and so on. When I try to render the navbar (with for tabKey := range navbar) it shows up in a random order. I'm aware range randomly sorts when it runs but there appears to be no way to get an ordered list of keys or iterate in the insertion order.
The playground link is here: http://play.golang.org/p/nSL1zhadg5 although it seems to not exhibit the same behavior.
How can I iterate over this map without breaking the insertion order?
The general concept of the map data structure is that it is a collection of key-value pairs. "Ordered" or "sorted" is nowhere mentioned.
Wikipedia definition:
In computer science, an associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of (key, value) pairs, such that each possible key appears just once in the collection.
The map is one of the most useful data structures in computer science, so Go provides it as a built-in type. However, the language specification only specifies a general map (Map types):
A map is an unordered group of elements of one type, called the element type, indexed by a set of unique keys of another type, called the key type. The value of an uninitialized map is nil.
Note that the language specification not only leaves out the "ordered" or "sorted" words, it explicitly states the opposite: "unordered". But why? Because this gives greater freedom to the runtime to implement the map type. The language specification allows to use any map implementation like hash map, tree map etc. Note that the current (and previous) versions of Go use a hash map implementation, but you don't need to know that to use it.
The blog post Go maps in action is a must read regarding to this question.
Before Go 1, when a map was not changed, the runtime returned the keys in the same order when you iterated over its keys/entries multiple times. Note that this order could have changed if the map was modified as the implementation might needed to do a rehash to accommodate more entries. People started to rely on the same iteration order (when map was not changed), so starting with Go 1 the runtime randomizies map iteration order on purpose to get the attention of the developers that the order is not defined and can't be relied on.
What to do then?
If you need a sorted dataset (be it a collection of key-value pairs or anything else) either by insertion order or natural order defined by the key type or an arbitrary order, map is not the right choice. If you need a predefined order, slices (and arrays) are your friend. And if you need to be able to look up the elements by a predefined key, you may additionally build a map from the slice to allow fast look up of the elements by a key.
Either you build the map first and then a slice in proper order, or the slice first and then build a map from it is entirely up to you.
The aforementioned Go maps in action blog post has a section dedicated to Iteration order:
When iterating over a map with a range loop, the iteration order is not specified and is not guaranteed to be the same from one iteration to the next. Since Go 1 the runtime randomizes map iteration order, as programmers relied on the stable iteration order of the previous implementation. If you require a stable iteration order you must maintain a separate data structure that specifies that order. This example uses a separate sorted slice of keys to print a map[int]string in key order:
import "sort"
var m map[int]string
var keys []int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
fmt.Println("Key:", k, "Value:", m[k])
}
P.S.:
...although it seems to not exhibit the same behavior.
Seemingly you see the "same iteration order" on the Go Playground because the outputs of the applications/codes on the Go Playground are cached. Once a new, yet-unique code is executed, its output is saved as new. Once the same code is executed, the saved output is presented without running the code again. So basically it's not the same iteration order what you see, it's the exactly same output without executing any of the code again.
P.S. #2
Although using for range the iteration order is "random", there are notable exceptions in the standard lib that do process maps in sorted order, namely the encoding/json, text/template, html/template and fmt packages. For more details, see In Golang, why are iterations over maps random?
Go maps do not maintain the insertion order; you will have to implement this behavior yourself.
Example:
type NavigationMap struct {
m map[string]navbarTab
keys []string
}
func NewNavigationMap() *NavigationMap { ... }
func (n *NavigationMap) Set(k string, v navbarTab) {
n.m[k] = v
n.keys = append(n.keys, k)
}
This example is not complete and does not cover all use-cases (eg. updating insertion order on duplicate keys).
If your use-case includes re-inserting the same key multiple times (this will not update insertion order for key k if it was already in the map):
func (n *NavigationMap) Set(k string, v navbarTab) {
_, present := n.m[k]
n.m[k] = v
if !present {
n.keys = append(n.keys, k)
}
}
Choose the simplest thing that satisfies your requirements.

Resources