Gremlin - avoid traversing already seen edges

Gremlin - avoid traversing already seen edges - gremlin

I've the following model which I'd like to model as a graph in Azure CosmmodDB.
So I have a user that can be in multiple groups, user can also have multiple permissions attached, groups can also have multiple permissions attached.
I want to find an efficient query that starting from User, I get all the permissions attached (either directly attached or via a group).
One thing to add is that user and group may be assigned to the same permission (and I want to get it just once).
I came up with the query:
g.V().hasLabel('user').has('userid', '0_2147483647').repeat(out().simplePath()).until(hasLabel('permission'))
This query is not very efficient when there is much data, so the question is: can we make it better ?

I don't see a reason to use repeat() here as the depth of your traversal is known. I would just do:
g.V().has('user`, 'userid', '0_2147483647').
union(out('has'),
out('isingroup').out('has')).
dedup()

Related

Capture path of all accessed nodes in arbitrary Gremlin query

Assuming I have an arbitrary Gremlin query I don't control as input, and a graph database that I run it against, how can I capture the paths of all accessed nodes in the graph, as in, how can I see what parts of the graph are needed by an arbitrary query?
Clarification:
If I run the arbitrary, how can I capture all the accessed data as the query runs, not just the result, but all the data accessed during the query.

Different databases may have explain plan options that give some insight into how a query will run but really the only way to know what a Gremlin query is going to need to visit in the graph is to run it. If you know the schema of the graph you could potentially write some code that analyzes the query to look at the various steps and labels used to make an estimate of what the query will touch but I am not aware of any existing tools that do that.

GraphDB account modeling: user access relationship attribute or relationship?

I am attempting to model account access in a graph DB.
The account can have multiple users and multiple features. A user can have access to many accounts. Each account can give access to only part of the features for each user.
One way I see it is to represent access for each user through relationship attributes, this allows having a shared feature node.
user_1 has access to account_1-feature_1 and account_2-feature-2. user_1 does not have access to account_1-feature_2 even though it is enabled for the account.
Another way to model the same access, but without relationship attribute is to create account specific feature nodes.
Question 1: which of these 2 ways is a more 'proper' modeling in the graph DB world?
Now to make things more interesting the account can also have parts which can be accessed by multiple accounts and a certain feature should be able to be scoped down to only be accessible for specific part by user.
In this example user_1 can access account_1 only for part_a feature_1.
To me it seems like defining an attribute on relationship is the way to go for being able to scope down user access by feature & by part of the account. However, reading neo4j powerpoints this would be one of the code smells of relationships having "Lots of attribute-like properties". Is there a better way to approach such problem in a graph?

I could be wrong here, but here are my thoughts. Option 1 definitely sounds the better way from a modeling perspective, however, I don't see how you can keep the data consistent without building heavy machinery to do it. For example, If someone deletes Account1.Feature1, and does not update the edge from User1 -> Account1, then you end up having stale RBAC rules in the system. You think you have access to something, but in reality that "thing" does not exist anymore. Option 2 may not seem very attractive from a data model perspective, but it does keep your data consistent. If you delete Account1.Feature1, the edge is automatically deleted in the same transaction.
The only con is that, you need to incur additional cost at insertion where you need to insert a lot more nodes than Option 1. For an RBAC system, I think its a fair compromise.
The same comment applies to the second half of your question as well.

Gremlin query to get in and out edges for a given Vertex

I’m just playing with the Graph API in Cosmos DB
which uses the Gremlin syntax for query.
I have a number of users (Vertex) in the graph and each have ‘knows’ properties to other users. Some of these are out edges (outE) and others are in edges (inE) depending on how the relationship was created.
I’m now trying to create a query which will return all ‘knows’ relationships for a given user (Vertex).
I can easily get the ID of either inE or outE via:
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').inE('knows')
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').outE('knows')
where '7112138f-fae6-4272-92d8-4f42e331b5e1' is the Id of the user I’m querying, but I don’t know ahead of time whether this is an in or out edge, so want to get both (e.g. if the user has in and out edges with the ‘knows’ label).
I’ve tried using a projection and OR operator and various combinations of things e.g.:
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').where(outE('knows').or().inE('knows'))
but its not getting me back the data I want.
All I want out is a list of the Id’s of all inE and outE that have the label ‘knows’ for a given vertex.
Or is there a simpler/better way to model bi-directional associations such as ‘knows’ or ‘friendOf’?
Thanks

You can use the bothE step in this case. g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').bothE('knows')

Firebase: How flat should my data structure be?

I'm building an app that tracks the user's location and updates Firebase. I've read the documentation about structure data but still have a few questions.
I'm considering structuring the data in one of two ways, but can't determine which one.
users
$id
-position
-other attr
vs:
user_position
$id
users
$id
-other attr.
In what scenario would the first design work best, second?

If you only keep one position per user (as seems to be the case by the fact that you use singular user_position), there is no useful difference between the two structures. A user's position in that case is just another attribute, just one that happens to have two value (lat and lon).
But if you want to keep multiple positions per user, then your first structure is mixing entity types: users and user_positions. This is an anti-pattern when it comes to Firebase Database.
The two most common reasons are:
Say you want to show a list of user names (or any specific, single-value attribute). With the first structure you will also need to read the list of all positions of all users, just to get the list of names. With the second structure, you just read the user's attributes. If that is still much more data than you need, consider also keeping a list of /user_names for optimal read performance.
Many developers end up wanting different access rules for the user positions and the other user attributes. In the first structure that is only possible by pushing the read permission from the top /users down to lower in the tree. In the second structure, you can just give separate permissions to /users and /user_positions.

Storing multiple graphs in Neo4J

I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.

Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.

Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.

This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex