Neo4j Match and Create takes too long in a 10000 node graph - graph

I have a data model like this:
Person node
Email node
OWNS relationship
LISTS relationship
KNOWS relationship
each Person can OWN one Email and LISTS multiple Emails (like a contact list, 200 contacts is assumed per Person).
The query I am trying to perform is finding all the Persons that OWN an Email that a Contact LISTS and create a KNOWS relationship between them.
MATCH (n:Person {uid:'123'}) -[r1:LISTS]-> (m:Email) <-[r2:OWNS]- (l:Person)
CREATE UNIQUE (n)-[:KNOWS]->[l]
The counts of my current database is as follows:
Number of Person nodes: 10948
Number of Email nodes: 1951481
Number of OWNS rels: 21882
Number of LISTS rels: 4376340 (Each Person has 200 unique LISTS rels)
Now my problem is that running the said query on this current database takes something between 4.3 to 4.8 seconds which is unacceptable for my need. I wanted to know if this is normal timing considering my data model or am I doing something wrong with the query (or even model).
Any help would be much appreciated. Also if this is normal for Neo4j please feel free to suggest other graph databases that can handle this kind of model better.
Thank you very much in advance
UPDATE:
My query is: profile match (n: {uid: '4692'}) -[:LISTS]-> (:Email) <-[:OWNS]- (l) create unique (n)-[r:KNOWS]->(l)
The PROFILE command on my query returns this:
Cypher version: CYPHER 2.2, planner: RULE. 3919222 total db hits in 2713 ms.

Yes, 4.5 seconds to match one person from index along with its <=100 listed email addresses and merging a relationship from user to the single owner of each email, is slow.
The first thing is to make sure you have an index for uid property on nodes with :Person label. Check your indices with SCHEMA command and if missing create such an index with CREATE INDEX ON :Person(uid).
Secondly, CREATE UNIQUE may or may not do the work fine, but you will want to use MERGE instead. CREATE UNIQUE is deprecated and though they are sometimes equivalent, the operation you want performed should be expressed with MERGE.
Thirdly, to find out why the query is slow you can profile it:
PROFILE
MATCH (n:Person {uid:'123'})-[:LISTS]->(m:Email)<-[:OWNS]-(l:Person)
MERGE (n)-[:KNOWS]->[l]
See 1, 2 for details. You may also want to profile your query while forcing the use of one or other of the cost and rule based query planners to compare their plans.
CYPHER planner=cost
PROFILE
MATCH (n:Person {uid:'123'})-[:LISTS]->(m:Email)<-[:OWNS]-(l:Person)
MERGE (n)-[:KNOWS]->[l]
With these you can hopefully find and correct the problem, or update your question with the information to help others help you find it.

Related

Querying details from GraphDB

We are trying to implement Customer oriented details in Graphdb, were with a single query we can fetch the details of a customer such as his address,phone,email etc. We have build it using had address, has email edges..
g.addV('member').property('id','CU10611972').property('CustomerId', 'CU10611972').property('TIN', 'xxxx').property('EntityType', 'Person').property('pk', 'pk')
g.addV('email').property('id','CU10611972E').property('pk', 'pk')
g.addV('primary').property('id','CU10611972EP').property('EmailPreference','Primary').property('EmailType', 'Home').property('EmailAddress', 'SNEHA#GMAIL.COM').property('pk', 'pk')
g.V('CU10611972').addE('has Email').to(g.V('CU10611972E'))
g.V('CU10611972E').addE('has Primary Email').to(g.V('CU10611972EP')
This is how we have build email relation to the customer.. Similarly we have relations with Address and Phone. So right now we are using this command to fetch the json related to this customer for email,
g.V('CU10611972').out('has Email').out('has Primary Email')
And for complete Customer details we are using union for each Vertex, Phone,Emaiul and address..
Could you please suggest if there is an efficient way to query this detail?
This comes down really to two things.
General graph data modelling
Things the graph DB you are using does and does not support.
With Gremlin there are a few ways to model this data for a single vertex.
If the database supports it, have a list of names like ['home','mobile'] and use metaproperties to attach a phone number to each.
A lot of the Gremlin implementations I am aware of have chosen not to support meta properties. In these cases you have a couple of options.
(a) Have a property for 'Home' and another for 'Mobile'. If either is not known you could either not create that property or give it a value such as "unknown"
(b) Use prefixed strings such as ["Home:123456789","Mobile:123456789] and store them in a set or list (multi properties) and access them in Gremlin using the startingWith predicate. Such as g.V(id).properties('phone').hasValue(startingWith('Mobile')).value()

How can I use a gremlin query to filter based on a users permissions?

I am fairly new to graph databases, however I have used SQL Server and document databases (Lucene, DocumentDb, etc.) extensively. It's completely possible that I am approaching this query the wrong way, since I am new to graph databases. I am trying to convert some logic to a graph database (CosmosDB Graph via Gremlins to be specific) that we currently are using SQL Server for. The reason for the change is that this problem set is not really what SQL Server is great at and so our SQL query (which we have optimized as good as we can) is really starting to be the hot spot of our application.
To give a very brief overview of our logic, we run a web shop that allows admins to configure products and users with several levels of granular permissions (described below). Based on these permissions, we show the user only the products they are allowed to see.
Entities:
Region: A region consists of multiple countries
Country: A country has many markets and many regions
Market: A market is a group of stores in a single country
Store: A store is belongs to a single market
Users have the following set of permissions and each set can contain multiple values:
can-view-region
can-view-country
can-view-market
can-view-store
Products have the following set of permissions and each set can contain multiple values:
visible-to-region
visible-to-country
visible-to-market
visible-to-store
After trying for a few days, this is the query that I have come up with. This query does work and returns the correct products for the given user, however it takes about 25 seconds to execute.
g.V().has('user','username', 'john.doe').union(
__.out('can-view-region').out('contains-country').in('in-market').hasLabel('store'),
__.out('can-view-country').in('in-market').hasLabel('store'),
__.out('can-view-market').in('in-market').hasLabel('store'),
__.out('can-view-store')
).dedup().union(
__.out('in-market').in('contains-country').in('visible-to-region').hasLabel('product'),
__.out('in-market').in('visible-to-country').hasLabel('product'),
__.out('in-market').in('visible-to-market').hasLabel('product'),
__.in('visible-to-store').hasLabel('product')
).dedup()
Is there a better way to do this? Is this problem maybe not best suited with a graph database?
Any help would be greatly appreciated!
Thanks,
Chris
I don't think this is going to help a lot, but here's an improved version of your query:
g.V().has('user','username', 'john.doe').union(
__.out('can-view-region').out('contains-country').in('in-market').hasLabel('store'),
__.out('can-view-country','can-view-market').in('in-market').hasLabel('store'),
__.out('can-view-store')
).dedup().union(
__.out('in-market').union(
__.in('contains-country').in('visible-to-region'),
__.in('visible-to-country','visible-to-market')).hasLabel('product'),
__.in('visible-to-store').hasLabel('product')
).dedup()
I wonder if the hasLabel() checks are really necessary. If, for example, .in('in-market') can only lead a store vertex, then remove the extra check.
Furthermore it might be worth to create shortcut edges. This would increase write times whenever you mutate the permissions, but should significantly increase the read times for the given query. Since the reads are likely to occur way more often than permission updates, this might be a good trade-off.
CosmosDB Graph team is looking into improvements that can done on union step in particular.
Other options that haven't already been suggested:
Reduce the number of edges that are traversed per hop with additional predicates. e.g:
g.V('1').outE('market').has('prop', 'value').inV()
Would it be possible to split the traversal up and do parallel request in your client code? Since you are using .NET, you could take each result in first union, and execute parallel requests for the traversals in the second union. Something like this (untested code):
string firstUnion = #"g.V().has('user','username', 'john.doe').union(
__.out('can-view-region').out('contains-country').in('in-market').hasLabel('store'),
__.out('can-view-country').in('in-market').hasLabel('store'),
__.out('can-view-market').in('in-market').hasLabel('store'),
__.out('can-view-store')
).dedup()"
string[] secondUnionTraversals = new[] {
"g.V({0}).out('in-market').in('contains-country').in('visible-to-region').hasLabel('product')",
"g.V({0}).out('in-market').in('visible-to-country').hasLabel('product')",
"g.V({0}).out('in-market').in('visible-to-market').hasLabel('product')",
"g.V({0}).in('visible-to-store').hasLabel('product')",
};
var response = client.CreateGremlinQuery(col, firstUnion);
while (response.HasMoreResults)
{
var results = await response.ExecuteNextAsync<Vertex>();
foreach (Vertex v in results)
{
Parallel.ForEach(secondUnionTraversals, (traversal) =>
{
var secondResponse = client.CreateGremlinQuery<Vertex>(col, string.Format(traversal, v.Id));
while (secondResponse.HasMoreResults)
{
concurrentColl.Add(secondResponse);
}
});
}
}

How to obtain list of common users from different groups in firebase DB

Is there any way I can find if users is present in both the groups here: user 1 so that notification/data can be sent to only that set of common users only?
As DB grows I think it will be inefficient to check if every user in one group is present in another or not.
Yes there is. You can create a list of users from GroupA, then create another list of users from GroupsB and then just simply use this line of code using Java8:
!Collections.disjoint(list1, list2);

Firebase query for bi-directional link

I'm designing a chat app much like Facebook Messenger. My two current root nodes are chats and users. A user has an associated list of chats users/user/chats, and the chats are added by autoID in the chats node chats/a151jl1j6. That node stores information such as a list of the messages, time of the last message, if someone is typing, etc.
What I'm struggling with is where to make the definition of which two users are in the chat. Originally, I put a reference to the other user as the value of the chatId key in the users/user/chats node, but I thought that was a bad idea incase I ever wanted group chats.
What seems more logical is to have a chats/chat/members node in which I define userId: true, user2id: true. My issue with this is how to efficiently query it. For example, if the user is going to create a new chat with a user, we want to check if a chat already exists between them. I'm not sure how to do the query of "Find chat where members contains currentUserId and friendUserId" or if this is an efficient denormalized way of doing things.
Any hints?
Although the idea of having ids in the format id1---||---id2 definitely gets the job done, it may not scale if you expect to have large groups and you have to account for id2---||---id1 comparisons which also gets more complicated when you have more people in a conversation. You should go with that if you don't need to worry about large groups.
I'd actually go with using the autoId chats/a151jl1j6 since you get it for free. The recommended way to structure the data is to make the autoId the key in the other nodes with related child objects. So chats/a151jl1j6 would contain the conversation metadata, members/a151jl1j6 would contain the members in that conversation, messages/a151jl1j6 would contain the messages and so on.
"chats":{
"a151jl1j6":{}}
"members":{
"a151jl1j6":{
"user1": true,
"user2": true
}
}
"messages":{
"a151jl1j6":{}}
The part where this gets is little "inefficient" is the querying for conversations that include both user1 and user2. The recommended way is to create an index of conversations for each user and then query the members data.
"user1":{
"chats":{
"a151jl1j6":true
}
}
This is a trade-off when it comes to querying relationships with a flattened data structure. The queries are fast since you are only dealing with a subset of the data, but you end up with a lot of duplicate data that need to be accounted for when you are modifying/deleting i.e. when the user leaves the chat conversation, you have to update multiple structures.
Reference: https://firebase.google.com/docs/database/ios/structure-data#flatten_data_structures
I remember I had similar issue some time ago. The way how I solved it:
user 1 has an unique ID id1
user 2 has an unique ID id2
Instead of adding a new chat by autoId chats/a151jl1j6 the ID of the chat was id1---||---id2 (superoriginal human-readable delimeter)
(which is exactly what you've originally suggested)
Originally, I put a reference to the other user as the value of the chatId key in the users/user/chats node, but I thought that was a bad idea in case I ever wanted group chats.
There is a saying: https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it
There might a limitation of how many userIDs can live in the path - you can always hash the value...

Riak and index - how to check the index a particular bucket has?

I am very much new to riak and I am currently using it with an API. So, let me explain my scenario.
There is an account - say example#riaktest.com created under accounts bucket.
So, on doing
localhost:8098/buckets/accounts/keys?keys=true it returns {{"keys":["example#riaktest.com"]}}
So, i basically have one 'key' with other data fields inside.
Now, I have a functionality that a user can join a particular group. Lets say, 'group1'.
He logs in, joins group1. I give the snowflake generated ID of the account as the index.
Question is:
How do I see a particular key's list of index?
localhost:8098/riak/group/group1 will give me the values and the main account it is tied to.

Resources