How can I use a gremlin query to filter based on a users permissions? - gremlin

I am fairly new to graph databases, however I have used SQL Server and document databases (Lucene, DocumentDb, etc.) extensively. It's completely possible that I am approaching this query the wrong way, since I am new to graph databases. I am trying to convert some logic to a graph database (CosmosDB Graph via Gremlins to be specific) that we currently are using SQL Server for. The reason for the change is that this problem set is not really what SQL Server is great at and so our SQL query (which we have optimized as good as we can) is really starting to be the hot spot of our application.
To give a very brief overview of our logic, we run a web shop that allows admins to configure products and users with several levels of granular permissions (described below). Based on these permissions, we show the user only the products they are allowed to see.
Entities:
Region: A region consists of multiple countries
Country: A country has many markets and many regions
Market: A market is a group of stores in a single country
Store: A store is belongs to a single market
Users have the following set of permissions and each set can contain multiple values:
can-view-region
can-view-country
can-view-market
can-view-store
Products have the following set of permissions and each set can contain multiple values:
visible-to-region
visible-to-country
visible-to-market
visible-to-store
After trying for a few days, this is the query that I have come up with. This query does work and returns the correct products for the given user, however it takes about 25 seconds to execute.
g.V().has('user','username', 'john.doe').union(
__.out('can-view-region').out('contains-country').in('in-market').hasLabel('store'),
__.out('can-view-country').in('in-market').hasLabel('store'),
__.out('can-view-market').in('in-market').hasLabel('store'),
__.out('can-view-store')
).dedup().union(
__.out('in-market').in('contains-country').in('visible-to-region').hasLabel('product'),
__.out('in-market').in('visible-to-country').hasLabel('product'),
__.out('in-market').in('visible-to-market').hasLabel('product'),
__.in('visible-to-store').hasLabel('product')
).dedup()
Is there a better way to do this? Is this problem maybe not best suited with a graph database?
Any help would be greatly appreciated!
Thanks,
Chris

I don't think this is going to help a lot, but here's an improved version of your query:
g.V().has('user','username', 'john.doe').union(
__.out('can-view-region').out('contains-country').in('in-market').hasLabel('store'),
__.out('can-view-country','can-view-market').in('in-market').hasLabel('store'),
__.out('can-view-store')
).dedup().union(
__.out('in-market').union(
__.in('contains-country').in('visible-to-region'),
__.in('visible-to-country','visible-to-market')).hasLabel('product'),
__.in('visible-to-store').hasLabel('product')
).dedup()
I wonder if the hasLabel() checks are really necessary. If, for example, .in('in-market') can only lead a store vertex, then remove the extra check.
Furthermore it might be worth to create shortcut edges. This would increase write times whenever you mutate the permissions, but should significantly increase the read times for the given query. Since the reads are likely to occur way more often than permission updates, this might be a good trade-off.

CosmosDB Graph team is looking into improvements that can done on union step in particular.
Other options that haven't already been suggested:
Reduce the number of edges that are traversed per hop with additional predicates. e.g:
g.V('1').outE('market').has('prop', 'value').inV()
Would it be possible to split the traversal up and do parallel request in your client code? Since you are using .NET, you could take each result in first union, and execute parallel requests for the traversals in the second union. Something like this (untested code):
string firstUnion = #"g.V().has('user','username', 'john.doe').union(
__.out('can-view-region').out('contains-country').in('in-market').hasLabel('store'),
__.out('can-view-country').in('in-market').hasLabel('store'),
__.out('can-view-market').in('in-market').hasLabel('store'),
__.out('can-view-store')
).dedup()"
string[] secondUnionTraversals = new[] {
"g.V({0}).out('in-market').in('contains-country').in('visible-to-region').hasLabel('product')",
"g.V({0}).out('in-market').in('visible-to-country').hasLabel('product')",
"g.V({0}).out('in-market').in('visible-to-market').hasLabel('product')",
"g.V({0}).in('visible-to-store').hasLabel('product')",
};
var response = client.CreateGremlinQuery(col, firstUnion);
while (response.HasMoreResults)
{
var results = await response.ExecuteNextAsync<Vertex>();
foreach (Vertex v in results)
{
Parallel.ForEach(secondUnionTraversals, (traversal) =>
{
var secondResponse = client.CreateGremlinQuery<Vertex>(col, string.Format(traversal, v.Id));
while (secondResponse.HasMoreResults)
{
concurrentColl.Add(secondResponse);
}
});
}
}

Related

How to model complex relational data in Firestore while limiting composite indexes?

First of all thank you to anybody reading through this and offering any advice and help. It is much appreciated.
I'm developing a small custom CRM (ouch) for my father's business (specialty contractor) and I'm using Firestore for my database. It is supposed to be very lean, with not much "bling" but stream lined to his speciality contracting business, which is very hard to to get any other custom CRM to be applied to his process. I have gotten quite far and have a decent size implementation, but am now running into some very fundamental issues as everything is expanding.
I admit that only having experience with relational databases (and not much of that either) left me scratching my head a few times when properly setting up my database structure and am running into some issues with Firestore. I'm also a fairly novice developer and I feel I'm tackling something that is just way out of my league. (but there's not much turning around now being a year into this journey)
As of right now I'm using Top Level Collections for what I am presenting here. I recently started using Sub-Collections for some other minor features and started questioning if I should apply that for everything.
A big problem that I foresee is because I want to query in a multitude of ways, I am already consuming almost 100 composite indexes at this time. There is still lots to add, so I need to reduce the amount of composite indexes that my current and future data structure needs.
So I am somewhat certain, that my data model probably is deeply flawed and needs to be improved/optimized/changed. (Which I don't mind doing, if that's what it takes, but I'm lost on "how") I don't need a specific solution, but maybe just some pointers, generally speaking, of what approaches are available. I think I might be lacking an "aha" moment. If I understand a pattern, I can usually apply that further in other areas.
I will make my "Sales Leads Collection" a central concern of this post, as it has the most variations of querying.
So I have a mostly top level collection structure like this, but also want to prefix, that besides writing the IDs to other Documents, I will "stash" an entire "Customer" or "Sales Rep" Object/Document with other Documents and I have Cloud Functions that will iterate through certain documents when there are updates, etc. (To avoid extra reads, i.e. when I read a SalesLead, I don't need to read the SalesRep and Customer Document, as they are also stashed/nested with the SalesLead)
| /sales_reps //SalesReps Collection
| /docId //Document ID
| + salesRepId (document id)
| + firstName
| + lastName
| + other employee/salesRep related info etc.
| /customers //Customers Collection
| /docId //Document ID
| + customerId (document id)
| + firstName
| + lastName
| + address + other customer specific related info such as contact info (phone, email) etc.
Logically Sales Leads are of course linked to a Customer (one to many, one Customer can have many leads).
All the Fields mentioned below I need to be able to "query" and "filter"
| /sales_leads //SalesLeads Collection
| /docId //Document ID
| + customerId (document id) <- this is what I would query by to look for leads for a specific customer
| + salesRepId (document id) <- this is what I would query by to look for leads for a specific sales Rep
| + status <- (String: "Open", "Sold", "Lost", "On Hold)
| + progress <- (String: "Started", "Appointment scheduled", "Estimates created", etc. etc., )
| + type <- (String: New Construction or Service/Repair)
| + jobTye <- (String: Different Types job Jobs related to what type of structures they are; 8-10 types right now)
| + reference <- (String: How the lead was referred to the company, i.e. Facebook, Google, etc. etc. );
| + many other (non queryable) data related to a lead, but not relevant here...
SalesEstimates are related to Leads in a one to many relationship. (one lead can have many estimates) But Estimates are not all that relevant for this discussion, but just wanted to include it anyhow. I query and filter estimates in a very similar way I do with leads, though. (similar fields etc.)
| /sales_estimates //SalesEstimates Collection
| /docId //Document ID
| + salesLeadId (document id) <- this is what I would query by to look for estimates for a specific lead
| + customerId (document id) <- this is what I would query by to look for estimates for a specific customer
| + salesRepId (document id) <- this is what I would query by to look for estimates for a specific sales Rep
| + specific sales Lead related data etc....
In my "Sales Lead List" on the client, I have some Drop Down Boxes as Filters, that contain Values (i.e. Sales Reps) but also haven an Option/Value "All" to negate any filtering.
So I would start assembling a query:
Query query = db.collection("sales_leads");
//Rep
if (!salesRepFilter.equals("All")) { //Typically only Managers/Supervisors woujld be able to see "all leads" whereas for a SalesRep this would be set on his own ID by default.
query = query = query.whereEqualTo("salesRepId", salesRepId);
}
//Lead Status (Open, Sold, Lost, On Hold)
if (!statusFilter.contains("All")) {
query = query.whereEqualTo("status", statusFilter);
}
//Lead Progress
if (!progressFilter.contains("All")) {
query = query.whereEqualTo("progress", progressFilter);
}
//Lead Type
if (!typeFilter.contains("All")) {
query = query.whereEqualTo("leadType", typeFilter);
}
//Job Type
if (!jobTypeFilter.contains("All")) {
query = query.whereArrayContains("jobTypes", jobTypeFilter);
}
//Reference
if (!referenceFilter.contains("All")) {
query = query.whereEqualTo("reference", referenceFilter);
}
Additionally I might want to reduce the whole query to a single customer (this typically means that all other filters are skipped and "all leads for this customer are shown). This would happen if the user opens the Customer Page/Details and clicks on something like "Show Leads for this customer".
//Filter by Customer (when entering my SalesLead List from a Customer Card/Page where user clicked on "Show Leads for this Customer")
if (filterByCustomer) {
query = query.whereEqualTo("customerId", customerFilter);
}
//And at last I want to be able to query the date Range (when the lead was created) and also sort by "oldest" or "newest"
//Date Range
query = query.whereGreaterThan("leadCreatedOnDate", filterFromDate);
.whereLessThan("leadCreatedOnDate", filterToDate;
//Sort Newest vs Oldest
if (sortByNewest) { //either newest or oldest
query = query.orderBy("leadCreatedOnDate", Query.Direction.ASCENDING);
} else {
query = query.orderBy("leadCreatedOnDate", Query.Direction.DESCENDING);
}
And that would complete my query on sales leads. Which that all works great right now but I am anxious about going forward and ultimately hitting the composite index limitation. I don't have an exact number, but I am probably entertaining 25-30 composite indexes just for my collection of sales_leads. (Yikes!)
Not only are there many fields to query by, the amount of composite indexes required is multiplied by the combination of possible filters set. (UGH)
I need to be able to query all leads and then filter them by the fields mentioned above (when describing my sales_leads collection).
So instead of keeping all these collections as top level collections I am guessing that somehow I should restructure my database by entertaining sub collections, but I tried modeling this with different approaches and always seem to hit a wall.
I suppose I could have "sales_leads" as a subcollection under each customer object and could use a collection group query to retrieve "all leads", but those require composite indexes, too right? So it would just be tradeoff for that one searchable field. (..hits wall..)
Sorry for the length. I hope it's readable. I appreciate any help, feedback and input. I'm in a very anxious and frustrated position.
If this doesn't work, I might need to consider professional consultation.
Thanks!
Here are a few things I think will help you.
First, watch the AWS re:Invent 2018: Amazon DynamoDB Deep Dive on YouTube. It's about DynamoDB but DynamoDB is a NoSQL database very similar to Firestore and the concepts universally apply. Midway through the video, Rick uses a company like yours as an example and you may be surprised to see how effectively he can reduce query count simply through data modeling.
Second, familiarize yourself with Firestore's index merging. In situations like yours, it may be better to manually create your composite indices, or at least manually audit them, because Firestore's automatic indexing doesn't guarantee the most efficient menu of composite indices. Remember, composite indices are automatically created based on the order you execute queries and if you execute a query later that could be better structured by voiding a previous index, Firestore will not go back and delete it for you—you have to.
I'm highly suspicious of the fact that the sales-lead query consumes 25-30 composite indices; that number seems far too high to me given how many fields in the documents are indexed. Before you do anything—after having watched the video and studied index merging, of course—I'd focus entirely on this collection. You must be completely certain of the maximum number of composite indices this collection needs to consume. Perhaps create a dummy collection and experiment with index merging and really understand how it works because this alone may solve all of your problems. I would be shocked if Firestore couldn't handle your company's use case.
Third, don't be afraid to denormalize your data. The fundamental premise of NoSQL is really denormalization—that is, data storage really should be your least concern and computation/operation really should be your greatest concern. If you can reduce your query count by duplicating data over multiple documents in multiple collections, that is simply what you must do if the alternative is hitting 200 composite indices.

xquery finding records with specific security role

I'm trying to find records with a specific security role, and I cna't seem to find a way to do it using cts:search (which should be faster than a for loop). Here is the for loop:
let $validRoleList := (
xdmp:role("myRole1"),
xdmp:role("myRole2")
)
for $recordUri in cts:uris((), (), cts:collection-query("bigCollection"))
let $documentPermissions := xdmp:document-get-permissions($recorduri)/sec:role-id/fn:string()
let $intPermissions :=
for $permissionValue in $documentPermissions
return xs:unsignedLong($documentPermissions)
where $intPermissions eq $validRoleList
return $recordUri
With my "bigCollection" being in the 15 million record range, even on the task server it's taking over an hour. Is there any easier way to find a record by its permission role name?
I found this function somewhere years ago, and I don't know how it works, but it does. I've used it in production systems for years, and it works great for your question of "How do I query for documents that have a particular permission?" It's in XQuery, but I believe there's a JS equivalent for each XQuery function.
declare function permission-query($role, $capability)
{
cts:term-query(
xdmp:add64(
xdmp:mul64(xdmp:add64(xdmp:mul64(xdmp:role($role), 5), xdmp:hash64($capability)), 5),
xdmp:hash64("permission()")
)
)
};
This looping approach is inherently slow because it's going to pull every document off disk to extract its permissions. 15 million docs means 15 million disk fetches. No matter the code, that's slow.
The fastest and easiest way to answer this would be to make and become a user with those two roles and do a cts:uris query for all the URIs in the database, and the answer will be automatically and efficiently limited the URIs visible for those two roles.
If you need it more dynamic without creating such a user, it's possible for an admin user to xdmp:login with a list of roles.

Firebase - Structuring Data For Efficient Indexing

I've read almost everywhere about structuring one's Firebase Database for efficient querying, but I am still a little confused between two alternatives that I have.
For example, let's say I want to get all of a user's "maxBenchPressSessions" from the past 7 days or so.
I'm stuck between picking between these two structures:
In the first array, I use the user's id as an attribute to index on whether true or false. In the second, I use userId as the attribute NAME whose value would be the user's id.
Is one faster than the other, or would they be indexed a relatively same manner? I kind of new to database design, so I want to make sure that I'm following correct practices.
PROGRESS
I have come up with a solution that will both flatten my database AND allow me to add a ListenerForSingleValueEvent using orderBy ONLY once, but only when I want to check if a user has a session saved for a specific day.
I can have each maxBenchPressSession object have a key in the format of userId_dateString. However, if I want to get all the user's sessions from the last 7 days, I don't know how to do it in one query.
Any ideas?
I recommend to watch the video. It is told about the structuring of the data very well.
References to the playlist on the firebase 3
Firebase 3.0: Data Modelling
Firebase 3.0: Node Client
As I understand the principle firebase to use it effectively. Should be as small as possible to query the data and it does not matter how many requests.
But you will approach such a request. We'll have to add another field to the database "negativeDate".
This field allows you to get the last seven entries. Here's a video -
https://www.youtube.com/watch?v=nMR_JPfL4qg&feature=youtu.be&t=4m36s
.limitToLast(7) - 7 entries
.orderByChild('negativeDate') - sort by date
Example of a request:
const ref = firebase.database().ref('maxBenchPressSession');
ref.orderByChild('negativeDate').limitToLast(7).on('value', function(snap){ })
Then add the user, and it puts all of its sessions.
const ref = firebase.database().ref('maxBenchPressSession/' + userId);
ref.orderByChild('negativeDate').limitToLast(7).on('value', function(snap){ })

How to load first 50 objects in firebase, stop the loading and then filter results?

My Firebase database is more than 800mb large and with more than 100.000 objects (news articles).
What I want to do is to fetch just the first 50 objects (most recent) and then to sort the objects got from the result of the first query according to child parameters.
So, for example, when the page is loaded, I need angularfire / firebase to load just first 50 objects and to stop loading the rest of objects in database. Then, I want to filter out just these 50 objects (articles) based on node category music.
So far, my first query seems to be fine (but if there is better way to ask firebase to load X objects and to stop, I would appreciate). But, the second part, I can’t figure it out because firebase throw an error.
The error is:
Query: Can't combine startAt(), endAt(), and limit(). Use limitToFirst() or limitToLast() instead
Here is my sample code:
var myArticlesRef = new Firebase(FIREBASE_URL + 'articles/');
var latestArticlesRef = myArticlesRef.limitToFirst(20); // is this the recommended way to ask firebase to stop
var latestArticlesOrder = latestArticlesRef.orderByChild('category').equalTo(‘Music’); // <- how to do something similar?
var latestArticlesInfo = $firebaseArray(latestArticlesOrder);
$scope.latestArticles = latestArticlesInfo;
console.log($scope.latestArticles);
This should work:
var query = myArticlesRef.orderByChild('category').equalTo(‘Music’).limitToFirst(20);
So you're asking Firebase to return the first 20 articles in the Music category.
While it is common to think of queries like this when coming from a relational/SQL mindset, I recommend that you consider this alternative data structure:
articles_by_category
music
article1: { }
article2: { }
article3: { }
...
technology
article4: { }
...
So instead of storing the articles in one big list, store them by category. That way to access the articles about music, you only have to do:
var query = ref.child('articles_by_category').child('music').limitToFirst(20);
With this approach the database doesn't have to execute any query and it can scale to a much higher number of users.
This is something you'll see regularly in a NoSQL world: you end up modeling your data for the way your application wants to query it. For a great introduction, see this article on NoSQL data modeling. Also read the Firebase documentation on data modeling and indexes.

Neo4j Match and Create takes too long in a 10000 node graph

I have a data model like this:
Person node
Email node
OWNS relationship
LISTS relationship
KNOWS relationship
each Person can OWN one Email and LISTS multiple Emails (like a contact list, 200 contacts is assumed per Person).
The query I am trying to perform is finding all the Persons that OWN an Email that a Contact LISTS and create a KNOWS relationship between them.
MATCH (n:Person {uid:'123'}) -[r1:LISTS]-> (m:Email) <-[r2:OWNS]- (l:Person)
CREATE UNIQUE (n)-[:KNOWS]->[l]
The counts of my current database is as follows:
Number of Person nodes: 10948
Number of Email nodes: 1951481
Number of OWNS rels: 21882
Number of LISTS rels: 4376340 (Each Person has 200 unique LISTS rels)
Now my problem is that running the said query on this current database takes something between 4.3 to 4.8 seconds which is unacceptable for my need. I wanted to know if this is normal timing considering my data model or am I doing something wrong with the query (or even model).
Any help would be much appreciated. Also if this is normal for Neo4j please feel free to suggest other graph databases that can handle this kind of model better.
Thank you very much in advance
UPDATE:
My query is: profile match (n: {uid: '4692'}) -[:LISTS]-> (:Email) <-[:OWNS]- (l) create unique (n)-[r:KNOWS]->(l)
The PROFILE command on my query returns this:
Cypher version: CYPHER 2.2, planner: RULE. 3919222 total db hits in 2713 ms.
Yes, 4.5 seconds to match one person from index along with its <=100 listed email addresses and merging a relationship from user to the single owner of each email, is slow.
The first thing is to make sure you have an index for uid property on nodes with :Person label. Check your indices with SCHEMA command and if missing create such an index with CREATE INDEX ON :Person(uid).
Secondly, CREATE UNIQUE may or may not do the work fine, but you will want to use MERGE instead. CREATE UNIQUE is deprecated and though they are sometimes equivalent, the operation you want performed should be expressed with MERGE.
Thirdly, to find out why the query is slow you can profile it:
PROFILE
MATCH (n:Person {uid:'123'})-[:LISTS]->(m:Email)<-[:OWNS]-(l:Person)
MERGE (n)-[:KNOWS]->[l]
See 1, 2 for details. You may also want to profile your query while forcing the use of one or other of the cost and rule based query planners to compare their plans.
CYPHER planner=cost
PROFILE
MATCH (n:Person {uid:'123'})-[:LISTS]->(m:Email)<-[:OWNS]-(l:Person)
MERGE (n)-[:KNOWS]->[l]
With these you can hopefully find and correct the problem, or update your question with the information to help others help you find it.

Resources