DynamoDB modelling advice, duplicating a field for lookup - amazon-dynamodb

I am trying to model a relationship in a sports betting app.
For a given game of sports users can predict who they think is going to win.
I am thinking about building the landing page for this app where users can view all active pools that they either own, or have participated (made a prediction in).
My data model then looks like this
So for example, we have a Pool id a1, owned by user b2 with two predictions by users b1 and b2.
To get all active pools owned by a1 is simple, I just add a GSI on OwnerId and filter by IsActive.
However, I am unsure how to also get all active pools that a1 is not an owner of, but has made a prediction for.
Would the best option here to be duplicate the IsActive flag on to the Predictions and add OwnerId to the Predictions, so I could first fetch by OwnerId and filter by SK startswith Prediction to get the Pool ids and then fetch the Pool profiles via these ids?

You are not taking full advantage of the PK like that.
Here is how I would design the 2 entities you've described (hopefully I understood you correctly)
Pool
PK SK Other attributes
POOL_OWNER#(UserId) Active#True#POOL_NAME(Or ID) ...
This way you can
Get all pools for a given user (Query by partition key)
Get all active/inactive pools of a given user (Query begins with "Active =True/False)
Prediction
PK SK Prediction attributes
PREDICTION_USER#(user id) POOL_ACTIVE#True#POOL_NAME#(pool name) ...
This way you can
Get all predictions for a user by querying the PK
Get all predictions for a user for a specific pool + active/unactive pools
Now when "deactivating" a pool you will have to update all entries. If you are worried about data consistency you can use transactions, but keep in mind that they have a 25 items limit (but they recently upped that number to 100)
If you need to get all predictions for a pool, you can add a GSI for it
GSI1PK GSI1SK
PREDICTION_POOL_NAME#(pool name) PREDICTION_USER#(user id)
It is not a good idea to place data that you want to use directly into the PK or SK. If for example you need the UserID don't extract it from a SK but rather have a separate attribute UserId, this way changing and overloading PK and SK is much easier

Related

What's the best way to store users in DynamoDB so I can get one efficiently, and a related group as well?

I have users for my website that need to log in. In order to do that, I have to check the database for them, by email address or a hash of their email.
Some of my users have an online course in common.
Others are all on the same project.
There are multiple projects and courses.
How might I set up my table so that I can grab individual users, and efficiently query related groups of users?
I'm thinking...
PK = user#mysite
SK = user#email.com
projects = [1,2,3]
courses = [101,202,303]
I can get any user user with a get PK = user#mysite, SK = user#email.com.
But if I query, I have to filter two attributes, and I feel like I'm no longer very efficient.
If I set up users like this on the other hand:
PK = user#email.com
SK = 1#2#3#101#202#303
projects = [1,2,3]
courses = [101,202,303]
Then I can get PK = user#gmail.com and that's unique on its own.
And I can query SK contains 101 for example if I want all the 101 course students.
But I have to maintain this weird # deliminated list of things in the SK string.
Am I thinking about this the right way?
You want to find items which possess a value in an attribute holding a list of values. So do I sometimes! But there is not an index for that.
You can, however, solve this by adding new items to the table.
Your main item would have the email address as both the PK and the SK. It includes attributes listing the courses and projects, and all the other metadata about that user.
For each course, you insert additional items where the course id is the PK and the member emails are the various SKs in that item collection. Same for projects.
Given an email, you can find all about them with a get item. Given a course or project you can find all matching emails with a query against the course or project id. Do a batch get items then if you need all the data about each email.
When someone adds or drops a course or project, you update the main item as well as add/remove the additional indexed items.
Should you want to query by course X and project Y you can pull the matching results to the client and join in the client on email address.
In one of your designs you're proposing a contains against the SK, which is not a supported operator against SKs so that design wouldn't work.

How to model complex relational data in Firestore while limiting composite indexes?

First of all thank you to anybody reading through this and offering any advice and help. It is much appreciated.
I'm developing a small custom CRM (ouch) for my father's business (specialty contractor) and I'm using Firestore for my database. It is supposed to be very lean, with not much "bling" but stream lined to his speciality contracting business, which is very hard to to get any other custom CRM to be applied to his process. I have gotten quite far and have a decent size implementation, but am now running into some very fundamental issues as everything is expanding.
I admit that only having experience with relational databases (and not much of that either) left me scratching my head a few times when properly setting up my database structure and am running into some issues with Firestore. I'm also a fairly novice developer and I feel I'm tackling something that is just way out of my league. (but there's not much turning around now being a year into this journey)
As of right now I'm using Top Level Collections for what I am presenting here. I recently started using Sub-Collections for some other minor features and started questioning if I should apply that for everything.
A big problem that I foresee is because I want to query in a multitude of ways, I am already consuming almost 100 composite indexes at this time. There is still lots to add, so I need to reduce the amount of composite indexes that my current and future data structure needs.
So I am somewhat certain, that my data model probably is deeply flawed and needs to be improved/optimized/changed. (Which I don't mind doing, if that's what it takes, but I'm lost on "how") I don't need a specific solution, but maybe just some pointers, generally speaking, of what approaches are available. I think I might be lacking an "aha" moment. If I understand a pattern, I can usually apply that further in other areas.
I will make my "Sales Leads Collection" a central concern of this post, as it has the most variations of querying.
So I have a mostly top level collection structure like this, but also want to prefix, that besides writing the IDs to other Documents, I will "stash" an entire "Customer" or "Sales Rep" Object/Document with other Documents and I have Cloud Functions that will iterate through certain documents when there are updates, etc. (To avoid extra reads, i.e. when I read a SalesLead, I don't need to read the SalesRep and Customer Document, as they are also stashed/nested with the SalesLead)
| /sales_reps //SalesReps Collection
| /docId //Document ID
| + salesRepId (document id)
| + firstName
| + lastName
| + other employee/salesRep related info etc.
| /customers //Customers Collection
| /docId //Document ID
| + customerId (document id)
| + firstName
| + lastName
| + address + other customer specific related info such as contact info (phone, email) etc.
Logically Sales Leads are of course linked to a Customer (one to many, one Customer can have many leads).
All the Fields mentioned below I need to be able to "query" and "filter"
| /sales_leads //SalesLeads Collection
| /docId //Document ID
| + customerId (document id) <- this is what I would query by to look for leads for a specific customer
| + salesRepId (document id) <- this is what I would query by to look for leads for a specific sales Rep
| + status <- (String: "Open", "Sold", "Lost", "On Hold)
| + progress <- (String: "Started", "Appointment scheduled", "Estimates created", etc. etc., )
| + type <- (String: New Construction or Service/Repair)
| + jobTye <- (String: Different Types job Jobs related to what type of structures they are; 8-10 types right now)
| + reference <- (String: How the lead was referred to the company, i.e. Facebook, Google, etc. etc. );
| + many other (non queryable) data related to a lead, but not relevant here...
SalesEstimates are related to Leads in a one to many relationship. (one lead can have many estimates) But Estimates are not all that relevant for this discussion, but just wanted to include it anyhow. I query and filter estimates in a very similar way I do with leads, though. (similar fields etc.)
| /sales_estimates //SalesEstimates Collection
| /docId //Document ID
| + salesLeadId (document id) <- this is what I would query by to look for estimates for a specific lead
| + customerId (document id) <- this is what I would query by to look for estimates for a specific customer
| + salesRepId (document id) <- this is what I would query by to look for estimates for a specific sales Rep
| + specific sales Lead related data etc....
In my "Sales Lead List" on the client, I have some Drop Down Boxes as Filters, that contain Values (i.e. Sales Reps) but also haven an Option/Value "All" to negate any filtering.
So I would start assembling a query:
Query query = db.collection("sales_leads");
//Rep
if (!salesRepFilter.equals("All")) { //Typically only Managers/Supervisors woujld be able to see "all leads" whereas for a SalesRep this would be set on his own ID by default.
query = query = query.whereEqualTo("salesRepId", salesRepId);
}
//Lead Status (Open, Sold, Lost, On Hold)
if (!statusFilter.contains("All")) {
query = query.whereEqualTo("status", statusFilter);
}
//Lead Progress
if (!progressFilter.contains("All")) {
query = query.whereEqualTo("progress", progressFilter);
}
//Lead Type
if (!typeFilter.contains("All")) {
query = query.whereEqualTo("leadType", typeFilter);
}
//Job Type
if (!jobTypeFilter.contains("All")) {
query = query.whereArrayContains("jobTypes", jobTypeFilter);
}
//Reference
if (!referenceFilter.contains("All")) {
query = query.whereEqualTo("reference", referenceFilter);
}
Additionally I might want to reduce the whole query to a single customer (this typically means that all other filters are skipped and "all leads for this customer are shown). This would happen if the user opens the Customer Page/Details and clicks on something like "Show Leads for this customer".
//Filter by Customer (when entering my SalesLead List from a Customer Card/Page where user clicked on "Show Leads for this Customer")
if (filterByCustomer) {
query = query.whereEqualTo("customerId", customerFilter);
}
//And at last I want to be able to query the date Range (when the lead was created) and also sort by "oldest" or "newest"
//Date Range
query = query.whereGreaterThan("leadCreatedOnDate", filterFromDate);
.whereLessThan("leadCreatedOnDate", filterToDate;
//Sort Newest vs Oldest
if (sortByNewest) { //either newest or oldest
query = query.orderBy("leadCreatedOnDate", Query.Direction.ASCENDING);
} else {
query = query.orderBy("leadCreatedOnDate", Query.Direction.DESCENDING);
}
And that would complete my query on sales leads. Which that all works great right now but I am anxious about going forward and ultimately hitting the composite index limitation. I don't have an exact number, but I am probably entertaining 25-30 composite indexes just for my collection of sales_leads. (Yikes!)
Not only are there many fields to query by, the amount of composite indexes required is multiplied by the combination of possible filters set. (UGH)
I need to be able to query all leads and then filter them by the fields mentioned above (when describing my sales_leads collection).
So instead of keeping all these collections as top level collections I am guessing that somehow I should restructure my database by entertaining sub collections, but I tried modeling this with different approaches and always seem to hit a wall.
I suppose I could have "sales_leads" as a subcollection under each customer object and could use a collection group query to retrieve "all leads", but those require composite indexes, too right? So it would just be tradeoff for that one searchable field. (..hits wall..)
Sorry for the length. I hope it's readable. I appreciate any help, feedback and input. I'm in a very anxious and frustrated position.
If this doesn't work, I might need to consider professional consultation.
Thanks!
Here are a few things I think will help you.
First, watch the AWS re:Invent 2018: Amazon DynamoDB Deep Dive on YouTube. It's about DynamoDB but DynamoDB is a NoSQL database very similar to Firestore and the concepts universally apply. Midway through the video, Rick uses a company like yours as an example and you may be surprised to see how effectively he can reduce query count simply through data modeling.
Second, familiarize yourself with Firestore's index merging. In situations like yours, it may be better to manually create your composite indices, or at least manually audit them, because Firestore's automatic indexing doesn't guarantee the most efficient menu of composite indices. Remember, composite indices are automatically created based on the order you execute queries and if you execute a query later that could be better structured by voiding a previous index, Firestore will not go back and delete it for you—you have to.
I'm highly suspicious of the fact that the sales-lead query consumes 25-30 composite indices; that number seems far too high to me given how many fields in the documents are indexed. Before you do anything—after having watched the video and studied index merging, of course—I'd focus entirely on this collection. You must be completely certain of the maximum number of composite indices this collection needs to consume. Perhaps create a dummy collection and experiment with index merging and really understand how it works because this alone may solve all of your problems. I would be shocked if Firestore couldn't handle your company's use case.
Third, don't be afraid to denormalize your data. The fundamental premise of NoSQL is really denormalization—that is, data storage really should be your least concern and computation/operation really should be your greatest concern. If you can reduce your query count by duplicating data over multiple documents in multiple collections, that is simply what you must do if the alternative is hitting 200 composite indices.

How do I model this in DynamoDB?

I am testing out DynamoDB for a serverless app I am building. I have successfully modeled all of my application's query patterns except one. I was hoping someone could provide some guidance. Here are the details:
Data Model
There are three simple entities: User (~1K records), Product (~100K), ActionItem (~100/product).
A User has a many-to-many relationship with Product.
A Product has a one-to-many relationship with ActionItem.
The Workflow
There's no concept of "Team" for this app. Instead, a user is assigned a set of products which they (and others) are responsible for managing. The user picks the oldest items from their products' action item list, services the item and then closes it.
The use case I am trying to model is: As a user, show me all action items for products to which I am assigned.
Any help would be greatly appreciated.
Really only two options...
If you can store the list of products within the 400KB limit of DDB record, then you could have a record like so...
Hash Key: userID
Sort KEY: "ASSIGNED_PRODUCTS"
Otherwise,
Hash key: UserID
Sort key: "#PRODUCT#10001-54502"
userID in the above might be the raw userid, or if using a GSI, might be something like "#USER#user-id"

Neo4j Match and Create takes too long in a 10000 node graph

I have a data model like this:
Person node
Email node
OWNS relationship
LISTS relationship
KNOWS relationship
each Person can OWN one Email and LISTS multiple Emails (like a contact list, 200 contacts is assumed per Person).
The query I am trying to perform is finding all the Persons that OWN an Email that a Contact LISTS and create a KNOWS relationship between them.
MATCH (n:Person {uid:'123'}) -[r1:LISTS]-> (m:Email) <-[r2:OWNS]- (l:Person)
CREATE UNIQUE (n)-[:KNOWS]->[l]
The counts of my current database is as follows:
Number of Person nodes: 10948
Number of Email nodes: 1951481
Number of OWNS rels: 21882
Number of LISTS rels: 4376340 (Each Person has 200 unique LISTS rels)
Now my problem is that running the said query on this current database takes something between 4.3 to 4.8 seconds which is unacceptable for my need. I wanted to know if this is normal timing considering my data model or am I doing something wrong with the query (or even model).
Any help would be much appreciated. Also if this is normal for Neo4j please feel free to suggest other graph databases that can handle this kind of model better.
Thank you very much in advance
UPDATE:
My query is: profile match (n: {uid: '4692'}) -[:LISTS]-> (:Email) <-[:OWNS]- (l) create unique (n)-[r:KNOWS]->(l)
The PROFILE command on my query returns this:
Cypher version: CYPHER 2.2, planner: RULE. 3919222 total db hits in 2713 ms.
Yes, 4.5 seconds to match one person from index along with its <=100 listed email addresses and merging a relationship from user to the single owner of each email, is slow.
The first thing is to make sure you have an index for uid property on nodes with :Person label. Check your indices with SCHEMA command and if missing create such an index with CREATE INDEX ON :Person(uid).
Secondly, CREATE UNIQUE may or may not do the work fine, but you will want to use MERGE instead. CREATE UNIQUE is deprecated and though they are sometimes equivalent, the operation you want performed should be expressed with MERGE.
Thirdly, to find out why the query is slow you can profile it:
PROFILE
MATCH (n:Person {uid:'123'})-[:LISTS]->(m:Email)<-[:OWNS]-(l:Person)
MERGE (n)-[:KNOWS]->[l]
See 1, 2 for details. You may also want to profile your query while forcing the use of one or other of the cost and rule based query planners to compare their plans.
CYPHER planner=cost
PROFILE
MATCH (n:Person {uid:'123'})-[:LISTS]->(m:Email)<-[:OWNS]-(l:Person)
MERGE (n)-[:KNOWS]->[l]
With these you can hopefully find and correct the problem, or update your question with the information to help others help you find it.

Challenges with adding a 1-1 relational table (Users and UserSettings)

My issue is related to this question: Entity Framework One-To-One Mapping Issues
I have a Users table that already has a bunch of records.
Users (Id, UserName, Password, FullName, Gender)
I need to add a bunch of notification options for each user:
NotifyForNewComment
NotifyForNewPost
NotifyWhenFriendSignsUp
I may have to add more options later, but there will always be a 1-1 relationship, so my question is whether to store these in a separate table, say UserSettings, or just add them as columns to the Users table.
In the linked question (above), the advice was to create a new table and make the UserId in the UserSettings table as the primary key (because otherwise, Entity Framework doesn't like it). If that's what I have to do, then I have a few questions regarding it:
All my tables have an Id column. The UserSettings will not have an Id column then, since the UserId will be the primary key?
I'd have to turn on Identity Insert for the UserSettings table, so that when I insert a new User record, I can also insert a UserSettings record with the UserId?
Given that I already have a bunch of records in the Users table, what do I have to do if I'm going to introduce the new UserSettings table now which will have a 1-1 relationship with the Users table? Do I just run a script to add records for each user in the Users table with some default values? Or do I make it into a 0-1 relationship?
Since it's a 1-1 relationship, should I not worry about a new table, and instead just add them as columns to the existing Users table?
I think you are missing the point of a UserSettings table. It would have columns like:
UserSettingsId
UserId
Notification
It might also contain things like when the notification was created, whether it is currently enabled, and other information.
If you know exactly what the notifications are, and they are not going to change, then you might consider adding them as separate columns in the user table.
On the other hand, this is a natural 1-N relationship, and you should probably implement it as such.

Resources