I'm trying to figure out how to model data in Riak. Let's say you are building something like a CMS with two features, news and products. You need to be able to store this information for multiple clients X and Y. How would you typically structure this?
One bucket per client and then two keys news and products. Store multiple objects under each key and then use map/reduce to order them.
Store both the news and the products in the same bucket, but with a new autogenerated key for each news item and product item. That is, one bucket for X and one for Y.
One bucket per client/feature combination, that is, the buckets would be X-news, X-products, Y-news and Y-products. Then use map/reduce on the whole bucket to return the results in order.
Which would be the best way to handle this problem?
I'd create 2 buckets: news and products.
Then I'd prefix keys in each bucket with client names.
I'd probably also include dates in news keys for easy date ranging.
news/acme_2011-02-23_01
news/acme_2011-02-23_02
news/bigcorp_2011-02-21_01
And optionally prefix product names with category names
products/acme_blacksmithing_anvil
products/bigcorp_databases_oracle
Then in your map/reduce you could use key filtering:
// BigCorp News items
{
"inputs":{
"bucket":"news",
"key_filters":[["starts_with", "bigcorp"]]
}
// ... rest of mapreduce job
}
// Acme Blacksmithing items
{
"inputs":{
"bucket":"products",
"key_filters":[["starts_with", "acme_blacksmithing"]]
}
// ... rest of mapreduce job
}
// News for all clients from Feb 12th to 19th
{
"inputs":{
"bucket":"news",
"key_filters":[["tokenize", "_", 2],
["between", "2011-02-12", "2011-02-19"]]
}
// ... rest of mapreduce job
}
An even more efficient approach to this than using key filtering (as per Kev Burns's recommendation) is to use Secondary Indexes or Riak Search, to model this scenario.
Take a look at my answers to Which clustered NoSQL DB for a Message Storing purpose? and Links in Riak: what can they do/not do, compared to graph databases? for a discussion of similar cases.
You have several decisions to make, depending on your use case. In all cases, you would start out with a company bucket, so that each company has a unique key.
1) Whether to store the items of interest in 2 separate buckets (news and products) or in one (something like items_of_interest) depends on your preference and ease of querying. If you're always going to be querying for both news and products for a company in a single query, you might as well store them in a single bucket. But I recommend using 2 separate ones, to keep easier track of them, especially if you'll have something like separate tabs or pages for "Company X - Products" and "Company X - News". And if you need to combine them into a single feed, you would make 2 queries (one for news and one for products), and combine them in the client code (by date or whatever).
2) If a news/product item can have one and only one company that it belongs to, create a secondary index on company_key for each item. That way, you can easily fetch all news or products for a company via a secondary index (2i) query for that company.
3) If there's a many-to-many relationship (if a news/product item can belong to several companies (perhaps the news item is about a joint venture for 2 separate companies)), then I recommend modeling the relationship as a separate Riak object. For example, you could create a mentions bucket, and for each company mentioned in a news story, you would insert a Mention object, with its own unique key, a secondary index for company_key, and the value would contain a type ('news' or 'product') and an item_key (news key or product key).
Extracting relationships to separate Riak objects like this allows you to do a lot of interesting things -- tag them arbitrarily using Riak Search, query them for subscription event notifications, etc.
Related
Viewed the Firestore docs + Google's I/O 2019 webinar, but I'm still not clear about the right data modeling for my particular use case.
App lets pro service providers register and publish one or more of their services in pre-defined categories (Stay, Sports, Wellness...) and at pre-defined price points (50$, 75$, 100$...).
Users on the homepage are to filter down first with a price point slider - see wireframe), e.g: 199€, then and optionally by selecting the category, eg: all 'Sports' (at 199€) and the location (e.g: all sports at 199€ in the UK). Optionally because users can also build their list with a button as soon as the price is selected. The same 'build list' button is after the category selection and after the location selection. So 3 depths of filtering are possible.
What would be the ideal data structure, given that I want to avoid thousands of reads each time there's filtering.
Three root-level collections (service providers, price points, service categories?) with their relevant documents? I understand and accept denormalization for the purpose of my filtering.
Here's the wireframe for a better understanding of the filtering:
App lets pro service providers register and publish one or more of their services in pre-defined categories (Stay, Sports, Wellness...) and at pre-defined price points (50$, 75$, 100$...).
Since you're having pre-defined categories, prices, and locations, then the simplest solution for modeling such a database would be to have a single collection of products:
Firestore-root
|
--- products (collection)
|
--- $productId (document)
|
--- name: "Running Shoe"
|
--- category: "Sport"
|
--- price: 199
|
--- location: "Europe"
|
--- country: "France"
In this way, you can simply perform all queries that you need. Since you didn't specify a programming language, I'll write the queries in Java, but you can simply convert them into any other programming language. So for example, you can query all products with a particular price:
FirebaseFirestore db = FirebaseFirestore.getInstance();
Query queryByPrice = db.collection("products").whereEqualTo("price", 199);
If you need to query by price, category and location, then you have to chain multiple whereEqualTo() methods:
Query queryByPrice = db.collection("products")
.whereEqualTo("price", 199)
.whereEqualTo("category", "Sport")
.whereEqualTo("location", "Europe");
If you, however, need to order the results, ascending or descending, don't also forget to create an index.
What would be the ideal data structure, given that I want to avoid thousands of reads each time there's filtering.
If you don't need to have all the results at once, then you have to implement pagination. If you need to know the number of products that exist in the sports category ahead of time, that is not possible without performing a query and counting the available products. I have written an article regarding this topic called:
How to count the number of documents in a Firestore collection?
Another feasible possible solution would be to create a single document that contains all those numbers. In other words, exactly what you're displaying to the users, everything that exists in those screenshots. In this way, you'll only have to pay a single read operation. When the users click on a particular category, only then you should perform the actual search.
I understand and accept denormalization for the purpose of my filtering.
In this case, there is no need to denormalize the data. For more info regarding this kind of operation, please check my answer below:
What is denormalization in Firebase Cloud Firestore?
I have users for my website that need to log in. In order to do that, I have to check the database for them, by email address or a hash of their email.
Some of my users have an online course in common.
Others are all on the same project.
There are multiple projects and courses.
How might I set up my table so that I can grab individual users, and efficiently query related groups of users?
I'm thinking...
PK = user#mysite
SK = user#email.com
projects = [1,2,3]
courses = [101,202,303]
I can get any user user with a get PK = user#mysite, SK = user#email.com.
But if I query, I have to filter two attributes, and I feel like I'm no longer very efficient.
If I set up users like this on the other hand:
PK = user#email.com
SK = 1#2#3#101#202#303
projects = [1,2,3]
courses = [101,202,303]
Then I can get PK = user#gmail.com and that's unique on its own.
And I can query SK contains 101 for example if I want all the 101 course students.
But I have to maintain this weird # deliminated list of things in the SK string.
Am I thinking about this the right way?
You want to find items which possess a value in an attribute holding a list of values. So do I sometimes! But there is not an index for that.
You can, however, solve this by adding new items to the table.
Your main item would have the email address as both the PK and the SK. It includes attributes listing the courses and projects, and all the other metadata about that user.
For each course, you insert additional items where the course id is the PK and the member emails are the various SKs in that item collection. Same for projects.
Given an email, you can find all about them with a get item. Given a course or project you can find all matching emails with a query against the course or project id. Do a batch get items then if you need all the data about each email.
When someone adds or drops a course or project, you update the main item as well as add/remove the additional indexed items.
Should you want to query by course X and project Y you can pull the matching results to the client and join in the client on email address.
In one of your designs you're proposing a contains against the SK, which is not a supported operator against SKs so that design wouldn't work.
I can't manage to determine what is the better way of organizing my database for my app :
My users can create items identified by a unique ID.
The queries I need :
- Query 1: Get all the items created by a user
- Query 2 : From the UID of an item, get its creator
My database is organized as following :
Users database
user1 : {
item1_uid,
item2_uid
},
user2 : {
item3_uid
}
Items database
item1_uid : {
title,
description
},
item2_uid : {
title,
description
},
item3_uid : {
title,
description
}
For the query 2, its quite simple but for the query 2, I need to parse all the users database and list all the items Id to see if there is the one I am looking for. It works right now but I'm afraid that it will slow the request time as the database grows.
Should I add in the items data a row with the user id ? If yes the query will be simpler but I heard that I am not supposed to have twice the same data in the database because it can lead to conflicts when adding or removing items.
Should I add in the items data a row with the user id ?
Yes, this is a very common approach in the NoSQL world and is called denormalization. Denormalization is described, in this "famous" post about NoSQL data modeling, as "copying of the same data into multiple documents in order to simplify/optimize query processing or to fit the user’s data into a particular data model". In other words, the main driver of your data model design is the queries you plan to execute.
More concretely you could have an extra field in your item documents, which contain the ID of the creator. You could even have another one with, e.g., the name of the creator: This way, in one query, you can display the items and their creators.
Now, for maintaining these different documents in sync (for example, if you change the name of one user, you want it to be updated in the corresponding items), you can either use a Batched Write to modify several documents in one atomic operation, or rely on one or more Cloud Functions that would detect the changes of the user documents and reflect them in the item documents.
First of all thank you to anybody reading through this and offering any advice and help. It is much appreciated.
I'm developing a small custom CRM (ouch) for my father's business (specialty contractor) and I'm using Firestore for my database. It is supposed to be very lean, with not much "bling" but stream lined to his speciality contracting business, which is very hard to to get any other custom CRM to be applied to his process. I have gotten quite far and have a decent size implementation, but am now running into some very fundamental issues as everything is expanding.
I admit that only having experience with relational databases (and not much of that either) left me scratching my head a few times when properly setting up my database structure and am running into some issues with Firestore. I'm also a fairly novice developer and I feel I'm tackling something that is just way out of my league. (but there's not much turning around now being a year into this journey)
As of right now I'm using Top Level Collections for what I am presenting here. I recently started using Sub-Collections for some other minor features and started questioning if I should apply that for everything.
A big problem that I foresee is because I want to query in a multitude of ways, I am already consuming almost 100 composite indexes at this time. There is still lots to add, so I need to reduce the amount of composite indexes that my current and future data structure needs.
So I am somewhat certain, that my data model probably is deeply flawed and needs to be improved/optimized/changed. (Which I don't mind doing, if that's what it takes, but I'm lost on "how") I don't need a specific solution, but maybe just some pointers, generally speaking, of what approaches are available. I think I might be lacking an "aha" moment. If I understand a pattern, I can usually apply that further in other areas.
I will make my "Sales Leads Collection" a central concern of this post, as it has the most variations of querying.
So I have a mostly top level collection structure like this, but also want to prefix, that besides writing the IDs to other Documents, I will "stash" an entire "Customer" or "Sales Rep" Object/Document with other Documents and I have Cloud Functions that will iterate through certain documents when there are updates, etc. (To avoid extra reads, i.e. when I read a SalesLead, I don't need to read the SalesRep and Customer Document, as they are also stashed/nested with the SalesLead)
| /sales_reps //SalesReps Collection
| /docId //Document ID
| + salesRepId (document id)
| + firstName
| + lastName
| + other employee/salesRep related info etc.
| /customers //Customers Collection
| /docId //Document ID
| + customerId (document id)
| + firstName
| + lastName
| + address + other customer specific related info such as contact info (phone, email) etc.
Logically Sales Leads are of course linked to a Customer (one to many, one Customer can have many leads).
All the Fields mentioned below I need to be able to "query" and "filter"
| /sales_leads //SalesLeads Collection
| /docId //Document ID
| + customerId (document id) <- this is what I would query by to look for leads for a specific customer
| + salesRepId (document id) <- this is what I would query by to look for leads for a specific sales Rep
| + status <- (String: "Open", "Sold", "Lost", "On Hold)
| + progress <- (String: "Started", "Appointment scheduled", "Estimates created", etc. etc., )
| + type <- (String: New Construction or Service/Repair)
| + jobTye <- (String: Different Types job Jobs related to what type of structures they are; 8-10 types right now)
| + reference <- (String: How the lead was referred to the company, i.e. Facebook, Google, etc. etc. );
| + many other (non queryable) data related to a lead, but not relevant here...
SalesEstimates are related to Leads in a one to many relationship. (one lead can have many estimates) But Estimates are not all that relevant for this discussion, but just wanted to include it anyhow. I query and filter estimates in a very similar way I do with leads, though. (similar fields etc.)
| /sales_estimates //SalesEstimates Collection
| /docId //Document ID
| + salesLeadId (document id) <- this is what I would query by to look for estimates for a specific lead
| + customerId (document id) <- this is what I would query by to look for estimates for a specific customer
| + salesRepId (document id) <- this is what I would query by to look for estimates for a specific sales Rep
| + specific sales Lead related data etc....
In my "Sales Lead List" on the client, I have some Drop Down Boxes as Filters, that contain Values (i.e. Sales Reps) but also haven an Option/Value "All" to negate any filtering.
So I would start assembling a query:
Query query = db.collection("sales_leads");
//Rep
if (!salesRepFilter.equals("All")) { //Typically only Managers/Supervisors woujld be able to see "all leads" whereas for a SalesRep this would be set on his own ID by default.
query = query = query.whereEqualTo("salesRepId", salesRepId);
}
//Lead Status (Open, Sold, Lost, On Hold)
if (!statusFilter.contains("All")) {
query = query.whereEqualTo("status", statusFilter);
}
//Lead Progress
if (!progressFilter.contains("All")) {
query = query.whereEqualTo("progress", progressFilter);
}
//Lead Type
if (!typeFilter.contains("All")) {
query = query.whereEqualTo("leadType", typeFilter);
}
//Job Type
if (!jobTypeFilter.contains("All")) {
query = query.whereArrayContains("jobTypes", jobTypeFilter);
}
//Reference
if (!referenceFilter.contains("All")) {
query = query.whereEqualTo("reference", referenceFilter);
}
Additionally I might want to reduce the whole query to a single customer (this typically means that all other filters are skipped and "all leads for this customer are shown). This would happen if the user opens the Customer Page/Details and clicks on something like "Show Leads for this customer".
//Filter by Customer (when entering my SalesLead List from a Customer Card/Page where user clicked on "Show Leads for this Customer")
if (filterByCustomer) {
query = query.whereEqualTo("customerId", customerFilter);
}
//And at last I want to be able to query the date Range (when the lead was created) and also sort by "oldest" or "newest"
//Date Range
query = query.whereGreaterThan("leadCreatedOnDate", filterFromDate);
.whereLessThan("leadCreatedOnDate", filterToDate;
//Sort Newest vs Oldest
if (sortByNewest) { //either newest or oldest
query = query.orderBy("leadCreatedOnDate", Query.Direction.ASCENDING);
} else {
query = query.orderBy("leadCreatedOnDate", Query.Direction.DESCENDING);
}
And that would complete my query on sales leads. Which that all works great right now but I am anxious about going forward and ultimately hitting the composite index limitation. I don't have an exact number, but I am probably entertaining 25-30 composite indexes just for my collection of sales_leads. (Yikes!)
Not only are there many fields to query by, the amount of composite indexes required is multiplied by the combination of possible filters set. (UGH)
I need to be able to query all leads and then filter them by the fields mentioned above (when describing my sales_leads collection).
So instead of keeping all these collections as top level collections I am guessing that somehow I should restructure my database by entertaining sub collections, but I tried modeling this with different approaches and always seem to hit a wall.
I suppose I could have "sales_leads" as a subcollection under each customer object and could use a collection group query to retrieve "all leads", but those require composite indexes, too right? So it would just be tradeoff for that one searchable field. (..hits wall..)
Sorry for the length. I hope it's readable. I appreciate any help, feedback and input. I'm in a very anxious and frustrated position.
If this doesn't work, I might need to consider professional consultation.
Thanks!
Here are a few things I think will help you.
First, watch the AWS re:Invent 2018: Amazon DynamoDB Deep Dive on YouTube. It's about DynamoDB but DynamoDB is a NoSQL database very similar to Firestore and the concepts universally apply. Midway through the video, Rick uses a company like yours as an example and you may be surprised to see how effectively he can reduce query count simply through data modeling.
Second, familiarize yourself with Firestore's index merging. In situations like yours, it may be better to manually create your composite indices, or at least manually audit them, because Firestore's automatic indexing doesn't guarantee the most efficient menu of composite indices. Remember, composite indices are automatically created based on the order you execute queries and if you execute a query later that could be better structured by voiding a previous index, Firestore will not go back and delete it for you—you have to.
I'm highly suspicious of the fact that the sales-lead query consumes 25-30 composite indices; that number seems far too high to me given how many fields in the documents are indexed. Before you do anything—after having watched the video and studied index merging, of course—I'd focus entirely on this collection. You must be completely certain of the maximum number of composite indices this collection needs to consume. Perhaps create a dummy collection and experiment with index merging and really understand how it works because this alone may solve all of your problems. I would be shocked if Firestore couldn't handle your company's use case.
Third, don't be afraid to denormalize your data. The fundamental premise of NoSQL is really denormalization—that is, data storage really should be your least concern and computation/operation really should be your greatest concern. If you can reduce your query count by duplicating data over multiple documents in multiple collections, that is simply what you must do if the alternative is hitting 200 composite indices.
I have a data model like this:
Person node
Email node
OWNS relationship
LISTS relationship
KNOWS relationship
each Person can OWN one Email and LISTS multiple Emails (like a contact list, 200 contacts is assumed per Person).
The query I am trying to perform is finding all the Persons that OWN an Email that a Contact LISTS and create a KNOWS relationship between them.
MATCH (n:Person {uid:'123'}) -[r1:LISTS]-> (m:Email) <-[r2:OWNS]- (l:Person)
CREATE UNIQUE (n)-[:KNOWS]->[l]
The counts of my current database is as follows:
Number of Person nodes: 10948
Number of Email nodes: 1951481
Number of OWNS rels: 21882
Number of LISTS rels: 4376340 (Each Person has 200 unique LISTS rels)
Now my problem is that running the said query on this current database takes something between 4.3 to 4.8 seconds which is unacceptable for my need. I wanted to know if this is normal timing considering my data model or am I doing something wrong with the query (or even model).
Any help would be much appreciated. Also if this is normal for Neo4j please feel free to suggest other graph databases that can handle this kind of model better.
Thank you very much in advance
UPDATE:
My query is: profile match (n: {uid: '4692'}) -[:LISTS]-> (:Email) <-[:OWNS]- (l) create unique (n)-[r:KNOWS]->(l)
The PROFILE command on my query returns this:
Cypher version: CYPHER 2.2, planner: RULE. 3919222 total db hits in 2713 ms.
Yes, 4.5 seconds to match one person from index along with its <=100 listed email addresses and merging a relationship from user to the single owner of each email, is slow.
The first thing is to make sure you have an index for uid property on nodes with :Person label. Check your indices with SCHEMA command and if missing create such an index with CREATE INDEX ON :Person(uid).
Secondly, CREATE UNIQUE may or may not do the work fine, but you will want to use MERGE instead. CREATE UNIQUE is deprecated and though they are sometimes equivalent, the operation you want performed should be expressed with MERGE.
Thirdly, to find out why the query is slow you can profile it:
PROFILE
MATCH (n:Person {uid:'123'})-[:LISTS]->(m:Email)<-[:OWNS]-(l:Person)
MERGE (n)-[:KNOWS]->[l]
See 1, 2 for details. You may also want to profile your query while forcing the use of one or other of the cost and rule based query planners to compare their plans.
CYPHER planner=cost
PROFILE
MATCH (n:Person {uid:'123'})-[:LISTS]->(m:Email)<-[:OWNS]-(l:Person)
MERGE (n)-[:KNOWS]->[l]
With these you can hopefully find and correct the problem, or update your question with the information to help others help you find it.