This is my first Firestore – and NoSQL – project, and I'm struggling with modeling my data.
I have a number of objects (in the order of 500 to 1000) that travel physically around the globe. They periodically (about once a day) check in to send their geolocation along with some extra data.
In other words, there are a thousand streams of slowly accumulating tracking data.
How do I best structure my data to optimize for the following query?
For each of the tousand objects, give me the last N tracking locations, sorted from newest to oldest. I assume N to be around 100 to 300.
EDIT: To clarify, this would return about 1000 x (100 to 300) tracking locations. Can this be accomplished without 1000 queries (i.e. one for each of the objects)?
The following database structure should work for your use-case.
Firestore-root
|
--- drivers (collection)
| |
| --- driverId (document)
| |
| --- //other driver details
|
--- data (collection)
| |
| --- driverId (document)
| |
| --- driverData (collection)
| |
| --- driverDataId (document) //Same object as below
| |
| --- geoPoint: [[48.858376° N, 2.294537° E]]
| |
| --- date: Oct 11, 2018 at 6:16:58 PM UTC+3
| |
| --- driverId: "DriverUserId"
| |
| --- //other extra data
|
--- allData (collection)
|
--- driverDataId (document) //Same object as above
|
--- geoPoint: [[48.858376° N, 2.294537° E]]
|
--- date: Oct 11, 2018 at 6:16:58 PM UTC+3
|
--- driverId: "DriverUserId"
|
--- //other extra data
They periodically (about once a day) check in to send their geolocation along with some extra data.
Assuming that you have a model class for the data that the driver is sending once a day, the object that it should be sent to the database, should be sent in two differetnt locations:
data (collection) -> driverId (document) -> driverData (collection) -> driverDataId (document)
and
allData (collection) -> driverDataId (document)
For all objects give me the last N tracking locations, sorted from newest to oldest.
To get all those objects a query like this is needed:
FirebaseFirestore rootRef = FirebaseFirestore.getInstance();
CollectionReference allDataRef = rootRef.collection("allData");
Query query = allDataRef.orderBy("date", Query.Direction.ASCENDING).limit(n);
If you want to get also the driver details, you need to make an extra get() call, so you can get its details. You can acheive this using the driverId that exist as a property within the driver data object.
If you want to get all those object from a single driver, you should use the following query:
FirebaseFirestore rootRef = FirebaseFirestore.getInstance();
CollectionReference allDataRef = rootRef.collection("data").document(driverId).collecton("driverData");
Query query = allDataRef.orderBy("date", Query.Direction.ASCENDING).limit(n);
This practice is called denormalization and is a common practice when it comes to Firebase. For a better understanding, i recomand you see this video, Denormalization is normal with the Firebase Database. It is for Firebase realtime database but same principle apply to Cloud Firestore.
Also, when you are duplicating data, there is one thing that need to keep in mind. In the same way you are adding data, you need to maintain it. With other words, if you want to update/detele an item, you need to do it in every place that it exists.
Edit:
According to your comment, I uderstand now what you mean. In this case you can consider allData collection a feed, in which you should add as you can see, driver data objects. Let's say that n = 100. This means that everytime you add a new object after the 100th object, you need to delete the oldest one. So this implies an extra delete operation. In this way you'll keep in that feed only 100 objects of a particular user. And yes, if you have 1000 users and every user has 100 data objects, you'll need to query a collection that has 100k documents. So if you want to have all that data at once, 100k reads will be performed.
Edit2:
There is another schema at which I can think at but this implies some tests, because I don't know how big your driver data object can be. So please see my schema below:
Firestore-root
|
--- drivers (collection)
|
--- driverId (document)
|
--- //other driver details
|
--- driverData (map)
|
--- driverDataId (document) //Same object as below
|
--- geoPoint: [[48.858376° N, 2.294537° E]]
|
--- date: Oct 11, 2018 at 6:16:58 PM UTC+3
|
--- driverId: "DriverUserId"
|
--- //other extra data
As you can see I have changed the driverData collection into a map within the driver object. In this case, you should also maintain those 100 object within this map. In this case, only 1000 queries are needed, that can return 100k driver data object. But pay atention, the problem is that that the documents have limits. So there are some limits when it comes to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much but as your map of objects getts bigger, be careful about this limitation.
Related
I've see older posts around this but hoping to bring this topic up again. I have a table in DynamoDB that has a UUID for the primary key and I created a secondary global index (SGI) for a more business-friendly key. For example:
| account_id | email | first_name | last_name |
|------------ |---------------- |----------- |---------- |
| 4f9cb231... | linda#gmail.com | Linda | James |
| a0302e59... | bruce#gmail.com | Bruce | Thomas |
| 3e0c1dde... | harry#gmail.com | Harry | Styles |
If account_id is my primary key and email is my SGI, how do I query the table to get accounts with email in ('linda#gmail.com', 'harry#gmail.com')? I looked at the IN conditional expression but it doesn't appear to work with SGI. I'm using the go SDK v2 library but will take any guidance. Thanks.
Short answer, you can't.
DDB is designed to return a single item, via GetItem(), or a set of related items, via Query(). Related meaning that you're using a composite primary key (hash key & sort key) and the related items all have the same hash key (aka partition key).
Another way to think of it, you can't Query() a DDB Table/index. You can only Query() a specific partition in a table or index.
Scan() is the only operation that works across partitions in one shot. But scanning is very inefficient and costly since it reads the entire table every time.
You'll need to issue a GetItem() for every email you want returned.
Luckily, DDB now offers BatchGetItem() with will allow you to send multiple, up to 100, GetItem() requests in a single call. Saves a little bit of network time and automatically runs the requests in parallel; but otherwise is the little different from what your application could do itself directly with GetItem(). Make no mistake, BatchGetItem() is making individual GetItem() requests behind the scenes. In fact, the requests in a BatchGetItem() don't even have to be against the same tables/indexes. The cost for each request in a batch will be the same as if you'd used GetItem() directly.
One difference to make note of, BatchGetItem() can only return 16MB of data. So if your DDB items are large, you may not get as many returned as your requested.
For example, if you ask to retrieve 100 items, but each individual
item is 300 KB in size, the system returns 52 items (so as not to
exceed the 16 MB limit). It also returns an appropriate
UnprocessedKeys value so you can get the next page of results. If
desired, your application can include its own logic to assemble the
pages of results into one dataset.
Because you have a GSI with PK of email (from what I understand) you can use PartiQL command to get your batch of emails back. The API is called ExecuteStatment and you use a SQL like syntax:
SELECT * FROM mytable.myindex WHERE email IN ['email#email.com','email1#email.com']
I am trying to find what's causing the higher RU usage on the Cosmos DB. I enabled the Log Analytics on the Doc DB and ran the below Kusto query to get the RU consumption by Collection Name.
AzureDiagnostics
| where TimeGenerated >= ago(24hr)
| where Category == "DataPlaneRequests"
| summarize ConsumedRUsPer15Minute = sum(todouble(requestCharge_s)) by collectionName_s, _ResourceId, bin(TimeGenerated, 15m)
| project TimeGenerated , ConsumedRUsPer15Minute , collectionName_s, _ResourceId
| render timechart
We have only one collection on the DocDb Account (prd-entities) which is represents Red line in the Chart. I am not able to figure out what the Blue line represents.
Is there a way to get more details about the empty collection name RU usage (i.e., Blue line)
I'm not sure but I think there's no empty collection costs RU actually.
Per my testing in my side, I found that when I execute your kusto query I can also get the 'empty collection', but when I watch the line details, I found all these rows are existing in my operation. What I mean here is that we shouldn't sum by collectionName_s especially you only have one collection in total, you may try to use requestResourceId_s instead.
When using requestResourceId_s, there're still some rows has no id, but they cost 0.
AzureDiagnostics
| where TimeGenerated >= ago(24hr)
| where Category == "DataPlaneRequests"
| summarize ConsumedRUsPer15Minute = sum(todouble(requestCharge_s)) by requestResourceId_s, bin(TimeGenerated, 15m)
| project TimeGenerated , ConsumedRUsPer15Minute , requestResourceId_s
| render timechart
Actually, you can check the requestCharge_s are coming from which operation, just watch details in Results, but not in Chart, and order by the collectionName_s, then you'll see those requests creating from the 'empty collection', judge if these requests existing in your collection.
How does one return a list of unique users from a dynamodb table with the following (simplified) schema? Does it require a GSI? This is for an app with small number of users, and I can think of ways that will work for my needs without creating a GSI (like scanning and filtering on SK, or creating a new item with list of user ids inside). But what is the scalable solution?
------------------------------------------------------
| pk | sk | amount | balance
------------------------------------------------------
| "user1" | "2021-01-01T12:00:00Z" | 7 |
| "user1" | "2021-01-03T12:00:00Z" | 5 |
| "user2" | "2021-01-01T12:00:00Z" | 3 |
| "user2" | "2021-01-03T12:00:00Z" | 2 |
| "user1" | "user1" | | 12
| "user2" | "user2" | | 5
Your data model isn't designed to fetch all unique users efficiently.
You certainly could use a scan operation and filter with your current data model, but that is inefficient.
If you want to fetch all users in a single query, you'll need to get all user information into a single partition. As you've identified, you could do this with a GSI. You could also re-organize your data model to accommodate this access pattern.
For example, you mentioned that the application has a small number of users. If the number of users is small enough, you could create a partition that stores a list of all users (e.g. PK=USERS). If you could do this under 400kb, that may be a viable solution.
The idiomatic solution is to create a global secondary index.
I`m just creating a Instagram clone app for testing
my data structure is below
--- users (root collection)
|
--- uid (one of documents)
|
--- name: "name"
|
--- email: "email#email.com"
|
--- following (sub collection)
| |
| --- uid (one of documents)
| |
| --- customUserId : "blahblah"
| |
| --- name : "name"
| |
| --- pictureStorageUrl : "https://~~"
|
--- followers (sub collection)
| |
| --- uid (one of documents)
| |
| --- customUserId : "blahblah"
| |
| --- name : "name"
| |
| --- pictureStorageUrl : "https://~~"
|
Assume user A has 1 million followers and then if user A edits a picture or name or customUserId, should the document of each sub collection "following" of 1 million followers users be modified?
Should there be 1 million updates? Are there any more efficient rescue methods? And if there is no other good way, is it appropriate to batch data modification through the database trigger of the cloud function in the case of the above method?
Should the document of each sub collection "following" of 1 million followers users be modified? Should there be 1 million updates?
That's entirely up to you to decide. If you don't want to update them, then don't. But if you want the data to stay in sync, then you will have to find and update all of the documents where that data is copied.
Are there any more efficient rescue methods?
To update 1 million documents? No. If you have 1 million documents to update, then you will have to find and update them each individually.
And if there is no other good way, is it appropriate to batch data modification through the database trigger of the cloud function in the case of the above method?
Doing the updates in Cloud Functions still costs 1 million updates. There aren't any shortcuts to this work - it's the same on both the frontend and the backend. Cloud Functions will just let you trigger that work to happen on the backend automatically.
If you want to avoid 1 million updates, then you should instead not copy the data 1 million times. Just store a UID, and do a second query to look up information about that user.
I was planning to use a Dynamo table as a sort of replication log, so I have a table that looks like this:
+--------------+--------+--------+
| Sequence Num | Action | Thing |
+--------------+--------+--------+
| 0 | ADD | Thing1 |
| 1 | DEL | Thing1 |
| 2 | ADD | Thing2 |
+--------------+--------+--------+
Each of my processes keeps track of the last sequence number it read. Then on an interval it issues a Scan against the table with ExclusiveStartKey set to that sequence number. I assumed this would result in reading everything after that sequence, but instead I am seeing inconsistent results.
For example, given the table above, if I do a Scan(ExclusiveStartKey=1), I get zero results when I am expecting to see the 3rd row (seq=2).
I have a feeling it has to do with the internal hashing DynamoDB uses to partition the items and that I am misusing the ExclusiveStartKey option.
Is this the wrong tool for the job?
Alternatively, each process could issue a Query for seq+1 on each interval (looping if anything was found), which would result in the same ReadThroughput, but would require N API calls instead of N/1MB I would get with a Scan.
When you do a DynamoDB Scan operation, it does not seem to proceed sorted by the hash key. So using ExclusiveStartKey does not allow you to get an arbitrary page of keys.
For this example table with the Sequence ID, what I want can be accomplished with a Kinesis stream.