I`m just creating a Instagram clone app for testing
my data structure is below
--- users (root collection)
|
--- uid (one of documents)
|
--- name: "name"
|
--- email: "email#email.com"
|
--- following (sub collection)
| |
| --- uid (one of documents)
| |
| --- customUserId : "blahblah"
| |
| --- name : "name"
| |
| --- pictureStorageUrl : "https://~~"
|
--- followers (sub collection)
| |
| --- uid (one of documents)
| |
| --- customUserId : "blahblah"
| |
| --- name : "name"
| |
| --- pictureStorageUrl : "https://~~"
|
Assume user A has 1 million followers and then if user A edits a picture or name or customUserId, should the document of each sub collection "following" of 1 million followers users be modified?
Should there be 1 million updates? Are there any more efficient rescue methods? And if there is no other good way, is it appropriate to batch data modification through the database trigger of the cloud function in the case of the above method?
Should the document of each sub collection "following" of 1 million followers users be modified? Should there be 1 million updates?
That's entirely up to you to decide. If you don't want to update them, then don't. But if you want the data to stay in sync, then you will have to find and update all of the documents where that data is copied.
Are there any more efficient rescue methods?
To update 1 million documents? No. If you have 1 million documents to update, then you will have to find and update them each individually.
And if there is no other good way, is it appropriate to batch data modification through the database trigger of the cloud function in the case of the above method?
Doing the updates in Cloud Functions still costs 1 million updates. There aren't any shortcuts to this work - it's the same on both the frontend and the backend. Cloud Functions will just let you trigger that work to happen on the backend automatically.
If you want to avoid 1 million updates, then you should instead not copy the data 1 million times. Just store a UID, and do a second query to look up information about that user.
Related
What is a solid DynamoDB access pattern for storing data from a bunch of receipts of identical format? I would use SQL for maximum flexibility on more advanced analytics, but as a learning exercise want to see how far one can go with DynamoDB here. For starters I'd like to query for aggregate overall and per product spending for a given time range, track product price history, sort receipts by total, stuff along those lines. But I also want it to be as flexible as possible for future queries I haven't thought of yet. Would something like this, plus some GSI's, work?
-----------------------------------------------------------------------------------------------------------
| pk | sk | unit $ | qty | total $ | receipt total | items
-----------------------------------------------------------------------------------------------------------
| "product a" | "2021-01-01T12:00:00Z" | 2 | 2 | 4 | |
| "product b" | "2021-01-01T12:00:00Z" | 2 | 3 | 6 | |
| "receipt" | "2021-01-01T12:00:00Z" | | | | 10 | array of above item data
| "product a" | "2021-01-02T12:00:00Z" | 1.75 | 3 | 5.25 | |
| "product c" | "2021-01-02T12:00:00Z" | 2 | 2 | 4 | |
| "receipt" | "2021-01-02T12:00:00Z" | | | | 9.25 | array of above item data
-----------------------------------------------------------------------------------------------------------
You have to decide your access patterns, and build the design of the dynamo off that not the other way around. No one outside your team/product can tell you what your access patterns are. That entirely depends on your products need.
You have to ask: What pieces of Information do you have, and what do you need to retrieve when you have those pieces of information? You then have decide what is the most common ones that will be done the most and craft your PK/SK combinations off that. If you can't fit all your queries into just one or two bits of information, you may want to set up an Index - but Index's should be maintained only for far less often accessed queries.
If you need to, its also Accepted Practice to enter the same information twice - in two documents in the table - as writes are easier/cheaper than multiple reads (a write is pretty much one WCU per document - any query/scan can be multiple RCUs even if you only need one part -- plus Index's being replications of the table mean there is a desync chance if you write/read too quickly or try to write/read the same document in parallel calls)
Take your time now to sit down and consider everything your app will need to query the dynamo for. The more you can figure out now, the better, and if you can set your PK to something that will almost always be available to the calling function trying to query then you will be in a much better state.
How does one return a list of unique users from a dynamodb table with the following (simplified) schema? Does it require a GSI? This is for an app with small number of users, and I can think of ways that will work for my needs without creating a GSI (like scanning and filtering on SK, or creating a new item with list of user ids inside). But what is the scalable solution?
------------------------------------------------------
| pk | sk | amount | balance
------------------------------------------------------
| "user1" | "2021-01-01T12:00:00Z" | 7 |
| "user1" | "2021-01-03T12:00:00Z" | 5 |
| "user2" | "2021-01-01T12:00:00Z" | 3 |
| "user2" | "2021-01-03T12:00:00Z" | 2 |
| "user1" | "user1" | | 12
| "user2" | "user2" | | 5
Your data model isn't designed to fetch all unique users efficiently.
You certainly could use a scan operation and filter with your current data model, but that is inefficient.
If you want to fetch all users in a single query, you'll need to get all user information into a single partition. As you've identified, you could do this with a GSI. You could also re-organize your data model to accommodate this access pattern.
For example, you mentioned that the application has a small number of users. If the number of users is small enough, you could create a partition that stores a list of all users (e.g. PK=USERS). If you could do this under 400kb, that may be a viable solution.
The idiomatic solution is to create a global secondary index.
This is my first Firestore – and NoSQL – project, and I'm struggling with modeling my data.
I have a number of objects (in the order of 500 to 1000) that travel physically around the globe. They periodically (about once a day) check in to send their geolocation along with some extra data.
In other words, there are a thousand streams of slowly accumulating tracking data.
How do I best structure my data to optimize for the following query?
For each of the tousand objects, give me the last N tracking locations, sorted from newest to oldest. I assume N to be around 100 to 300.
EDIT: To clarify, this would return about 1000 x (100 to 300) tracking locations. Can this be accomplished without 1000 queries (i.e. one for each of the objects)?
The following database structure should work for your use-case.
Firestore-root
|
--- drivers (collection)
| |
| --- driverId (document)
| |
| --- //other driver details
|
--- data (collection)
| |
| --- driverId (document)
| |
| --- driverData (collection)
| |
| --- driverDataId (document) //Same object as below
| |
| --- geoPoint: [[48.858376° N, 2.294537° E]]
| |
| --- date: Oct 11, 2018 at 6:16:58 PM UTC+3
| |
| --- driverId: "DriverUserId"
| |
| --- //other extra data
|
--- allData (collection)
|
--- driverDataId (document) //Same object as above
|
--- geoPoint: [[48.858376° N, 2.294537° E]]
|
--- date: Oct 11, 2018 at 6:16:58 PM UTC+3
|
--- driverId: "DriverUserId"
|
--- //other extra data
They periodically (about once a day) check in to send their geolocation along with some extra data.
Assuming that you have a model class for the data that the driver is sending once a day, the object that it should be sent to the database, should be sent in two differetnt locations:
data (collection) -> driverId (document) -> driverData (collection) -> driverDataId (document)
and
allData (collection) -> driverDataId (document)
For all objects give me the last N tracking locations, sorted from newest to oldest.
To get all those objects a query like this is needed:
FirebaseFirestore rootRef = FirebaseFirestore.getInstance();
CollectionReference allDataRef = rootRef.collection("allData");
Query query = allDataRef.orderBy("date", Query.Direction.ASCENDING).limit(n);
If you want to get also the driver details, you need to make an extra get() call, so you can get its details. You can acheive this using the driverId that exist as a property within the driver data object.
If you want to get all those object from a single driver, you should use the following query:
FirebaseFirestore rootRef = FirebaseFirestore.getInstance();
CollectionReference allDataRef = rootRef.collection("data").document(driverId).collecton("driverData");
Query query = allDataRef.orderBy("date", Query.Direction.ASCENDING).limit(n);
This practice is called denormalization and is a common practice when it comes to Firebase. For a better understanding, i recomand you see this video, Denormalization is normal with the Firebase Database. It is for Firebase realtime database but same principle apply to Cloud Firestore.
Also, when you are duplicating data, there is one thing that need to keep in mind. In the same way you are adding data, you need to maintain it. With other words, if you want to update/detele an item, you need to do it in every place that it exists.
Edit:
According to your comment, I uderstand now what you mean. In this case you can consider allData collection a feed, in which you should add as you can see, driver data objects. Let's say that n = 100. This means that everytime you add a new object after the 100th object, you need to delete the oldest one. So this implies an extra delete operation. In this way you'll keep in that feed only 100 objects of a particular user. And yes, if you have 1000 users and every user has 100 data objects, you'll need to query a collection that has 100k documents. So if you want to have all that data at once, 100k reads will be performed.
Edit2:
There is another schema at which I can think at but this implies some tests, because I don't know how big your driver data object can be. So please see my schema below:
Firestore-root
|
--- drivers (collection)
|
--- driverId (document)
|
--- //other driver details
|
--- driverData (map)
|
--- driverDataId (document) //Same object as below
|
--- geoPoint: [[48.858376° N, 2.294537° E]]
|
--- date: Oct 11, 2018 at 6:16:58 PM UTC+3
|
--- driverId: "DriverUserId"
|
--- //other extra data
As you can see I have changed the driverData collection into a map within the driver object. In this case, you should also maintain those 100 object within this map. In this case, only 1000 queries are needed, that can return 100k driver data object. But pay atention, the problem is that that the documents have limits. So there are some limits when it comes to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much but as your map of objects getts bigger, be careful about this limitation.
I've been trying to research this for a while now, what I want is very simple. I'm trying to compare two phone numbers and checks if they match because I'm tryign to implement something similar to telegram, notify a user if one of his contacts list created an account.
My problem is the following:
If I saved my contact using this format 0791234567 and my contact joined using this number +962791234567 both numbers are the same but the first is using local formats and the second using international formats. Does telegram finds these two numbers as a match and sends me a notification indicating that my contact has joined the network ?
I tried to use google library for parsing the numbers, but unfortunately the library doesn't always parse numbers in any format especially if the region was not provided.
Any hints ? or this is just not possible and all numbers must be of a specific format to be able to find a match ?
I think you should have two fields: counry_code and phone_number, and when registering, login, changing the mobile number and etc, get each of the fields individually.
for example :
id | first_name| last_name | password | country_code |phone_number|...
----------------------------------------------------------------------
1 | alihossein| shahabi | XXXXX | +98 |9377548654
or two tables users and phone_numbers :
id | first_name| last_name | password |
------------------------------------------
1 | alihossein| shahabi | XXXXX |
id | user_id| country_code | phone_number | active
--------------------------------------------------
1 | 1 | +98 | 9377541258 | 1
2 | 1 | +98 | 9377543333 | 0
I was planning to use a Dynamo table as a sort of replication log, so I have a table that looks like this:
+--------------+--------+--------+
| Sequence Num | Action | Thing |
+--------------+--------+--------+
| 0 | ADD | Thing1 |
| 1 | DEL | Thing1 |
| 2 | ADD | Thing2 |
+--------------+--------+--------+
Each of my processes keeps track of the last sequence number it read. Then on an interval it issues a Scan against the table with ExclusiveStartKey set to that sequence number. I assumed this would result in reading everything after that sequence, but instead I am seeing inconsistent results.
For example, given the table above, if I do a Scan(ExclusiveStartKey=1), I get zero results when I am expecting to see the 3rd row (seq=2).
I have a feeling it has to do with the internal hashing DynamoDB uses to partition the items and that I am misusing the ExclusiveStartKey option.
Is this the wrong tool for the job?
Alternatively, each process could issue a Query for seq+1 on each interval (looping if anything was found), which would result in the same ReadThroughput, but would require N API calls instead of N/1MB I would get with a Scan.
When you do a DynamoDB Scan operation, it does not seem to proceed sorted by the hash key. So using ExclusiveStartKey does not allow you to get an arbitrary page of keys.
For this example table with the Sequence ID, what I want can be accomplished with a Kinesis stream.