My firebase project is growing fast and I have a lot of children in some paths (the structure is kept flat as possible).
my tree looks like this
/client/section
-key: value
-key: value
...
In some sections I've got 80k+ children and growing fast (I could easily hit 1mil+ in a few months). I was thinking of splitting the sections into section1, section2,... but the problem is that I have to do the numChildren (which of course loads all the children) check before I insert another child (to keep the section inside the desired limit).
Another idea was to change the section to date Y-m-d or just Y-m but that would again generate a lot of paths.
With the new schema I'd also like to add some properties to the current children (so I have to do some work on the schema anyway).
Another idea was to feed this data into a relational DB.
I'd like you input on how to structure the schema for the future.
Thank you!
Related
I currently have a need to add a local secondary index to a DynamoDB table but I see that they can't be added after the table is created. It's fine for me to re-create the table now while my project is in development, but it would be painful to do that later if I need another index when the project is publicly deployed.
That's made me wonder whether it would be sensible to re-create the table with the maximum number of secondary indexes allowed even though I don't need them now. The indexes would have generically-named attributes that I am not currently using as their sort keys. That way, if I ever need another local secondary index on this table in the future, I could just bring one of the unused ones into service.
I don't think it would be waste of storage or a performance problem because I understand that the indexes will only be written to when an item is written that includes the attribute that they index on.
I Googled to see if this idea was a common practice, but haven't found anyone talking about it. Is there some reason why this wouldn't be a good idea?
Don’t do that. If a table has any LSIs it follows different rules and cannot grow an item collection beyond 10 GB or isolate hot items within an item collection. Why incur these downsides if you don’t need to? Plus later you can always create a GSI instead of an LSI.
Some background:
My question is very similar to this clarification question about denormalization, but I want to change the situation a bit.
In the Considerations section of this blog post on denormalization, the Firebase people say the following about updating data.
Let’s discuss some consequences of a [denormalized data structure]. You will need to ensure that every time some data is created (in this case, a comment) it is put in the right places.
The example includes three paths, one to store the comment's data, and two paths under which to store pointers to that comment.
...
Modification of comments is easy: just set the value of the comment under /comments to the new content. For deletion, simply delete the comment from /comments — and whenever you come across a comment ID elsewhere in your code that doesn’t exist in /comments, you can assume it was deleted and proceed normally:
But this only works because, as the answer to the other question says,
The structure detailed in the blog post does not store duplicate comments. We store comments once under /comments then store the name of those comments under /links and /users. These function as pointers to the actual comment data.
Basically, the content is only stored in one location.
The question:
What if the situation were such that storing duplicate data is necessary? In that case, what is the recommended way to update data?
My attempt at an answer:
An answer to this question exists, but it is directed at MongoDB, and I'm not sure it quite addresses the issue in Firebase.
The most sensible way I could think of, just for reference, is as follows.
I have a helper class to which I give a catalog of paths in Firebase, which somewhat resembles a schema. This class has methods that wrap Firebase methods, so that I can perform writes and updates under all the paths specified by my schema. The helper class iterates over every path where there is a reference to the object, and at each location performs a write, update, or delete. In my case, no more than 4 paths exist for any individual operation like that, and most have 2.
Example:
Imagine I have three top-level keys, Users, Events, and Events-Metadata. Users post Images to Events, and both Events and Users have a nested record for all their respective Images. Events-Metadata is its own top-level key for the case where I want to display a bunch of events on a page, but I don't want to pull down potentially hundreds of Image records along with them.
Images can have captions, and thus, when updating an Image's caption, I should update these paths:
new Firebase("path/to/eventID/images/imageID/caption"),
and
new Firebase("path/to/userID/images/imageID/caption")
I give my helper class both those of these paths and a wrapper method, so that anytime a caption is updated, I can call helperclass.updateCaption(imageObj, newCaptionData), and it iteratively updates the data at each path.
Images are stored with attributes including eventID, userID, and imageID, so that the skeletons of those paths can be filled in correctly.
Is this a recommended and/or appropriate way to approach this issue? Am I doing this wrong?
Consider a set of data called Library, which contains a set of Books and each book contains a set of Pages.
Let's say you are using Riak to store this data, and you need to be access the data in two possible ways:
- Query for a particular page (with a unique id)
- Query for all pages in a particular book (with a unique name)
Additionally, you need to be able to easily update and delete pages of a particular Book.
What would be the best way to accomplish this in Riak?
Obviously Riak Search will do the trick, but maybe is inefficient for what I am trying to do. I am wondering if it makes sense to set up buckets where each bucket can be a Book (which would make for potentially millions of "Book" buckets). Maybe that is a bad idea...
Can this be accomplished with secondary indexes?
I am trying to keep this simple...
I am new to Riak and I am trying to find the best way to accomplish something that is probably relatively simple. I would appreciate any help from the Stack Overflow community. Thanks!
A common way to model master-detail relationships in Riak is to have the master record contain a list of detail record IDs, possibly together with some information about the detail record that may be useful when deciding which detail records to retrieve.
In your example, you could have two buckets called 'books' and 'pages'. The master record in the 'books' bucket will contain metadata and information about the book as a whole together with a list of pages that are included in the book. Each page would contain the ID of the 'pages' record holding the page data as well as the corresponding page number. If you e.g. wanted to be able to query by chapter, you could also add information about which chapters a certain page belongs to.
The 'pages' bucket would contain the text of the page and possibly links to images and other media data that are included on that page. This data could be stored in yet another bucket.
In order to get a specific page or a range of pages, one would first retrieve the master record from the 'books' bucket and then based on the contents of the record the appropriate pages. Even though this requires several GET operations, they are all direct lookups based on keys, which is the most efficient and scalable way to retrieve data from Riak, so it is will perform and scale well.
This approach also makes it simple to change the order of pages and/or chapters as only the master record needs to be updated. Adding, deleting or modifying pages would however require both the master record as well as one or more detail records to be updated, added or deleted.
You can most certainly also solve this problem by adding secondary indexes to the objects and query based on this. Secondary index queries in Riak does however have to include processing on a covering set (generally ring size / n_val) of partitions in order to fulfil the request, and therefore puts a bit more load on the system and generally results in higher latencies than retrieving a single object containing keys through a direct key lookup (which only needs to involve the partitions where the object is actually stored).
Although maintaining a separate object containing indexes adds a bit of extra work when inserting or deleting pages/entries, this approach will generally result in more efficient reads, as only direct key lookups are required. If your application is heavy on reads, it probably makes sense to use this approach, while secondary indexes could be more efficient for a write heavy application as inserts and modifications are made cheaper at the expense of more expensive reads. You can however always add secondary indexes just in case in order to keep your options open.
In cases like this I would usually recommend performing some benchmarks to test the solutions and chech which solution that best matches you particular performance and scaling requirements.
The most efficient way will be to store hole book as an one object, and duplicate it's pages as another separate objects.
Pros:
you will be able to select any object by its key(the most cheapest op
in riak is kv query)
any query will be predicted by latency
this is natural way of storing for riak
Cons:
If you need to update any page you must update whole book, and then page. As riak doesn't have atomic ops, you must to think how to recover any failure situation (like this: book was updated, but page was not).
Riak is about availability predictable latency, so if you will use something like 2i to collect results, it will make unpredictable time query, which will grow with page numbers
I have about 20 different tables that each have a different parent / child relationship built into them. I've recently been asked to create a breadcrumb and Site Map for our website based off of all of these tables.
One idea I had, was to remove the parent / child relationship from each of these tables and create basically one table that holds the id and parentId and whenever I need to pull the parent child relationship I would just join the parent_child_relationships table to whatever table I was pulling from specifically.
Does this make sense?
Anyway, the problem with this idea is that i don't like it. haha.
Does anyone else have any other ideas of how this could be done? Or what the correct way of building a breadcrumb and sitemap based off of a site comprised of 20 tables or so?
If it helps, my site is comprised of asp.net, ColdFusion and uses a MSSQL database.
Thanks!
Do not let the implementation of the UI effect the design of your model and especially not your DB. Prototype the front end, involve your customer(s), give them a voice. Build your breadcrumbs and site map without it initially tied into your actual DB. Once your customer says "thats what we want, just like that", then freeze the prototype, then work on the actual implementation - how will your app request the data, what type of dataobject will you use AND THEN build your db,
"One idea I had, was to remove the parent / child relationship from each of these tables and create basically one table that holds the id and parentId"
This is not a very scalable solution, do not *reverse normalize your db. Follow standard relation database modeling/normalization techniques. Lots of small cohensive tables with lots of association tables.
I have this much:
A table which used to store the "Folders". Each folder may contain sub folders and files. So if I click a folder, I have to list the content of the folder.
The table to represent the folder listing is something like the following
FolderID Name Type Desc ParentID
In the case of sub folders, ParentID is refer to the FolderID of the parent folder.
Now, my questions are
1.
a. There are 3 type of folders, I use 3 data lists to categorize them. Can I load the entire table in a single fetch and then use LINQ to categorize the types.
OR
b. Load each category by passing 'Type' to stored procedure. Which will do 3 database calls.
2.
a. If I click the parent folder, use LINQ to filter the contents of the folder(because we have the entire table in memory)
OR
b. If I click the parent folder, pass the FolderID of the parent folder and then fetch the content.
In the two cases above, which points makes more sense, which points are best in the case of performance?
There are a number of considerations you need to make.
What is the size of the folder tree, if not currently large could it potentially become very large?
What is the likelihood that the folder table will be modified whilst a user is using/viewing it? If there is a high chance then it may be worthwhile to make smaller, more frequent calls to the DB so that the user is aware of any changes which have been made by other users.
Will users by working with one folder type at a time? Or will they be switching between these three different trees?
As an instinctive answer I would be drawn towards calling 1 or 2 levels at a time. For example - start with loading root folder and immediate children. As the user navigates down into the tree, retrieve more children...
When you are questioning about the performance, the only available answer is:
Measure it!. Implement both scenarios and look at them - how can they load your system.
Try to think, how will you cache your data to prevent high database load.
All works fast for small n, so we can't say something for sure.
If your data is small and changed not frequently, then use Caching and LINQ-based queries for your cache data.
If your data can't be stored in cache because it is huge, or it changes constantly, then cache the results of your queries, create cache dependensies for them, and again, measure it!