On my database, I have Post and User Models
User Model has a lot of information, but when I load the posts, I only need 3 out of like 20 parameters.
What I am currently doing is just loading the entire node. This is obviously not really efficient.
My question: Is it more efficient if I observe all 3 values (making 3 connections) individually or just observe the entire node once (making only a single connection).
I don't know exactly what would be more expensive (higher consumption as making 3 connections is probably not better than 1)
Kind regards
Edit
Firebase always loads complete nodes. While it is possible to get a subset of nodes with queries, that doesn't apply here.
So you will either have to load all nodes and do the subselection client-side, or you'll have to create another higher level node that only contains the three properties that you're interested in.
Which one to choose is highly dependent, and (honestly) largely subjective. The main options:
You can reduce bandwidth a bit by only loading the three properties, but if you store them as a duplicate you'll then end up paying for the storage of duplicate information.
You can also store the three properties separately, but not duplicate them. But that means that if you need all properties, you'll need to execute two read operations that then add some overhead and complicate code.
Related
First of, I know how Firestore works and have spent a lot of time, evaluating different approaches for a good structure. Still I am considering following scenario:
There is a database of known recipes. Users can add recipes, but they have to be confirmed to be real recipes and not just some variations. So every user can choose receipes from the user-generated list of recipes to state, that they know how to cook them (or add new ones).
Now I want users to share their list of receipes with others, but this is where I am not sure how this can be best accomplished using Firestore. The trick is, that I want to show all the recipes at once, and don't want to paginate them.
I am currently evaluating two possibilities:
Subcollections
Whenever a user shares his list, the user looking at said list will have to load the entire list of the recipes which can result in a high amount of document reads (I suppose realistically ~50, in very rare cases maybe 1000).
Pros:
More natural structure
Easier to maintain (e.g. deleting a recipe, checking if a specific one exists)
Easier to add fields (e.g. timeOfCreation, comment, personalRating, ...)
Cons:
Can result in a high amount of reads on the long run
Arrays
I could save every known recipe (the id and an imageURL) inside the user's document (or as a single subdocument "KnownRecipes") within an array. This array could be in form of
recipesKnown: [{rid: 293ndwa, imageURL: image1.com, timeAdded: 8371201332},
{rid: 9012831, imageURL: image1.com, timeAdded: 8371201871},
{rid: jd812da, imageURL: image1.com, timeAdded: 8371201118},
...
]
Pros:
I only need one document read whenever someone wants to see another user's list
Reading a user's list is probably faster
Cons:
It's hard to update a specific recipe (e.g. someone wants to change the imageURL: I need to change the list locally and send the entire document as an update to the server - since I cannot just change a single element in the array)
When a user decides to have around 1000 recipes (this will maybe never happen, but it could), the 1MiB limit of the Firestore limit could be reached. A possible workaround would be to create a seperate document and split those two arrays into these two documents.
For me, the idea with Subcollections seems to be the more "clean" solution to this problem, but maybe I am missing some arguments on why one of those solutions would be superior over the other.
My most common queries are as follows (ordered descending by importance):
Which recipes can a user cook
Add a recipe a user can cook to the user's list
Who can cook a specific recipe (there is a Recipe -> Cooks subcollection)
Update an existing recipe a user can cook
The answer to your question depends on the level of scalability you want to achieve.
If by design the amount of sub-data you want to store is limited and very low, you should use arrays, since you reduce the number of document reads, which means lower costs.
If your sub-data is supposed to increase "unlimitedly" over time, you should use sub-collections.
If you're building a database which is not supposed to scale in any direction (Proof of concept, very small business, etc.) just go with what you feel more comfortable with.
I'm researching the same question...
One of the questions is whether the data held in the document will be ever go pass 1MB that is the limit for a document. Researching a bit on how much it can be held in plain text in 1MB well it's a hell of a lot. Still if it were to be incredible bigger it would crash in the end. Thus if you think in a big-big way sub-collections.
If we had to use the Firebase element logic the answer would be sub-collections.
Still I guess the major point is the data pulled. If you call the user you will directly be pulling out that MB of data. Instead with a sub-collection it won't load, even if you loaded it you can still lazy-load.
I guess for the kind of setup you are doing sub-collections.
key is an additional collection's con/pro
key could help to avoid duplicates; but this requires thinking of what is duplicate's definition (which might change);
array's no-key behavior could be emulated via auto-id.
p.s. #Thomas's list of pros/cons in the question has been quite helpful.
What is the best way to add huge amount of documents into riak? Let's say there are millions of product records, which change very often (prices, ...) and we want to update all of them very frequently. Is there a better way than replace keys one by one in Riak? Something as bulk set of 1000 documents at once...
There are unfortunately not any bulk operations available in Riak, so this has to be done by updating each object individually. If your updates however arrive in bulks, it may be worthwhile revisiting your data model. If you can de-normalise your products, perhaps by storing a range of products in a single object, it might be possible to reduce the number of updates that need to be performed by grouping them, thereby reducing the load on the cluster.
When modelling data in Riak you usually need to look at access and query patterns in addition to the structure of the data, and make sure that the model supports all types of queries and latency requirements. This quite often means de-normalising your model by either grouping or duplicating data in order to ensure that updates and queries can be performed as efficiently as possible, ideally through direct K/V access.
I have an Application that allows people to bet on the result of soccer games.
The score of each single bet (=entity) is calculated by comparing the betted scores of the bet with the actual result in the game(=entity). Bet's are betted within Betrounds. Betrounds are organisations where groups bet on gamegroups (groups of games e.g. single matchdays). Single Usergroups can have several betrounds.
To summarize the relational model:
UserGroup 1:N BetRounds 1:N Bets N:1 Game
Within each betround I create a resulttable where I show every user with their result points and position.
In order to calculate the position of one user I need to calculate the points of every user within a betround.
These points from the single betrounds are aggregated into groups and within the group there is again a resulttable.
Example
A Usergroup with: 20 users
One Season has 34 matchdays
One matchday has 9 games
In order to calculate the the points for this usergroup I would need to calculate the points from 20*34*9=6120 bets.
Since this is a lot to calculate I don't want to do it everytime I show the resulttable.
I currently see two options in order to save some calculation time:
Cache
Save interim results (e.g. on the bet entity) in the database
Maybe a mix of both.
Cache
If caching is the correct way I am not sure on which level and how to invalidate.
There are several options what to cache:
- pointresult of single bets
- pointresults of single users within a betround
- whole result table of a betround (points & position)
- pointresult of single user within usergroup
- whole resulttable of usergroup
I am unsure how to cache those data:
- just the integer values for positions and points
- whole entities (e.g. bets)
- temporary not persistent entities (e.g. to represent the the resulttables)
- the html output of the table
Then dependent on format how to cache it:
- html views could be cached via reverse proxies
- values / entities probably via redis / memcache etc.
In the future we might change to a single page app that data is only served via restapi, then caching of html outputs is not an option.
Dependent on the caching strategy the question arises how to invalidate cache and optionally warm it, so that the result is never calculated within the application but only recalculated when the cache is invalidated and immediately replaced by the new result.
I have read very often that cache invalidation is evil. I am not sure if this applies to my use case since all points/results/tables etc. only change when my interface updates the result of the games. This is the only time when points change.
2.Save interim results (e.g. on the bet entity) in the database
I am not sure if this scenario is applicable on all levels. I first thought about saving the actual result on a bet instead of always comparing the bet scores with the actual scores. This would then make my data model a little bit redundent and i have increased complexity if I wrong result is fetched by my interface and later the correct comes in and my points are not recalculated.
On all other levels I would need to create new interim entities to store table results persistently.
3.Mix of both
I am not sure how mixing both would look like and if it makes sense at all, but I thought it might be an option.
Any advice, Input or experience would be highly appreciated.
I only mildly understand betting, so hopefully this helps.
It sounds like you are asking two questions:
When do I calculate results?
How much caching should I use?
To me it sounds like there are very clear events that happen, after which you can successfully calculate your results. Your design should take advantage of this and be evented in nature. You should have background processes that can detect when a game is complete. The results of the game should be written, and additional background jobs should be triggered to calculate the results of any bets that depend on that game.
This would also be the point at which any caches that involve that game, results from that game, or results from any bets on that game, should be invalidated and/or refreshed.
How much you should cache should be based on how much you need to cache. Caching should be considered separately from computing results. That is not caching. That is computing results and storing them. You should definitely not be calculating results during a page view request, and should be done ahead of time when the corresponding event (game ends) has triggered the calculation.
Your database should pretty much always represent the latest information you have on everything. You should avoid doing any calculations on-the-fly if possible.
I would get all the events and background stuff working first, then see what kind of performance you get. At that point your app should be doing little more than taking the results and sticking them into a view for each page view. If that part is going too slow, then you should start looking at caching your views/templates/html. As mentioned before, these caches could be invalidated by your background workers when they encounter new results.
Consider a set of data called Library, which contains a set of Books and each book contains a set of Pages.
Let's say you are using Riak to store this data, and you need to be access the data in two possible ways:
- Query for a particular page (with a unique id)
- Query for all pages in a particular book (with a unique name)
Additionally, you need to be able to easily update and delete pages of a particular Book.
What would be the best way to accomplish this in Riak?
Obviously Riak Search will do the trick, but maybe is inefficient for what I am trying to do. I am wondering if it makes sense to set up buckets where each bucket can be a Book (which would make for potentially millions of "Book" buckets). Maybe that is a bad idea...
Can this be accomplished with secondary indexes?
I am trying to keep this simple...
I am new to Riak and I am trying to find the best way to accomplish something that is probably relatively simple. I would appreciate any help from the Stack Overflow community. Thanks!
A common way to model master-detail relationships in Riak is to have the master record contain a list of detail record IDs, possibly together with some information about the detail record that may be useful when deciding which detail records to retrieve.
In your example, you could have two buckets called 'books' and 'pages'. The master record in the 'books' bucket will contain metadata and information about the book as a whole together with a list of pages that are included in the book. Each page would contain the ID of the 'pages' record holding the page data as well as the corresponding page number. If you e.g. wanted to be able to query by chapter, you could also add information about which chapters a certain page belongs to.
The 'pages' bucket would contain the text of the page and possibly links to images and other media data that are included on that page. This data could be stored in yet another bucket.
In order to get a specific page or a range of pages, one would first retrieve the master record from the 'books' bucket and then based on the contents of the record the appropriate pages. Even though this requires several GET operations, they are all direct lookups based on keys, which is the most efficient and scalable way to retrieve data from Riak, so it is will perform and scale well.
This approach also makes it simple to change the order of pages and/or chapters as only the master record needs to be updated. Adding, deleting or modifying pages would however require both the master record as well as one or more detail records to be updated, added or deleted.
You can most certainly also solve this problem by adding secondary indexes to the objects and query based on this. Secondary index queries in Riak does however have to include processing on a covering set (generally ring size / n_val) of partitions in order to fulfil the request, and therefore puts a bit more load on the system and generally results in higher latencies than retrieving a single object containing keys through a direct key lookup (which only needs to involve the partitions where the object is actually stored).
Although maintaining a separate object containing indexes adds a bit of extra work when inserting or deleting pages/entries, this approach will generally result in more efficient reads, as only direct key lookups are required. If your application is heavy on reads, it probably makes sense to use this approach, while secondary indexes could be more efficient for a write heavy application as inserts and modifications are made cheaper at the expense of more expensive reads. You can however always add secondary indexes just in case in order to keep your options open.
In cases like this I would usually recommend performing some benchmarks to test the solutions and chech which solution that best matches you particular performance and scaling requirements.
The most efficient way will be to store hole book as an one object, and duplicate it's pages as another separate objects.
Pros:
you will be able to select any object by its key(the most cheapest op
in riak is kv query)
any query will be predicted by latency
this is natural way of storing for riak
Cons:
If you need to update any page you must update whole book, and then page. As riak doesn't have atomic ops, you must to think how to recover any failure situation (like this: book was updated, but page was not).
Riak is about availability predictable latency, so if you will use something like 2i to collect results, it will make unpredictable time query, which will grow with page numbers
I have several graphs. The breadth and depth of each graph can vary and will undergo changes and alterations during runtime. See example graph.
There is a root node to get a hold on the whole graph (i.e. tree). A node can have several children and each child serves a special purpose. Furthermore a node can access all its direct children in order to retrieve certain informations. On the other hand a child node may not be aware of its own parent node, nor other siblings. Nothing spectacular so far.
Storing each graph and updating it with an object database (in this case DB4O) looks pretty straightforward. I could have used a relational database to accomplish data persistence (including database triggers, etc.) but I wanted to realize it with an object database instead.
There is one peculiar thing with my graphs. See another example graph.
To properly perform calculations some nodes require informations from other nodes. These other nodes may be siblings, children/grandchildren or related in some other kind. In this case a specific node knows the other relevant nodes as well (and thus can get the required informations directly from them). For the sake of simplicity the first image didn't show all potential connections.
If one node has a change of state (e.g. triggered by an internal timer or triggered by some other node) it will inform other nodes (interested obsevers, see also observer pattern) about the change. Each informed node will then take appropriate actions to update its own state (and in turn inform other observers as needed). A root node will not know about every change that occurs, since only the involved nodes will know that something has changed. If such a chain of events is triggered by the root node then of course it's not much of an issue.
The aim is to assure data persistence with an object database. Data in memory should be in sync with data stored within the database. What adds to the complexity is the fact that the graphs don't consist of simple (and stupid) data nodes, but that lots of functionality is integrated in each node (i.e. events that trigger state changes throughout a graph).
I have several rough ideas on how to cope with the presented issue (e.g. (1) stronger separation of data and functionality or (2) stronger integration of the database or (3) set an arbitrary time interval to update data and accept that data may be out of synch for a period of time). I'm looking for some more input and options concerning such a key issue (which will definitely leave significant footprints on a concrete implementation).
(edited)
There is another aspect I forgot to mention. A graph should not reside all the time in memory. Graphs that are not needed will be only present in the database and thus put in a state of suspension. This is another issue which needs consideration. While in suspension the update mechanisms will probably be put to sleep as well and this is not intended.
In the case of db4o check out "transparent activation" to automatically load objects on demand as you traverse the graph (this way the graph doesn't have to be all in memory) and check out "transparent persistence" to allow each node to persist itself after a state change.
http://www.gamlor.info/wordpress/2009/12/db4o-transparent-persistence/
Moreover you can use db4o "callbacks" to trigger custom behavior during db4o operations.
HTH
German
What's the exact question? Here a few comments:
As #German already mentioned: For complex object graphs you probably want to use transparent persistence.
Also as #German mentione: Callback can help you to do additional stuff when objects are read/written etc on the database.
To the Observer-Pattern. Are you on .NET or Java? Usually you don't want to store the observers in the database, since the observers are usually some parts of your business-logic, GUI etc. On .NET events are automatically not stored. On Java make sure that you mark the field holding the observer-references as transient.
In case you actually want to store observers, for example because they are just other elements in your object-graph. On .NET, you cannot store delegates / closures. So you need to introduce a interface for calling the observer. On Java: Often we use anonymous inner classes as listener: While db4o can store those, I would NOT recommend that. Because a anonymous inner class gets generated name which can change. Then db4o will not find that class later if you've changed your code.
Thats it. Ask more detailed questions if you want to know more.