Quoting from the MongoDB docs on Capped Collections:
Once the space is fully utilized,
newly added objects will replace the
oldest objects in the collection.
Is there any way to capture a capped collection's "dropped" objects before they are overwritten ? What I am interested in doing is implementing a series of rollup collections. eg.
Hourly --> Daily --> Weekly --> Monthly etc.
so when an object is dropped from the Hourly collections, I want to capture it and aggregate it up to the Daily collection.
Thanks in advance.
//Nicholas
You would have to implement that functionality in code rather than in MongoDB.
I don't think Capped Collections are the right solution for your use-case.
You could insert into a capped collection, and at the same time insert into a "normal" collection, and aggregate them into hourly / daily, weekly, monthly etc... using map reduce.
As per the MongoDB developers, you can't do this:
http://groups.google.com/group/mongodb-user/browse_thread/thread/aec8d0c85f58d89e/d6701df083eb4679?fwc=1
What I am interested in doing is implementing a series of rollup collections.
As alex said, one way to solve this is to use MapReduce. Another way is to have a different collection e.g. per day, for example logs20110414 and have your application manage read/writes to the appropriate collection.
Related
I'm working on an app where users create certain events in a calendar.
I was thinking on structuring the calendar events data as follows:
allEventsEver/{yearId}/months/{monthId}/events/{eventId}
I understand that
Firestore is optimized for storing large collections of small documents
but the structure above would mean that this would be an ever-growing collection. Is this something I should worry about? Would it be better to create a new collection for each year, e.g.:
2022/months/{monthId}/events/{eventId}
2023/months/{monthId}/events/{eventId}
Also, should I avoid using year/month value as document id (e.g. 2022) as those would be considered sequential ids that could cause hotspots that impact latency? If yes, what other approach do you suggest?
The most important/unique performance guarantee Firestore gives is that its query performance is independent of the number of documents in the collection. Query performance only depends on how much data you return, not on how much data needs to be considered.
So an ever-growing collection is not a concern on Firestore. As long put a limit on how many results your query can return, you'll have an upper bound on how much time it will take.
I have been using Postgres to store time-series sensor data but I am weighing the cost of using Firestore cause I prefer the serverless nature of Firestore. My only concern is the cost of Firestore because I am paying for every read. I want to be able to display this sensor information on my web app. Now, I am taking data every 10 seconds and theres over 400+ sensor points (400 columns per row in my postgres table)
Currently, if a user queries for a week's work of data that's about 60,000 rows of data, but I optimise it by just taking every nth value to "feather" the data. So by taking every 20th row for example, I have reduced the return of the data to 3000 rows which is manageable and still the chart shows a clear trend.
I want to be able to do this in Firestore to save costs, because if a user queries for a week's data, I am paying for 60000 document reads which I can't display all those data points on the web app anyway. I have tried searching for ways to query firestore to take the Nth row of data, but haven't found any concrete solutions.
Does anybody have any recommendation how I can optimise my Firestore costs for time series data or perhaps any other cheap serverless methods to manage this data?
Firestore doesn't offer any way to "feather" data from queries, as you say. What you could do instead is put an integer in each document that describes its "Nth" value, then query for only those "N" that you want.
Happy Holidays everyone!
tl;dr: I need to aggregate movie rental information that is being stored in one DynamoDB table and store running total of the aggregation in another table. How do I ensure exactly-once
aggregation?
I currently store movie rental information in a DynamoDB table named MovieRentals:
{movie_title, rental_period_in_days, order_date, rent_amount}
We have millions of movie rentals happening on any given day. Our web application needs to display the aggregated rental amount for any given movie title.
I am planning to use Flink to aggregate rental amounts by movie_title on the MovieRental DynamoDB stream and store the aggregated rental amounts in another DynamoDB table named RentalAmountsByMovie:
{movie_title, total_rental_amount}
How do I ensure that RentalAmountsByMovie amounts are always accurate. i.e. How do I prevent results from any checkpoint from not updating the RentalAmountsByMovie table records more than once?
Approach 1: I store the checkpoint ids in the RentalAmountsByMovie table and do conditional updates to handle the scenario described above?
Approach 2: I can possibly implement the TwoPhaseCommitSinkFunction that uses DynamoDB Transactions. However, according to Flink documentation the commit function can be called more than once and hence needs to be idempotent. So even this solution requires checkpoint-ids to be stored in the target data store.
Approach 3: Another pattern seems to be just storing the time-window aggregation results in the RentalAmountsByMovie table: {movie_title, rental_amount_for_checkpoint, checkpoint_id}. This way the writes from Flink to DynamoDB will be idempotent (Flink is not doing any updates it is only doing inserts to the target DDB table. However, the webapp will have to compute the running total on the fly by aggregating results from the RentalAmountsByMovie table. I don't like this solution for its latency implications to the webapp.
Approach 4: May be I can use Flink's Queryable state feature. However, that feature seems to be in Beta:
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/queryable_state.html
I imagine this is a very common aggregation use case. How do folks usually handle updating aggregated results in Flink external sinks?
I appreciate any pointers. Happy to provide more details if needed.
Thanks!
Typically the issue you are concerned about is a non-issue, because folks are using idempotent writes to capture aggregated results in external sinks.
You can rely on Flink to always have accurate information for RentalAmountsByMovie in Flink's internal state. After that it's just a matter of mirroring that information out to DynamoDB.
In general, if your sinks are idempotent, that makes things pretty straightforward. The state held in Flink will consist of some sort of pointer into the input (e.g., offsets or timestamps) combined with the aggregates that result from having consumed the input up to that point. You will need to bootstrap the state; this can be done by processing all of the historic data, or by using the state processor API to create a savepoint that establishes a starting point.
I have many (order of 100s) pieces of data that I want to associate with a document in CosmosDB. Each piece of data is small (order of 100s of bytes).
My first solution was to store the data as an array inside the document. This works okay, but in order to append a new item to the array I need to read the document from CosmosDB, add the element, then replace the document back into CosmosDB.
Instead of doing this I would like to store each piece of data as its own document in the same partition. What are the drawbacks of having many tiny documents vs the one aggregated document?
What are the drawbacks of having many tiny documents vs the one
aggregated document?
I would like to say that i suggest you storing each piece of data,instead of one aggregated document.
Reason1:As you mentioned in your question,if you want to add the element into the document,you need to read the document from CosmosDB, then replace the document because the partial update is not supported by cosmos db so far.(Please refer to this feedback and follow it if you need:https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/6693091-be-able-to-do-partial-updates-on-document) That's a huge and tedious work.
Reason2:If you store pieces of data,you can query them flat. (select * from c)
For one single array document,you need to use join to access the nested properties.(select a.array from c join array in c.array)
Reason3:If you store pieces of data,you could manage them into different partitions.Even though you don't need it now,why not keep the feature for the future.
Reason4:As to cost,it all depends the RUs and storage and requests to cosmos db will consume RUs. If you store pieces of data,you just need to access the specific document as you want which is more economical i think.
Depends on your use case.
For frequent add operations, you are first reading and updating the document back (2 operations) which will incur you more cost than creating a new document (1 operation).
However, if the documents are having some sort of relationships (like foreign keys in traditional SQL), getting data would require multiple queries if you go with approach #1 above (have more cost) otherwise, you'll get the complete data in a single query (low cost).
I'd recommend to go through this and this posts which will give you better insights on which approach you can choose.
I'm facing this question right now and I want to let my contribution here. I'm having to store some statuses, this status is a metric that I get once per hour, then i have two options:
Create a register per status -> 24 registers per day
Create a register per day and add status inside it -> 1 register per day with 24 status inside an array
I chose the second one because:
Both options will have the same amount of operations on database
I'm using this data on Power BI and after doing some tests the data from second option had a small size after importation
I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.