I have around 200 tables in an SQLite database with hundreds to millions rows. These tables are queried by many concurrent processes of an OLTP application.
Each table needs to be occasionally updated by a full refresh -- delete all rows and then insert different ones, all within a transaction. For largest tables this takes almost one minute, but such refresh only happens few times a day for a table.
I need to make sure that the readers do not need to wait for the refresh to finish -- should use old version of table data until the transaction completes, then use the new version. Any waits, if need be, should be in terms of milliseconds rather than seconds or minutes.
Can this be achieved, i.e. avoid any database locks or table locks blocking the readers? I am not concerned about writers, they can wait and serialize.
It is a Python application with pysqlite.
Use WAL mode:
WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently.
Related
(Note: sorry if I am using the relational DB terms here.)
Let's say I have ten clients that are connected to a database. This database has a sustained throughput of about 1k updates per second. Obviously sending 1k updates per second to a web-browser (let's say 1MB data changes per second) is not going to be a good experience for the end-user. Does Firebase have any controls as to how much data a client can 'accept' before it starts throttling it? I understand it may batch requests, but my point here is, Google can accept data/updates faster than a browser can (potentially from a phone on a weak internet connection), so what controls or techniques are there in place to control this experience for the end-user?
The only items I see from the docs are:
You should not update a single document more than once per second. If you update a document too quickly, then your application will experience contention, including higher latency, timeouts, and other errors.
https://firebase.google.com/docs/firestore/best-practices#updates_to_a_single_document
This topic is covered in here, keeping the language used to code in aside, the linked code in that answer can assist.
In a general explanation, if your client application is configured to listen for Firestore updates, it will receive all the update events to that listener (just like you mentioned is happening).
You can consider polling Firebase for changes. The poll can even be an extension of the client application code where the code tracks the frequency of the updates being received and has a maximum value of updates per second which, when reached, results in the client disconnecting as a listener and performing periodic polls for the data.
The listener could then be re-established after a period to continue the normal workflow when there are fewer updates again.
The above being said, this is not optimal and treats the symptom rather than the cause. If a listener is returning too many updates, you should consider the structure of the data and look to isolate the updates to only require updates to listeners that require it.
Similarly, the large updates can be mitigated by ensuring smaller records contain the changes resulting in less data.
A generalized example is where two fields of data are updated, but the record is 150 fields in size. Rather than returning the full 150 fields, shard the fields into different data sets, so the two fields are in their own record with an additional reference field used to correlate with a second data set of the remaining 148 fields (plus the reference field).
When the smaller record is updated, the client application receives the small update, determines if the update is applicable to itself, and if so, fetches the corresponding larger record.
To prevent high volumes of writes from overwhelming the client's snapshot listeners, you could periodically duplicate the writes to a proxy collection that the client watches instead.
Documents would need a field to record the time of the last duplicate write to the proxy collection, and the process performing the writes should avoid making writes to the proxy collection until after the frequency duration has elapsed.
A small number of unnecessary writes may still occur due to any concurrent processes you have, but these might be insignificant in practice (with a reasonably long duplication frequency).
If the data belongs to a user, rather than being global data, then you could conceivably adjust the frequency of writes per user to suit their connection, either dynamically or based on user configuration.
In this way, your processes get to control the frequency of writes seen by clients, without needing to throttle or otherwise reject ingress writes (which would presumably be bad news for the upstream processes).
Relevant part of the documentation below.
https://firebase.google.com/docs/firestore/best-practices#realtime_updates
Limit the collection write rate
1,000 operations/second
Keep the rate of write operations for an individual collection under 1,000 operations/second.
Limit the individual client push rate
1 document/second
Keep the rate of documents the database pushes to an individual client under 1 document/second.
We use Cosmos DB to track all our devices and also data that is related to the device (and not stored in the device document itself) is stored in the same container with the same partition ID.
Both the device document and the related documents have /deviceId as the partition key. When a device is removed, then I remove the device document. I actually want to remove the entire partition, but this doesn't seem to be possible. So I revert to a query that queries for all items with this partition key and remove them from the database.
This works fine, but may consume a lot of RUs if there is a lot of related data (which may be true in some cases). I would rather just remove the device and schedule all related data for removal later (it doesn't hurt to have them in the database for a while). When RU utilization is low, then I start removing these items. Is there a standard solution to do this?
The best solution would be to schedule this and that Cosmos DB would process these commands when it has spare RUs, just like with the TTL deletion. Is this even possible?
A feature is now in preview to delete all items by partition key using fire and forget background processing model with a limited amount of available throughput. There's a signup link in the feature request page to get access to preview.
Currently, the API looks like a new DeleteAllItemsByPartitionKey method in the SDK.
It definitely is possible to set a TTL and then let Cosmos handle expiring data out of the container when it is idle. However, the cost to update the document in the first place is about what it costs to delete it anyway so you're not gaining much.
An approach as you suggest, may be to have a separate container (or even a queue) where you insert a new item with the deviceId to retire. Then in the evenings or during a time when you know the system is idle. Run a job that reads the next deviceId in the queue, queries for all the items with that partition key, then deletes the data or sets the TTL to expire the data.
There is a feature to delete an entire partition in the works that would be perfect for this scenario (in fact, it's designed for it) but no ETA on availability.
I have a SQL Azure database on which I need to perform some data archiving operation.
Plan is to move all the irrelevant data from the actual tables into Archive_* tables.
I have tables which have up to 8-9 million records.
One option is to write a stored procedure and insert data in to the new Archive_* tables and also delete from the actual tables.
But this operation is really time consuming and running for more than 3 hrs.
I am in a situation where I can't have more than an hour's downtime.
How can I make this archiving faster?
You can use Azure Automation to schedule execution of a stored procedure every day at the same time, during maintenance window, where this stored procedure will archive the oldest one week or one month of data only, each time it runs. The store procedure should archive data older than X number of weeks/months/years only. Please read this article to create the runbook. In a few days you will have all the old data archived and the Runbook will continue to do the job from now and on.
You can't make it faster, but you can make it seamless. The first option is to have a separate task that moves data in portions from the source to the archive tables. In order to prevent table lock escalations and overall performance degradation I would suggest you to limit the size of a single transaction. E.g. start transaction, insert N records into the archive table, delete these records from the source table, commit transaction. Continue for a few days until all the necessary data is transferred. The advantage of that way is that if there is some kind of a failure, you may restart the archival process and it will continue from the point of the failure.
The second option that does not exclude the first one really depends on how critical the performance of the source tables for you and how many updates are happening with them. It if is not a problem you can write triggers that actually pour every inserted/updated record into an archive table. Then, when you want a cleanup all you need to do is to delete the obsolete records from the source tables, their copies will already be in the archive tables.
In the both cases you will not need to have any downtime.
I have a messaging app, where all messages are arranged into seasons by creation time. There could be billions of messages each season. I have a task to delete messages of old seasons. I thought of a solution, which involves DynamoDB table creation/deletion like this:
Each table contains messages of only one season
When season becomes 'old' and messages no longer needed, table is deleted
Is it a good pattern and does it encouraged by Amazon?
ps: I'm asking, because I'm afraid of two things, met in different Amazon services -
In Amazon S3 you have to delete each item before you can fully delete bucket. When you have billions of items, it becomes a real pain.
In Amazon SQS there is a notion of 'unwanted behaviour'. When using SQS api you can act badly regarding SQS infrastructure (for example not polling messages) and thus could be penalized for it.
Yes, this is an acceptable design pattern, it actually follows a best practice put forward by the AWS team, but there are things to consider for your specific use case.
AWS has a limit of 256 tables per region, but this can be raised. If you are expecting to need multiple orders of magnitude more than this you should probably re-evaluate.
You can delete a table a DynamoDB table that still contains records, if you have a large number of records you have to regularly delete this is actually a best practice by using a rolling set of tables
Creating and deleting tables is an asynchronous operation so you do not want to have your application depend on the time it takes for these operations to complete. Make sure you create tables well in advance of you needing them. Under normal circumstances tables create in just a few seconds to a few minutes, but under very, very rare outage circumstances I've seen it take hours.
The DynamoDB best practices documentation on Understand Access Patterns for Time Series Data states...
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
It's perfectly acceptable to split your data the way you describe. You can delete a DynamoDB table regardless of its size of how many items it contains.
As far as I know there are no explicit SLAs for the time it takes to delete or create tables (meaning there is no way to know if it's going to take 2 seconds or 2 minutes or 20 minutes) but as long your solution does not depend on this sort of timing you're fine.
In fact the idea of sharding your data based on age has the potential of significantly improving the performance of your application and will definitely help you control your costs.
I want to implement a views counter like most forums, Youtube and several others have. So every time a user reads an article, that is stored and remembered. I also want to know who looked at the article.
My queston is: How do you implement this efficiently? What is the best practice?
One way would be to call a stored procedure for every view, but that would result in a lot of unneeded calls to the database.
Another way would be to store this to some global application object, and then store in DB every 5 minutes or so (and can you even do that in a good way?)
What's the best way to do this?
Database operations are surprisingly cheap and really are not worth worrying about. In the event that a DB operation was even marginally expensive then you can always delegate the blocking operation to a new thread thus freeing-up your page-generation thread (you can trivially do this for UPDATE and INSERT operations that return nothing from the database - they are inconsequential).
Sprocs aren't really in-fashion right now - the performance advantage they might have had from pre-computed execution plans is almost eliminated because modern servers cache plans from all previous queries, and for trivial SELECT, INSERT, and UPDATEs you begin to suffer from increased code complexity. There's nothing wrong with inline SQL commands now.
Anyway, back on-topic and in summary: your assumptions are wrong. There is nothing wrong with running UPDATE Pages SET ViewCount = ViewCount + 1 WHERE PageId = #pageId on every page-view. There is also nothing wrong with doing this either: INSERT INTO UserPageviews (UserId, PageId, DateTime) VALUES ( #userId, #pageId, NOW() ). Both operations are very cheap and will execute in under 2-3 miliseconds on even an old and aged database server.
Another way would be to store this to some global application object,
and then store in DB every 5 minutes or so (and can you even do that
in a good way?)
This method is very prone to data loss unless you use a durable queueing mechanism (like MSMQ). Unless you anticipate massive traffic, I wouldn't even think about this approach.
Writes of this nature are inexpensive and hundreds of operations per second are not a big deal. I recently built a comment/rating framework that acheives throughput of 3000+ complete transactions per second just on my local all-in-one workstation. This included processing the request, validation, and creating multiple records within a transaction.
As a note, you should take steps to ensure that your statistics data isn't vulnerable to artificial inflation/manipulation. This part of the process will probably be more complex than the view tracking itself. For example, a user should not be able to sit and hold down the F5 key and inflate the number of views on their video. Nor should these values be manipulable by HTTP (e.g. creating a small script to send an AJAX request over and over).
This suggests that each INSERT would be preceded by a SELECT to ensure that the same user ID or IP hadn't already been recorded in some period of time. Of course, this isn't foolproof (unless you invest a great deal of effort), but it errs on the side of conservatism which is usually a good approach.
One way would be to call a stored procedure for every view, but that
would result in a lot of unneeded calls to the database.
I regularly have to remind myself (and other developers) to not fear the database. People (me included) sometimes go to great lengths to avoid a few simple database calls. Keep your tables narrow and well-indexed, and operations like this are faster than you might think.