We want to keep certain documents in our DB for a short duration. When a document is created, it doesn't matter how often its modified but it should be deleted after say X time units.
We looked at time to live in Cosmos DB but it seems to set the TTL from last edit and not creation.
One approach that we are considering is reduce the TTL everytime we update based on current time vs last update time of the document. It is hacky and inaccurate to errors due to clock skews.
Is there a better/accurate approach to achieving expiry from creation time? Our next approach will be to setup a service bus event that will trigger document deletion. Even that is more of best effort approach than an accurate TTL.
Every time you update a record you can derive a new TTL from the current TTL and the _ts field. So first get the item, derive the new TTL, and update the item together with the new (smaller) TTL.
Related
I have diagnostic data for devices being written to cosmos, some devices write 1000's of messages a day while others write just a few. I always want there to be diagnostics data regardless of when it was added but I don't want to retain all of it forever. Adding a TTL of 90 days works fine for the devices that are very active, they will always have diagnostics data as they are sending it in on a daily basis. The not so active devices will loose their diagnostics logs after the TTL.
Is there a way to use the TTL feature of CosmosDb but always keep at least n records?
I am looking for something like only keeping records from the last last 90 days (TTL) but always keep at least 100 documents regardless of the last updated timestamp.
There are no built-in quantity-based filters for TTL: you either have collection-based TTL, or collection+item TTL (item-based TTL overriding default set in the collection).
You'd need to create something yourself, where you'd mark eligible documents for deletion (based on time period, perhaps?), and then run a periodic cleanup routine based on item counts, age of delete-eligible items, etc.
Alternatively, you could treat low-volume and high-volume devices differently, with high-volume device telemetry written to TTL-based collections, and low-volume device telemetry written to no-TTL collections (or something like that)...
tl;dr this isn't something built-in.
Short anwser: there's no such built in functionality.
You could create your own Function App working on a schedule trigger that fires a query as such:
SELECT *
FROM c
WHERE NOT IS_DEFINED(c.ttl) --only update items that have no ttl
ORDER BY c._ts DESC
OFFSET 100 LIMIT 2147483647 --skip the newest 100
and then updates the items from it by setting a ttl for them. That way you'll be assured that that the newest 100 records remain available (assuming you don't have another process deleting others), while cleaning up the other items periodically. Keep in mind the update resets the tll as _ts will be updated.
We use Cosmos DB to track all our devices and also data that is related to the device (and not stored in the device document itself) is stored in the same container with the same partition ID.
Both the device document and the related documents have /deviceId as the partition key. When a device is removed, then I remove the device document. I actually want to remove the entire partition, but this doesn't seem to be possible. So I revert to a query that queries for all items with this partition key and remove them from the database.
This works fine, but may consume a lot of RUs if there is a lot of related data (which may be true in some cases). I would rather just remove the device and schedule all related data for removal later (it doesn't hurt to have them in the database for a while). When RU utilization is low, then I start removing these items. Is there a standard solution to do this?
The best solution would be to schedule this and that Cosmos DB would process these commands when it has spare RUs, just like with the TTL deletion. Is this even possible?
A feature is now in preview to delete all items by partition key using fire and forget background processing model with a limited amount of available throughput. There's a signup link in the feature request page to get access to preview.
Currently, the API looks like a new DeleteAllItemsByPartitionKey method in the SDK.
It definitely is possible to set a TTL and then let Cosmos handle expiring data out of the container when it is idle. However, the cost to update the document in the first place is about what it costs to delete it anyway so you're not gaining much.
An approach as you suggest, may be to have a separate container (or even a queue) where you insert a new item with the deviceId to retire. Then in the evenings or during a time when you know the system is idle. Run a job that reads the next deviceId in the queue, queries for all the items with that partition key, then deletes the data or sets the TTL to expire the data.
There is a feature to delete an entire partition in the works that would be perfect for this scenario (in fact, it's designed for it) but no ETA on availability.
I use a Cosmos change feed processor to read from my CosmosDb.
How can I, in the lease container, see which is the last read record read?
There are 2 records but I don't understand how it points to a Cosmos record.
If it does at all.
There is no simple way to know, the leases don't point at a particular record, they point at the last point in time you consumed the Change Feed and the Processor checkpointed. So if the Processor would start again (or continue execution), it would look for changes after that last saved Continuation.
You cannot correlate the saved Continuation with a particular document. The leases do have a Timestamp though, so that indicates when was the last saved time, you could use that to have a temporal notion.
SHORT VERSION: How can I trigger events regularly in AWS lambda?
LONG VERSION: My situation is such that I have events in a database that expire within a certain time. I want to run a function (send push notifications, delete rows, etc.) whenever I figure out that an event has expired. I know that setting up a timer for every single event created would be impractical, but is there something that would scan my database every minute or something and look for expired events to run my code on? If not, is there some alternative for my solution?
You could store your events in a DynamoDB table keyed at a UUID, and have a hash-range schema GSI on this table where the hash key would be an expiry time bucket, like the hour an event expires, 20150701T04Z, and the range key of the GSI could be the exact timestamp (unix time). That way, for a given hour-expiry bucket, you can use a range Query on the hour you are expiring events for, and take advantage of key conditions to limit your read to the time range you are interested in. GSI do not enforce uniqueness, so you are still OK even if there are multiple events at the same Unix time. By projecting ALL attributes instead of KEYS_ONLY or INCLUDE, you can drive your event expiry off the GSI, without a second read to the base table. By adjusting the size of your expiry buckets (hours or minutes or days are all good candidates), you can greatly reduce the chances that your writes to the base table and queries on the GSI do not get throttled, as the expiry buckets, having different hash keys, will be evenly distributed throughout the hash key space.
Regarding event processing and the use of Lambda, first, you could have an EC2 instance perform the queries and delete items from the event table as they expire (or tombstone them by marking them as expired). Deleting the event items will keep the size of your table manageable and help you avoid IOPS dilution in the partitions of your table. If the number of items grows without bound, then your table's partitions will keep splitting resulting in smaller and smaller amounts of provisioned throughput on each partition, unless you up-provision your table. Next in the pipeline, you could enable a DynamoDB stream on the event table with the stream view type that includes old and new images. Then, you could attach a Lambda function to your Stream that does the event-driven processing (push notifications, etc). You can have your Lambda function fire notifications when old is populated and new is null, or when the difference between old and new image indicates that an event was tombstoned.
There's support now for scheduled Lambda jobs I believe, but I haven't tried it yet. https://aws.amazon.com/about-aws/whats-new/2015/10/aws-lambda-supports-python-versioning-scheduled-jobs-and-5-minute-functions/
I have created a BPEL process and added a DB adapter for polling a table change of new row added..
and my polling interval is 60 seconds,
but my process is creating an instance on every 60 seconds, ideally when table have some change then it should create an workitem in application..
please guide me if i am doing any thing wrong...
I presume if you look at the instances that are created you will notice that you are getting the same data back.
This will occur if the db adapter is unsure what records have been read.
The simplest way to do this is to have the DB adapter mark the record as read. You can do this by adding an indicator column to your schema as one solution that is set to read or unread.
But effectively, your issue is most likely, without further information, that you are re-reading the same record every iteration and therefore you need to use one of the options for determining a record has been read.