With DynamoDB I'm never sure what is happening on the other side. This makes me wondering: what exactly is happening once I fire a DeleteItem request?
I want to delete a record containing sensitive information once and for good, ideally so that it doesn't even stay in the underlying file system. Is it something DynamoDB can do?
I suspect the answer is no, given the distributed nature of DDB.
In fact, the DDB Whitepaper says
Using this reconciliation mechanism, an “add to cart” operation is never
lost. However, deleted items can resurface.
But I also suspect it does not matter given the list of regulations DDB is compliant with...Amazon Web Services: Risk and Compliance
Even if DDB did "wipe" the sectors of disk where a given record was stored on a delete, the data would still be in your various backups.
I suspect the right answer is ti make sure you encrypt your DDB tables.
Related
I have use case where I write data in Dynamo db in two table say t1 and t2 in transaction.My app needs to read data from these tables lot of times (1 write, at least 4 reads). I am considering DAX vs Elastic Cache. Anyone has any suggestions?
Thanks in advance
K
ElastiCache is not intended for use with DynamoDB.
DAX is good for read-heavy apps, like yours. But be aware that DAX is only good for eventually consistent reads, so don't use it with banking apps, etc. where the info always needs to be perfectly up to date. Without further info it's hard to tell more, these are just two general points to consider.
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache that can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second. While DynamoDB offers consistent single-digit millisecond latency, DynamoDB with DAX takes performance to the next level with response times in microseconds for millions of requests per second for read-heavy workloads. With DAX, your applications remain fast and responsive, even when a popular event or news story drives unprecedented request volumes your way. No tuning required. https://aws.amazon.com/dynamodb/dax/
AWS recommends that you use **DAX as solution for this requirement.
Elastic Cache is an old method and it is used to store the session states in addition to the cache data.
DAX is extensively used for intensive reads through eventual consistent reads and for latency sensitive applications. Also DAX stores cache using these parameters:-
Item cache - populated with items with based on GetItem results.
Query cache - based on parameters used while using query or scan method
Cheers!
I'd recommend to use DAX with DynamoDB, provided you're having more read calls using item level API (and NOT query level API), such as GetItem API.
Why? DAX has one weird behavior as follows. From, AWS,
"Every write to DAX alters the state of the item cache. However, writes to the item cache don't affect the query cache. (The DAX item cache and query cache serve different purposes, and operate independently from one another.)"
Hence, If I elaborate, If your query operation is cached, and thereafter if you've write operation that affect's result of previously cached query and if same is not yet expired, in that case your query cache result would be outdated.
This out of sync issue, is also discussed here.
I find DAX useful only for cached queries, put item and get item. In general very difficult to find a use case for it.
DAX separates queries, scans from CRUD for individual items. That means, if you update an item and then do a query/scan, it will not reflect changes.
You can't invalidate cache, it only invalidates when ttl is reached or nodes memory is full and it is dropping old items.
Take Aways:
doing puts/updates and then queries - two seperate caches so out of sync
looking for single item - you are left only with primary key and default index and getItem request (no query and limit 1). You can't use any indexes for gets/updates/deletes.
Using ConsistentRead option when using query to get latest data - it works, but only for primary index.
Writing through DAX is slower than writing directly to Dynamodb since you have a hop in the middle.
XRay does not work with DAX
Use Case
You have queries that you don't really care they are not up to date
You are doing few putItem/updateItem and a lot of getItem
The data in our vault is manageable. Eventually, we would accumulate a large volume. It is not possible to retain such large data for every day transactions. We would want to periodically archive or warehouse the data, so that query performance is maintained.
May I know if you have thought about handling large scale datasets and what would be your advise.
From the corda-dev mailing list:
Yep, we should do some design work around this. As you note it’s not a pressing issue right now but may become one in future.
Our current implementation is actually designed to keep data around even when it’s no longer ‘current’ on the ledger. The ORM mapped vault tables prefer to mark a row as obsolete rather than actually delete the data from the underlying database. Also, the transaction store has no concept of garbage collection or pruning so it never deletes data either. This has clear benefits from the perspective of understanding the history of the ledger and how it got into its current state, but it poses operational issues as well.
I think people will have different preferences here depending on their resources and jurisdiction. Let’s tackle the two data stores separately:
Making the relationally mapped tables delete data is easy, it’s just a policy change. Instead of marking a row as gone, we actually issue a SQL DELETE call.
The transaction store is trickier. Corda benefits from its blockless design here; in theory we can garbage collect old transactions. The devil is in the details however because for nodes that use SGX the tx store will be encrypted. Thus not only do we need to develop a parallel GC for the tx graph, but also, run it entirely inside the enclaves. A fun systems engineering problem.
If the concern is just query performance, one obvious move is to shift the tx store into a scalable K/V store like Cassandra, hosted BigTable etc. There’s no deep reason the tx store must be in the same RDBMS as the rest of the data, it’s just convenient to have a single database to backup. Scalable K/V stores don’t really lose query performance as the dataset grows, so, this is also a nice solution.
W.R.T. things like the GDPR, being able to delete data might help or it might be irrelevant. As with all things GDPR related nobody knows because the EU didn’t bother to define any answers - auditing a distributed ledger might count as a “legitimate need” for data, or it might not, depending on who the judge is on the day of the case.
It is at any rate only an issue when personal data is stored on ledger, which is not most use cases today.
I have a website that allows users to query for specific recipes using various search criteria. For example, you can say "Show me all recipes that I can make in under 30 minutes that will use chicken, garlic and pasta but not olive oil."
This query is sent to the web server over JSON, and deserialized into a SearchQuery object (which has various properties, arrays, etc).
The actual database query itself is fairly expensive, and there's a lot of default search templates that would be used quite frequently. For this reason, I'd like to start caching common queries. I've done a little investigation into various caching technologies and read plenty of other SO posts on the subject, but I'm still looking for advice on which way to go. Right now, I'm considering the following options:
Built in System.Web.Caching: This would provide a lot of control over how many items are in the cache, when they expire, and their priority. However, cached objects are keyed by a string, rather than a hashable object. Not only would I need to be able to convert a SearchQuery object into a string, but the hash would have to be perfect and not produce any collisions.
Develop my own InMemory cache: What I'd really like is a Dictionary<SearchQuery, Results> object that persists in memory across all sessions. Since search results can start to get fairly large, I'd want to be able to cap how many queries would be cached and provide a way for older queries to expire. Something like a FIFO queue would work well here. I'm worried about things like thread safety, and am wondering if writing my own cache is worth the effort here.
I've also looked into some other third party cache providers such as NCache and Velocity. These are both distributed cache providers and are probably completely overkill for what I need at the moment. Plus, it seems every cache system I've seen still requires objects to be keyed by a string. Ideally, I want something that holds a cache in process, allows me to key by an object's hash value, and allows me to control expiration times and priorities.
I'd appreciate any advice or references to free and preferably open source solutions that could help me out here. Thanks!
Based on what you are saying, I recommend you use System.Web.Caching and build that into your DataAccess layer shielding it from the rest of you system. When called you can make your real time query or pull from a cached object based on your business/application needs. I do this today, but with Memcached.
An in-memory cache should be pretty easy to implement. I can't think of any reason why you should have particular concerns about validating the uniqueness of a SearchQuery object versus any other - that is, while the key must be a string, you can just store the original object along with the results in the cache, and validate equality directly after you've got a hit on the hash. I would use System.Web.Caching for the benefits you've noted (expiration, etc.). If there happened to be a collision, then the 2nd one would just not get cached. But this would be extremely rare.
Also, the amount of memory needed to store search results should be trivial. You don't need to keep the data of every single field, of every single row, in complete detail. You just need to keep a fast way to access each result, e.g. an int primary key.
Finally, if there are possibly thousands of results for a search that could be cached, you don't even need to keep an ID for each one - just keep the first 100 or something (as well as the total number of hits). I suspect if you analyzed how people use search results, it's a rare person that goes beyond a few pages. If someone did, then you can just run the query again.
So basically you're just storing a primary key for the first X records of each common search, and then if you get a hit on your cache, all you have to do is run a very inexpensive lookup of a handful of indexed keys.
Give a quick look to the Enterprise library Caching Application Block. Assuming you want a web application wide cache, this might be the solution your looking for.
I'm assuming that generating a database query from a SearchQuery object is not expensive, and you want to cache the result (i.e. rowset) obtained from executing the query.
You could generate the query text from your SearchQuery object and use that text as the key for a lookup using System.Web.Caching.
From a quick reading the documentation for the Cache class it appears that the keys have to be unique - which they would be if you used they query text - not the hash of the key.
EDIT
If you are concerned about long cache keys then check the following links:
Cache key length in asp.net
Maximum length of cache keys in HttpRuntime.Cache object?
It seems that the Cache class stores the cached items in an internal dictionary, which uses the key's hash. Keys (query text) with the same hash would end-up in the same bucket in the dictionary, where its just a quick linear search to find the required one when do a cache lookup. So I think you'd be okay with long key strings.
The asp.net caching is pretty well thought out, and I don't think this is a case where you need something else.
I have asked a few questions today as I try to think through to the solution of a problem.
We have a complex data structure where all of the various entities are tightly interconnected, with almost all entities heavily reliant/dependant upon entities of other types.
The project is a website (MVC3, .NET 4), and all of the logic is implemented using LINQ-to-SQL (2008) in the business layer.
What we need to do is have a user "lock" the system while they make their changes (there are other reasons for this which I won't go into here that are not database related). While this user is making their changes we want to be able to show them the original state of entities which they are updating, as well as a "preview" of the changes they have made. When finished, they need to be able to rollback/commit.
We have considered these options:
Holding open a transaction for the length of time a user takes to make multiple changes stinks, so that's out.
Holding a copy of all the data in memory (or cached to disk) is an option but there is heck of a lot of it, so seems unreasonable.
Maintaining a set of secondary tables, or attempting to use session state to store changes, but this is complex and difficult to maintain.
Using two databases, flipping between them by connection string, and using T-SQL to manage replication, putting them back in sync after commit/rollback. I.e. switching on/off, forcing snapshot, reversing direction etc.
We're a bit stumped for a solution that is relatively easy to maintain. Any suggestions?
Our solution to a similar problem is to use a locking table that holds locks per entity type in our system. When the client application wants to edit an entity, we do a "GetWithLock" which gets the client the most up-to-date version of the entity's data as well as obtaining a lock (a GUID that is stored in the lock table along with the entity type and the entity ID). This prevents other users from editing the same entity. When you commit your changes with an update, you release the lock by deleting the lock record from the lock table. Since stored procedures are the api we use for interacting with the database, this allows a very straight forward way to lock/unlock access to specific entities.
On the client side, we implement IEditableObject on the UI model classes. Our model classes hold a reference to the instance of the service entity that was retrieved on the service call. This allows the UI to do a Begin/End/Cancel Edit and do the commit or rollback as necessary. By holding the instance of the original service entity, we are able to see the original and current data, which would allow the user to get that "preview" you're looking for.
While our solution does not implement LINQ, I don't believe there's anything unique in our approach that would prevent you from using LINQ as well.
HTH
Consider this:
Long transactions makes system less scalable. If you do UPDATE command, update locks last until commit/rollback, preventing other transaction to proceed.
Second tables/database can be modified by concurent transactions, so you cannot rely on data in tables. Only way is to lock it => see no1.
Serializable transaction in some data engines uses versions of data in your tables. So after first cmd is executed, transaction can see exact data available in cmd execution time. This might help you to show changes made by user, but you have no guarantee to save them back into storage.
DataSets contains old/new version of data. But that is unfortunatelly out of your technology aim.
Use a set of secondary tables.
The problem is that your connection should see two versions of data while the other connections should see only one (or two, one of them being their own).
While it is possible theoretically and is implemented in Oracle using flashbacks, SQL Server does not support it natively, since it has no means to query previous versions of the records.
You can issue a query like this:
SELECT *
FROM mytable
AS OF TIMESTAMP
TO_TIMESTAMP('2010-01-17')
in Oracle but not in SQL Server.
This means that you need to implement this functionality yourself (placing the new versions of rows into your own tables).
Sounds like an ugly problem, and raises a whole lot of questions you won't be able to go into on SO. I got the following idea while reading your problem, and while it "smells" as bad as the others you list, it may help you work up an eventual solution.
First, have some kind of locking system, as described by #user580122, to flag/record the fact that one of these transactions is going on. (Be sure to include some kind of periodic automated check, to test for lost or abandoned transactions!)
Next, for every change you make to the database, log it somehow, either in the application or in a dedicated table somewhere. The idea is, given a copy of the database at state X, you could re-run the steps submitted by the user at any time.
Next up is figuring out how to use database snapshots. Read up on these in BOL; the general idea is you create a point-in-time snapshot of the database, do whatever you want with it, and eventually throw it away. (Only available in SQL 2005 and up, Enterprise edition only.)
So:
A user comes along and initiates one of these meta-transactions.
A flag is marked in the database showing what is going on. A new transaction cannot be started if one is already in process. (Again, check for lost transactions now and then!)
Every change made to the database is tracked and recorded in such a fashion that it could be repeated.
If the user decides to cancel the transaction, you just drop the snapshot, and nothing is changed.
If the user decides to keep the transaction, you drop the snapshot, and then immediately re-apply the logged changes to the "real" database. This should work, since your requirements imply that, while someone is working on one of these, no one else can touch the related parts of the database.
Yep, this sure smells, and it may not apply to well to your problem. Hopefully the ideas here help you work something out.
I need to manage the acquisition of many record at hour. About 1000000 records. And I need to get every second the last insert value for every primary key. It works quit well with sharding. I was thinking to try the use os capped collection to get only the last record for every primary key. In order to do this, I made two separated insert, there is a way, into mongodb, to make some kind of trigger to propagate the insert into a collection to another collection?
MongoDB does not have any support for triggers or similar behavior.
The only way to do this is to make it happen in your code. So the code that writes the first entry should also write the second.
People have definitely requested triggers. If they are necessary for your solution, please cast a vote on the feature request.
I disagree with "triggers is needed". People, MongoDB was created to be very fast and to provide as basic functionalities as can be. This is a power of this solution.
I think that here the best think is to create triggers inside Your application as a part of Data Access layer.