Does adding GSIs to a DynamoDB slow down write performance? For instance, if I create a DynamoDB table with 5 GSIs, are writes slower than writing to a table with no GSIs? (Assuming GSI updates are eventually consistent).
No. DDB writes to the to both the table and the GSI asynchronously, thus the eventual consistency.
You can request strongly consistent read of the table itself.
However, GSI don't support strongly consistent reads.
Strongly consistent reads are not supported on global secondary indexes.
Related
Case 1) When we create a GSI with a partition key different from the main table's partition key, the dynamo replicates the data into another table under the hood. Which is understood.
Case 2) What if I create a GSI with the same partition key as the main table's PK but just with a different sort key? Will it replicate the data the same way as in Case 1? This situation sounds similar to an LSI because they also share the partition key with the main table. If I created an LSI instead, would it save me any data replication and hence the cost associated with it?
Yes, it replicates the same as Case 1. In general people should use GSIs unless they absolutely require LSIs.
Pros of an LSI:
Enables strongly consistent reads out of the index
Cons of an LSI:
Cannot be added or deleted after table creation
Prevents an item collection (items having the same PK) from growing beyond 10 GB (because to maintain strong reads the item collection has to be co-located)
Prevents adaptive capacity from isolating hot items in the item collection across different partitions (again, due to the need to be co-located)
Increases the likelihood of a hot partition because the base table write and LSI writes always go to the same partition, limiting write throughput to that partition (whereas a GSI has its own write capacity)
It's not actually true to say LSIs don't cost extra. They still consume write capacity, just out of the base table's allotment.
Any GSI regardless of the key is a separate table you pay extra for.
An LSI doesn't cost any extra quite as much as a GSI; especially if using a provision table. Additionally, an LSI has strongly consistent reads available just like the base table. GSI only offer eventually consistent reads.
However, the downside to using an LSI instead of a GSI, is that a table with an LSI is limited to a partition size of 10GB.
In other words, if you try to add data above 10GB in a table with the same partition (aka hash) key, if there's any LSIs it will fail.
If there are no LSIs, then it will succeed.
Item collection size limit
The maximum size of any item collection for
a table which has one or more local secondary indexes is 10 GB. This
does not apply to item collections in tables without local secondary
indexes, and also does not apply to item collections in global
secondary indexes. Only tables that have one or more local secondary
indexes are affected.
So depending on your data, it might behoove you to pay for the GSI even if an LSI would work instead.
If I know the primary key of the items, Which approach is best approach
Scan with FilterExpression with IN Operator
BatchGetItem with all keys in request parameter
Please recommend the solution in terms of both latency and partitions impact.
Probably neither. Of course it all depends on the key schema and the data in the table, but you probably want to create an Global Secondary Index for your most frequently used queries.
Having said that; performing scans is highly discouraged, especially when working with large volumes of data. So if you know the primary key of the items you're interested in, go for BatchGetItems over doing a scan.
I am looking at CosmosDB partitioning facility and what I have got so far is that it is good for performance. It can really help us in avoiding the fanout queries but I have got stuck into one question with partitioning. For partitioning in write if I have got different type of documents, can be thousands of them, belong to same partition the write operation will be slow but if I give them different partition key then I will lose the transactional behaviour because store procedures are scoped to one transaction.
My use case is I have got different type of documents within same collection and at one given time i will be updating and inserting thousands of different type of documentation and I have to do that within the same transaction which means I have to use the same key but if I do that then I will be doing HOT write operation which is not suggested in CosmosDB. Anyhelp on how to achive this issue will be be appreciated.
People use stored procedures to batch their documents and today it does constrain you to one partition. However, be aware of other limitations that your partition key should be as such that your documents fan out in different partitions. So your one batch can be for one partition key and next batch is for another.
read more here
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
hope this help.
Rafat
Its tricky.. I do have a large set of docs within a single partition at the moment, maybe later on I would need to redesign the collection. Right now I am using a bulk insert/update lib in CosmosDB. Link https://learn.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview Way faster for large data inserts/updates, its Microsoft backed library, however it supports transactional behaviour but only withing a single partition. So at the moment, I am safe.
We are using DynamoDB and have some complex queries that would be very easily handled using code instead of trying to write a complicated DynamoDB scan operation. Is it better to write a scan operation or just pull the minimal amount of data using a query operation (query on the hash key or a secondary index) and perform further filtering and reduction in the calling code itself? Is this considered bad practice or something that is okay to do in NoSQL?
Unfortunately, it depends.
If you have an even modestly large table a table scan is not practical.
If you have complicated query needs the best way to tackle that using DynamoDB is using Global Secondary Indexes (GSIs) to act as projections on the fields that you want. You can use techniques such as sparse indexes (creating a GSI on fields that only exist on a subset of the objects) and composite attributes keys (concatenating two or more attributes and using this as a new attribute to create a GSI on).
However, to directly address the question "Is it okay to filter using code instead of the NoSQL database?" the answer would be yes, that is an acceptable approach. The reason for performing filters in DynamoDB is not to reduce the "cost" of the query, that is actually the same, but to decrease unnecessary data transfer over the network.
The ideal solution is to use a GSI to get to reduce the scope of what is returned to as close to what you want as possible, but if it is necessary some additional filtering can be fine to eliminate some records either through a filter in DynamoDB or using your own code.
Here's my problem.
I want to ingest lots and lots of data .... right now millions and later billions of rows.
I have been using MySQL and I am playing around with PostgreSQL for now.
Inserting is easy, but before I insert I want to check if that particular records exists or not, if it does I don't want to insert. As the DB grows this operation (obviously) takes longer and longer.
If my data was in a Hashmap the look up would be o(1) so I thought I'd create a Hash index to help with lookups. But then I realised that if I have to compute the Hash again every time I will slow the process down massively (and if I don't compute the index I don't have o(1) lookup).
So I am in a quandry, is there a simple solution? Or a complex one? I am happy to try other datastores, however I need to be able to do reasonably complex queries e.g. something to similar to SELECT statements with WHERE clauses, so I am not sure if no-sql solutions are applicable.
I am very much a novice, so I wouldn't be surprised if there is a trivial solution.
Nosql Stores are good for handling huge inserts and updates
MongoDB has really good feature for update/Insert (called as upsert) based on whether the document is existing.
Check out this page from mongo doc
http://www.mongodb.org/display/DOCS/Updating#Updating-UpsertswithModifiers
Also you can checkout the safe mode in mongo connection. Which you can set it as false to get more efficiency in inserts.
http://www.mongodb.org/display/DOCS/Connections
You could use CouchDB. Its no SQL so you can't do queries per se, but you can create design documents that allow you to run map/reduce functions on your data.