How to write 5000 records into DynamoDB Table?

How to write 5000 records into DynamoDB Table? - amazon-dynamodb

I have a use case where I have to write 5000 records into dynamoDB table in one shot. I am using batchSave api of DynamoDBMapper Library. it can write upto 25 records in one go.
can I pass the list of 5000 records to it and it will internally convert them into batch of 25 records and write to dynamodb table or I will have to handle this thing in my code using conditional some logic and will pass only 25 records to batchSave?

According to the batchSave documentation, batchSave():
Saves the objects given using one or more calls to the AmazonDynamoDB.batchWriteItem
Indeed, it splits up the items you give it into appropriately-sized batches (25 items) and writes them using the DynamoDB BatchWriteItem operation.
You can see the code that does this in batchWrite() in DynamoDBMapper.java:
/** The max number of items allowed in a BatchWrite request */
static final int MAX_ITEMS_PER_BATCH = 25;
// Break into chunks of 25 items and make service requests to DynamoDB
for (final StringListMap<WriteRequest> batch :
requestItems.subMaps(MAX_ITEMS_PER_BATCH, true)) {
List<FailedBatch> failedBatches = writeOneBatch(batch, config.getBatchWriteRetryStrategy());
...

Here are the methods I use in order to achieve this end. I manage to do it, by first chucking the dataArray into small arrays (of length 25):
const queryChunk = (arr, size) => {
const tempArr = []
for (let i = 0, len = arr.length; i < len; i += size) {
tempArr.push(arr.slice(i, i + size));
}
return tempArr
}
const batchWriteManyItems = async (tableName, itemObjs, chunkSize = 25) => {
return await Promise.all(queryChunk(itemObjs, chunkSize).map(async chunk => {
await dynamoDB.batchWriteItem({RequestItems: {[tableName]: chunk}}).promise()
}))
}

Related

Is there a better way to write this firebase cloud function than what I have right now?

I have been trying to learn firebase cloud functions recently and I have wrote an http that takes the itemName, sellerUid, and quantity. Then I have a background trigger (an onWrite) that finds the Item Price with the provided sellerUid and itemName and computes the total (Item Price * Quantity) and then writes it into a document in firestore.
My question is:
with what I have right now, suppose my client purchases N items, this means that I will have:
N reads (from the N items' price searching),
2 writes (one initial write for the N items and 1 for the Total Amount after computation),
N number of searches from cloud function??
I am not exactly sure how cloud functions count towards read and writes as well as the amount of compute time it needs (though it's all just text though so should be negligible?)
Would love to hear your thoughts on if what I have is already good enough or is there a much more efficient way of going about this.
Thanks!
exports.itemAdded = functions.firestore.document('CurrentOrders/{documentId}').onWrite(async (change, context) => {
const snapshot = change.after.data();
var total = 0;
for (const [key, value] of Object.entries(snapshot)) {
if (value['Item Name'] != undefined) {
await admin.firestore().collection('Items')
.doc(key).get().then((dataValue) => {
const itemData = dataValue.data();
if (!dataValue.exists) {
console.log('This is empty');
} else {
total += (parseFloat(value['Item Quantity']) * parseFloat(itemData[value['Item Name']]['Item Price']));
}
});
console.log('This is in total: ', total);
}
}
snapshot['Total'] = total;
console.log('This is snapshot afterwards: ', snapshot);
return change.after.ref.set(snapshot);
});

With your current approach you will be billed with:
N reads (from the N items' price searching);
1 write that triggers your onWrite function;
1 write that persists the total value;
One better approach that I can think of is one of comparing the size of the list of values in change.before.data() and change.after.data(), and reading the current total value (0 if this is the first time) and afterwards add only the values that were added in change.after.data() instead of N values, which would potentially result in you being charged for less reads.
For the actual pricing, if you check this Documentation for Cloud Functions, you will see that on your case only invocation and compute billing applies to your case, however there is a free tier for both, so if you are using this only to learn and this app does not have a lot of use, you should be on the free tier with either approach.
Let me know if you need any more information.

AWS Scan ignores withLimit()

I am trying to fetch the items from a DynamoDB table to put them in a csv file. Following is the code:
ArrayList<String> ids = new ArrayList<String>();
ScanResult result = null;
do{
ScanRequest req = new ScanRequest();
req.setTableName("table");
req.withLimit(10);
if(result != null){
req.setExclusiveStartKey(result.getLastEvaluatedKey());
}
AmazonDynamoDBClient client = new AmazonDynamoDBClient(awsCreds);
result = client.scan(req);
List<Map<String, AttributeValue>> rows = result.getItems();
for(Map<String, AttributeValue> map : rows){
try{
AttributeValue v = map.get("prod_number");
String id = v.getS();
ids.add(id);
} catch (NumberFormatException e){
System.out.println(e.getMessage());
}
}
} while(result.getLastEvaluatedKey() != null);
System.out.println("Result size: " + ids.size());
I want to know why 'req.withLimit(10)' has no impact on the number of results. The query still tries to fetch all the records.

The limit property of ScanRequest means:
The maximum number of items to evaluate (not necessarily the number of matching items). If DynamoDB processes the number of items up to the limit while processing the results, it stops the operation and returns the matching values up to that point, and a key in LastEvaluatedKey to apply in a subsequent operation, so that you can pick up where you left off. Also, if the processed dataset size exceeds 1 MB before DynamoDB reaches this limit, it stops the operation and returns the matching values up to the limit, and a key in LastEvaluatedKey to apply in a subsequent operation to continue the operation. For more information, see Working with Queries in the Amazon DynamoDB Developer Guide.
So, it limits only the size of a portion of data returned by a single request, but not the whole scan operation. And I see you're doing multiple requests, so you'll get more data.

How to get a Firestore document size?

From Firestore docs, we get that the maximum size for a Firestore document is:
Maximum size for a document 1 MiB (1,048,576 bytes)
QUESTION
How can I know the current size of a single doc, to check if I'm approaching
that 1mb limit?
Example:
var docRef = db.collection("cities").doc("SF");
docRef.get().then(function(doc) {
if (doc.exists) {
console.log("Document data:", doc.data());
// IS THERE A PROPERTY THAT CAN DISPLAY THE DOCUMENT FILE SIZE?
} else {
// doc.data() will be undefined in this case
console.log("No such document!");
}
}).catch(function(error) {
console.log("Error getting document:", error);
});

The calculations used to compute the size of a document is fully documented here. There is a lot of text there, so please navigate there to read it. It's not worthwhile to copy all that text here.
If you're having to manually compute the size of a document as it grows, my opinion is that you're probably not modeling your data scalably. If you have lists of data that can grow unbounded, you probably shouldn't be using a list field, and instead put that data in documents in a new collection or subcollection. There are some exceptions to this rule, but generally speaking, you should not have to worry about computing the size of a document in your client code.

I've published a npm package that calculates the size of a Firestore document.
Other packages like sizeof or object-sizeof that calculate the size of JS object will not give you a precise result because some primitives in Firestore have different byte value. For example boolean in Js is stored in 4 bytes, in a Firestore document it's 1 byte. Null is 0 bytes, in Firestore it's 1 byte.
Additionally to that Firestore has own unique types with fixed byte size: Geo point, Date, Reference.
Reference is a large object. Packages like sizeof will traverse through all the methods/properties of Reference, instead just doing the right thing here. Which is to sum String value of a document name + path to it + 16 bytes. Also, if Reference points to a parent doc , sizeof or object-sizeof will not detect circular reference here which might spell even bigger trouble than incorrect size.

For Android users who want to check the size of a document against the maximum of 1 MiB (1,048,576 bytes) quota, there is a library I have made and that can help you calculate that:
https://github.com/alexmamo/FirestoreDocument-Android/tree/master/firestore-document
In this way, you'll be able to always stay below the limit. The algorithm behind this library is the one that is explained in the official documentation regarding the Storage Size.

I was looking in the Firebase reference expecting the metadata would have an attribute, but it doesn't. You can check it here.
So my next approach would be to figure the weight of the object as an approximation. The sizeOf library seems to have a reasonable API for it.
So it would be something like:
sizeof.sizeof(doc.data());
I wouldn't use the document snapshot, because it contains metadata, like if there are pending saves. On another hand overestimating could be better in some cases.
[UPDATE] Thanks to Doug Stevenson for the wonderful insight
So I was curious how much the difference would actually be, so with my clunky js I made a dirty comparison, you can see the demo here
Considering this object:
{
"boolean": true,
"number": 1,
"text": "example"
}
And discounting the id this is the result:
| Method | Bytes |
|---------|-------|
| FireDoc | 37 |
| sizeOf | 64 |
So sizeOf library could be a good predictor if we want to overestimate (assuming calculations are fine and will behave more or less equal for more complex entities). But as explained in the comment, it is a rough estimation.

For Swift users,
If you want to estimate the document size then I use the following. Returns the estimated size of document in Bytes. It's not 100% accurate but gives a solid estimate. Basically just converts each key, value in the data map to a string and returns total bytes of String + 1. You can see the following link for details on how Firebase determines doc size: https://firebase.google.com/docs/firestore/storage-size.
func getDocumentSize(data: [String : Any]) -> Int{
var size = 0
for (k, v) in data {
size += k.count + 1
if let map = v as? [String : Any]{
size += getDocumentSize(data: map)
} else if let array = v as? [String]{
for a in array {
size += a.count + 1
}
} else if let s = v as? String{
size += s.count + 1
}
}
return size
}

You can use this calculator (code snipped), i write by myself.
source : https://firebase.google.com/docs/firestore/storage-size
<!DOCTYPE html>
<html>
<head>
<title>Calculte Firestore Size</title>
</head>
<body>
<h1>Firestore Document Size Calculator</h1>
<h2 id="response" style="color:red">This is a Heading</h2>
<textarea id="id" style="width: 100%" placeholder="Firestore Doc Ref"></textarea>
<textarea id="json" style="width: 100%; min-height: 200px" placeholder="Firestore Doc Value JSON STRING"></textarea>
<textarea id="quantity" style="width: 100%;" placeholder="How Many repeat this value?"></textarea>
<script>
document.getElementById("json").value='{"type": "Personal","done": false , "priority": 1 , "description": "Learn Cloud Firestore"}'
document.getElementById("id").value = 'users/jeff/tasks/my_task_id'
calculate()
function yuzdeBul(total,number) {
if (number == 0) {
return 0;
}
const sonuc = Math.ceil(parseInt(number) / (parseInt(total) / 100));
return sonuc;
}
function calculate(){
var quantity = parseInt(document.getElementById("quantity").value || 1);
var firestoreId = document.getElementById("id").value;
var refTotal = firestoreId
.split("/")
.map((v) => v.length + 1)
.reduce((a, b) => a + b, 0) + 16;
var idTotal = 0
//console.log(idTotal);
var parseJson = JSON.parse(document.getElementById("json").value);
idTotal += calculateObj(parseJson);
idTotal+=32;
idTotal*=quantity;
idTotal+=refTotal;
document.getElementById("response").innerHTML = idTotal + "/" + 1048576 + " %"+yuzdeBul(1048576,idTotal);
}
function calculateObj(myObj) {
var total = Object.keys(myObj).map((key) => {
var keySize = key.toString().length + 1;
var findType = typeof myObj[key];
//console.log(key,findType)
if (findType == "string") {
keySize += myObj[key].length + 1;
} else if (findType == "boolean") {
keySize += 1;
}
if (findType == "number") {
keySize += 8;
}
if (findType == "object") {
keySize += calculateObj(myObj[key]);
}
return keySize;
});
return total.reduce((a, b) => a + b, 0);
}
document.getElementById("json").addEventListener("change", calculate);
document.getElementById("id").addEventListener("change", calculate);
document.getElementById("quantity").addEventListener("change", calculate);
</script>
</body>
</html>

So I was looking for a way to reduce unnecessary document reads by accumulating data in arrays and go worried about the size.
Turns out I wasn't even close to the limit.
Here's what you can do,
Create a new collection and add a document with the worst-case scenario for live data and using cloud console export that collection, you will see the document size.
Here is a screenshot of my export
Assuming all the documents are equal in size, each is 0.0003MB
You can also see if the documents exceed the 1024byte limit
document exceeding limit from the console
Note: you can only export when you have enabled billing.!

DocumentDB Change Feed and saving Checkpoint

After reading the documentation, I'm having a hard time conceptualizing the change feed. Let's take the code from the documentation below. The second change feed is picking up the changes from the last time it was run via the checkpoints. Let's say it is being used to create summary data and there was an issue and it needed to be re-run from a prior time. I don't understand the following:
How to specify a particular time the checkpoint should start. I understand I can save the checkpoint dictionary and use that for each run, but how do you get the changes from X time to maybe rerun some summary data
Secondly, let's say we are rerunning some summary data and we save the last checkpoint used for each summarized data so we know where that one left off. How does one know that a record is in or before that checkpoint?
Code that runs from collection beginning and then from last checkpoint:
Dictionary < string, string > checkpoints = await GetChanges(client, collection, new Dictionary < string, string > ());
await client.CreateDocumentAsync(collection, new DeviceReading {
DeviceId = "xsensr-201", MetricType = "Temperature", Unit = "Celsius", MetricValue = 1000
});
await client.CreateDocumentAsync(collection, new DeviceReading {
DeviceId = "xsensr-212", MetricType = "Pressure", Unit = "psi", MetricValue = 1000
});
// Returns only the two documents created above.
checkpoints = await GetChanges(client, collection, checkpoints);
//
private async Task < Dictionary < string, string >> GetChanges(
DocumentClient client,
string collection,
Dictionary < string, string > checkpoints) {
List < PartitionKeyRange > partitionKeyRanges = new List < PartitionKeyRange > ();
FeedResponse < PartitionKeyRange > pkRangesResponse;
do {
pkRangesResponse = await client.ReadPartitionKeyRangeFeedAsync(collection);
partitionKeyRanges.AddRange(pkRangesResponse);
}
while (pkRangesResponse.ResponseContinuation != null);
foreach(PartitionKeyRange pkRange in partitionKeyRanges) {
string continuation = null;
checkpoints.TryGetValue(pkRange.Id, out continuation);
IDocumentQuery < Document > query = client.CreateDocumentChangeFeedQuery(
collection,
new ChangeFeedOptions {
PartitionKeyRangeId = pkRange.Id,
StartFromBeginning = true,
RequestContinuation = continuation,
MaxItemCount = 1
});
while (query.HasMoreResults) {
FeedResponse < DeviceReading > readChangesResponse = query.ExecuteNextAsync < DeviceReading > ().Result;
foreach(DeviceReading changedDocument in readChangesResponse) {
Console.WriteLine(changedDocument.Id);
}
checkpoints[pkRange.Id] = readChangesResponse.ResponseContinuation;
}
}
return checkpoints;
}

DocumentDB supports check-pointing only by the logical timestamp returned by the server. If you would like to retrieve all changes from X minutes ago, you would have to "remember" the logical timestamp corresponding to the clock time (ETag returned for the collection in the REST API, ResponseContinuation in the SDK), then use that to retrieve changes.
Change feed uses logical time in place of clock time because it can be different across various servers/partitions. If you would like to see change feed support based on clock time (with some caveats on skew), please propose/upvote at https://feedback.azure.com/forums/263030-documentdb/.
To save the last checkpoint per partition key/document, you can just save the corresponding version of the batch in which it was last seen (ETag returned for the collection in the REST API, ResponseContinuation in the SDK), like Fred suggested in his answer.

How to specify a particular time the checkpoint should start.
You could try to provide a logical version/ETag (such as 95488) instead of providing a null value as RequestContinuation property of ChangeFeedOptions.

Faster database access by index

I have this code
using (var contents = connection.CreateCommand())
{
contents.CommandText = "SELECT [subject],[note] FROM tasks";
var r = contents.ExecuteReader();
int zaehler = 0;
int zielzahl = 5;
while (r.Read())
{
if (zaehler == zielzahl)
{
//access r["subject"].ToString()
}
zaehler++;
}
}
I want to make it faster by accessing zielzahl directly like r[zielzahl] instead of iterating through all entries. But
r[zielzahl]["subject"]
does not work aswell as
r["subject"][zielzahl]
How do I access the column subject of result number zielzahl?

To get only the sixth record, use the OFFSET clause:
SELECT subject, note
FROM tasks
LIMIT 1 OFFSET 5
Please note that the order of returned records is not guaranteed unless you use the ORDER BY clause.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to write 5000 records into DynamoDB Table? - amazon-dynamodb

Related

Is there a better way to write this firebase cloud function than what I have right now?

AWS Scan ignores withLimit()

How to get a Firestore document size?

DocumentDB Change Feed and saving Checkpoint

Faster database access by index

Categories

Resources